New materials are updated #4

showmic09 · 2025-02-21T18:10:51Z

No description provided.

ChristinaLK

a mix of grammar/spelling things, and some content suggestions. there's more context that could be added about why jobs go on hold.

README.md

ChristinaLK · 2025-04-17T16:47:26Z

README.md

-## Diagnostics with condor_q
+ - [Job is held](#job-is-held)
+ - [Job completed but was unsuccessful](#job-completed-but-was-unsuccessful)
+ - [Job does not start](#Job-does-not-start)


Suggested change

- [Job does not start](#Job-does-not-start)

- [Job does not start](#job-does-not-start)

ChristinaLK · 2025-04-17T16:48:06Z

README.md

+	$ condor_q -hold
+	ap40.uw.osg-htc.org : <128.105.68.62:9618?... @ 08/01/24 15:01:50
+ 	ID         OWNER          HELD_SINCE  HOLD_REASON
+	130.0       alice           8/1  14:56 Transfer input files failure at ⋯


Suggested change

130.0 alice 8/1 14:56 Transfer input files failure at ⋯

130.0 alice 8/1 14:56 Transfer output files failure at ...

ChristinaLK · 2025-04-17T16:50:33Z

README.md


-	$ condor_q -better-analyze JOB-ID -pool POOL-NAME
+In this particular case, a user had this in his or her HTCondor submit file:


Suggested change

In this particular case, a user had this in his or her HTCondor submit file:

In this particular case, the message indicates that HTCondor cannot transfer back output files, likely because they were never created. This hold condition can be triggered if a user has this option in their HTCondor submit file:

ChristinaLK · 2025-04-17T16:52:07Z

README.md

+
+	transfer_output_files = outputfile
+
+However, when the job executed, it went into Held state. To see more about the error message `condor_q -hold JOB-ID` can be used. An example error message in this scenario is shown below:


Suggested change

However, when the job executed, it went into Held state. To see more about the error message `condor_q -hold JOB-ID` can be used. An example error message in this scenario is shown below:

However, when the job executed, it went into the Held state. To learn more, `condor_q -hold JOB-ID` can be used. An example from this scenario is shown below:

ChristinaLK · 2025-04-17T16:55:41Z

README.md

+	command2
+## Job does not start
+
+Matchmaking cycle can take more than 5 minutes to complete and it can be longer if the server is busy. Specially, the more/more-specific resources are requested, the longer is the wait. For example if you are submitting a lot of GPU jobs, there may not be enough `GPUs` in the pool. This type of issues can be diagnosed using the `-better-analyze` flag 


"Matchmaking" is a little jargon-y. Can you explain a little more what is meant by that here?

ChristinaLK · 2025-04-17T16:57:00Z

README.md

+	command2
+## Job does not start
+
+Matchmaking cycle can take more than 5 minutes to complete and it can be longer if the server is busy. Specially, the more/more-specific resources are requested, the longer is the wait. For example if you are submitting a lot of GPU jobs, there may not be enough `GPUs` in the pool. This type of issues can be diagnosed using the `-better-analyze` flag 


Suggested change

Matchmaking cycle can take more than 5 minutes to complete and it can be longer if the server is busy. Specially, the more/more-specific resources are requested, the longer is the wait. For example if you are submitting a lot of GPU jobs, there may not be enough `GPUs` in the pool. This type of issues can be diagnosed using the `-better-analyze` flag

Matchmaking cycle can take more than 5 minutes to complete and it can be longer if the server is busy. In addition, the more/more-specific resources are requested, typically the longer the wait. For example if you are submitting a lot of GPU jobs, there may be fewer GPUs (than other resources) to run your jobs. This type of issue can be diagnosed using the `-better-analyze` flag

ChristinaLK · 2025-04-17T16:57:43Z

README.md

+<li>The job has been continuously running on the same slot</li>
+<ul>
+Debugging tools: 
+<li>[condor_tail](https://htcondor.readthedocs.io/en/latest/man-pages/condor_tail.html)</li>- It returns the last X bytes of the job


Suggested change

<li>[condor_tail](https://htcondor.readthedocs.io/en/latest/man-pages/condor_tail.html)</li>- It returns the last X bytes of the job

<li>[condor_tail](https://htcondor.readthedocs.io/en/latest/man-pages/condor_tail.html)</li>- It returns the last X bytes of the job standard output or error files.

ChristinaLK · 2025-04-17T16:58:20Z

README.md

+<ul>
+Debugging tools: 
+<li>[condor_tail](https://htcondor.readthedocs.io/en/latest/man-pages/condor_tail.html)</li>- It returns the last X bytes of the job
+<li>[condor_ssh_to_job](https://htcondor.readthedocs.io/en/latest/man-pages/condor_ssh_to_job.html). It allows to log in to the execution point. Point to be noted-It **does not work** on all systems. 


Suggested change

<li>[condor_ssh_to_job](https://htcondor.readthedocs.io/en/latest/man-pages/condor_ssh_to_job.html). It allows to log in to the execution point. Point to be noted-It **does not work** on all systems.

<li>[condor_ssh_to_job](https://htcondor.readthedocs.io/en/latest/man-pages/condor_ssh_to_job.html). It allows to log in to the execution point. Point to be noted-It **does not work on all nodes**.

ChristinaLK · 2025-04-17T16:58:41Z

README.md

+<li>The job is stuck on the file transfer step</li>
+<ul>
+Debugging tips:
+<li>Check the `.log` file for meaning errors</li>


Suggested change

<li>Check the `.log` file for meaning errors</li>

<li>Check the `.log` file for meaningful errors</li>

New materials are updated

b7f2cd0

showmic09 requested review from ChristinaLK and aowen-uwmad February 21, 2025 18:10

ChristinaLK reviewed Apr 17, 2025

View reviewed changes

newer changes

378e4c4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New materials are updated #4

New materials are updated #4

Uh oh!

showmic09 commented Feb 21, 2025

Uh oh!

ChristinaLK left a comment

Uh oh!

Uh oh!

ChristinaLK Apr 17, 2025

Uh oh!

ChristinaLK Apr 17, 2025

Uh oh!

ChristinaLK Apr 17, 2025

Uh oh!

ChristinaLK Apr 17, 2025

Uh oh!

ChristinaLK Apr 17, 2025

Uh oh!

ChristinaLK Apr 17, 2025

Uh oh!

ChristinaLK Apr 17, 2025

Uh oh!

ChristinaLK Apr 17, 2025

Uh oh!

ChristinaLK Apr 17, 2025

Uh oh!

Uh oh!

	- [Job does not start](#Job-does-not-start)
	- [Job does not start](#job-does-not-start)

	130.0 alice 8/1 14:56 Transfer input files failure at ⋯
	130.0 alice 8/1 14:56 Transfer output files failure at ...


		$ condor_q -better-analyze JOB-ID -pool POOL-NAME
		In this particular case, a user had this in his or her HTCondor submit file:

	In this particular case, a user had this in his or her HTCondor submit file:
	In this particular case, the message indicates that HTCondor cannot transfer back output files, likely because they were never created. This hold condition can be triggered if a user has this option in their HTCondor submit file:


		transfer_output_files = outputfile

		However, when the job executed, it went into Held state. To see more about the error message `condor_q -hold JOB-ID` can be used. An example error message in this scenario is shown below:

	Matchmaking cycle can take more than 5 minutes to complete and it can be longer if the server is busy. Specially, the more/more-specific resources are requested, the longer is the wait. For example if you are submitting a lot of GPU jobs, there may not be enough `GPUs` in the pool. This type of issues can be diagnosed using the `-better-analyze` flag
	Matchmaking cycle can take more than 5 minutes to complete and it can be longer if the server is busy. In addition, the more/more-specific resources are requested, typically the longer the wait. For example if you are submitting a lot of GPU jobs, there may be fewer GPUs (than other resources) to run your jobs. This type of issue can be diagnosed using the `-better-analyze` flag

	<li>[condor_tail](https://htcondor.readthedocs.io/en/latest/man-pages/condor_tail.html)</li>- It returns the last X bytes of the job
	<li>[condor_tail](https://htcondor.readthedocs.io/en/latest/man-pages/condor_tail.html)</li>- It returns the last X bytes of the job standard output or error files.

	<li>[condor_ssh_to_job](https://htcondor.readthedocs.io/en/latest/man-pages/condor_ssh_to_job.html). It allows to log in to the execution point. Point to be noted-It does not work on all systems.
	<li>[condor_ssh_to_job](https://htcondor.readthedocs.io/en/latest/man-pages/condor_ssh_to_job.html). It allows to log in to the execution point. Point to be noted-It does not work on all nodes.

	<li>Check the `.log` file for meaning errors</li>
	<li>Check the `.log` file for meaningful errors</li>

New materials are updated #4

Are you sure you want to change the base?

New materials are updated #4

Uh oh!

Conversation

showmic09 commented Feb 21, 2025

Uh oh!

ChristinaLK left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!