-
Notifications
You must be signed in to change notification settings - Fork 3
New materials are updated #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a mix of grammar/spelling things, and some content suggestions. there's more context that could be added about why jobs go on hold.
README.md
Outdated
## Diagnostics with condor_q | ||
- [Job is held](#job-is-held) | ||
- [Job completed but was unsuccessful](#job-completed-but-was-unsuccessful) | ||
- [Job does not start](#Job-does-not-start) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- [Job does not start](#Job-does-not-start) | |
- [Job does not start](#job-does-not-start) |
README.md
Outdated
$ condor_q -hold | ||
ap40.uw.osg-htc.org : <128.105.68.62:9618?... @ 08/01/24 15:01:50 | ||
ID OWNER HELD_SINCE HOLD_REASON | ||
130.0 alice 8/1 14:56 Transfer input files failure at ⋯ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
130.0 alice 8/1 14:56 Transfer input files failure at ⋯ | |
130.0 alice 8/1 14:56 Transfer output files failure at ... |
README.md
Outdated
|
||
$ condor_q -better-analyze JOB-ID -pool POOL-NAME | ||
In this particular case, a user had this in his or her HTCondor submit file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this particular case, a user had this in his or her HTCondor submit file: | |
In this particular case, the message indicates that HTCondor cannot transfer back output files, likely because they were never created. This hold condition can be triggered if a user has this option in their HTCondor submit file: |
README.md
Outdated
|
||
transfer_output_files = outputfile | ||
|
||
However, when the job executed, it went into Held state. To see more about the error message `condor_q -hold JOB-ID` can be used. An example error message in this scenario is shown below: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, when the job executed, it went into Held state. To see more about the error message `condor_q -hold JOB-ID` can be used. An example error message in this scenario is shown below: | |
However, when the job executed, it went into the Held state. To learn more, `condor_q -hold JOB-ID` can be used. An example from this scenario is shown below: |
README.md
Outdated
command2 | ||
## Job does not start | ||
|
||
Matchmaking cycle can take more than 5 minutes to complete and it can be longer if the server is busy. Specially, the more/more-specific resources are requested, the longer is the wait. For example if you are submitting a lot of GPU jobs, there may not be enough `GPUs` in the pool. This type of issues can be diagnosed using the `-better-analyze` flag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Matchmaking" is a little jargon-y. Can you explain a little more what is meant by that here?
README.md
Outdated
command2 | ||
## Job does not start | ||
|
||
Matchmaking cycle can take more than 5 minutes to complete and it can be longer if the server is busy. Specially, the more/more-specific resources are requested, the longer is the wait. For example if you are submitting a lot of GPU jobs, there may not be enough `GPUs` in the pool. This type of issues can be diagnosed using the `-better-analyze` flag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Matchmaking cycle can take more than 5 minutes to complete and it can be longer if the server is busy. Specially, the more/more-specific resources are requested, the longer is the wait. For example if you are submitting a lot of GPU jobs, there may not be enough `GPUs` in the pool. This type of issues can be diagnosed using the `-better-analyze` flag | |
Matchmaking cycle can take more than 5 minutes to complete and it can be longer if the server is busy. In addition, the more/more-specific resources are requested, typically the longer the wait. For example if you are submitting a lot of GPU jobs, there may be fewer GPUs (than other resources) to run your jobs. This type of issue can be diagnosed using the `-better-analyze` flag |
README.md
Outdated
<li>The job has been continuously running on the same slot</li> | ||
<ul> | ||
Debugging tools: | ||
<li>[condor_tail](https://htcondor.readthedocs.io/en/latest/man-pages/condor_tail.html)</li>- It returns the last X bytes of the job |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<li>[condor_tail](https://htcondor.readthedocs.io/en/latest/man-pages/condor_tail.html)</li>- It returns the last X bytes of the job | |
<li>[condor_tail](https://htcondor.readthedocs.io/en/latest/man-pages/condor_tail.html)</li>- It returns the last X bytes of the job standard output or error files. |
README.md
Outdated
<ul> | ||
Debugging tools: | ||
<li>[condor_tail](https://htcondor.readthedocs.io/en/latest/man-pages/condor_tail.html)</li>- It returns the last X bytes of the job | ||
<li>[condor_ssh_to_job](https://htcondor.readthedocs.io/en/latest/man-pages/condor_ssh_to_job.html). It allows to log in to the execution point. Point to be noted-It **does not work** on all systems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<li>[condor_ssh_to_job](https://htcondor.readthedocs.io/en/latest/man-pages/condor_ssh_to_job.html). It allows to log in to the execution point. Point to be noted-It **does not work** on all systems. | |
<li>[condor_ssh_to_job](https://htcondor.readthedocs.io/en/latest/man-pages/condor_ssh_to_job.html). It allows to log in to the execution point. Point to be noted-It **does not work on all nodes**. |
README.md
Outdated
<li>The job is stuck on the file transfer step</li> | ||
<ul> | ||
Debugging tips: | ||
<li>Check the `.log` file for meaning errors</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<li>Check the `.log` file for meaning errors</li> | |
<li>Check the `.log` file for meaningful errors</li> |
No description provided.