Skip to content

New materials are updated #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

New materials are updated #4

wants to merge 2 commits into from

Conversation

showmic09
Copy link
Contributor

No description provided.

Copy link
Contributor

@ChristinaLK ChristinaLK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a mix of grammar/spelling things, and some content suggestions. there's more context that could be added about why jobs go on hold.

README.md Outdated
## Diagnostics with condor_q
- [Job is held](#job-is-held)
- [Job completed but was unsuccessful](#job-completed-but-was-unsuccessful)
- [Job does not start](#Job-does-not-start)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- [Job does not start](#Job-does-not-start)
- [Job does not start](#job-does-not-start)

README.md Outdated
$ condor_q -hold
ap40.uw.osg-htc.org : <128.105.68.62:9618?... @ 08/01/24 15:01:50
ID OWNER HELD_SINCE HOLD_REASON
130.0 alice 8/1 14:56 Transfer input files failure at ⋯
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
130.0 alice 8/1 14:56 Transfer input files failure at
130.0 alice 8/1 14:56 Transfer output files failure at ...

README.md Outdated

$ condor_q -better-analyze JOB-ID -pool POOL-NAME
In this particular case, a user had this in his or her HTCondor submit file:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In this particular case, a user had this in his or her HTCondor submit file:
In this particular case, the message indicates that HTCondor cannot transfer back output files, likely because they were never created. This hold condition can be triggered if a user has this option in their HTCondor submit file:

README.md Outdated

transfer_output_files = outputfile

However, when the job executed, it went into Held state. To see more about the error message `condor_q -hold JOB-ID` can be used. An example error message in this scenario is shown below:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
However, when the job executed, it went into Held state. To see more about the error message `condor_q -hold JOB-ID` can be used. An example error message in this scenario is shown below:
However, when the job executed, it went into the Held state. To learn more, `condor_q -hold JOB-ID` can be used. An example from this scenario is shown below:

README.md Outdated
command2
## Job does not start

Matchmaking cycle can take more than 5 minutes to complete and it can be longer if the server is busy. Specially, the more/more-specific resources are requested, the longer is the wait. For example if you are submitting a lot of GPU jobs, there may not be enough `GPUs` in the pool. This type of issues can be diagnosed using the `-better-analyze` flag
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Matchmaking" is a little jargon-y. Can you explain a little more what is meant by that here?

README.md Outdated
command2
## Job does not start

Matchmaking cycle can take more than 5 minutes to complete and it can be longer if the server is busy. Specially, the more/more-specific resources are requested, the longer is the wait. For example if you are submitting a lot of GPU jobs, there may not be enough `GPUs` in the pool. This type of issues can be diagnosed using the `-better-analyze` flag
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Matchmaking cycle can take more than 5 minutes to complete and it can be longer if the server is busy. Specially, the more/more-specific resources are requested, the longer is the wait. For example if you are submitting a lot of GPU jobs, there may not be enough `GPUs` in the pool. This type of issues can be diagnosed using the `-better-analyze` flag
Matchmaking cycle can take more than 5 minutes to complete and it can be longer if the server is busy. In addition, the more/more-specific resources are requested, typically the longer the wait. For example if you are submitting a lot of GPU jobs, there may be fewer GPUs (than other resources) to run your jobs. This type of issue can be diagnosed using the `-better-analyze` flag

README.md Outdated
<li>The job has been continuously running on the same slot</li>
<ul>
Debugging tools:
<li>[condor_tail](https://htcondor.readthedocs.io/en/latest/man-pages/condor_tail.html)</li>- It returns the last X bytes of the job
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<li>[condor_tail](https://htcondor.readthedocs.io/en/latest/man-pages/condor_tail.html)</li>- It returns the last X bytes of the job
<li>[condor_tail](https://htcondor.readthedocs.io/en/latest/man-pages/condor_tail.html)</li>- It returns the last X bytes of the job standard output or error files.

README.md Outdated
<ul>
Debugging tools:
<li>[condor_tail](https://htcondor.readthedocs.io/en/latest/man-pages/condor_tail.html)</li>- It returns the last X bytes of the job
<li>[condor_ssh_to_job](https://htcondor.readthedocs.io/en/latest/man-pages/condor_ssh_to_job.html). It allows to log in to the execution point. Point to be noted-It **does not work** on all systems.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<li>[condor_ssh_to_job](https://htcondor.readthedocs.io/en/latest/man-pages/condor_ssh_to_job.html). It allows to log in to the execution point. Point to be noted-It **does not work** on all systems.
<li>[condor_ssh_to_job](https://htcondor.readthedocs.io/en/latest/man-pages/condor_ssh_to_job.html). It allows to log in to the execution point. Point to be noted-It **does not work on all nodes**.

README.md Outdated
<li>The job is stuck on the file transfer step</li>
<ul>
Debugging tips:
<li>Check the `.log` file for meaning errors</li>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<li>Check the `.log` file for meaning errors</li>
<li>Check the `.log` file for meaningful errors</li>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants