Skip to content

new method for ingesting tarballs via a single staging PR #232

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

trz42
Copy link
Contributor

@trz42 trz42 commented Jun 29, 2025

This PR is based on the proof-of-concept #213. It aims at keeping only necessary code from the proof-of-concept, move the main loop into a separate script, and thus leave the existing ingestion code unchanged.

Summary of the ideas/changes/additions:

  • Adds CI for ensuring code quality (flake8, existing code is not validated) and running pytest.
  • We aim at using type hints for all function arguments and return values.
  • Improved logging combining levels (as before), scopes (to limit logging to parts of the code) and a decorator for logging function entry & exit. We aim at using the decorator for all functions to provide detailed debugging means.
  • Model the client to fetch files and ETags from a remote storage service.
  • Model an S3 bucket (e.g., hosted on AWS or Minio).
  • Model a file and its signature including functions to download them, use ETags to only download them if they have changed on the remote storage, verify the signature. A file can be the payload (tarball), a metadata (or task) file, or any other file of interest.
  • Model a task description (essentially the read in metadata or task file and some associated convenience functions such as obtaining the architecture from the name of the metadata/task file).
  • Model a task payload (could be a list of directories/files to be removed from CVMFS repo, a tarball containing software installations, or anything that should be applied to a CVMFS repo)
  • Model a task (combines the task description and the task payload, provides most of the logic to process a task, ensures that a task for a single payload is bundled in a single staging PR, updates its information in the staging repo, ...)

States, repository directory structure ... in a picture

ingest_bundles_infographics

High-level overview of state handler functions

_handle_add_undetermined

  1. Determine sequence number (corresponds to open or yet-to-be-opened pull request)
  2. Create files and directories with a single commit in default branch (see picture above)

_handle_add_new_task

  1. Init payload object (EESSITaskPayload) by downloading payload
  2. Update TaskState file

_handle_add_payload_staged

  1. Determines feature branch name
  2. Creates feature branch if it doesn't exist (TaskState is still PAYLOAD_STAGED in default and feature branch after it was created)
  3. Search for PR for feature branch
  4. none found: update states (default branch: PULL_REQUEST, feature branch: APPROVED) and create pull request
  5. found and closed: open issue (TO BE IMPLEMENTED)
  6. found and open: update states (default branch: PULL_REQUEST, feature branch: APPROVED) and update pull request
    Creating/updating a pull request will create and update a TaskSummary.html file and create/update the description of the pull request.

_handle_add_pull_request

  1. Determines state of PR
  2. If PR was closed, it changes state to REJECTED

@trz42 trz42 added enhancement New feature or request help wanted Extra attention is needed labels Jun 29, 2025
return EESSITaskAction.ADD
elif action_str == "update":
return EESSITaskAction.UPDATE
return EESSITaskAction.UNKNOWN
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would probably be good to, at least for now, return EESSITaskAction.ADD here, allowing us to test/use this new functionality with the existing metadata files (which don't have an action defined yet). In the future we can make this field required and revert to UNKNOWN..

Copy link
Contributor Author

@trz42 trz42 Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should work around this by returning UNKNOWN ADD

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in b1663f1

Comment on lines 80 to 90
other = [ # anything that is not in <prefix>/software nor <prefix>/modules
member.path
for member in members
if (
not PurePosixPath(prefix).joinpath("software") in PurePosixPath(member.path).parents
and not PurePosixPath(prefix).joinpath("modules") in PurePosixPath(member.path).parents
)
# if not fnmatch.fnmatch(m.path, os.path.join(prefix, 'software', '*'))
# and not fnmatch.fnmatch(m.path, os.path.join(prefix, 'modules', '*'))
]
members_list = sorted(swdirs + modfiles + other)
Copy link
Collaborator

@bedroge bedroge Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An initial attempt to run this on some real 2025.06 tarballs (containing EB 5.1.1) ultimately resulted in the PR description not being updated anymore after it had processed four of them, probably because the message became too long. This is because it's not taking the reprod dirs in the tarball into account, which was fixed in #235. Can be fixed here in the same way:

Suggested change
other = [ # anything that is not in <prefix>/software nor <prefix>/modules
member.path
for member in members
if (
not PurePosixPath(prefix).joinpath("software") in PurePosixPath(member.path).parents
and not PurePosixPath(prefix).joinpath("modules") in PurePosixPath(member.path).parents
)
# if not fnmatch.fnmatch(m.path, os.path.join(prefix, 'software', '*'))
# and not fnmatch.fnmatch(m.path, os.path.join(prefix, 'modules', '*'))
]
members_list = sorted(swdirs + modfiles + other)
reprod_dirs = [
member.path
for member in members
if member.isdir() and PurePosixPath(member.path).match(os.path.join(prefix, 'reprod', '*', '*', '*'))
]
other = [ # anything that is not in <prefix>/software nor <prefix>/modules nor <prefix>/reprod
member.path
for member in members
if (
not PurePosixPath(prefix).joinpath('software') in PurePosixPath(member.path).parents
and not PurePosixPath(prefix).joinpath('modules') in PurePosixPath(member.path).parents
and not PurePosixPath(prefix).joinpath('reprod') in PurePosixPath(member.path).parents
)
# if not fnmatch.fnmatch(m.path, os.path.join(prefix, 'software', '*'))
# and not fnmatch.fnmatch(m.path, os.path.join(prefix, 'modules', '*'))
]
members_list = sorted(swdirs + modfiles + reprod_dirs + other)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes. I wonder if we need to be a little more aggressive about shortening/limiting the size of the description. Just thinking that we may have plenty of tarballs for GPU builds, it will require a more concise presentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in d7952d8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants