Skip to content

Delete git/RO directories when not needed, delete old files/dirs in cron #426

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Dec 30, 2022

Conversation

kinow
Copy link
Member

@kinow kinow commented Jun 14, 2022

Closes #424
Closes #200
Closes #279
Supersedes and closes #344

Description

See #424

Motivation and Context

See #424

How Has This Been Tested?

  • Unit tests
  • Local docker-compose cluster
  • Cron expression (i.e. create files with certain file age, start server, wait for cron/scheduler to run)

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@kinow
Copy link
Member Author

kinow commented Jun 14, 2022

Note to self: confirm we are deleting these files when they are no longer necessary, and not too early.

@kinow kinow force-pushed the heal-cwlviewer branch 2 times, most recently from 888269b to 9516d2a Compare June 20, 2022 23:00
@kinow kinow changed the title Fix intermittent issues in CWL Viewer (disk space, DB transactions, ...) Delete git/RO directories when not needed, delete old files/dirs in cron Jun 21, 2022
Copy link
Member Author

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still in draft as I want to see if we can have unit tests — it's hard to mock some I/O features, not sure if testing is easy/feasible for deleting files, especially by age. I also need to do some more manual testing on my environment 👍

But I think it's going in the right direction! The GitHub issues regarding temporary files and directories appear to date from about 2 years ago. Hopefully we will be able to close these issues in the next release. 🤞

List<String> temporaryDirectories = Stream.of(bundleStorage, graphvizStorage, gitStorage)
.distinct()
.toList();
temporaryDirectories.forEach(this::clearDirectory);
Copy link
Member Author

@kinow kinow Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mr-c , @tetron , this change should delete all files older than a certain limit from the server.

But I was thinking, maybe we want to keep the files from the workflows listed in the first page of the CWL Viewer app (or two first, or three first pages, or first 15 workflows, etc.)?

I can probably use the same code used when you visit "Explore", but requesting just one page or a certain number of workflows. Then I would get the git and RO directories locations from the PostgreSQL table and skip these.

Or is that a bit too much? WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep the ROs but delete the old Git checkouts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Do you mean keep all the ROs, or just the ones in the first page of the workflows listing?

I was going to send a screenshot of the first page, but the server appears to be unwell again.

image

I'll leave a message on the Matrix CWLViewer and in some room at Curii to see if someone can kick that server or container, etc.

Copy link
Member

@mr-c mr-c Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much disk space do the ROs take up? An estimate based upon some "typical" ROs is fine as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, cc'ing @tetron

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the code above to skip bundles 👍

@kinow
Copy link
Member Author

kinow commented Jun 24, 2022

Things to think about (from last week's meetings):

  • Can we improve compression of bundle files? Choose a different compression level?
    • That's controlled by Taverna. It creates a ZipOutputStream that's not exposed. It uses the Deflater's DEFAULT_COMPRESSION of -1. The levels go from 0 to 9 (best compression is 9). The values, including the -1 get passed down to zlib. zlib then chooses the default level, which from this SO answer appears to be 6. Not sure how much we would gain from going from 6 to 9. But I don't think that will help much if we have multiple large files.
  • Do we have any workflows with very large files? Or maybe a different approach could be to just use dd or use some binary file, putting it in a git repo with a dummy CWL workflow file, and then import it into CWL Viewer, and confirm if the git repo files sizes reflect in the bundle file size.
  • The metagenomics workflows have many files, but not sure if there are large files.
  • We need to confirm that, having the bundle ZIP, the bundle directory is really not necessary. If so, we can safely delete the bundle directories.
    • @mr-c, @tetron, removing the directories of the RO bundle and of the Git repository, the graphs and the RO bundle ZIP are still served fine. Removing the graphs, the RO bundle ZIP is still served with no problems. The Viewer then re-creates the graphs from the data stored in the database. So I believe we can safely delete the directories 👍
  • If the RO bundle ZIP files are not too large, then maybe we can just leave the ZIP files on disk? Or maybe delete the ZIP files using a higher file age limit (1-3 months for the ZIPs? a few hours or days for the cron job that deletes directories, graphs?)

@tetron
Copy link
Member

tetron commented Jul 5, 2022

A data point about the lingering git repositories:

There’s 1 checked out repo of 1.2GB (250MB of which are gone in the examples/ subdir
then, there’s multiple copies of ~700MB

So it might also be a good idea to introduce a cap on the size of research object bundles, even if we clean up the git repos, keeping around a bunch of 500-1000 MB RO bundles is a huge disk space hog.

@kinow
Copy link
Member Author

kinow commented Jul 11, 2022

Note: I spent some time debugging the main flow of importing a workflow, and there are a few things that are not perfect in this PR. Namely, sometimes an exception is triggered as a graph already exists, and that's breaking the process due to the changes here (this exception is old, I've seen it before), and I think I am deleting the .git directory, not the git repository. I took notes of what appeared to be wrong while debugging it, and also noting a few other things, and will take care of this next time I have to work on the viewer 👍

@kinow
Copy link
Member Author

kinow commented Aug 15, 2022

Rebased, fixed the code I remembered wasn't working as expected, fixed unit tests, and manually testing found no issues.

Waiting for CI to review the code coverage, then either mark as ready for review or add new unit tests. Also need to test the scheduler/cron.

Note to self & reviewers, here's how I tested it:

# create containers
$ docker-compose up --no-start
# start DB & Jena/SPARQL
$ docker-compose start postgres sparql
# create temporary directory, and separate directories for each service (production uses a single one, but it's handy for testing)
$ tree /tmp/cwlviewer/
/tmp/cwlviewer/
├── bundles
├── git
└── graphviz
$ mkdir -p /tmp/cwlviewer/{bundles,git,graphviz}
# in IntelliJ, start the `CwlViewerApplication` with the following JVM settings:
# -Djava.io.tmpdir=/tmp/cwlviewer/ -DbundleStorage=/tmp/cwlviewer/bundles -DgitStorage=/tmp/cwlviewer/git -DgraphvizStorage=/tmp/cwlviewer/graphviz
# Visit http://localhost:8080/

Now import compile.cwl. After a few seconds, when it succeeds, you should have:

  • /tmp/cwlviewer/graphviz with three graphs for compile.cwl
  • /tmp/cwlviewer/bundles with a single ZIP bundle
  • /tmp/cwlviewer/git empty (the git repository was cloned there, but the ROBundleService now deletes the git repository once it has suceeded - no further need for it)
  • /tmp/cwlviewer/ with the three directories above, a couple created by Tomcat/Spring, and no other file or directory

The most important part here is the removal of the Git repositories after they have been successfully used. There are cases where we could have an issue in the middle of the process execution, and the git and bundle directories could be left behind. Unfortunately the only way to remove them is periodically based on some file-age (hence the Scheduler/cron change).

-Bruno

@kinow kinow marked this pull request as ready for review August 15, 2022 04:01
@kinow
Copy link
Member Author

kinow commented Aug 15, 2022

Just pending some manual testing, and I will try to add a few mocked tests, but looks like the code that's not covered contains mostly exception handling, which is hard to be covered. For the Scheduler tomorrow I will start the server with a shorter cron expression (run every 2 minutes or so), and confirm it's deleting the files correctly.

@kinow kinow marked this pull request as draft August 15, 2022 23:50
@kinow
Copy link
Member Author

kinow commented Aug 16, 2022

Something strange with the coverage reports. FileUtils.java is now fully covered by FileUtilsTest.java 😕

The code changed in the services & bundle factory is calling FileUtils, so that and some manual tests should be enough to validate this change.

@@ -195,7 +203,7 @@ public Bundle createBundle(Workflow workflow, GitDetails gitInfo) throws IOExcep
addAggregation(bundle, manifestAnnotations,
"merged.cwl", cwlTool.getPackedVersion(rawUrl));
} catch (CWLValidationException ex) {
logger.error("Could not pack workflow when creating Research Object", ex.getMessage());
logger.error(String.format("Could not pack workflow when creating Research Object: %s", ex.getMessage()), ex);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug found with IDE.

} finally {
gitSemaphore.release(gitInfo.getRepoUrl());
org.commonwl.view.util.FileUtils.deleteGitRepository(gitRepo);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove git folder in the finally statement since we won't need it after this point (i.e. the graphs are created, the cwltool has been executed and populated the DB, and now the git artefacts are already in the bundle).

@kinow
Copy link
Member Author

kinow commented Aug 16, 2022

Tested the Scheduler code, which is executed periodically in the server (akin to cron). I changed in application.properties to, instead of running every mid-night (cron.clearTmpDir = 0 0 0 * * ?), to run every five minutes (cron.clearTmpDir = 5 * * * * ?).

Then started the docker cluster and imported compile.cwl and make-to-cwl.cwl. Both succeeded and created some images and two bundles.

kinow@ranma:/tmp/cwlviewer/graphviz$ ls -lah
total 96K
drwxrwxr-x  2 kinow kinow 4.0K Aug 16 18:34 .
drwxrwxr-x 10 kinow kinow 4.0K Aug 16 18:34 ..
-rw-rw-r--  1 kinow kinow  15K Aug 15 18:35 03e3eb2c-fe6b-4cab-ab8d-db6747bd03bf.png
-rw-rw-r--  1 kinow kinow 7.0K Aug 15 18:35 03e3eb2c-fe6b-4cab-ab8d-db6747bd03bf.svg
-rw-rw-r--  1 kinow kinow  12K Aug 16 18:34 2b3c799c-a9a4-4e15-9cb3-475930324952.png
-rw-rw-r--  1 kinow kinow  22K Aug 16 18:34 8053cbf6-df50-412e-85dc-7465a2303764.png
-rw-rw-r--  1 kinow kinow 7.2K Aug 16 18:34 8053cbf6-df50-412e-85dc-7465a2303764.svg
-rw-rw-r--  1 kinow kinow  19K Aug 16 18:34 da75c0ec-9565-4f2e-9fcf-ef2a453cfda9.png

The bundles are never touched as we they are in another directory.

Then I modified the access and modified dates of two files.

kinow@ranma:/tmp/cwlviewer/graphviz$ date
Tue 16 Aug 2022 18:35:47 NZST
kinow@ranma:/tmp/cwlviewer/graphviz$ touch -a -m -t 202208151835.00 03e3eb2c-fe6b-4cab-ab8d-db6747bd03bf.png
kinow@ranma:/tmp/cwlviewer/graphviz$ touch -a -m -t 202208151835.00 03e3eb2c-fe6b-4cab-ab8d-db6747bd03bf.svg

After a few minutes the scheduler runs and deletes the old files.

kinow@ranma:/tmp/cwlviewer/graphviz$ ls -lah
total 72K
drwxrwxr-x  2 kinow kinow 4.0K Aug 16 18:38 .
drwxrwxr-x 10 kinow kinow 4.0K Aug 16 18:34 ..
-rw-rw-r--  1 kinow kinow  12K Aug 16 18:34 2b3c799c-a9a4-4e15-9cb3-475930324952.png
-rw-rw-r--  1 kinow kinow  22K Aug 16 18:34 8053cbf6-df50-412e-85dc-7465a2303764.png
-rw-rw-r--  1 kinow kinow 7.2K Aug 16 18:34 8053cbf6-df50-412e-85dc-7465a2303764.svg
-rw-rw-r--  1 kinow kinow  19K Aug 16 18:34 da75c0ec-9565-4f2e-9fcf-ef2a453cfda9.png

Accessing both workflows in CWL Viewer re-created the images almost instantaneously.

Bundles still OK:

kinow@ranma:/tmp/cwlviewer/graphviz$ ls -lah /tmp/cwlviewer/bundles/
total 60K
drwxrwxr-x  2 kinow kinow 4.0K Aug 16 18:34 .
drwxrwxr-x 10 kinow kinow 4.0K Aug 16 18:34 ..
-rw-rw-r--  1 kinow kinow  28K Aug 16 18:34 bundle-47e8c47c-91af-44de-a02a-3cb6f20a89a9.zip
-rw-rw-r--  1 kinow kinow  21K Aug 16 18:34 bundle-ac6285ef-e6d5-49df-b988-6105412798e9.zip

Screenshot from 2022-08-16 18-38-31
Screenshot from 2022-08-16 18-38-08

@kinow kinow marked this pull request as ready for review August 16, 2022 06:48
@kinow
Copy link
Member Author

kinow commented Aug 16, 2022

@mr-c , @tetron ready for review. See previous comments for notes on how to test/how it was tested. One thing I just realized, is that the Scheduler.java I included here assumes in production we have distinct folders for bundles, graphs, and git repositories.

The default settings are pointing to the same temporary directory. I do not have access to production, so someone would have to confirm. If we have a single directory, then we can either change it, or I can remove Scheduler.java from this PR and move it to another draft PR for later.

Thanks! 🖖
Bruno

Copy link
Member

@mr-c mr-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to merge this, make a pre-release, and ask UniTo to test it

@mr-c mr-c merged commit 213b5dd into common-workflow-language:main Dec 30, 2022
@kinow kinow deleted the heal-cwlviewer branch December 30, 2022 10:19
@kinow
Copy link
Member Author

kinow commented Dec 30, 2022

I'm going to merge this, make a pre-release, and ask UniTo to test it

Great! I had forgotten about this PR. Hopefully it will help us with disk space issues 🤞 Thanks @mr-c !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants