Skip to content

[Project] TrainCheck #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 11, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added assets/img/project/traincheck_logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified assets/img/team/yuxuan.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
22 changes: 22 additions & 0 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,28 @@ <h2 class="mb-3">Recent Projects</h2>
<!-- single course -->
<div class="col-lg-12">
<div class="owl-theme owl-carousel active_course">
<div class="single_recent_project">
<div class="recent_project_head">
<img class="img-fluid" src="/assets/img/project/traincheck_logo.png" alt="TrainCheck" />
</div>
<div class="recent_project_content">
<h4 class="mb-3">
<a href="#">Catching Silent Errors in Deep Learning Training</a>
</h4>
<p>
Silent errors in deep learning training can silently waste
thousands of GPU hours and produce low-quality models. We
introduce TrainCheck, a proactive checking framework that learns
semantic invariants from correct training runs and enforces them
at runtime to catch failures early—before they silently
accumulate cost and damage model reliability.
</p>
<div class="recent_project_meta d-flex justify-content-lg-between align-items-lg-center flex-lg-row flex-column mt-4">
<a class="button button-light" href="paper/traincheck-osdi25-preprint.pdf" target="_blank">Read More</a>
</div>
</div>
</div>

<div class="single_recent_project">
<div class="recent_project_head">
<img class="img-fluid" src="/assets/img/project/watchdog.jpg" alt="" />
Expand Down
8 changes: 8 additions & 0 deletions news.html
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,14 @@
<section class="section-margin">
<div class="container">
<ul class="newslist">
<li>
<span class="newsicon"><i class="flaticon-document"></i></span><span class="newsdate">Mar 2025</span>
<a href="https://github.com/OrderLab/TrainCheck"> TrainCheck</a> is accepted to appear at <a href="https://www.usenix.org/conference/osdi25">OSDI '25</a>
<details>
<summary>[...]</summary>
Training deep learning (DL) models is a complex task involving multiple steps and various libraries, making DL training pipelines prone to silent bugs that lead to suboptimal or incorrect models. These issues are challenging to detect and diagnose. TrainCheck is the first framework that takes a proactive checking approach to systematically address silent issues. TrainCheck automatically infers invariants tailored for DL training. It uses these invariants to enhance a training task and proactively detect silent issues while providing debugging help.
</details>
</li>
<li>
<span class="newsicon"><i class="flaticon-distance"></i></span><span class="newsdate">May 2024</span>
<span class="text-danger">Yigong will join Boston University as an Assistant Professor!</span>
Expand Down
Binary file added paper/traincheck-osdi25-preprint.pdf
Binary file not shown.
10 changes: 10 additions & 0 deletions paper/traincheck-osdi25.bib
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
@inproceedings{TrainCheckOSDI2025,
author = {Jiang, Yuxuan and Zhou, Ziming and Xu, Boyu and Liu, Beijie and Xu, Runhui and Huang, Peng},
title = {Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks},
booktitle = {Proceedings of the 19th USENIX Symposium on Operating Systems Design and Implementation},
series = {OSDI '25},
month = {July},
year = {2025},
address = {Boston, MA, USA},
publisher = {USENIX Association},
}
5 changes: 3 additions & 2 deletions pubs.html
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,10 @@
<h2 id="publications">2025</h2>
<ul class="publications">
<li>
<a target="_blank" href="#">Training with Confidence: Catching Silent DL Training Bugs with Automated Proactive Checks</a><br>
<a target="_blank" href="paper/traincheck-osdi25-preprint.pdf">Training with Confidence: Catching Silent DL Training Bugs with Automated Proactive Checks</a><br>
<span class="authorlist"><i><a href="https://essoz.github.io" class="nodec">Yuxuan Jiang</a>, </i><i>Ziming Zhou, </i><i>Boyu Xu, </i><i>Beijie Liu, </i><i>Runhui Xu, </i><i><a href="https://web.eecs.umich.edu/~ryanph" class="nodec">Peng Huang</a><br></i></span>
<a target="_blank" href="https://www.usenix.org/conference/osdi25" class="conf"><b>OSDI 2025</b></a>&nbsp;&nbsp;<a target="_blank" class="btn btn-outline-primary publinkitem" href="https://github.com/OrderLab/TrainCheck">Software</a>
<a target="_blank" href="https://www.usenix.org/conference/osdi25" class="conf"><b>OSDI 2025</b></a>&nbsp;&nbsp;<a target="_blank" class="btn btn-outline-primary publinkitem" href="paper/traincheck-osdi25.bib">BibTeX</a>
&nbsp;&nbsp;<a target="_blank" class="btn btn-outline-primary publinkitem" href="https://github.com/OrderLab/TrainCheck">Software</a>
</li>
<li>
<a target="_blank" href="#">Deriving Semantic Checkers from Tests to Detect Silent Failures in Production Distributed Systems</a><br>
Expand Down
8 changes: 8 additions & 0 deletions software.html
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,14 @@ <h3 class="section-intro__title">Group GitHub Repository</h3>
</div>
</div>
</section>
<section class="section-padding bg-magnolia">
<div class="container">
<div class="section-intro pb-85px text-center">
<h3 class="section-intro__title">TrainCheck [<a href="/paper/violet-osdi20-preprint.pdf">OSDI '25</a>]</h3>
<p class="section-intro__subtitle">TrainCheck is an innovative tool for detecting silent errors in deep learning training. We are excited to open-source TrainCheck–explore the project and get involved on <a href="https://github.com/OrderLab/TrainCheck">GitHub</a>!</p>
</div>
</div>
</section>
<section class="section-padding bg-magnolia">
<div class="container">
<div class="section-intro pb-85px text-center">
Expand Down