finetune.cpp command-line arg #13873

graehl · 2025-05-28T20:26:00Z

add to ggml-opt learning rate (adamw alpha) cmdline arg, and an optimizer enum defaulting to adamw,
preparatory to work to support SGD

these are in common args a set of optimizer options active only for the new FINETUNE example (which includes all the previous finetune.cpp PERPLEXITY options as a precaution)

perhaps breaking with precedent, the ggml_opt_optimizer_params struct is included directly as args - if desired, we can instead just add learning rate and optimizer type to a struct independent of ggml-opt.h

as proposed in
#13835

graehl · 2025-05-28T20:31:41Z

perhaps no need to review until i have an actual SGD impl in a follow-on, @JohannesGaessler - but a few general questions about contributing:

is it ok to make small retouches to ggml/ sources in this (llama.cpp) project with the expectation of getting the changes into the actual ggml repo later? are there any plans to submodule a ggml-in-llama branch to keep things straight(er)?
is what i've got hee the expected way to add example-specific command line arguments? for finetune we definitely at least want to be able to vary the learning rate, which was formerly hard-coded.
were the PERPLEXITY args which i blindly added to the new FINETUNE example actually doing anything interesting? perhaps some should be dropped from finetune.
could you direct me to a .clang-format style file that might save me from accidentally re-indenting? i know i can set up clang-format to operate only on regions i've already changed ...

WilliamTambellini

you should better keep that change as it time to get more feedbacks/approval.

JohannesGaessler · 2025-05-28T21:08:35Z

is it ok to make small retouches to ggml/ sources in this (llama.cpp) project with the expectation of getting the changes into the actual ggml repo later? are there any plans to submodule a ggml-in-llama branch to keep things straight(er)?

Any changes made to the ggml source in this repository will eventually be synced to the ggml repository and vice versa; it is completely fine. I think the issue of a git submodule was previously brought up and rejected.

is what i've got hee the expected way to add example-specific command line arguments? for finetune we definitely at least want to be able to vary the learning rate, which was formerly hard-coded.

My opinion is that people serious about training should be writing a program rather than use a command line tool. Still, I think it's good to make things such as the learning rate configurable in the provided example program.

were the PERPLEXITY args which i blindly added to the new FINETUNE example actually doing anything interesting? perhaps some should be dropped from finetune.

I don't remember whether those args were put in by me when I copypasted code or by Georgi when he later refactored it but I myself definitely did not make an intentional choice to use these exact arguments.

could you direct me to a .clang-format style file that might save me from accidentally re-indenting? i know i can set up clang-format to operate only on regions i've already changed ...

I don't know, sorry.

WilliamTambellini · 2025-05-28T21:12:28Z

@ggerganov

JohannesGaessler

None of the previous perplexity-specific arguments are needed.

common/arg.cpp

common/common.h

ggml/include/ggml-opt.h

JohannesGaessler · 2025-05-28T21:26:17Z

For adding an SDG optimizer, add a new ggml op like OPT_STEP_SDG. Add a CPU implementation as a fallback for any backend without an implementation. Add a CUDA implementation since that is (I assume) the backend which you intend to use in production. Add a test to tests/test_backend_ops.cpp to assert that the CPU and CUDA backends produce consistent results. Extend ggml-opt.cpp to conditionally use the new SDG optimizer step, condition the allocation of the optimizer momenta on the optimizer type.

graehl · 2025-05-29T16:15:21Z

For adding an SDG optimizer, add a new ggml op like OPT_STEP_SDG. Add a CPU implementation as a fallback for any backend without an implementation. Add a CUDA implementation since that is (I assume) the backend which you intend to use in production. Add a test to tests/test_backend_ops.cpp to assert that the CPU and CUDA backends produce consistent results. Extend ggml-opt.cpp to conditionally use the new SDG optimizer step, condition the allocation of the optimizer momenta on the optimizer type.

yes, will do. should the actual SGD impl be a subsequent pull req (or several, e.g. starting first w/ just CPU impl) or do you want it all in one pull req?

JohannesGaessler · 2025-05-29T16:34:10Z

Either way would be fine with me as long as there are at no point broken or unfinished features on master.

matiaslin

Looking forward to the next PR(s).

graehl · 2025-05-29T18:56:23Z

you should see frivolous clang-format changes (using the project's .clang-format) only on lines changed in the PR (using git-clang-format). if there's something undesireable we could figure out what in the format config does it

JohannesGaessler · 2025-05-29T19:20:55Z

Don't autoformat code en masse unless it's done in a dedicated PR, it makes it unnecessarily difficult to track what was actually changed in a PR.

JohannesGaessler · 2025-05-29T19:25:03Z

Sorry, I didn't read the

only on lines changed in the PR

part.

graehl · 2025-05-30T16:59:33Z

Hi @WilliamTambellini @JohannesGaessler I think this is usable now, inviting code nitpicks etc :)
pretty new to the github interface honestly so let me know if this needs to be two separate PRs one for each commit or if it's reasonable to just review both commits here (obv. better to merge separately, first doesn't break any behavior, second impacts the finetune cmdline default learning rate but that should hurt no one)

graehl · 2025-05-30T17:01:47Z

Second (actual usable SGD) commit is 48a16bf (also shows above here)

WilliamTambellini

Mix up different projects: change of CLI/renaming and SGD. Need to split in 2 PRs.
@slaren ?

common/arg.cpp

examples/training/finetune.cpp

ggml/include/ggml-opt.h

ggml/src/ggml-opt.cpp

JohannesGaessler · 2025-05-30T18:17:21Z

ggml/src/ggml-opt.cpp

@@ -770,7 +814,7 @@ void ggml_opt_eval(ggml_opt_context_t opt_ctx, ggml_opt_result_t result) {
        // beta1, beta2 after applying warmup
        const float beta1h = 1.0f/(1.0f - powf(opt_pars.adamw.beta1, opt_ctx->iter));
        const float beta2h = 1.0f/(1.0f - powf(opt_pars.adamw.beta2, opt_ctx->iter));
-
+        const float keep           = 1.0f - opt_pars.adamw.alpha * opt_pars.adamw.wd;


Optimizer steps are going to be I/O bound and optimizing compute is not going to make a meaningful difference for the runtime of the steps, for the runtime of the total probram it's completely negligible. So please revert this change, I think the other variant is easier to understand.

I agree that it's not likely to matter, but it's 1. per parameter per epoch (ok, does seem unimportant now that I think further) and 2. i'm not confident cuda CC optimizes this and was hoping to learn more - would seem possible that w/o this we're loading repeatedly two floats instead of one - and mostly 3. this is exactly following precedent established for beta1h and beta2h, which are stored in the tensor just as i stored this quantity.

Anyway, totally willing, just curious what you think about the existing practice of saving beta1h and beta2h in light of this opinion that we're not compute bound.

i checked it out - doesn't seem to change runtime noticeably as you predicted

My biggest concern with the code is the amount of effort needed to maintain it, particularly when it comes to debugging and asserting that the code on master works correctly. It is quite likely that I will at some point be in a situation where a user reports bad training results and I will not know whether that is the due to a bug in ggml or due to bad hyperparamters or something similar. So it is very important to me that the data layout is consistent across multiple levels.

The correct way to implement the micro-optimization of pre-computing a parameter derived from the human-interpretable parameters is as follows:

Pass the human-interpretable parameters to ggml_opt_step_adamw / ggml_opt_step_sdg.

In the CUDA host code, pre-compute some derived parameters from the human-interpretable parameters.

Change the CUDA device code to accept the derived parameters instead.

The way CUDA works is that the CPU schedules the GPU kernels in a CUDA stream and then waits for said stream to finish all kernels. Scheduling the kernels is of course much faster and it doesn't matter how fast you are as long as you are fast enough to keep the GPU busy. So adding a bit of overhead to the scheduling has essentially no impact on the runtime of a CUDA program even if you do it once per CUDA kernel launch instead of once per epoch.

Thanks for explaining all that, the bottom line for me is that you were right and the micro-optimization has no visible benefit in this case.

tests/test-backend-ops.cpp

JohannesGaessler · 2025-06-10T21:58:12Z

Want opt_period removed?

I want it removed as a CLI argument, the parameterization should be done via --batch and --ubatch as is done on master.

graehl · 2025-06-11T04:05:18Z

Want opt_period removed?

I want it removed as a CLI argument, the parameterization should be done via --batch and --ubatch as is done on master.

ok, no problem. keep in mind that the parameterization w/ those args has no effect because of code that chops off batch to match ubatch. do you want me to open it up? it seems erroneous to limit it thus and opt_period can be computed (assuming physical batch divides logical evenly). otherwise i just kill the option i was using to get opt_period and save it for another day

void llama_context::opt_epoch_iter( ...) { const uint32_t n_ctx = llama_model_n_ctx_train(&model); const uint32_t n_batch = std::min(this->n_batch(), n_ctx); // logical batch limited to no more than n_ctx const uint32_t n_ubatch = std::min(this->n_ubatch(), n_batch); // physical batch limited to no more than (already limited) logical batch
i suppose in my testing before i didn't realize n_ctx was what was stopping me from having a large batch size. anyway, removing the opt_period arg. reading error on my part.

confirmed: llama_context: n_ctx = 2688
llama_context: n_ctx_per_seq = 2688
llama_context: n_batch = 2688
llama_context: n_ubatch = 1344

fits in memory exceeding the larget possible w/ opt_period 1:
llama_context: n_ctx = 1728
llama_context: n_ctx_per_seq = 1728
llama_context: n_batch = 1728
llama_context: n_ubatch = 1728

(couldn't do double, because of the extra accumulation memory needed w/ period>1)

graehl · 2025-06-11T05:01:11Z

i think we're caught up - double checked test-opt to make sure it still passes, do you want it re-disabled?

JohannesGaessler

Prior to my review there were already unresolved requests for changes that will need to be addressed prior to merging.

common/arg.cpp

JohannesGaessler · 2025-06-11T09:19:01Z

common/arg.cpp

+                { "-lr-half", "--learning-rate-halflife-epochs" }, "N",
+                string_format("reduce lr in half every N epochs (default: %.3g)", (double) params.lr.halflife_epochs),
+                [](common_params & params, const std::string & value) { params.lr.halflife_epochs = std::stof(value); })
+                .set_examples({ LLAMA_EXAMPLE_FINETUNE }));
+    add_opt(common_arg({ "-lr-halvings", "--learning-rate-halvings" }, "N",
+                       string_format("max N lr halvings (default: %.3g)", (double) params.lr.halvings),
+                       [](common_params & params, const std::string & value) { params.lr.halvings = std::stof(value); })
+                .set_examples({ LLAMA_EXAMPLE_FINETUNE }));


With the current code the final learning rate you get is relative to the initial learning rate. That is what I am taking issue with because to me it seems like it would make more sense to adjust the two values independently from one another. Please change the argument to something like -lr-min.

common/arg.cpp

common/common.h

examples/training/finetune.cpp

JohannesGaessler · 2025-06-11T09:59:01Z

tests/test-opt.cpp

    return result;
 }

+static enum ggml_opt_optimizer_type g_optimizer_type = GGML_OPT_OPTIMIZER_TYPE_ADAMW;


Don't introduce global state to the tests.

you're asking me to pass this around as a parameter everywhere instead, correct?

common/common.cpp

JohannesGaessler · 2025-06-11T11:01:03Z

I looked at the git history. When I overhauled the ggml training code I re-used a pre-existing path tests/test-opt.cpp for the unit tests. In the ggml repository, where I did the implementation, these tests are enabled. But when the tests were synced with the llama.cpp repository the previously disabled compilation of tests/test-opt.cpp was not enabled. So the tests should be enabled and the stale comment about them being slow should be removed.

graehl · 2025-06-11T18:37:58Z

if you prefer lr-min enough to do it yourself, great. otherwise i'll decline

graehl · 2025-06-11T18:40:14Z

other currently outstanding questions are (let me know if i missed any): no globals allowed in unit test (!), and 'set these values' (lr_opt)

JohannesGaessler · 2025-06-11T19:03:56Z

If we can't reach an agreement regarding the exact implementation regarding a decaying learning rate I would say to just strip the feature from this PR.

The other pre-existing, unresolved issues I mentioned were the ones in ggml-opt.cpp. If you go through the file on Github, you should still be seeing the comments. I will add additional comments if there is anything.

Sorry for being difficult, social skills are not my strong point.

graehl · 2025-06-13T17:27:10Z

If we can't reach an agreement regarding the exact implementation regarding a decaying learning rate I would say to just strip the feature from this PR.

The other pre-existing, unresolved issues I mentioned were the ones in ggml-opt.cpp. If you go through the file on Github, you should still be seeing the comments. I will add additional comments if there is anything.
^^^
I'll try to figure out how to do this.

Sorry for being difficult, social skills are not my strong point.

same :)

graehl · 2025-06-16T23:25:38Z

any update? all handled on my end afaict

JohannesGaessler · 2025-06-17T06:34:42Z

As I already said, please look at my comments in ggml-opt.cpp.

I am not willing to maintain an interface that has two parameterizations for the same thing. I am willing to maintain the -lr-min interface, not the -lr-halvings interface. Either remove -lr-halvings or a decaying learning rate altogether.

I started the CI. Once the above two issues are addressed and the CI works I will push some purely cosmetic changes and then I would be willing to merge this PR.

JohannesGaessler · 2025-06-17T06:43:13Z

Also there is still global state in the unit tests.

graehl · 2025-06-17T18:20:43Z

Also there is still global state in the unit tests.

i'll fix, thanks for the reminder

add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating m, v tensors. support finetune.cpp arg -opt SGD (or sgd). (default adamw as before) llama 3.2-1b-F32 result: observed 11gb gpu ram (41 sec/epoch) when using SGD instead of 19gb (55 sec/epoch) using adamw. (wikipedia 100 lines finetune) ( using the same GPU memory, adamw can only do before OOM 512 batch/context, reaching: train: [███████▉] data=0000140/0000140 loss=0.02575±0.00099 acc=99.52±0.03% t=00:00:47 ETA=00:00:00 val: [███████▉] data=0000008/0000008 loss=4.76565±0.28810 acc=41.46±0.77% t=00:00:00 ETA=00:00:00 SGD is superior, though it converges slower, with max before OOM 1728 batch/context (esp see the better validation perf): train: [███████▉] data=0000039/0000039 loss=0.00371±0.00010 acc=99.96±0.01% t=00:00:41 ETA=00:00:00 val: [███████▉] data=0000003/0000003 loss=5.11406±0.76034 acc=48.01±0.69% t=00:00:01 ETA=00:00:00 ) note: when finetuning long enough (or w/ enough -lr), validation accuracy *eventually* drops ('catastrophic forgetting') -lr-half (halflife) option useful for SGD to avoid oscillation or super slow underdamped learning (makes setting -lr more forgiving). terminal -lr for now is set by lr-halvings i.e. if you want at most 1/8 the inital -lr you set -lr-halvings 3. note: objective loss not directly comparable between adamw, sgd? - check perplexity or accuracy or consider relative improvements for convergence new finetune args -wd 1e-9 to enable weight decay in sgd or adamw, and max -epochs N (default 2 as before) cache (1 - wd*alpha) in 'adamw' opt struct - no noticeable perf benefit, disabled (still done for new SGD though) since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params would probably be able to change between SGD and AdamW with each epoch but would need to use adamw for the first (unconfirmed - no cmdline arg to set such a policy yet) test-opt checks adamw as before and now sgd (except for a few disabled tests for sgd only; probably just needs logging values and adding alternate reference values); tolerance on the 'regression' test is broader for sgd (so we don't need many more epochs)

graehl · 2025-06-17T19:24:08Z

pushed: removed 'half' related args, verify lr0/lr-min behavior, removed global var from test-opt (verified still passes)

please let me know if you find any other change you want as i'm keen to wrap this up

graehl · 2025-06-17T19:24:41Z

and no objection to any cosmetics you want to slap on yourself, of course!

graehl requested a review from JohannesGaessler as a code owner May 28, 2025 20:26

github-actions bot added examples ggml changes relating to the ggml tensor library for machine learning labels May 28, 2025

WilliamTambellini approved these changes May 28, 2025

View reviewed changes

JohannesGaessler reviewed May 28, 2025

View reviewed changes

common/arg.cpp Outdated Show resolved Hide resolved

common/common.h Outdated Show resolved Hide resolved

ggml/include/ggml-opt.h Outdated Show resolved Hide resolved

graehl force-pushed the finelayer branch 2 times, most recently from e752031 to e689af8 Compare May 29, 2025 17:07

matiaslin reviewed May 29, 2025

View reviewed changes

graehl force-pushed the finelayer branch from e689af8 to aa59aa3 Compare May 29, 2025 18:42

graehl force-pushed the finelayer branch from 3f6b262 to b3be58d Compare May 30, 2025 08:04

github-actions bot added build Compilation issues testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs labels May 30, 2025

graehl force-pushed the finelayer branch 3 times, most recently from 7534bbf to 48a16bf Compare May 30, 2025 16:57

WilliamTambellini suggested changes May 30, 2025

View reviewed changes

JohannesGaessler reviewed May 30, 2025

View reviewed changes

graehl force-pushed the finelayer branch from 48a16bf to 96c3988 Compare May 30, 2025 18:59

graehl force-pushed the finelayer branch 3 times, most recently from 9fbe596 to 7b9a0f2 Compare June 10, 2025 21:39

This comment was marked as resolved.

Sign in to view

graehl force-pushed the finelayer branch 6 times, most recently from 3a41cf4 to e30abe6 Compare June 11, 2025 04:56

JohannesGaessler reviewed Jun 11, 2025

View reviewed changes

graehl force-pushed the finelayer branch 3 times, most recently from fcc241e to 856d37c Compare June 13, 2025 23:44

graehl force-pushed the finelayer branch from 856d37c to 7c03387 Compare June 17, 2025 19:23

finetune.cpp command-line arg #13873

Are you sure you want to change the base?

finetune.cpp command-line arg #13873

Conversation

graehl commented May 28, 2025

Uh oh!

graehl commented May 28, 2025

Uh oh!

WilliamTambellini left a comment

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler commented May 28, 2025

Uh oh!

WilliamTambellini commented May 28, 2025

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler commented May 28, 2025

Uh oh!

graehl commented May 29, 2025

Uh oh!

JohannesGaessler commented May 29, 2025

Uh oh!

matiaslin left a comment

Choose a reason for hiding this comment

Uh oh!

graehl commented May 29, 2025

Uh oh!

JohannesGaessler commented May 29, 2025

Uh oh!

JohannesGaessler commented May 29, 2025

Uh oh!

graehl commented May 30, 2025

Uh oh!

graehl commented May 30, 2025

Uh oh!

WilliamTambellini left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JohannesGaessler commented Jun 10, 2025

Uh oh!

This comment was marked as resolved.

graehl commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

graehl commented Jun 11, 2025

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

WilliamTambellini left a comment •

edited

Loading

graehl commented Jun 11, 2025 •

edited

Loading