// how it works
Committed is a complete pipeline, not just a model. I started from CommitChronicle (roughly 10.7M real GitHub commits) and wrote a filter to extract clean, single-file diffs paired with well-formed Conventional Commit subjects, normalizing them into a consistent training target. I fine-tuned Qwen3-1.7B with QLoRA on the result, evaluated it against the un-tuned base model on a multi-metric harness with an LLM judge I validated against my own hand-ratings, then served it locally through llama.cpp with grammar-constrained decoding that guarantees every output is syntactically valid. Most of the work was the data, not the model, and all four stages (data, training, evaluation, serving) are here.
// results
I evaluated the fine-tune against the un-tuned Qwen3-1.7B base on a 442-example test sample, scored by an LLM judge on four orthogonal axes. The headline numbers are reweighted to the true commit-type distribution of the test split, so they reflect realistic deployment behavior.
The base model's dominant failure mode was “feat-collapse”: it labeled roughly 95% of all diffs as feat, regardless of what the change actually did. Because fix commits alone make up about 49% of real-world commits, a model that almost never predicts fix scores worse on type than a trivial always-guess-fix baseline (0.489), and the un-tuned base, at 0.131, did exactly that. Fine-tuning broke the collapse and lifted type accuracy well above the trivial floor.
One axis regressed, which I didn't expect: specificity dropped from 0.81 to 0.71. The fine-tune learned the terse, normalized subject style of the training targets so well that it sometimes produces messages slightly more generic than the base model's wordier output. It's a real trade-off, traceable to a normalization choice in the training data, and the next training iteration targets it directly. I include it because a complete evaluation covers the trade-offs, not just the wins.
An LLM judge is only trustworthy if it agrees with a human. I hand-rated 50 examples blind and measured the judge against them: raw agreement ran 0.68–0.84 across the four axes (Cohen's κ 0.25–0.54), strongest on completeness. That's a fair-to-moderate proxy, good enough to trust for relative comparisons, with the honest caveat that n=50 gives wide confidence intervals.
// sample outputs
// run it locally
Committed runs entirely on your machine. No API, no diff ever leaving your laptop. Install it from the repo, pipe a diff in, and get a commit message back:
The model is a quantized GGUF served through llama.cpp on CPU; the first run downloads it once (~1 GB), then it's fully offline. The hosted demo above runs the identical model.