~/projects/committed $ generate

committed.

Conventional Commit messages from your code diffs.

0.13 → 0.64

commit-type accuracy

vs. base, reweighted

0.43 → 0.86

faithfulness

vs. base model

~1 GB

runs locally on CPU

quantized GGUF

live · connecting

—

model downloads

Hugging Face

—

dataset downloads

Hugging Face

Fine-tuning a 1.7B model (Qwen3, QLoRA) on ~58k real commits taught it to turn a code diff into a clean Conventional Commit subject line. On a 442-diff held-out eval, that lifted commit-type accuracy from 0.13 to 0.64 and faithfulness from 0.43 to 0.86 over the base model. It serves as a ~1 GB quantized GGUF on llama.cpp, CPU-only, so your diffs never leave your machine: most tools like this ship your code off to an API, and the point was to build one small enough that nothing has to. A GBNF grammar constrains decoding, so every output is a valid commit by construction. Paste a diff below, or try an example.

Try the demo ↓

architecture

rendering diagram…

committed - generateconnecting

try an example

0 chars

This hosted demo sends your diff over the network to the model Space to generate a message. The model itself is small and runs fully offline. If you would rather not send code anywhere, run it locally.

// how it works

Committed is a complete pipeline, not just a model. I started from CommitChronicle (roughly 10.7M real GitHub commits) and wrote a filter to extract clean, single-file diffs paired with well-formed Conventional Commit subjects, normalizing them into a consistent training target. I fine-tuned Qwen3-1.7B with QLoRA on the result, evaluated it against the un-tuned base model on a multi-metric harness with an LLM judge I validated against my own hand-ratings, then served it locally through llama.cpp with grammar-constrained decoding that guarantees every output is syntactically valid. Most of the work was the data, not the model, and all four stages (data, training, evaluation, serving) are here.

git diff→committed→feat(scope): subject

// results

I evaluated the fine-tune against the un-tuned Qwen3-1.7B base on a 442-example test sample, scored by an LLM judge on four orthogonal axes. The headline numbers are reweighted to the true commit-type distribution of the test split, so they reflect realistic deployment behavior.

0.637

type accuracy

deployment-reweighted

0.471

conjunctive pass-rate

all four axes pass

2.188

graded mean (0–3)

LLM-judge score

Qwen3-1.7B

base model

fine-tuned with QLoRA

metricbasefine-tuned

Type correctness0.330.81

Faithfulness0.430.86

Completeness0.520.73

Specificity0.810.71

Type accuracy (deployment-reweighted)0.1310.637

Conjunctive pass-rate0.1810.471

Graded mean (0–3)1.2072.188

The base model's dominant failure mode was “feat-collapse”: it labeled roughly 95% of all diffs as feat, regardless of what the change actually did. Because fix commits alone make up about 49% of real-world commits, a model that almost never predicts fix scores worse on type than a trivial always-guess-fix baseline (0.489), and the un-tuned base, at 0.131, did exactly that. Fine-tuning broke the collapse and lifted type accuracy well above the trivial floor.

One axis regressed, which I didn't expect: specificity dropped from 0.81 to 0.71. The fine-tune learned the terse, normalized subject style of the training targets so well that it sometimes produces messages slightly more generic than the base model's wordier output. It's a real trade-off, traceable to a normalization choice in the training data, and the next training iteration targets it directly. I include it because a complete evaluation covers the trade-offs, not just the wins.

An LLM judge is only trustworthy if it agrees with a human. I hand-rated 50 examples blind and measured the judge against them: raw agreement ran 0.68–0.84 across the four axes (Cohen's κ 0.25–0.54), strongest on completeness. That's a fair-to-moderate proxy, good enough to trust for relative comparisons, with the honest caveat that n=50 gives wide confidence intervals.

// sample outputs

input diffPython

@@ -1639,7 +1639,7 @@ def moveaxis(a, source, destination): >>> np.transpose(x).shape (5, 4, 3)->>> np.swapaxis(x, 0, -1).shape+>>> np.swapaxes(x, 0, -1).shape (5, 4, 3) >>> np.moveaxis(x, [0, 1], [-1, -2]).shape (5, 4, 3)

base Qwen3-1.7B →feat(additional-function): Add `swapaxes` function with same behavior as `swapaxis` but using `swapaxes` notation. 📦

committed →docs: Fix typo in np.swapaxis docstring

input diffC#

@@ -20,13 +20,17 @@ public partial class RestClient { /// <param name="request">Request to be executed</param>-public RestResponse Execute(RestRequest request) => AsyncHelpers.RunSync(() => ExecuteAsync(request));+/// <param name="cancellationToken">The cancellation token</param>+public RestResponse Execute(RestRequest request, CancellationToken cancellationToken = default)+    => AsyncHelpers.RunSync(() => ExecuteAsync(request, cancellationToken));

base Qwen3-1.7B →feat(adds-parameter): Adds a `CancellationToken` parameter to `Execute` and `DownloadStream` methods, allowing for cancellation support. 📦

committed →feat(RestClient): add support for cancellation tokens

// run it locally

Committed runs entirely on your machine. No API, no diff ever leaving your laptop. Install it from the repo, pipe a diff in, and get a commit message back:

pip install git+https://github.com/marzoukbaig14/Committed.gitgit diff | committed

The model is a quantized GGUF served through llama.cpp on CPU; the first run downloads it once (~1 GB), then it's fully offline. The hosted demo above runs the identical model.

// built with

Qwen3-1.7BQLoRA / PEFTllama.cppGBNF grammarFastAPIDockerHugging FaceNext.js

GitHub repo ↗Model card (adapter) ↗Model card (GGUF) ↗Dataset ↗