51 Downloads Updated 1 week ago
ollama run mannix/gemma4-98e:Q5_K_M
Updated 1 week ago
1 week ago
3c3bdd40eebe · 13GB ·
20.8B parameters · 98 experts (30 dropped) · single-class teacher-force drop map
First publishable 98e variant. Takes Gemma-4-26B-A4B-it and drops 30 of 128 experts per layer using a pooled teacher-force scoring recipe (score = wnorm × α + tc, α=2.0, protect_top=16). Same router, same attention, same norms as base — only the expert keep-set changes.
v3 matches 128e parity on GPQA Diamond at 23.4% expert reduction. Later variants (v4, v5-coder) refine the drop map with multi-class signals and free-knob router boosts; v3 is kept available as the simplest recipe and the baseline against which they're measured.
Full model card, methodology, ablations: ManniX-ITA/gemma-4-A4B-98e-v3-it on Hugging Face.
| Format | Repo | Notes |
|---|---|---|
| GGUF (this repo, llama.cpp / ollama) | ManniX-ITA/gemma-4-A4B-98e-v3-it-GGUF |
Bartowski tier sweep (Q3_K → Q6_K_L, IQ-series) + 5 ContribDynamic CD-* per-layer quants. |
| BF16 source weights | ManniX-ITA/gemma-4-A4B-98e-v3-it |
20.8B bf16; base for any further surgery / quant. |
Pick v3 for: the simplest, most-conservative 98e recipe — pooled teacher-force scoring, no free-knob router adjustments, fewer moving parts to reason about. Good baseline for downstream experiments.
Pick v4 for: general-purpose chat with the multi-class CD-map drop map. Slightly better balance across class axes; same 20.8B parameter count.
Pick v5-coder for: Python / JS / Rust code generation, HumanEval / LCB workloads, MATH-500-class problems. Code-targeted drop map (C6 layer-relevance-weighted v4-floor, breadth=50) plus shared-FFN α=1.2 boost.
# recommended default for most setups (≈14 GB VRAM)
ollama pull mannix/gemma4-98e:Q4_K_M
# best quality at moderate size (≈17 GB VRAM)
ollama pull mannix/gemma4-98e:Q6_K
# size-conscious (≈8 GB VRAM) — minimal quality loss
ollama pull mannix/gemma4-98e:CD-Q3_K_M
CD-* variants are ContribDynamic per-layer mixed quants — expert layers get more bits, attention/norm less. Roughly 5–10% faster than the matching plain quant at similar quality.
GGUF Q6_K via llama.cpp on RTX 3090. Reasoning-format deepseek, reasoning-budget 8192. Apples-to-apples against 128e (Gemma-4-26B-A4B-it) on the same harness.
| Benchmark (n) | 128e ref | 98e v3 | Δ (v3 − 128e) |
|---|---|---|---|
| GPQA Diamond (198, flex) | 75.25 | 75.25 | 0.00 |
| HumanEval-164 chat (BF16) | 76.83 | 73.78 | −3.05 |
| MBPP-500 (BF16) | 89.60 | 85.60 | −4.00 |
Reading the deltas: 23.4% expert reduction costs ~4–5% relative on code generation; zero on knowledge recall (GPQA Diamond matches 128e parity exactly). v3 was scored on the older BF16+Q6_K rig before the canonical 9-bench vLLM/NVFP4A16 suite was introduced in v4 — the GPQA number is directly comparable across all variants, the BF16 HE/MBPP numbers use a different harness than the v4/v5-coder cards. See the HF card for the full methodology note.
Uses the Gemma 4 chat template with tool-use support and a 2nd-turn workaround for nested function calls. Default parameters baked into every tag:
PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER repeat_penalty 1.15
PARAMETER repeat_last_n 256
PARAMETER num_ctx 256000
PARAMETER stop <turn|>
PARAMETER stop <|tool_response>
Gemma Terms of Use. Use of this model implies acceptance.
mannix/gemma4-98e-v4mannix/gemma4-98e-v5-coder