51 1 week ago

Pruned to 98 experts gemma-4 a4b 26b v3

tools thinking
ollama run mannix/gemma4-98e:IQ3_M

Details

1 week ago

6f2613f8eb19 · 9.8GB ·

gemma4
·
19.9B
·
(!unknown_file_type 27!)
{{- if or .System .Tools }}<bos><|turn>system {{ if .System }}{{ .System }} {{ end }}{{- if .Tools }
{ "num_ctx": 256000, "repeat_last_n": 256, "repeat_penalty": 1.15, "stop": [

Readme

Gemma 4 26B-A4B 98e v3 — pooled teacher-force expert prune

20.8B parameters · 98 experts (30 dropped) · single-class teacher-force drop map

First publishable 98e variant. Takes Gemma-4-26B-A4B-it and drops 30 of 128 experts per layer using a pooled teacher-force scoring recipe (score = wnorm × α + tc, α=2.0, protect_top=16). Same router, same attention, same norms as base — only the expert keep-set changes.

v3 matches 128e parity on GPQA Diamond at 23.4% expert reduction. Later variants (v4, v5-coder) refine the drop map with multi-class signals and free-knob router boosts; v3 is kept available as the simplest recipe and the baseline against which they're measured.

Full model card, methodology, ablations: ManniX-ITA/gemma-4-A4B-98e-v3-it on Hugging Face.

Other formats

FormatRepoNotes
GGUF (this repo, llama.cpp / ollama) ManniX-ITA/gemma-4-A4B-98e-v3-it-GGUF Bartowski tier sweep (Q3_K → Q6_K_L, IQ-series) + 5 ContribDynamic CD-* per-layer quants.
BF16 source weights ManniX-ITA/gemma-4-A4B-98e-v3-it 20.8B bf16; base for any further surgery / quant.

When to use this vs. v4 / v5-coder

Pick v3 for: the simplest, most-conservative 98e recipe — pooled teacher-force scoring, no free-knob router adjustments, fewer moving parts to reason about. Good baseline for downstream experiments.

Pick v4 for: general-purpose chat with the multi-class CD-map drop map. Slightly better balance across class axes; same 20.8B parameter count.

Pick v5-coder for: Python / JS / Rust code generation, HumanEval / LCB workloads, MATH-500-class problems. Code-targeted drop map (C6 layer-relevance-weighted v4-floor, breadth=50) plus shared-FFN α=1.2 boost.

Quick start

# recommended default for most setups (≈14 GB VRAM)
ollama pull mannix/gemma4-98e:Q4_K_M

# best quality at moderate size (≈17 GB VRAM)
ollama pull mannix/gemma4-98e:Q6_K

# size-conscious (≈8 GB VRAM) — minimal quality loss
ollama pull mannix/gemma4-98e:CD-Q3_K_M

CD-* variants are ContribDynamic per-layer mixed quants — expert layers get more bits, attention/norm less. Roughly 5–10% faster than the matching plain quant at similar quality.

Scores

GGUF Q6_K via llama.cpp on RTX 3090. Reasoning-format deepseek, reasoning-budget 8192. Apples-to-apples against 128e (Gemma-4-26B-A4B-it) on the same harness.

Benchmark (n) 128e ref 98e v3 Δ (v3 − 128e)
GPQA Diamond (198, flex) 75.25 75.25 0.00
HumanEval-164 chat (BF16) 76.83 73.78 −3.05
MBPP-500 (BF16) 89.60 85.60 −4.00

Reading the deltas: 23.4% expert reduction costs ~4–5% relative on code generation; zero on knowledge recall (GPQA Diamond matches 128e parity exactly). v3 was scored on the older BF16+Q6_K rig before the canonical 9-bench vLLM/NVFP4A16 suite was introduced in v4 — the GPQA number is directly comparable across all variants, the BF16 HE/MBPP numbers use a different harness than the v4/v5-coder cards. See the HF card for the full methodology note.

Template & parameters

Uses the Gemma 4 chat template with tool-use support and a 2nd-turn workaround for nested function calls. Default parameters baked into every tag:

PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER repeat_penalty 1.15
PARAMETER repeat_last_n 256
PARAMETER num_ctx 256000
PARAMETER stop <turn|>
PARAMETER stop <|tool_response>

License

Gemma Terms of Use. Use of this model implies acceptance.

Related