126 1 week ago

Pruned to 98 experts gemma-4 a4b 26b v4

tools thinking
ollama run mannix/gemma4-98e-v4:Q6_K

Details

1 week ago

25d1dbb9e448 · 15GB ·

gemma4
·
19.9B
·
Q6_K
{{- if or .System .Tools }}<bos><|turn>system {{ if .System }}{{ .System }} {{ end }}{{- if .Tools }
{ "num_ctx": 256000, "repeat_last_n": 256, "repeat_penalty": 1.15, "stop": [

Readme

Gemma 4 26B-A4B 98e v4 — multi-class CD-map expert prune

20.8B parameters · 98 experts (30 dropped) · multi-class ContribDynamic drop map

Research checkpoint that takes Gemma-4-26B-A4B-it and drops 30 of 128 experts per layer using a multi-class CD-map (max-over-normalized-classes) recipe. Five task classes — math, logic, code, science, creative — are scored per layer; an expert is kept if it scores high on any single class, rescuing specialists that a pooled single-class score would have dropped. Same router, same attention, same norms as base — only the expert keep-set changes.

Full model card, methodology, drop-map rationale, ablations: ManniX-ITA/gemma-4-A4B-98e-v4-it on Hugging Face.

Other formats

FormatRepoNotes
GGUF (this repo, llama.cpp / ollama) ManniX-ITA/gemma-4-A4B-98e-v4-it-GGUF Bartowski tier sweep (Q2_K → Q8_0, IQ-series) + 5 ContribDynamic CD-* per-layer quants. F16 baseline included.
NVFP4A16 (vLLM) ManniX-ITA/gemma-4-A4B-98e-v4-NVFP4A16 ~13 GB, native vLLM, produced via modelopt==0.43.0.
BF16 source weights ManniX-ITA/gemma-4-A4B-98e-v4-it 20.8B bf16; base for any further surgery / quant.

When to use this vs. v5-coder

Pick v4 for: general-purpose chat, knowledge recall, instruction-following, mixed workloads. Balanced across the 5 task classes by construction.

Pick v5-coder for: Python / JS / Rust code generation, HumanEval / LCB workloads, MATH-500-class problems. Same parameter count, code-targeted drop map (C6 layer-relevance-weighted v4-floor, breadth=50) — wins on every code bench and on MATH-500.

Quick start

# recommended default for most setups (≈14 GB VRAM)
ollama pull mannix/gemma4-98e-v4:Q4_K_M

# best quality at moderate size (≈17 GB VRAM)
ollama pull mannix/gemma4-98e-v4:Q6_K

# size-conscious (≈8 GB VRAM) — minimal quality loss
ollama pull mannix/gemma4-98e-v4:CD-Q3_K_M

CD-* variants are ContribDynamic per-layer mixed quants — expert layers get more bits, attention/norm less. Roughly 5–10% faster than the matching plain quant at similar quality.

Scores

NVFP4A16, vLLM, greedy decoding, thinking-token budget 12 288. Apples-to-apples against the 128e reference (Gemma-4-26B-A4B-it) on the same harness.

Benchmark (n) 128e ref 98e v4 Δ (v4 − 128e)
HumanEval-164 chat (pass@1) 96.95 96.95 0.00
HumanEval+-164 chat (pass@1) 92.07 91.46 −0.61
MATH-500-100 (math_verify) 89.00 89.00 0.00
AIME 2024 (30) 36.67 36.67 0.00
IFEval-100 (prompt_strict) 95.00 93.00 −2.00
GSM8K-100 (flex) 91.00 86.00 −5.00
GPQA Diamond (198, flex) 73.23 69.19 −4.04
LCB-medium-55 v4 (pass@1) 87.27 78.18 −9.09

Reading the deltas: code chat (HE/HE+) and pure math (MATH-500, AIME) hold at 128e parity. Knowledge / multi-step reasoning (GPQA, GSM8K) and competitive code (LCB-medium) take a measured hit — the cost of dropping 23.4% of experts with a class-balanced drop map. Use v5-coder if those code/math axes matter for your workload.

Template & parameters

Uses the Gemma 4 chat template with tool-use support and a 2nd-turn workaround for nested function calls. Default parameters baked into every tag:

PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER repeat_penalty 1.15
PARAMETER repeat_last_n 256
PARAMETER num_ctx 256000
PARAMETER stop <turn|>
PARAMETER stop <|tool_response>

License

Gemma Terms of Use. Use of this model implies acceptance.

Related