126 Downloads Updated 1 week ago
ollama run mannix/gemma4-98e-v4:Q4_0
Updated 1 week ago
1 week ago
0de450e3274f · 11GB ·
20.8B parameters · 98 experts (30 dropped) · multi-class ContribDynamic drop map
Research checkpoint that takes Gemma-4-26B-A4B-it and drops 30 of 128 experts per layer using a multi-class CD-map (max-over-normalized-classes) recipe. Five task classes — math, logic, code, science, creative — are scored per layer; an expert is kept if it scores high on any single class, rescuing specialists that a pooled single-class score would have dropped. Same router, same attention, same norms as base — only the expert keep-set changes.
Full model card, methodology, drop-map rationale, ablations: ManniX-ITA/gemma-4-A4B-98e-v4-it on Hugging Face.
| Format | Repo | Notes |
|---|---|---|
| GGUF (this repo, llama.cpp / ollama) | ManniX-ITA/gemma-4-A4B-98e-v4-it-GGUF |
Bartowski tier sweep (Q2_K → Q8_0, IQ-series) + 5 ContribDynamic CD-* per-layer quants. F16 baseline included. |
| NVFP4A16 (vLLM) | ManniX-ITA/gemma-4-A4B-98e-v4-NVFP4A16 |
~13 GB, native vLLM, produced via modelopt==0.43.0. |
| BF16 source weights | ManniX-ITA/gemma-4-A4B-98e-v4-it |
20.8B bf16; base for any further surgery / quant. |
Pick v4 for: general-purpose chat, knowledge recall, instruction-following, mixed workloads. Balanced across the 5 task classes by construction.
Pick v5-coder for: Python / JS / Rust code generation, HumanEval / LCB workloads, MATH-500-class problems. Same parameter count, code-targeted drop map (C6 layer-relevance-weighted v4-floor, breadth=50) — wins on every code bench and on MATH-500.
# recommended default for most setups (≈14 GB VRAM)
ollama pull mannix/gemma4-98e-v4:Q4_K_M
# best quality at moderate size (≈17 GB VRAM)
ollama pull mannix/gemma4-98e-v4:Q6_K
# size-conscious (≈8 GB VRAM) — minimal quality loss
ollama pull mannix/gemma4-98e-v4:CD-Q3_K_M
CD-* variants are ContribDynamic per-layer mixed quants — expert layers get more bits, attention/norm less. Roughly 5–10% faster than the matching plain quant at similar quality.
NVFP4A16, vLLM, greedy decoding, thinking-token budget 12 288. Apples-to-apples against the 128e reference (Gemma-4-26B-A4B-it) on the same harness.
| Benchmark (n) | 128e ref | 98e v4 | Δ (v4 − 128e) |
|---|---|---|---|
| HumanEval-164 chat (pass@1) | 96.95 | 96.95 | 0.00 |
| HumanEval+-164 chat (pass@1) | 92.07 | 91.46 | −0.61 |
| MATH-500-100 (math_verify) | 89.00 | 89.00 | 0.00 |
| AIME 2024 (30) | 36.67 | 36.67 | 0.00 |
| IFEval-100 (prompt_strict) | 95.00 | 93.00 | −2.00 |
| GSM8K-100 (flex) | 91.00 | 86.00 | −5.00 |
| GPQA Diamond (198, flex) | 73.23 | 69.19 | −4.04 |
| LCB-medium-55 v4 (pass@1) | 87.27 | 78.18 | −9.09 |
Reading the deltas: code chat (HE/HE+) and pure math (MATH-500, AIME) hold at 128e parity. Knowledge / multi-step reasoning (GPQA, GSM8K) and competitive code (LCB-medium) take a measured hit — the cost of dropping 23.4% of experts with a class-balanced drop map. Use v5-coder if those code/math axes matter for your workload.
Uses the Gemma 4 chat template with tool-use support and a 2nd-turn workaround for nested function calls. Default parameters baked into every tag:
PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER repeat_penalty 1.15
PARAMETER repeat_last_n 256
PARAMETER num_ctx 256000
PARAMETER stop <turn|>
PARAMETER stop <|tool_response>
Gemma Terms of Use. Use of this model implies acceptance.
mannix/gemma4-98e-v5-codermannix/gemma4-98e