LESSTHANSUPER/ZEPHYRIA-Mistral_Small:42b_IQ1

ZEPHYRIA / I-MATRIX / 37-48B / I-QUANT

This model has been recommended often in forums and even interpersonally. This model has multiple versions finetuned from Mistral Small Instruct to complete different tasks. The versions listed are as follows:

Early Duplication: The 48-billion parameter and largest version which has had around two thirds of the layers duplicated, thus has been said to benefit “Low level feature processing.”

Balanced with Extended Duplication: The 45-billion parameter model which more heavily relies on duplicated layers earlier in the finetuning process. More fitting for complex generative tasks. May be the go-to for storytelling.

Mid Duplication: The 42-billion parameter model with a 1:1 ratio between unique and duplicated layers. Given as a general purpose model.

Balanced: The 39-billion parameter model with an approximate 1:1:1 ratio between duplicated, non duplicated, and unique layers after the process. On low VRAM devices, this would be the recommended general purpose model to use.

Late Duplication: The 37-billion parameter model with a 4:3 ratio between non-duplicated and duplicated layers. “Ideal for tasks requiring extensive unique feature processing,” the card says.

To stuff as many parameters in as little VRAM as possible, weighted K and I-quants will be listed, along with each version the creator featured. Also, since this is an “experimental” model, the medium 1-bit I-quants will be included just for fun. They may prove effective, as larger models that have been heavily quantized will still have enough paramters as to not generate garbled nonsense (see Ubergarm’s 671b 1-bit Deepseek), whether that rings true for 37-48b, that is to be seen. Note that I-quants forfeit some token generation speed relative to K-quants in exchange for storage efficiency.

As this is a large model collection, VRAM requirements can increase drastically when looking for performant quantizations. The following recommendations are tailored for consumer setups:

For >=48 (2x24, 4x12, etc.) GB GPUs: Anything. Anything will be fine.

For 40 (2x20, etc.) GB GPUs: Any non 8-bit model should work. If you wish to work with the largest, early duplication model at 6-bit, you may have to manually set layer splits to prevent any CPU offloading. To err on the side of caution, use the 5-bit models if you wish to use the largest model.

For 32 (2x16, 1X32, etc.) GB GPUs: 5-bit quantizations below 48-billion parameters will fit in 32GB VRAM. Note that the smallest model can be used at 6-bit.

For 24GB (2x12, 1X24, etc.) GB GPUs: The small 3-bit K-quant 48b, the medium 3-bit at 45b, the extra-small 4-bit I-quant at 42b, the small 4-bit K-quant at 39b, or the medium 4-bit K-quant at 37b.

For 20 GB GPUs: The small 3-bit K-quant for the 42b model, or the medium K-quant for the 39 and 37b model will fit into 20GB. If you would prefer a lower perplexity rather than a faster generation, use the small 3-bit I-quants. You can also use the XXS 3-bit I-quant if you want to use the 48b model.

For 16GB GPUs: The XXS 3-bit I-quant for the 37b and 39b model; and the extra-small 2-bit I-quant for the 42b, 45b, and 48b model all fit in 16GB.

For 12GB GPUs: You can use the extra-small, 2-bit I-quant for the 37b model will work fine, along with the XXS 2-bit I-quants for 42b and below.

For 10GB GPUs: Take a look at the 1-bit I-quants to see if they produce any worthwhile results.

These models were taken from GGUF formats from Huggingface.

Original model (SteelStorage):

GGUF weighted quantizations (mradermacher):

Series of 37b-48b experimental models. Made by SteelStorage (Huggingface).

Readme