22 Downloads Updated 2 days ago
Name
7 models
CLOUDY_MIXTRAL-22b:Q2_K_S
7.3GB · 4K context window · Text · 2 days ago
CLOUDY_MIXTRAL-22b:Q3_K_M
10GB · 4K context window · Text · 2 days ago
CLOUDY_MIXTRAL-22b:Q4_K_M
13GB · 4K context window · Text · 2 days ago
CLOUDY_MIXTRAL-22b:Q5_K_S
15GB · 4K context window · Text · 2 days ago
CLOUDY_MIXTRAL-22b:Q6_K
18GB · 4K context window · Text · 2 days ago
CLOUDY_MIXTRAL-22b:IQ3_XS
8.8GB · 4K context window · Text · 2 days ago
CLOUDY_MIXTRAL-22b:IQ4_XS
11GB · 4K context window · Text · 2 days ago
MIXTRAL MOE / I-MATRIX / 22B (2X13B) / I-QUANT
Contender for the “go-to” storytelling/roleplay model I have come across in my search for the best model (a fool’s errand). Although it loads in all 22-billion parameters into memory, only a set of 13 will be accessed, this not only drastically increases speed, but decreases the total size on disk - it has an effective 26-billion parameters, and virtually 22. In cases where the context needs to be constantly rewritten in its entirety, or the generation speed for standard models is low (using multiple GPUs without a high speed interconnect, for instance), an MoE (Mixture of Experts) model may prove beneficial. That, and its Mixtral base model will output a unique prose like 22-24b models, but at higher speed. To stuff as many parameters in as little VRAM as possible, weighted K and I-quants will be listed. Whenever model size allows, quantizations will be picked to fit within 8, 10, 12, and 16GB GPUs.
Note that I-quants forfeit some token generation speed relative to K-quants in exchange for storage efficiency. With its storage efficiency, the 5-bit quantization can fit inside the VRAM of a 16GB GPU. These models were taken from GGUF formats from Huggingface.
GGUF weighted quantizations (mradermacher):
[No obligatory model picture. Ollama would not like it.]