mychen76/qwen3_cline

Quantized version of Qwen3 models (4B,8B,14B,32B, 30B-moe) optimized for tool usage in Cline / Roo Code and solving Complex Problems.

Qwen3 Models

https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f

Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support. Available Quantized models [4B,8B,14B,32B, 30B-moe]

Are you interested in learning more about vibe coding? checkout following articles:

Optimizing for Tool Calling

PARAMETER num_ctx 65536 # Or higher if supported/needed. Maximize context.
PARAMETER temperature 0.25 # Low for precision, but slightly higher than 0.1/0.2 for minor flexibility
PARAMETER top_p 0.9      # Focuses on probable tokens, cutting off the long tail (less likely than top_p=1)
PARAMETER top_k 40       # Further restricts sampling pool (often works well with top_p)
PARAMETER repeat_penalty 1.1 # Mild penalty to discourage nonsensical loops, but allows necessary code repetition.
PARAMETER num_keep 1024   # Keep initial instructions/context
PARAMETER num_predict 16384 # Generous prediction length for substantial code blocks

Rationale for Key Parameters:

num_ctx (65536+): Maximize this. Code generation relies heavily on understanding the surrounding code, dependencies, and instructions. A larger context window is almost always better for accuracy and relevance. Use the largest value your model and hardware comfortably support.

temperature (0.25): Low temperature ensures the model picks high-probability tokens, leading to more predictable, syntactically correct, and logically consistent code. Setting it slightly above the extreme lows (like 0.1) gives it just enough leeway to potentially find slightly better or alternative valid structures without becoming random.

top_p (0.9): This works well with low temperature. It prunes the least likely tokens (the bottom 10% cumulative probability mass), further focusing the output on sensible code constructs without being as drastically limiting as top_p=0.1.

top_k (40): Often used alongside top_p and temperature. It provides a hard limit on the number of choices, ensuring the model only considers the top 40 most likely tokens before top_p filtering is applied (or sometimes after, depending on implementation). It helps prevent esoteric token choices.

repeat_penalty (1.1): Code often requires repeating variable names, keywords (like if, for, return), and structures. A penalty of 1.0 means no penalty. A slight penalty like 1.1 gently discourages unnecessary or runaway repetition/loops while still easily allowing the model to repeat tokens when appropriate for coding. Avoid high values (e.g., > 1.2) which could hinder correct code generation.

num_keep (1024): Preserves the initial part of the context (like system prompts or specific coding instructions) when the context window needs to slide, ensuring the core task isn’t forgotten.

num_predict (16384): Sets a maximum length for the generated response. This needs to be large enough to accommodate reasonably sized functions or code snippets. Adjust as needed.

References

GPT-4 (32k context): A reasonable starting point is num_predict = 8192 or 16384. The 32k context window provides significant headroom, but be mindful of cost.

GPT-4 (8k context): Start with num_predict = 2048 or 4096. Claude 2 (100k context): You can often start with num_predict = 16384 or 32768, but closely monitor token usage.

Smaller Models: Start with a smaller num_predict (e.g., 512 or 1024) and increase it as needed.

Example Scenario:

Let’s say you have a tool that generates a JSON response, and the average JSON response size is 500 tokens. Your model needs around 100 tokens to formulate the tool call. You anticipate the tool might occasionally return a larger response of up to 1000 tokens. In this case, a good starting point would be:

num_predict = 100 (reasoning) + 1000 (max tool response) + 200 (buffer) = 1300 tokens