mychen76/phi4_cline

Quantized version of Phi4-14B optimized for tool usage with Cline / Roo Code and complex problem solving.

Phi4-14B

phi-4 is a state-of-the-art open model built upon a blend of synthetic datasets, data from filtered public domain websites, and acquired academic books and Q&A datasets. The goal of this approach was to ensure that small capable models were trained with data focused on high quality and advanced reasoning.

phi-4 underwent a rigorous enhancement and alignment process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures.

Are you interested in learning more about vibe coding? checkout following articles:

Optimizing for Tool Calling

num_ctx 65536: Absolutely crucial. Tool calling often involves complex workflows that span multiple turns. A large context window enables the model to remember previous instructions, tool responses, and the overall goal. 128k or 200k is even better if available.

temperature 0.15: Significantly reduced from the coding configuration (0.25). Tool calling prioritizes accuracy and reliability above all else. The model needs to confidently select the correct tool and format its call precisely. Low temperature minimizes the risk of incorrect tool selection or poorly formatted calls.

top_p 0.7: A more restrained exploration. While a little creativity can be helpful in some cases, we don’t want the model to drastically deviate from the most likely tool choices.

repeat_penalty 1.2: Aggressively penalizes repetition. This is particularly important in tool calling to prevent the model from getting stuck in loops and repeatedly calling the same tool.

num_keep 1024: Significantly increased from the standard coding configuration (512). Tool calling benefits enormously from retaining a longer history. This allows the model to better understand the context of the task, remember the tools available, and track the progress of the workflow.

min_p 0.03: A slightly higher min_p than the standard coding configuration (0.02). Tool calling often requires creative problem-solving and exploring different tool combinations. While we still want grounded responses, a bit more exploration is beneficial. Caveat: Monitor for nonsensical tool choices or unexpected behavior. Reduce if needed.

num_predict 16384 - 32768: num_predict (or the more common max_tokens in many LLM APIs) sets the maximum number of tokens the model is allowed to generate in a single response. It’s a safeguard to prevent runaway generation and control costs. However, in tool-calling scenarios, it can be quite tricky to set correctly.

References

GPT-4 (32k context): A reasonable starting point is num_predict = 8192 or 16384. The 32k context window provides significant headroom, but be mindful of cost.

GPT-4 (8k context): Start with num_predict = 2048 or 4096. Claude 2 (100k context): You can often start with num_predict = 16384 or 32768, but closely monitor token usage.

Smaller Models: Start with a smaller num_predict (e.g., 512 or 1024) and increase it as needed.

Example Scenario:

Let’s say you have a tool that generates a JSON response, and the average JSON response size is 500 tokens. Your model needs around 100 tokens to formulate the tool call. You anticipate the tool might occasionally return a larger response of up to 1000 tokens. In this case, a good starting point would be:

num_predict = 100 (reasoning) + 1000 (max tool response) + 200 (buffer) = 1300 tokens

Quantized version of Phi4-14B optimized for tool usage with Cline / Roo Code and complex problem solving.

Readme

Phi4-14B

Optimizing for Tool Calling

References

Example Scenario: