664 3 months ago

Quantized version of Phi4-14B optimized for tool usage with Cline / Roo Code and complex problem solving.

tools 14b

4 months ago

650f20b83ad2 · 9.1GB

phi3
·
14.7B
·
Q4_K_M
Microsoft. Copyright (c) Microsoft Corporation. MIT License Permission is hereby granted, free of ch
You are a powerful agentic AI coding assistant, an expert software engineer with deep knowledge of p
{ "min_p": 0.03, "num_ctx": 65536, "num_keep": 1024, "num_predict": 32768, "repe
{{- if or .System .Tools }}<|system|>{{ if .System }}{{ .System }}{{ end }} {{- if .Tools }}{{ if no

Readme

Phi4-14B

https://huggingface.co/microsoft/phi-4

phi-4 is a state-of-the-art open model built upon a blend of synthetic datasets, data from filtered public domain websites, and acquired academic books and Q&A datasets. The goal of this approach was to ensure that small capable models were trained with data focused on high quality and advanced reasoning.

phi-4 underwent a rigorous enhancement and alignment process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures.

Are you interested in learning more about vibe coding? checkout following articles:

Optimizing for Tool Calling

num_ctx 65536: Absolutely crucial. Tool calling often involves complex workflows that span multiple turns. A large context window enables the model to remember previous instructions, tool responses, and the overall goal. 128k or 200k is even better if available.

temperature 0.15: Significantly reduced from the coding configuration (0.25). Tool calling prioritizes accuracy and reliability above all else. The model needs to confidently select the correct tool and format its call precisely. Low temperature minimizes the risk of incorrect tool selection or poorly formatted calls.

top_p 0.7: A more restrained exploration. While a little creativity can be helpful in some cases, we don’t want the model to drastically deviate from the most likely tool choices.

repeat_penalty 1.2: Aggressively penalizes repetition. This is particularly important in tool calling to prevent the model from getting stuck in loops and repeatedly calling the same tool.

temperature 0.15: Significantly reduced from the coding configuration (0.25). Tool calling prioritizes accuracy and reliability above all else. The model needs to confidently select the correct tool and format its call precisely. Low temperature minimizes the risk of incorrect tool selection or poorly formatted calls.

num_keep 1024: Significantly increased from the standard coding configuration (512). Tool calling benefits enormously from retaining a longer history. This allows the model to better understand the context of the task, remember the tools available, and track the progress of the workflow.

min_p 0.03: A slightly higher min_p than the standard coding configuration (0.02). Tool calling often requires creative problem-solving and exploring different tool combinations. While we still want grounded responses, a bit more exploration is beneficial. Caveat: Monitor for nonsensical tool choices or unexpected behavior. Reduce if needed.

num_predict 16384 - 32768: num_predict (or the more common max_tokens in many LLM APIs) sets the maximum number of tokens the model is allowed to generate in a single response. It’s a safeguard to prevent runaway generation and control costs. However, in tool-calling scenarios, it can be quite tricky to set correctly.

References

GPT-4 (32k context): A reasonable starting point is num_predict = 8192 or 16384. The 32k context window provides significant headroom, but be mindful of cost.

GPT-4 (8k context): Start with num_predict = 2048 or 4096. Claude 2 (100k context): You can often start with num_predict = 16384 or 32768, but closely monitor token usage.

Smaller Models: Start with a smaller num_predict (e.g., 512 or 1024) and increase it as needed.

Example Scenario:

Let’s say you have a tool that generates a JSON response, and the average JSON response size is 500 tokens. Your model needs around 100 tokens to formulate the tool call. You anticipate the tool might occasionally return a larger response of up to 1000 tokens. In this case, a good starting point would be:

num_predict = 100 (reasoning) + 1000 (max tool response) + 200 (buffer) = 1300 tokens