A Practical Guide to Fine-Tuning FunctionGemma for Reliable Tool Calls

FunctionGemma fine-tuning improves function calling by baking in context and policy, reducing tool ambiguity between a knowledge base and Google Search. A practical case study with clear routing signals (freshness, scope, privacy), clean train/dev/test splits, and supervised fine-tuning (SFT) boosted tool accuracy toward 88–90% while cutting over-calls and invalid calls. The no-code FunctionGemma Tuning Lab streamlines setup, dataset curation, guardrails, and evaluation, tracking key metrics like tool accuracy, refusal accuracy, and latency.

FunctionGemma can call tools, but tuning it makes the magic click. Curious how to pick the right function every time—or close? Read more

Why Fine-Tune FunctionGemma: context, policy, and resolving tool ambiguity

Fine-tuning FunctionGemma helps it choose tools with clearer judgment. Out of the box, it may hesitate. It might call the wrong tool when signals are weak. With targeted data, you nudge it toward the right choice more often. Read more

Context improves tool choice

Function calling depends on context. The model reads hints in the prompt and tools. If those hints are thin, choices get noisy. Fine-tuning adds consistent patterns the model can rely on. It learns what matters and what to ignore. Read more

Richer examples beat vague cues. Show inputs, reasoning, and the selected tool. Teach when not to call a tool. No-call cases reduce noisy behavior. Clarify tool scopes. Describe each tool in short, concrete language. Limit distraction. Keep irrelevant details out of the context window. Read more

When the model sees steady patterns, it forms stable habits. That means fewer random calls and fewer empty queries. Read more

Policy alignment and safe refusals

Policies guide use. They define safety, privacy, and compliance rules. Fine-tuning turns those rules into muscle memory. The model learns when to refuse, when to ask for consent, and when to mask data. Read more

Explicit refusal examples. Include policy text and a short, polite refusal. Redaction patterns. Show how to handle emails, IDs, or health data. Jurisdiction cues. Add region tags when rules differ by location. Conflict handling. Prefer policy over user request in your examples. Read more

This doesn’t make it perfect, but it raises refusal accuracy and consistency. You’ll also cut risky tool calls. Read more

Resolving tool ambiguity

Many tasks can fit more than one tool. Think “knowledge base” versus “web search.” The best answer depends on freshness and scope. Fine-tuning teaches routing decisions with simple rules and proofs. Read more

Define decision boundaries. Use if/then cues like “internal facts” vs “latest news.” Provide counterexamples. Show near misses and explain why they’re wrong. Reward precision. Penalize extra calls when one tool is enough. Handle fallbacks. If the first tool fails, show a graceful switch. Read more

Ambiguity won’t vanish, but it will drop. The model will explain its pick more clearly, too. Read more

Training data that teaches judgment

Use supervised fine-tuning (SFT) for patterns and formats. It copies good behavior. Add preference training, like DPO (a method that favors better outputs), to rank subtle choices. Keep examples short, focused, and varied. Cover edge cases, not just happy paths. Read more

Balanced labels. Include correct calls, wrong calls, and no-call cases. Rationales. Short, plain reasons improve generalization. Tool schemas. Show valid arguments and typical errors. Negative sampling. Include invalid inputs and timeouts. Read more

What to measure

Track metrics that reflect real use. Aim for clarity and reliability, not only raw accuracy. Read more

Tool accuracy. Right tool chosen for the task. Invalid call rate. Bad schemas, missing args, or empty results. Refusal accuracy. Correct refusals under policy. Over-call rate. Unneeded calls when text answers suffice. Time-to-answer. Fewer hops and faster resolution. Read more

With these metrics and examples, FunctionGemma can route with less guesswork. It should stay within policy, use context well, and reduce noisy tool calls. Read more

Case Study: routing between knowledge base and Google search, data splits, and SFT results

This case study shows how FunctionGemma routes between a knowledge base and Google search. The goal is clear choices with fewer wrong calls. The setup uses two tools and simple, transparent rules. Queries need the right path based on freshness, scope, and privacy. Read more

Routing signals that actually help

Good routing depends on strong, readable signals. We use a few reliable cues. Each cue maps to a tool in a simple way. Read more

Freshness needed: If the answer may change fast, prefer Google search. Source-of-truth: If the fact lives inside your docs, pick the knowledge base. Policy sensitivity: If data is private, stay internal and avoid web calls. Stable facts: Product specs or fixed rules favor the knowledge base. Breaking news: New events, prices, or outages favor Google search. No-call cases: If the prompt asks for a rewrite, do not call tools. Read more

Examples make signals stick. “What changed this week?” hints at news. “What is our return policy?” points to internal docs. Read more

Dataset design and clean data splits

Quality data drives routing quality. We build pairs of prompts and gold tool choices. We include short rationales that explain the pick in plain words. Arguments for tools are kept simple and valid. Read more

Positive cases: Clear matches for knowledge base or Google search. Hard negatives: Near misses that tempt the wrong tool. No-call examples: Tasks solved with text alone. Edge cases: Conflicts, partial data, and ambiguous cues. Read more

We split data by topic to avoid leakage. A common split is 70% train, 15% dev, and 15% test. Stratify by tool type and difficulty. Deduplicate similar prompts across splits. Time-based splits help test freshness logic. Read more

Supervised fine-tuning (SFT) setup

SFT teaches patterns, formats, and steady habits. The target output includes the chosen tool, arguments, and a brief reason. The reason is short and avoids fancy words. Read more

Format consistency: Fixed JSON-like structure for function calls. Argument validity: Show correct fields and typical errors to avoid. Refusals: Include polite, policy-safe no-call outputs. Compact rationales: One or two lines that cite exact signals. Read more

We keep sequences short to reduce noise. We cap context to the parts that matter. Tool schemas are clear and minimal. Read more

Evaluation and SFT results

We measure what users feel: right tool picks, fewer wasted calls, and faster answers. We also track safety and stability. Read more

Tool accuracy: Correct tool chosen for the task. Over-call rate: Calls made when none were needed. Invalid call rate: Bad schemas or missing arguments. Time-to-answer: Steps from prompt to final result. Refusal accuracy: Proper no-calls under policy. Read more

A typical lift looks like this in practice. Tool accuracy rises from around 70% to near 88–90%. Over-calls drop from the low 20s to under 10%. Invalid calls fall from about 8% to near 2%. Time-to-answer improves with fewer hops. Read more

Error review finds clear themes. Time-sensitive prompts that lack dates still confuse the model. Very close intents, like “policy vs press release,” may blur. Add counterexamples that show why one path wins. Teach a fallback: try internal first, then search if no match is found. Read more

Iteration loop that keeps routing sharp

Keep improving with a tight loop. Log mistakes by signal type. Sample fresh prompts weekly to catch drift. Add small, focused batches to the SFT set. Re-test on a frozen benchmark to confirm gains. When metrics stall, refine labels and shorten rationales. FunctionGemma tends to learn faster from clear, simple fixes than from big, complex changes. Read more

No-Code Option: Using FunctionGemma Tuning Lab—setup, features, and evaluation workflow

The Tuning Lab offers a simple, no-code way to tune FunctionGemma. You can shape tool choice without writing scripts. The workflow stays visual and clear. Teams move fast, and errors drop. It supports safe policies and clean evaluation, right in one place. Read more

Setup in minutes

Create a project. Pick the base FunctionGemma checkpoint and name your run. Define tools. Add each function’s schema with fields and short, plain docs. Import data. Upload JSONL or CSV with prompts, expected calls, and reasons. Create splits. Set train, dev, and test. Keep topics separated to avoid leakage. Add policies. Paste clear rules and sample refusals for sensitive requests. Choose objective. Start with SFT, which learns from correct examples. Pick settings. Select epochs, batch size, and learning rate from safe presets. Launch training. Monitor progress with live charts and compact logs. Read more

Core features that speed progress

Visual dataset browser. Filter by tool, label quality, or policy tags. One-click validation. Check function schemas and catch missing arguments. Label editing. Fix gold calls, add no-call cases, or attach short rationales. Prompt templates. Standardize system prompts and tool descriptions for clarity. Versioning and lineage. Track datasets, runs, and metrics across iterations. Experiment compare. Diff models side by side on the same frozen test set. Safety guardrails. Enforce refusal patterns and light data redaction rules. Export artifacts. Download tuned weights, eval reports, and error slices. Read more

Build a dataset that teaches routing

Positive picks. Clear matches for an internal knowledge base tool. Freshness cases. Prompts that demand web search due to changing facts. No-call examples. Tasks solved with text only, like rewrites or summaries. Hard negatives. Near misses that tempt the wrong tool or extra calls. Ambiguous pairs. Similar intents with different best tools and short reasons. Failure modes. Timeouts, empty results, or invalid schemas to avoid. Short rationales. One or two lines that cite the key signal. Consistent format. Input, chosen tool, arguments, and a brief reason. Read more

Training options explained simply

SFT first. Supervised fine-tuning copies correct patterns and output formats. Preference training. Rank better responses over weaker ones, using paired examples. Curriculum. Start easy, then add tricky, ambiguous prompts. Context control. Keep prompts concise and tool docs short and concrete. Regular checkpoints. Save mid-run models for safer rollbacks. Read more

Evaluation workflow and metrics

Tool accuracy. Right tool chosen for each task on a frozen test set. Over-call rate. Unneeded function calls when a text answer is enough. Invalid call rate. Bad schemas, missing fields, or empty arguments. Refusal accuracy. Correct, policy-safe no-calls for restricted requests. Latency to answer. Steps and time from prompt to final result. Error slices. Breakdowns by topic, tool, and policy tag. Playground checks. Spot-test live prompts and compare outputs side by side. Read more

Run A/B comparisons between recent models and a trusted baseline. Tag mistakes by cause, like freshness or scope confusion. Add small, focused fixes to the dataset. Re-test on the same split and confirm stable gains. Read more

Deployment and monitoring

Export and stage. Push the tuned model to a staging environment first. Canary traffic. Send a small share of requests and watch key metrics. Drift alerts. Track spikes in over-calls or invalid calls over time. Safety logs. Record refusals, redactions, and policy triggers for review. Feedback loop. Capture real prompts, label them, and feed back into training. Read more

Did you like this story?

Please share by clicking this button! Visit our site and see all other available articles! InfoHostingNews