DevPik Logo
ai-modelsopen-sourcefunction-callingai-agentsai-newsdistillationtutorial

Needle: Cactus Compute's 26M Tool-Calling Model Explained

Most AI agents still assume a large model is in the loop for every decision, including the simplest one: which tool to call. Cactus Compute is betting that's overkill.

ByMuhammad TayyabPublished:12 min read
Back to Blog
Needle: Cactus Compute's 26M Tool-Calling Model Explained

Tool selection doesn't need a 7B model

Most AI agents still assume a large model is in the loop for every decision, including the simplest one: which tool to call. Cactus Compute is betting that's overkill. On May 12, 2026, the company released Needle, a 26-million-parameter open-source model trained to do one thing well — pick the right function and fill its arguments — and to do it on a phone, a watch, or a pair of glasses without a network round-trip.

The one-line summary: Needle is distilled from Gemini 3.1 Flash Lite, runs at roughly 6,000 tokens/second prefill and 1,200 tokens/second decode on Cactus's runtime, and ships MIT-licensed with weights on Hugging Face. The architecture is the more interesting story — it drops the MLP/feed-forward blocks entirely and keeps only attention and gating, an arrangement Cactus calls a Simple Attention Network (SAN).

This guide goes deeper than the GitHub README and is more hands-on than the Simple Attention Networks doc. You'll get the architecture rationale, the distillation pipeline with real numbers, code to run Needle locally, and a clear answer to the question every founder is asking right now: when does the on-device model replace the API call?

What is Needle?

Needle is a specialist model. It is not built to write code, summarize a PDF, or hold a conversation. It is built to read a user query plus a list of available tools and emit a single JSON-formatted function call. That narrowness is the point. Tool selection on a fixed surface — "open my calendar," "set a timer," "send a message to Mom" — does not require general reasoning. It requires accurate retrieval and copying.

Three design choices follow from that goal:

  • Distilled, not pre-trained from scratch on instruction data. The training signal comes from Gemini 3.1 Flash Lite. Cactus used the frontier model to generate millions of (query, tool list, expected call) triplets across 15 task categories — timers, messaging, navigation, smart home, media playback, calendar, search, and more.
  • Open and small enough to ship. 26 million parameters is roughly two orders of magnitude smaller than the "mobile-friendly" baselines considered viable a year ago, and an order of magnitude smaller than Google's FunctionGemma at 270M. At INT8 the weights are around 26 MB — easy to drop into an app bundle.
  • Paired with a runtime that handles the cloud handoff. Cactus's broader thesis is that on-device handles ~80% of simple intents and the cloud (Gemini, Claude, GPT) handles the remaining ~20% that need multi-step reasoning. Needle is the cheap, fast first stage of that pipeline.

For builders, the practical surface is the Python API and a needle CLI with a Gradio playground. Weights are MIT-licensed, so commercial use, fine-tuning, and redistribution are all explicitly permitted.

The architecture: Simple Attention Networks

This is the section other Needle coverage has skipped. A standard transformer block alternates two operations: an attention layer that mixes information across positions, and a multilayer perceptron (MLP, also called a feed-forward network or FFN) that transforms each position individually. The MLP is where the model stores most of its learned knowledge — facts, idioms, reasoning patterns. It is also where most of the parameters live: in a typical decoder, the FFN accounts for roughly two-thirds of the weight count.

Cactus's bet is that for function calling, the FFN is dead weight. From the Simple Attention Networks doc:

"MLPs can be completely dropped from transformer networks, as long as the model relies on external knowledge source."

The assumption is defensible specifically for tool calling because the tools list itself is the external knowledge source. Cactus puts it this way: "Tool calling is retrieval-and-assembly. Match query to tool name, extract argument values, assemble JSON. All three are aligning and copying between input and output — exactly what cross-attention does."

A function-calling model does not need to know what set_timer means in some abstract sense. It needs to look at the tool's name and parameter schema (which are right there in the prompt) and copy structured values from the user's query into the right slots. That is precisely the work that attention performs.

So Needle keeps:

  • A 12-layer encoder with masked self-attention and gated residuals, no FFN.
  • An 8-layer decoder with self-attention, cross-attention back to the encoded prompt, and gated residuals.
  • Rotary Position Embeddings (RoPE), an 8-head/4-KV-head attention configuration, a hidden width of 512, an 8,192-entry BPE tokenizer, and ZCRMSNorm normalization.

The result is a network that is small enough to be cheap and big enough to be accurate on this single task. It is also a research artifact worth watching: if the SAN architecture generalizes to other retrieval-and-assembly tasks (autocomplete on a fixed schema, slot-filling dialogue, structured extraction from documents), the broader cost of running narrow agents collapses. The trend is consistent with what we covered in the Stanford AI Index 2026 report: the gap between frontier and "good enough" is narrowing fastest at the specialist tier.

The distillation pipeline

Needle was not trained on internet text and then taught to call tools. It was built specifically to call tools, in three stages, on a budget any reasonable startup can afford.

Stage 1 — pretrain. Cactus pretrained the SAN on 200 billion tokens of general language data from the PleIAs/SYNTH dataset, on 16 Google Cloud TPU v6e chips, in 27 hours of wall time. At public TPU v6e rates, 16 chips for 27 hours is in the low four-figure range of compute cost — not millions, not tens of thousands. The point of pretraining is to give the model basic linguistic competence (subword segmentation, syntax, common vocabulary) so the fine-tuning stage does not have to teach English from scratch.

Stage 2 — generate synthetic data. Cactus used Gemini 3.1 Flash Lite as a data generator. For each of 15 task categories, the frontier model produced query/tool/call triplets at scale. The total: 2 billion tokens of synthetic single-shot function-calling data. Crucially, this stage uses the expensive model once, offline, to produce a training corpus — not on every inference call.

Stage 3 — post-train (distill). Needle was fine-tuned on the 2B synthetic tokens. Wall time: 45 minutes. This is the step where the model actually learns the task — match a natural-language query to the correct tool and emit valid JSON.

That's the entire recipe. The pattern is clean and worth memorizing: the frontier model is the data factory; the deployed model is the specialist student. If you've been wondering when "use the API forever" stops being the only option for production agents, this is the workflow that changes it. We've covered the broader distillation/compression trend before — see the Google TurboQuant guide on 5x LLM compression and the Gemma 4 developer guide for related small-model strategies.

How to run Needle locally

Setup is three commands on Mac or Linux:

bash
git clone https://github.com/cactus-compute/needle.git
cd needle && source ./setup
needle playground

The source ./setup script provisions a Python environment and installs JAX, the tokenizer, and the rest of the dependencies. needle playground boots a Gradio UI at http://127.0.0.1:7860 where you can paste a tools list, type queries, watch the model emit calls, and run a fine-tune from the same screen.

For programmatic use, the Python API takes a JSON-string tools list and a natural-language query:

python
from needle import SimpleAttentionNetwork, load_checkpoint, generate, get_tokenizer

params, config = load_checkpoint("checkpoints/needle.pkl")
model = SimpleAttentionNetwork(config)
tokenizer = get_tokenizer()

result = generate(
    model, params, tokenizer,
    query="What's the weather in San Francisco?",
    tools='[{"name":"get_weather","parameters":{"location":"string"}}]',
    stream=False,
)
print(result)
# [{"name":"get_weather","arguments":{"location":"San Francisco"}}]

Two things to notice. First, the output is a structured call object, not a natural-language reply — you wire it up to your own dispatcher. Second, tools is just a JSON string; you can swap your app's whole tool surface in and out without retraining as long as the tool names and parameter schemas are clear.

The needle CLI exposes the rest of the workflow:

needle run --query "..." --tools     Single inference
needle finetune <data.jsonl>         Fine-tune on your own tools
needle eval --checkpoint <path>      Evaluate a checkpoint
needle generate-data                 Synthesize training data via Gemini
needle pretrain                      Pretrain on PleIAs/SYNTH
needle train                         Full training run
needle tokenize                      Tokenize a dataset
needle tpu <action>                  TPU management

A typical custom deployment looks like this: write 50–500 example (query, tools, expected call) pairs that match your product's actual tool surface, save them as JSONL, run needle finetune your-data.jsonl, and ship the resulting checkpoint inside your app. Fine-tuning on a few hundred examples is a single-GPU job that finishes in minutes.

Where Needle fits in your agent stack

Needle is not a replacement for a large model. It is a router and argument extractor you can run before the large model is invoked — or instead of it, when the request is simple enough that no large model is needed at all.

Use Needle when:

  • Your tool list is finite and known. Mobile apps, in-car assistants, smart-home hubs, wearables — all have a fixed catalog of intents.
  • Latency matters. A round-trip to a cloud LLM is ~300–800 ms in the best case. Needle responds in single-digit milliseconds on-device.
  • Privacy matters. The user's query never leaves the device. This is structural, not policy.
  • Cost matters at scale. A 26M model running on a phone has effectively zero marginal cost per call. A frontier-model API call does not.

Don't use Needle when:

  • The user's request requires reasoning over long context ("given this 30-page contract, draft a counter-proposal").
  • The agent must compose multi-step tool plans, not single shots.
  • Your tool set changes frequently and you can't afford a quick retrain.

The hybrid pattern is the obvious win. Needle handles the simple 80% on-device. When confidence is low or the request looks complex, you hand off to a frontier model. The decision boundary itself can be a single bit returned by Needle. The broader shift this enables — agents that do more, locally, with less — connects to the agent view UX shift in Claude Code, where managing many small agents has become easier than feeding one giant one.

Mental model: Needle is to Gemini what a GPS receiver is to a satellite network — a cheap, narrow consumer of expensive infrastructure that someone else paid for. The infrastructure is the frontier model; Needle just listens.

Frontier models as factories: the strategy shift

Here's the editorial take that matters more than any benchmark. The default mental model of an AI product, for the last three years, has been: call OpenAI/Anthropic/Google forever, pay per token, and hope the prices come down. Needle is one of the first widely-visible examples of a different default — use the frontier model *once*, offline, to generate training data, and deploy a narrow distilled student that runs locally for free.

This is not a new idea. It is the textbook student–teacher distillation pattern that produced Gemma, Phi, and other small open models. What Needle changes is the task framing. Instead of distilling a generalist into a smaller generalist, Cactus distilled a generalist into a specialist: a model that does only one thing, on hardware it would never normally fit on.

Three implications for builders:

  1. Build-vs-buy flips again. If your product surface is narrow enough — and most consumer apps are — you may not need the API at all in steady state. You need the API to generate your training set, then you ship the student.
  2. The frontier API business has a new ceiling. Token revenue from "use my model in your hot path" is fundamentally undermined by "use my model once to teach yours." This is a long-term pressure on margins for the frontier labs.
  3. The interesting models aren't the biggest ones anymore. As we covered in the Kimi K2.6 review, open specialist models are closing the gap with closed generalists at a startling pace.

Needle matters less as a model than as a template. If you're building an AI feature in 2026 and you haven't asked "could we distill this into something we ship in the app bundle?" — start asking.

Limitations and caveats

Needle is a research preview shipping at 0.1.0, and the caveats are honest:

  • Specialist, not generalist. Don't ask Needle to write code, summarize a document, or hold a conversation. It will not do those things at all.
  • Synthetic-data ceiling. Training entirely on Gemini-generated data caps real-world generalization at "what Gemini thought looked plausible." Real user queries will surface gaps you'll need to fine-tune away on actual production traces.
  • Mac/Linux first. Mobile deployment is on the roadmap, not in the box. You'll need to bring your own quantization and inference runtime for phones and wearables.
  • Single-shot only. Needle picks one tool per inference call. Multi-step plans, conditional branches, and tool chaining are not in scope — those still belong to a frontier model.
  • The 80/20 handoff isn't free. You still need a cloud model for the hard 20% and a confidence signal to decide when to escalate. Building that signal is the part of the system Cactus does not give you.

If you're shipping consumer AI today, that's a reasonable risk profile. If you're staffing an agent platform, treat Needle as a building block, not a product. As discussed in our piece on whether software engineering is dead, the role of the human builder is increasingly to compose these specialist pieces rather than write each one from scratch.

Frequently Asked Questions

What is Needle AI?
Needle is a 26-million-parameter open-source AI model released by Cactus Compute on May 12, 2026. It is distilled from Google's Gemini 3.1 Flash Lite and trained specifically for single-shot function calling — given a natural-language query and a list of available tools, it emits a JSON-formatted function call. Weights are MIT-licensed and hosted on Hugging Face at huggingface.co/Cactus-Compute/needle.
Is Needle free to use?
Yes. Needle is released under the MIT license, which permits commercial use, modification, fine-tuning, and redistribution. You can clone the repository, download the weights, and run it locally without registration, API keys, or per-call fees.
What's the difference between Needle and Gemini?
Gemini is Google's frontier general-purpose model, available primarily via paid API. Needle is a 26M-parameter specialist distilled from Gemini 3.1 Flash Lite that runs entirely on-device and only handles function calling. Gemini handles open-ended generation and reasoning; Needle handles the narrow task of mapping a query to the right tool call — fast, free, and offline.
Can I run Needle on my phone?
Mobile deployment is on Cactus's roadmap but not turnkey today. The initial release targets Mac and Linux first. The model is small enough (~26 MB at INT8) to fit in a mobile app bundle, but you will need to bring your own quantization and on-device inference runtime to ship it on iOS or Android.
What is a Simple Attention Network?
A Simple Attention Network (SAN) is the architecture Needle introduces. It is a transformer with the multilayer perceptron (FFN) layers removed — only attention layers and gated residual connections remain. Cactus argues that MLPs are unnecessary when the model can rely on an external knowledge source, which is exactly the case for function calling: the tools list is in the prompt.
How fast is Needle?
On Cactus's runtime, Needle reaches roughly 6,000 tokens/second prefill and 1,200 tokens/second decode. In practical terms, a function call returns in a few milliseconds, well below the latency floor of any cloud API call.
Who built Needle?
Needle was built by Cactus Compute. The listed authors are Henry Ndubuaku, Jakub Mroz, Karen Mosoyan, Roman Shemet, Parkirat Sandhu, Satyajit Kumar, Noah Cylich, and Justin H. Lee.
Muhammad Tayyab

Written by

Muhammad Tayyab

CEO & Founder at Mergemain

Muhammad Tayyab builds free, privacy-first developer tools at DevPik. He writes about AI trends, developer tools, and web technologies.

More Articles