Tool selection doesn't need a 7B model
Most AI agents still assume a large model is in the loop for every decision, including the simplest one: which tool to call. Cactus Compute is betting that's overkill. On May 12, 2026, the company released Needle, a 26-million-parameter open-source model trained to do one thing well — pick the right function and fill its arguments — and to do it on a phone, a watch, or a pair of glasses without a network round-trip.
The one-line summary: Needle is distilled from Gemini 3.1 Flash Lite, runs at roughly 6,000 tokens/second prefill and 1,200 tokens/second decode on Cactus's runtime, and ships MIT-licensed with weights on Hugging Face. The architecture is the more interesting story — it drops the MLP/feed-forward blocks entirely and keeps only attention and gating, an arrangement Cactus calls a Simple Attention Network (SAN).
This guide goes deeper than the GitHub README and is more hands-on than the Simple Attention Networks doc. You'll get the architecture rationale, the distillation pipeline with real numbers, code to run Needle locally, and a clear answer to the question every founder is asking right now: when does the on-device model replace the API call?
What is Needle?
Needle is a specialist model. It is not built to write code, summarize a PDF, or hold a conversation. It is built to read a user query plus a list of available tools and emit a single JSON-formatted function call. That narrowness is the point. Tool selection on a fixed surface — "open my calendar," "set a timer," "send a message to Mom" — does not require general reasoning. It requires accurate retrieval and copying.
Three design choices follow from that goal:
- Distilled, not pre-trained from scratch on instruction data. The training signal comes from Gemini 3.1 Flash Lite. Cactus used the frontier model to generate millions of (query, tool list, expected call) triplets across 15 task categories — timers, messaging, navigation, smart home, media playback, calendar, search, and more.
- Open and small enough to ship. 26 million parameters is roughly two orders of magnitude smaller than the "mobile-friendly" baselines considered viable a year ago, and an order of magnitude smaller than Google's FunctionGemma at 270M. At INT8 the weights are around 26 MB — easy to drop into an app bundle.
- Paired with a runtime that handles the cloud handoff. Cactus's broader thesis is that on-device handles ~80% of simple intents and the cloud (Gemini, Claude, GPT) handles the remaining ~20% that need multi-step reasoning. Needle is the cheap, fast first stage of that pipeline.
For builders, the practical surface is the Python API and a needle CLI with a Gradio playground. Weights are MIT-licensed, so commercial use, fine-tuning, and redistribution are all explicitly permitted.
The architecture: Simple Attention Networks
This is the section other Needle coverage has skipped. A standard transformer block alternates two operations: an attention layer that mixes information across positions, and a multilayer perceptron (MLP, also called a feed-forward network or FFN) that transforms each position individually. The MLP is where the model stores most of its learned knowledge — facts, idioms, reasoning patterns. It is also where most of the parameters live: in a typical decoder, the FFN accounts for roughly two-thirds of the weight count.
Cactus's bet is that for function calling, the FFN is dead weight. From the Simple Attention Networks doc:
"MLPs can be completely dropped from transformer networks, as long as the model relies on external knowledge source."
The assumption is defensible specifically for tool calling because the tools list itself is the external knowledge source. Cactus puts it this way: "Tool calling is retrieval-and-assembly. Match query to tool name, extract argument values, assemble JSON. All three are aligning and copying between input and output — exactly what cross-attention does."
A function-calling model does not need to know what set_timer means in some abstract sense. It needs to look at the tool's name and parameter schema (which are right there in the prompt) and copy structured values from the user's query into the right slots. That is precisely the work that attention performs.
So Needle keeps:
- A 12-layer encoder with masked self-attention and gated residuals, no FFN.
- An 8-layer decoder with self-attention, cross-attention back to the encoded prompt, and gated residuals.
- Rotary Position Embeddings (RoPE), an 8-head/4-KV-head attention configuration, a hidden width of 512, an 8,192-entry BPE tokenizer, and ZCRMSNorm normalization.
The result is a network that is small enough to be cheap and big enough to be accurate on this single task. It is also a research artifact worth watching: if the SAN architecture generalizes to other retrieval-and-assembly tasks (autocomplete on a fixed schema, slot-filling dialogue, structured extraction from documents), the broader cost of running narrow agents collapses. The trend is consistent with what we covered in the Stanford AI Index 2026 report: the gap between frontier and "good enough" is narrowing fastest at the specialist tier.
The distillation pipeline
Needle was not trained on internet text and then taught to call tools. It was built specifically to call tools, in three stages, on a budget any reasonable startup can afford.
Stage 1 — pretrain. Cactus pretrained the SAN on 200 billion tokens of general language data from the PleIAs/SYNTH dataset, on 16 Google Cloud TPU v6e chips, in 27 hours of wall time. At public TPU v6e rates, 16 chips for 27 hours is in the low four-figure range of compute cost — not millions, not tens of thousands. The point of pretraining is to give the model basic linguistic competence (subword segmentation, syntax, common vocabulary) so the fine-tuning stage does not have to teach English from scratch.
Stage 2 — generate synthetic data. Cactus used Gemini 3.1 Flash Lite as a data generator. For each of 15 task categories, the frontier model produced query/tool/call triplets at scale. The total: 2 billion tokens of synthetic single-shot function-calling data. Crucially, this stage uses the expensive model once, offline, to produce a training corpus — not on every inference call.
Stage 3 — post-train (distill). Needle was fine-tuned on the 2B synthetic tokens. Wall time: 45 minutes. This is the step where the model actually learns the task — match a natural-language query to the correct tool and emit valid JSON.
That's the entire recipe. The pattern is clean and worth memorizing: the frontier model is the data factory; the deployed model is the specialist student. If you've been wondering when "use the API forever" stops being the only option for production agents, this is the workflow that changes it. We've covered the broader distillation/compression trend before — see the Google TurboQuant guide on 5x LLM compression and the Gemma 4 developer guide for related small-model strategies.
How to run Needle locally
Setup is three commands on Mac or Linux:
git clone https://github.com/cactus-compute/needle.git
cd needle && source ./setup
needle playgroundThe source ./setup script provisions a Python environment and installs JAX, the tokenizer, and the rest of the dependencies. needle playground boots a Gradio UI at http://127.0.0.1:7860 where you can paste a tools list, type queries, watch the model emit calls, and run a fine-tune from the same screen.
For programmatic use, the Python API takes a JSON-string tools list and a natural-language query:
from needle import SimpleAttentionNetwork, load_checkpoint, generate, get_tokenizer
params, config = load_checkpoint("checkpoints/needle.pkl")
model = SimpleAttentionNetwork(config)
tokenizer = get_tokenizer()
result = generate(
model, params, tokenizer,
query="What's the weather in San Francisco?",
tools='[{"name":"get_weather","parameters":{"location":"string"}}]',
stream=False,
)
print(result)
# [{"name":"get_weather","arguments":{"location":"San Francisco"}}]Two things to notice. First, the output is a structured call object, not a natural-language reply — you wire it up to your own dispatcher. Second, tools is just a JSON string; you can swap your app's whole tool surface in and out without retraining as long as the tool names and parameter schemas are clear.
The needle CLI exposes the rest of the workflow:
needle run --query "..." --tools Single inference
needle finetune <data.jsonl> Fine-tune on your own tools
needle eval --checkpoint <path> Evaluate a checkpoint
needle generate-data Synthesize training data via Gemini
needle pretrain Pretrain on PleIAs/SYNTH
needle train Full training run
needle tokenize Tokenize a dataset
needle tpu <action> TPU managementA typical custom deployment looks like this: write 50–500 example (query, tools, expected call) pairs that match your product's actual tool surface, save them as JSONL, run needle finetune your-data.jsonl, and ship the resulting checkpoint inside your app. Fine-tuning on a few hundred examples is a single-GPU job that finishes in minutes.
Where Needle fits in your agent stack
Needle is not a replacement for a large model. It is a router and argument extractor you can run before the large model is invoked — or instead of it, when the request is simple enough that no large model is needed at all.
Use Needle when:
- Your tool list is finite and known. Mobile apps, in-car assistants, smart-home hubs, wearables — all have a fixed catalog of intents.
- Latency matters. A round-trip to a cloud LLM is ~300–800 ms in the best case. Needle responds in single-digit milliseconds on-device.
- Privacy matters. The user's query never leaves the device. This is structural, not policy.
- Cost matters at scale. A 26M model running on a phone has effectively zero marginal cost per call. A frontier-model API call does not.
Don't use Needle when:
- The user's request requires reasoning over long context ("given this 30-page contract, draft a counter-proposal").
- The agent must compose multi-step tool plans, not single shots.
- Your tool set changes frequently and you can't afford a quick retrain.
The hybrid pattern is the obvious win. Needle handles the simple 80% on-device. When confidence is low or the request looks complex, you hand off to a frontier model. The decision boundary itself can be a single bit returned by Needle. The broader shift this enables — agents that do more, locally, with less — connects to the agent view UX shift in Claude Code, where managing many small agents has become easier than feeding one giant one.
Mental model: Needle is to Gemini what a GPS receiver is to a satellite network — a cheap, narrow consumer of expensive infrastructure that someone else paid for. The infrastructure is the frontier model; Needle just listens.
Frontier models as factories: the strategy shift
Here's the editorial take that matters more than any benchmark. The default mental model of an AI product, for the last three years, has been: call OpenAI/Anthropic/Google forever, pay per token, and hope the prices come down. Needle is one of the first widely-visible examples of a different default — use the frontier model *once*, offline, to generate training data, and deploy a narrow distilled student that runs locally for free.
This is not a new idea. It is the textbook student–teacher distillation pattern that produced Gemma, Phi, and other small open models. What Needle changes is the task framing. Instead of distilling a generalist into a smaller generalist, Cactus distilled a generalist into a specialist: a model that does only one thing, on hardware it would never normally fit on.
Three implications for builders:
- Build-vs-buy flips again. If your product surface is narrow enough — and most consumer apps are — you may not need the API at all in steady state. You need the API to generate your training set, then you ship the student.
- The frontier API business has a new ceiling. Token revenue from "use my model in your hot path" is fundamentally undermined by "use my model once to teach yours." This is a long-term pressure on margins for the frontier labs.
- The interesting models aren't the biggest ones anymore. As we covered in the Kimi K2.6 review, open specialist models are closing the gap with closed generalists at a startling pace.
Needle matters less as a model than as a template. If you're building an AI feature in 2026 and you haven't asked "could we distill this into something we ship in the app bundle?" — start asking.
Limitations and caveats
Needle is a research preview shipping at 0.1.0, and the caveats are honest:
- Specialist, not generalist. Don't ask Needle to write code, summarize a document, or hold a conversation. It will not do those things at all.
- Synthetic-data ceiling. Training entirely on Gemini-generated data caps real-world generalization at "what Gemini thought looked plausible." Real user queries will surface gaps you'll need to fine-tune away on actual production traces.
- Mac/Linux first. Mobile deployment is on the roadmap, not in the box. You'll need to bring your own quantization and inference runtime for phones and wearables.
- Single-shot only. Needle picks one tool per inference call. Multi-step plans, conditional branches, and tool chaining are not in scope — those still belong to a frontier model.
- The 80/20 handoff isn't free. You still need a cloud model for the hard 20% and a confidence signal to decide when to escalate. Building that signal is the part of the system Cactus does not give you.
If you're shipping consumer AI today, that's a reasonable risk profile. If you're staffing an agent platform, treat Needle as a building block, not a product. As discussed in our piece on whether software engineering is dead, the role of the human builder is increasingly to compose these specialist pieces rather than write each one from scratch.





