Model choice sounds like the part that should be easy. Pick the best one. Pick the fastest one. Pick the cheap one unless the expensive one is obviously better. That is a comforting way to think about it, mostly because it avoids the harder question.
What is the model being asked to do with the user's data?
That is where privacy starts in The Wheel: when data needs to be used. Privacy as a right to expose information deliberately, under conditions the user chose. Send this context here. Use it for this answer. Do not keep it afterward. Let this outside model read the minimum it needs. Let this application act with the facts the user chose to provide. The choice to reveal is the privacy right being exercised.
The routing question is which model should see which context for this particular job. A source-backed lookup, a messy draft, a comparison, a rewrite, and a judgment call do not need the same route. They do not ask the same thing of the model, and they do not warrant the same exposure.
The Wheel application is built for the work before a clean artifact exists: drafting from team knowledge, turning a codebase wiki into product language, testing an argument against the record, or deciding what matters in a pile of source material. AI is a partner in that space. The models handle retrieval, classification, drafting, reasoning, summarizing, checking, and rewriting. The human frames the work, pushes on the answer, and decides what becomes part of the record.
Running a model on your own hardware gives the user the tightest control, but the upkeep is not something most knowledge workers should have to carry for a private answer. The practical version is a model hosted inside a private environment, with visible logs for what was sent, what came back, and what was retained. For this kind of work, that model can be small and still do the job.
The hosted generative route today is Qwen3-4B-VL, running inside The Wheel's private cloud environment on modest hardware that scales itself. This is the small model used for generative answers, separate from the classification models and other non-generative pieces around it.
When a task can run there, The Wheel can verify how the model interaction is handled because it is our system. That does not make Qwen3-4B-VL the right answer for everything. It makes it one candidate in a decision about what the work needs, what context the task requires, and whether the base model clears the quality bar.
The test had to look like real work
I was not trying to prove that a small model could win a benchmark. I needed it to handle the work The Wheel exists for: thinking through a question, getting the answer or source you need from your own record, and using that information to make the next thing. A reply. A decision. A plan. A draft. A comparison. The everyday loop of asking your materials what they know and turning that into work.
The Wheel's query engine is tuned for the part where a human has to think, shape, choose, and decide what good enough means. Sometimes that is the work before the work, when you are trying to find the question. Sometimes it is the work after the work, when you have an answer and need to turn it into something usable. For some people, that is most of the job.
Privacy, speed, cost, and quality are not fixed dials you set once. They depend on the task and the user. The hosted generative model may be right for one question because the record already has the answer. A larger outside model may be right for another because the user wants deeper reasoning, a better draft, or a different quality bar. Pick wrong and you can lose on all four at the same time. The router has to make that call for each question, not for models in the abstract.
The data had to be built
Testing that required data that is not for sale: thousands of examples of real knowledge work, with real questions, real documents, and real people in real roles. The only naturally occurring version of that is the material The Wheel is built specifically not to look at: what users store.
So I built it with Tonic's Fabricate. Ask a model for "a thousand workplace questions" and you get a thousand of the same beige sentence. Instead I built a small world and let the questions fall out of it.
Organizations first: companies, agencies, individual consultancies, research institutions. Then the roles inside them. Then the person in each role and how they type: the one who sends three words, the one who pastes a whole email thread and adds "thoughts?", the one who explains far too much. Then the documents that organization would have, then the questions someone in that chair would ask of them.
By the end a row was not a prompt floating in space. It was a situation. A specific person, at a specific organization, with a reason to be asking.
Small team, shifting roadmap, investor notes, customer calls, and a codebase wiki that is ahead of the website.
Needs to turn messy internal context into a decision, a draft, or a clear next step without pretending the record says more than it does.
"Do we have enough support for this positioning, or am I making it up?"
The answer exists, but only if the model keeps source age, audience, and product reality separate.
That layering matters because a question does not tell you what it needs until it is sitting in context. "What did we decide?" is a two-second lookup if the decision is in the attached notes. It is a real reasoning problem if it is smeared across three threads and a meeting that half-happened. Same five words. Different job entirely.
Leaderboards get you onto the field
What is out there is genuinely useful for understanding overall model behavior and broad use cases. LMArena lets humans test prompts across models and vote on the answers they prefer. Artificial Analysis, Vellum, and llm-stats give you a way to compare models across broad signals like reasoning, coding, speed, and price. GPQA Diamond adds a hard reasoning benchmark to that picture. I used that kind of public signal to narrow the field.
But public benchmarks and The Wheel's corpus answer different questions. A leaderboard tells you what a model can do in general. It cannot tell you what it does on your work, for your users, under your constraints. It will not tell you whether that same model keeps a fact the user stated in their own question, says "that is not in here" instead of bluffing, or hands back a clean list when a list was the whole ask.
So the leaderboards decide who gets on the field. They shortlist the models worth running and expose obvious mismatches on speed, cost, or capability. The Wheel's corpus decides who plays, by running that shortlist against work that looks like our work and scoring it the way a user would.
The first cut died fast
I started small: lookup, summary, draft, comparison. If a question fit one of those buckets, the router would know where to send it. That held for a few rows, then broke. Two questions with the same label kept needing different models.
"Lookup" was supposed to be the easy case for a small model. Plenty of lookups scored badly, and the label was not lying about the topic. It was hiding the part that mattered: whether the user had stated a fact the model needed to keep instead of overwrite, or whether the honest answer was "that is not in here" and the model bluffed. The word lookup carries none of that.
| Same bucket | Question | Actual job |
|---|---|---|
| Lookup | "What did we decide?" | Return the decision from the attached notes. No reasoning needed. |
| Lookup | "What did we decide?" | Resolve three conflicting sources, find the current one, then answer. |
| Lookup | "Is this covered?" | Preserve the user's stated fact instead of overwriting it with outside assumptions. |
So I stopped tagging tasks with one word and started describing them by what moves the answer: how hard, whether it needs outside information or lives entirely in what the user handed over, whether it wants a straight answer or worked reasoning, and what the output should even be, prose or structure.
Sometimes the paragraph is the problem
We have all gotten used to AI handing back a fluent, confident paragraph, and it is easy to mistake the polish for the product. If I ask what we decided about the migration, I do not want three graceful sentences. I want the decision and who made it, in something I can read in two seconds and act on. In that task, the paragraph is friction. It makes the model look thorough while making the answer harder to use.
But sometimes the paragraph is the entire point. Drafting a reply to an angry customer, a tidy list of facts is useless. The sentence is the deliverable. It is not facts over prose or prose over facts. The right output is a property of the task.
Once you can name what a task needs, a lot of the supposed gap between a small model and a frontier one turns out to be the frontier model doing work you never wanted. You can read the right output straight off the task, as long as you do it before you answer, not during. Let the model decide whether you wanted a paragraph or a list and it has already made the call you wanted the system to make deliberately.
The prompt work had levels
The first runs started after I had narrowed the private candidate set to small models in the Qwen and Gemma families. I ran the same prompts across them, looking for the speed-and-quality tradeoff that made sense: fast enough to use in the product, good enough not to route outside by default, and small enough to run on infrastructure I could afford.
Then I tried static prompt variants: longer context windows, different system language, different penalties for over-answering, more or less instruction about source use. That helped, but it still treated the prompt as one object. The lookup row, the artifact row, and the evaluation row were getting the same kind of instruction even when they needed different behavior.
The next layer was backward-looking. For every row, after scoring, I could ask which prompt would have worked best on that exact example. That was useful because it showed the ceiling. It was not deployable, because production chooses the prompt before the answer exists.
The next step was turning the backward-looking winners into rules that can be applied before generation. Use the user's own words. Preserve facts the user supplied. For lookup, answer first and qualify second. For evaluation, separate evidence from judgment. For artifact work, produce the requested shape without inventing missing facts. For missing retrieval, say what can and cannot be known.
| Level | What changed | What it taught |
|---|---|---|
| Bare baseline | One plain task prompt | Where the small model starts before tuning. |
| Static variants | Longer prompts, different source rules, different answer budgets | A better prompt helps, but one prompt cannot fit every task. |
| Row-by-row best case | Pick the best prompt after seeing scores | The ceiling, not the production plan. |
| Task rules | Prompt behavior chosen from the task shape | The model improves when the task shape is decided before generation. |
| Prompt builder | Model-specific system prompt, user prompt, context format, guardrails | Prompting is infrastructure, not copywriting. |
The prompt builder is not a better paragraph glued onto the front of the question. It takes the user prompt, conversation context, retrieved context, labels that describe the task, selected model, the user's speed or quality preference, privacy setting, output requirements, and source-use rules, then builds the prompt in the shape that model expects.
By the end, the prompts did not look like one master instruction with different model names swapped in. They looked more like small packets built for the job: same user request, different structure depending on the model, the context, and the answer shape. The rows below are examples of prompt shapes, not fixed rules. Not every lookup goes to Qwen3-4B-VL, not every evaluation goes to Kimi, and not every draft goes to MiniMax.
| Example route | Prompt packet | Why that shape |
|---|---|---|
| Qwen3-4B-VL lookup | User question, selected source chunks, direct answer mode, answer-first instruction, preserve user facts, say when the record is missing. | The hosted model needs a narrow lane: use the user's words, answer from the record, avoid filling gaps from prior knowledge. |
| Kimi evaluate | Question, relevant context, evaluation criteria, compare-or-judge mode, concise rationale, separate evidence from recommendation. | The model needs to weigh evidence without turning the answer into a memo. |
| DeepSeek reasoning | Question, context bundle, task axes, reasoning-heavy mode, requested output format, instruction to show only the useful conclusion. | Some cells need more reasoning pressure, but the user still needs the final answer in the requested shape. |
| MiniMax draft | User goal, supplied facts, tone or format constraints, draft shape, unsupported-claim rule, source-use rules. | For a draft, the sentence is the work. The prompt has to protect facts while still letting the model write. |
Each model needed its own prompt
After Qwen3-4B-VL, I ran the same experiment pattern across outside models and, where useful, across more than one provider for the same model. Provider mattered because the same model can behave differently once speed, cost, reliability, and serving limits are part of the product decision. The shortlist narrowed to GPT-OSS 120B and MiniMax M2.7 through SambaNova, Nemotron 120B and Kimi K2.6 through Baseten, and DeepSeek V4 Pro and DeepSeek V4 Flash through DeepInfra. Anthropic Haiku was in the loop as the judge. Closed frontier models were tested only on the combinations where none of the other models met the bar.
The mistake would have been to take Qwen3-4B-VL's best prompt and hand it to every other model. Qwen3-4B-VL needed help preserving user-supplied facts. Kimi failed in different ways. DeepSeek needed different answer-shape pressure. Some models over-structured. Some models wrote too much. Some were fast and cheap but only good in specific cells. The policy could be based on the task, but the prompt had to be model-specific.
| Model family | Where it mattered | Why it was not universal |
|---|---|---|
| Qwen3-4B-VL | Hosted route for many lookup and meta cells | Some task cells still misstepped even with a better prompt. |
| Kimi K2.6 | Strong evaluate and compare behavior | Using it everywhere would externalize work the hosted model could handle. |
| DeepSeek Pro and Flash | Different quality, speed, and cost lanes | Latency and task fit varied by cell. |
| MiniMax, GPT-OSS, Nemotron | Specific fallback and specialty cells | Aggregate score was less important than where each model cleared the bar. |
The judge had to be chosen too
The judging layer went through its own version of this. I started with Kimi because I wanted a strong, independent judge for the other model families. Kimi comes from Moonshot AI. At first I could not find a US-hosted Kimi path that was reliable enough to be part of the candidate list, but Moonshot's own version, hosted in China, performed very well as a judge. That would not have been an acceptable path for real user context. For this test, the data was synthetic, so I could use it to compare model behavior. For a while, the practical answer was to run Kimi directly from the source and use it to judge the non-Kimi families.
During that process, I got an email from Moonshot pointing me to Baseten as their US partner. I tried the Baseten-hosted Kimi models, and Kimi cleared the bar to be included as a candidate model. Since the test data was synthetic, I could choose the judge for comparison quality. I did not want Kimi judging a bakeoff in which Kimi was also competing, so the main judge moved to Anthropic Haiku. With optimized judging prompts, Haiku performed best in the closed frontier class when I looked at judging quality, speed, and cost together.
The final bakeoff used multiple prompt versions for each candidate model, Haiku as a judge scoring from the user's perspective, and a rule-based judge alongside it for the pieces that should not depend on model taste: required elements, forbidden claims, source-use checks, and obvious format failures. I treated the judge as another part of the system, not as truth.
| Stage | What I used | Why it changed |
|---|---|---|
| Early judge | Kimi from Moonshot | Strong enough to evaluate other families, but not the right judge once Kimi entered the candidate set. |
| Main bakeoff judge | Anthropic Haiku | With optimized judging prompts, it performed best in the closed frontier class across judging quality, speed, and cost. |
| Deterministic checks | Rule-based judge | Some failures should be checked mechanically: missing required elements, unsupported claims, and format violations. |
The route policy came last
Only after that did routing become a policy problem. A question comes in. The system first describes the work: task family, difficulty, whether it needs retrieval, whether it wants reasoning or a short answer, and what shape the output should take. Then it checks which models have cleared the bar for that kind of work, applies the user's privacy rules, and ranks what is left by the mode the user or product selected: default, speed, quality, or privacy.
The policy starts with the work, then the user's privacy constraints. If Qwen3-4B-VL can do the job well enough, and the user or product wants a route The Wheel can verify directly, use it. If not, choose the best outside model that fits the user's rules and has already cleared that kind of task. If the record has to stay inside The Wheel, keep it inside and record what would have been chosen without that constraint. If the user chooses an outside route, make the exposure explicit: what context moved, which destination received it, why it was sent, and what the retention rules are. It is the same restraint problem I wrote about in The Hardest Part of AI Is Shutting Up, applied to model choice.
I spent most of the optimization work on US-hosted open or open-weight models. By that I mean models whose weights are available, or whose behavior can be tested outside one closed provider. If one of those models clears the quality bar, or can clear it with a better prompt, the speed and cost can improve a lot. Closed frontier models, the large proprietary models, still belong in the system for cases where nothing else holds up: a hard synthesis across conflicting sources, a high-stakes draft where tone and accuracy both matter, or a question where smaller routes keep missing a user-supplied fact. Once two routes clear the bar under the user's privacy rules, the decision is no longer only "best model." It is whether a small judge-score gain is worth a much higher price, slower response, and broader exposure. A lot of the time, the difference is style, and the answer is no.
In the original validation panel, the hosted model handled a substantial share of the work. Kimi handled another large share. DeepSeek, MiniMax, GPT-OSS, and Nemotron each earned smaller slices. A small remainder needed a larger model or a path that asked for more source material. The exact mix will change with real traffic, but the shape is the product: choose the route that fits the work and make any exposure explicit, scoped, and recorded.
| Step | Decision | Recorded |
|---|---|---|
| Classify | Task family, difficulty, retrieval need, response mode, output format | Axes and probabilities |
| Check hosted route | Does Qwen3-4B-VL do this task well enough? | Qualified or failed, with reason |
| Apply policy | Privacy, speed, quality, or default mode | Minimum quality bar, failure limit, selected policy |
| Choose model | Model that clears the job under the user's privacy, speed, quality, and cost preference | Model, route, latency, cost estimate, fallback |
| Escalate if needed | Use a closed frontier model only when smaller or US-hosted open-model routes do not hold up | Escalation reason and what context was exposed |
| Build prompt | Use model-specific prompt shape and context format | Prompt rules and source-use rules |
The factory makes the whole thing replaceable
The implementation piece is the factory. The service asking for AI does not name a model. It names a use case. The catalog maps that use case to a provider, model, context setup, and prompt rules. The prompts live there too, versioned. When a prompt changes, the version changes in one place. When a model changes, the catalog entry changes. The service calling it does not need to know.
Repeatability comes from that separation. When a new model lands, open or closed, it can be added to the catalog, run against the same test rows with its own prompt builder, judged with the same context, and compared against the existing routes. If it wins, the route table changes. If it does not, nothing changes.
The simulated rows make the starting point solid, but they are not the finish line. User feedback and usage will test the assumptions we used to choose the routing, prompts, and fallback paths as more question types show up. Real work will shape the system in ways a generated corpus cannot, because users will ask stranger questions, bring messier context, and notice different failures than a test set can predict. Once the beta opens in a few weeks, we will publish more of the testing behind these choices.
For most real knowledge work, a small hosted model with the right prompt is not a compromise. It is faster and cheaper, uses less compute, and when it clears the job, The Wheel can verify the handling inside its own private cloud environment. The frontier models are still there for the tasks that need them.
The Wheel has to ask for the smallest exposure that can do the job: enough context for the destination to help, not a silent transfer of the user's record. The user still has to decide whether what came back was the answer they needed.
This post is part of the build journal. For the adjacent product decisions, read The Hardest Part of AI Is Shutting Up, Privacy Is Not Protection, and The Unsolved Search.