Designing a Token-Efficient MCP Server: the OctoPerf Approach
In the first two articles of this series we showed what the OctoPerf MCP Server does. This one is for the builders: how we designed it, and specifically how we kept its token cost under control.
Because here is the thing nobody tells you when you start writing a Model Context Protocol server: the hard part is not exposing your API to an LLM. The hard part is not exposing too much of it. Every byte a tool returns lands in the model's context window, where it costs money, adds latency, and dilutes the model's attention. A server that naively mirrors a REST API produces an agent that is expensive, slow, and confused.
This article walks through the five patterns we applied to avoid that fate. None of them is specific to load testing: if you are building an MCP server for your own product, they should transfer directly.
Table of Contents¶
- The Context Window Is the Scarce Resource
- Pattern 1 - Presigned URLs: Keep the Bytes Out of the Conversation
- Pattern 2 - Return Listings, Not Entities
- Pattern 3 - Patch, Don't Replace
- Pattern 4 - Validate Globally, Nit-Pick Locally
- Pattern 5 - Read Reports from the General to the Specific
- What It Adds Up To
- Conclusion
The Context Window Is the Scarce Resource¶

A REST API and an MCP server look deceptively similar: both expose operations over HTTP, both move JSON around. But they serve radically different consumers. A web UI fetches a 200 KB entity, renders the 2% it needs, and throws the rest away for free. An LLM cannot throw anything away: everything a tool returns is read, token by token, on every subsequent reasoning step of the conversation.
That changes the economics in three ways:
- Money: tool results are input tokens, and in an agentic loop they are re-read at every turn, so a single oversized response is billed many times,
- Latency: bigger contexts mean slower responses, and an agent chains dozens of tool calls per task,
- Quality: this is the sneaky one. Long contexts degrade reasoning; a model digging through 50,000 tokens of irrelevant JSON is measurably worse at spotting the one field that matters.
A load-testing platform is close to a worst-case scenario here. A Virtual User is a deeply nested action tree, a test run produces gigabytes of results, and a single HTTP response body captured during validation can be larger than the entire context window. Bridging OctoPerf to an LLM without a token strategy was never going to work. So we made the token cost a first-class design constraint, and it shaped the server's whole API surface.
Pattern 1 - Presigned URLs: Keep the Bytes Out of the Conversation¶

Load testing is a file-heavy discipline: CSV datasets go up, JTL results, HAR archives, Playwright traces and PDF reports come down. Pushing file content through tool results would be absurd: a modest 2 MB results file is roughly half a million tokens, several times the context budget of the conversation, spent on bytes the model would mostly not read.
Our answer is presigned URLs. A file tool never returns content; it returns instructions for fetching it:
/**
* Instructions handed to the MCP host so its code interpreter
* (or the user's browser) can GET a file directly from the
* OctoPerf REST API, bypassing the MCP process for the bytes.
*/
public record PresignedDownload(
String url, // embeds a single-use ephemeral token
String method, // "GET"
DateTime expiresAt, // ~5 minutes
String instructions) { ... }
The url embeds a single-use, short-lived token, so the link can be handed around safely: it works once, then dies. The agent host's code interpreter (or a plain curl in Claude Code) fetches the bytes directly from the OctoPerf REST API, and the LLM only ever sees the few dozen tokens of the envelope. Uploads work symmetrically with a PresignedUpload that the client POSTs to.
The payoff goes beyond cost. Because the bytes bypass the model entirely, file size stops being the LLM's problem: the agent can pull a 50 MB Playwright trace, unzip it locally, and grep the one failing selector, something no context window could ever absorb. The conversation stays light while the heavy lifting happens where it belongs.
Pattern 2 - Return Listings, Not Entities¶
Files were the obvious offender; entities are the insidious one. A reflex to unlearn: a tool that lists or creates entities should not return the entities. In OctoPerf's REST API, a VirtualUser carries its full recursive children action tree, which for a recorded checkout journey easily reaches thousands of lines of JSON. An agent that calls list_virtual_users to find an id does not need any of it.
So every list/create/import tool returns a compact projection instead. Here is the actual VirtualUserListing from the server's source:
/**
* Compact projection of an OctoPerf VirtualUser returned by every
* MCP tool that creates or lists VUs.
*
* Drops the heavy fields that aren't useful to an LLM (children tree,
* userId/projectId already known by the caller, type discriminator)
* while keeping the metadata the agent typically reasons about.
*/
public record VirtualUserListing(
String id,
String name,
String description,
Set<String> tags,
DateTime created,
DateTime lastModified,
String url) { ... }
The selection logic is in the Javadoc: keep what the agent reasons about (the id to chain into the next tool call, the human-readable name, tags, timestamps), drop what it does not (the tree, the ids it already knows, internal discriminators). The server has fifteen of these projections, one per entity family: workspaces, projects, scenarios, bench reports, correlation rules, scheduled jobs, HTTP servers, variables, and so on.
The last field deserves a special mention. Every listing carries a url deep-link to the matching page in the OctoPerf UI. It costs a handful of tokens and buys two things: the agent can hand the user a clickable link whenever it summarizes a result, and more subtly, it gives the agent a graceful exit. When a question is better answered by an interactive chart than by another round of tool calls, the agent can simply point the user at the right page instead of burning tokens trying to reproduce it.
The ballpark: a full VirtualUser entity weighs tens of thousands of tokens; its listing weighs about sixty. On a list_virtual_users over a busy project, the difference is two orders of magnitude, on the very first call of the conversation.
Pattern 3 - Patch, Don't Replace¶
Reading was the easy half. Editing is where a naive design really bleeds tokens: if the only write tool is update_virtual_user(fullEntity), then renaming one action in a 500-action tree forces the agent to read the whole tree, regenerate it entirely with one field changed, and send it back. Two full copies of the entity through the context, with a real risk of the model mangling a field it should not have touched.
Instead, every entity family gets a patch_* tool built on RFC 6902 JSON Patch. The agent sends only the operations:
[
{ "op": "replace", "path": "/children/3/name", "value": "Submit payment" },
{ "op": "add", "path": "/children/7/enabled", "value": true }
]
A surgical edit costs a few dozen tokens regardless of how big the entity is. Server-side, the patch is applied to the entity's JSON representation, then re-deserialized through Jackson before persisting: a round-trip validation that rejects any patch producing a structurally invalid entity. The agent can be wrong, but it cannot corrupt your script.
There is a catch, though: to write a correct patch against a polymorphic action tree, the model needs to know the shape of every node type. Guessing burns tokens in failed attempts. So the server publishes its entity schemas as MCP resources (JSON Schema 2020-12, one oneOf branch per subtype): octoperf://schema/vu, octoperf://schema/scenario, and friends, with a plain-HTTP fallback for clients that do not read MCP resources. The agent loads the schema once, on demand, instead of rediscovering field names by trial and error. And when a patch does fail validation, the error message points back at the relevant schema, turning the retry into a one-shot fix.
Pattern 4 - Validate Globally, Nit-Pick Locally¶

The patterns above are generic. The next two are about shaping workflows, and the first concerns Virtual User validation.
A validation run replays the script and captures, for every action, four HTTP entities: the request as recorded, the request as replayed, and both responses. For a 24-action journey, that is easily megabytes of payload. The one thing the server must not do is hand all of it to the model at once.
So the validation API is deliberately layered:
get_virtual_user_validation_indexreturns one tiny entry per action: success and failure counts, plus timestamps. No bodies. For a 24-action VU this is a few hundred tokens, and it is usually enough to classify the failures into groups (auth, data, server-side...),get_validation_failure_detailfetches the four HTTP entities for one representative action of a group, the few KB that confirm or refute the diagnosis,fetch_validation_http_bodygoes one level deeper and retrieves a single body of a single exchange (recorded vs replayed, request vs response), for the cases where one side fits in context but both would not.
The agent reads like a good engineer debugs: global picture first, then one representative failure, then one specific body if needed. The validation triage skill we demonstrated in part 1 is precisely this discipline written down; the layered API is what makes it cheap. Triaging a red validation typically costs a few thousand tokens instead of the hundreds of thousands a "return everything" design would burn.
Pattern 5 - Read Reports from the General to the Specific¶
The same layering applies to result analysis, with one extra twist: an OctoPerf bench report is a polymorphic document holding twenty-plus widget types, and the data behind a single line chart of a one-hour run is thousands of points. Serializing a full report would be both enormous and useless.
So there is no get_full_report tool, by design. Instead, get_bench_report returns the report's structure (which widgets exist, with their ids), and each widget family has its own narrow value tool: get_report_insights, get_report_summary_values, get_report_errors, get_report_line_chart_values, get_report_top_values, and so on.
The reading order is encoded in the scenario diagnosis and bench-report skills, and it always flows from the general to the specific:
- start with Insights: one call, up to ~15 pre-computed heuristics tagged by severity, the platform's own classification of the run,
- then the summary: a dozen test-wide numbers (error rate, percentiles, throughput) that decide which class of problem you are in,
- only then drill into the widget that the global view designates: the error table if errors dominate, the percentile chart if response times drift, the per-action top if one endpoint stands out.
Most diagnoses never need the expensive widgets at all: the insights and the summary, a few hundred tokens combined, answer "is this run healthy and if not, where does it hurt". The detailed tools exist for the minority of cases that warrant them. And when the user wants to explore rather than ask, the deep-link from Pattern 2 hands them the interactive report, where exploring is free.
What It Adds Up To¶
Take the workflow from part 1 of this series: import a HAR, triage a red validation, auto-correlate, re-validate, run a 500-user scenario, diagnose the result. With a naive mirror of the REST API, the entity reads alone (full VU trees, full validation payloads, full report data) would blow through the context window before the workflow's midpoint, forcing summarization, losing precision, multiplying cost.
With the five patterns combined, the same workflow holds comfortably in a single conversation, and the agent's context contains almost nothing but signal: listings, indexes, one confirmed failure detail, patch operations, insight verdicts. That is what makes the chat-driven experience of the previous articles actually viable in production, not just in a demo.
If you are building your own MCP server, the patterns to steal are honestly simple:
- Move bytes out of band: presigned URLs cost dozens of tokens, files cost millions,
- Project your entities: return what the model reasons about, never what your UI renders,
- Edit by patch, validate server-side: publish your schemas so the model patches right the first time,
- Layer your reads: index before detail, detail before body,
- Encode the reading order: tools define what is possible, skills define what is wise.
Conclusion¶
The Model Context Protocol makes connecting an LLM to a product almost trivially easy, and that is exactly why design discipline matters: the protocol will happily let you ship a token furnace. Treating the context window as the scarce resource it is changed every API decision we made, and the result is an agent that stays fast, cheap and sharp across a full performance-testing session.
The OctoPerf MCP Server is live, the skills are on GitHub, and the first two parts of this series show it all in action. If you are building an MCP server of your own and these patterns help, or if you have found better ones, we would genuinely love to compare notes.