Cutting LLM token usage by 80% using recursive document analysis (lemmygrad.ml)

submitted 1 month ago by yogthos@lemmygrad.ml to c/technology@hexbear.net

29 comments fedilink hide all child comments

When you employ AI agents, there’s a significant volume problem for document study. Reading one file of 1000 lines consumes about 10,000 tokens. Token consumption incurs costs and time penalties. Codebases with dozens or hundreds of files, a common case for real world projects, can easily exceed 100,000 tokens in size when the whole thing must be considered. The agent must read and comprehend, and be able to determine the interrelationships among these files. And, particularly, when the task requires multiple passes over the same documents, perhaps one pass to divine the structure and one to mine the details, costs multiply rapidly.

Matryoshka is a tool for document analysis that achieves over 80% token savings while enabling interactive and exploratory analysis. The key insight of the tool is to save tokens by caching past analysis results, and reusing them, so you do not have to process the same document lines again. These ideas come from recent research, and retrieval-augmented generation, with a focus on efficiency. We'll see how Matryoshka unifies these ideas into one system that maintains a persistent analytical state. Finally, we'll take a look at some real-world results analyzing the anki-connect codebase.

The Problem: Context Rot and Token Costs

A common task is to analyze a codebase to answers a question such as “What is the API surface of this project?” Such work includes identifying and cataloguing all the entry points exposed by the codebase.

Traditional approach:

Read all source files into context (~95,000 tokens for a medium project)
The LLM analyzes the entire codebase’s structure and component relationships
For follow-up questions, the full context is round-tripped every turn

This creates two problems:

Token Costs Compound

Every time, the entire context has to go to the API. In a 10-turn conversation about a codebase of 7,000 lines, almost a million tokens might be processed by the system. Most of those tokens are the same document contents being dutifully resent, over and over. The same core code is sent with every new question. This redundant transaction is a massive waste. It forces the model to process the same blocks of text repeatedly, rather than concentrating its capabilities on what’s actually novel.

Context Rot Degrades Quality

As described in the Recursive Language Models paper, even the most capable models exhibit a phenomenon called context degradation, in which their performance declines with increasing input length. This deterioration is task-dependent. It’s connected to task complexity. In information-dense contexts, where the correct output requires the synthesis of facts presented in widely dispersed locations in the prompt, this degradation may take an especially precipitous form. Such a steep decline can occur even for relatively modest context lengths, and is understood to reflect a failure of the model to maintain the threads of connection between large numbers of informational fragments long before it reaches its maximum token capacity.

The authors argue that we should not be inserting prompts into the models, since this clutters their memory and compromises their performance. Instead, documents should be considered as external environments with which the LLM can interact by querying, navigating through structured sections, and retrieving specific information on an as-needed basis. This approach treats the document as a separate knowledge base, an arrangement that frees up the model from having to know everything.

Prior Work: Two Key Insights

Matryoshka builds on two research directions:

Recursive Language Models (RLM)

The RLM paper introduces a new methodology that treats documents as external state to which step-by-step queries can be issued, without the necessity of loading them entirely. Symbolic operations, search, filter, aggregate, are actively issued against this state, and only the specific, relevant results are returned, maintaining a small context window while permitting analysis of arbitrarily large documents.

Key point is that the documents stay outside the model, and only the search results enter the context. This separation of concerns ensures that the model never sees complete files, instead, a search is initiated to retrieve the information.

Barliman: Synthesis from Examples

Barliman, a tool developed by William Byrd and Greg Rosenblatt, shows that it is possible to use program synthesis without asking for precise code specifications. Instead, input/output examples are used, and a solver engine is used as a relational programming system in the spirit of miniKanren. Barliman uses such a system to synthesize functions that satisfy the constraints specified. The system interprets the examples as if they were relational rules, and the synthesis engine tries to satisfy them. This approach makes it possible to describe what is desired for concrete test cases.

The approach is to simply show examples of the kind of behavior one wishes the system to exhibit, letting it derive the implmentation on its own. Thus, the emphasis shifts from writing long and detailed step-by-step recipes for behavior to simply portraying, in a declarative fashion, what the desired goal is.

Matryoshka: Combining the Insights

Matryoshka incorporates these insights into a functioning system for LLM agents. A practical tool is provided that enables agents to decompose challenging tasks into a sequence of smaller and more manageable objectives.

1. Nucleus: A Declarative Query Language

Instead of issuing commands, the LLM describes what it wants, using Nucleus, a simple S-expression query language. This changes the focus from describing each step to specifying the desired outcome.

(grep "class ")           ; Find all class definitions
(count RESULTS)           ; Count them
(map RESULTS (lambda x    ; Extract class names
  (match x "class (\\w+)" 1)))

We observe that the declarative interface retains its robustness even when the LLM employs different vocabulary or sentence structures. This robustness originates from the system’s commitment to elucidating the underlying intent of a request, independent of superficial linguistic variations.

2. Pointer-Based State

The key new insight is that we can separate the results from the context. Results are now stored in the REPL state, rather than in the context.

When the agent runs (grep "def ") and gets 150 matches:

Traditional tools: All 150 lines are fed into context, and round-tripped every turn
Matryoshka: Binds matches to RESULTS in the REPL, returning only "Found 150 results"

The variable RESULTS is bound to the actual value in the REPL. This binding acts as a pointer, revealing the location of the data within the server's memory. Subsequent operations, queries, for example, or updates, use this reference to access the data. But the data itself never actually enters the conversation:

Turn 1: (grep "def ")         → Server stores 150 matches as RESULTS
                              → Context gets: "Found 150 results"

Turn 2: (count RESULTS)       → Server counts its local RESULTS
                              → Context gets: "150"

Turn 3: (filter RESULTS ...)  → Server filters locally
                              → Context gets: "Filtered to 42 results"

The LLM never sees the 150 function definitions, just the aggregated answers from these functions.

3. Synthesis from Examples

When queries need custom parsing, Matryoshka synthesizes functions from examples:

(synthesize_extractor
  "$1,250.00" 1250.00
  "€500" 500
  "$89.99" 89.99)

The synthesizer learns the pattern directly from examples, obtaining numerical values straight from the currency strings and entirely circumventing the need to construct manual regex.

The Lifecycle

A typical Matryoshka session:

1. Load Document

(load "./plugin/__init__.py")
→ "Loaded: 2,244 lines, 71.5 KB"

The document is parsed and stored server-side. Only metadata enters the context.

2. Query Incrementally

(grep "@util.api")
→ "Found 122 results, bound to RESULTS"
   [402] @util.api()
   [407] @util.api()
   ... (showing first 20)

Each query returns a preview plus the count. Full data stays on server.

3. Chain Operations

(count RESULTS)           → 122
(filter RESULTS ...)      → "Filtered to 45 results"
(map RESULTS ...)         → Transforms bound to RESULTS

Operations chain through the RESULTS binding. Each step refines without re-querying.

4. Close Session

(close)
→ "Session closed, memory freed"

Sessions auto-expire after 10 minutes of inactivity.

How Agents Discover and Use Matryoshka

Matryoshka integrates with LLM agents via the Model Context Protocol (MCP).

Tool Discovery

When the agent starts, it launches Matryoshka as an MCP server and receives a tool manifest:

{
  "tools": [
    {
      "name": "lattice_load",
      "description": "Load a document for analysis..."
    },
    {
      "name": "lattice_query",
      "description": "Execute a Nucleus query..."
    },
    {
      "name": "lattice_help",
      "description": "Get Nucleus command reference..."
    }
  ]
}

The agent sees the available tools and their descriptions. When a user asks to analyze a file, it decides which tools to use based on the task.

Guided Discovery

The lattice_help tool returns a command reference, teaching the LLM the query language on-demand:

; Search commands
(grep "pattern")              ; Regex search
(fuzzy_search "query" 10)     ; Fuzzy match, top N
(lines 10 20)                 ; Get line range

; Aggregation
(count RESULTS)               ; Count items
(sum RESULTS)                 ; Sum numeric values

; Transformation
(map RESULTS fn)              ; Transform each item
(filter RESULTS pred)         ; Keep matching items

The agent learns capabilities incrementally rather than needing upfront training.

Session Flow

User: "How many API endpoints does anki-connect have?"

Agent: [Calls lattice_load("plugin/__init__.py")]
        → "Loaded: 2,244 lines"

Agent: [Calls lattice_query('(grep "@util.api")')]
        → "Found 122 results"

Agent: [Calls lattice_query('(count RESULTS)')]
        → "122"

Agent: "The anki-connect plugin exposes 122 API endpoints,
         decorated with @util.api()."

Each tool invocation maintains its own state within the conversation. So, for example, when a document is loaded, that content is retained in memory. Similarly, the results of any query that is executed are saved and available for later use.

Real-World Example: Analyzing anki-connect

Let's walk through a complete analysis of the anki-connect Anki plugin. Here we have a real-world codebase with 7,770 lines across 17 files.

The Task

"Analyze the anki-connect codebase: find all classes, count API endpoints, extract configuration defaults, and document the architecture."

The Workflow

The agent uses Matryoshka's prompt hints to accomplish the following workflow:

Discover files with Glob
Read small files directly (<300 lines)
Use Matryoshka for large files (>500 lines)
Aggregate across all files

Step 1: File Discovery

Glob **/*.py → 15 Python files
Glob **/*.md → 2 markdown files

File sizes:
  plugin/__init__.py    2,244 lines  → Matryoshka
  plugin/edit.py          458 lines  → Read directly
  plugin/web.py           301 lines  → Read directly
  plugin/util.py          107 lines  → Read directly
  README.md             4,660 lines  → Matryoshka
  tests/*.py           11 files      → Skip (tests)

Step 2: Read Small Files

Reading util.py (107 lines) reveals configuration defaults:

DEFAULT_CONFIG = {
    'apiKey': None,
    'apiLogPath': None,
    'apiPollInterval': 25,
    'apiVersion': 6,
    'webBacklog': 5,
    'webBindAddress': '127.0.0.1',
    'webBindPort': 8765,
    'webCorsOrigin': None,
    'webCorsOriginList': ['http://localhost/'],
    'ignoreOriginList': [],
    'webTimeout': 10000,
}

Reading web.py (301 lines) reveals the server architecture:

Classes: WebRequest, WebClient, WebServer
JSON-RPC style API with jsonschema validation
CORS support with configurable origins

Step 3: Query Large Files with Matryoshka

; Load the main plugin file
(load "plugin/__init__.py")
→ "Loaded: 2,244 lines, 71.5 KB"

; Find all classes
(grep "^class ")
→ "Found 1 result: [65] class AnkiConnect:"

; Count methods
(grep "def \\w+\\(self")
→ "Found 148 results"

; Count API endpoints
(grep "@util.api")
→ "Found 122 results"

; Load README for documentation
(load "README.md")
→ "Loaded: 4,660 lines, 107.2 KB"

; Find documented action categories
(grep "^### ")
→ "Found 13 sections"
   [176] ### Card Actions
   [784] ### Deck Actions
   [1231] ### Graphical Actions
   ...

Complete Findings

Metric	Value
Total files	17 (15 .py + 2 .md)
Total lines	7,770
Classes	8 (1 main + 3 web + 4 edit)
Instance methods	148
API endpoints	122
Config settings	11
Imports	48
Documentation sections	8 categories, 120 endpoints

Token Usage Comparison

Approach	Lines Processed	Tokens Used	Coverage
Read everything	7,770	~95,000	100%
Matryoshka only	6,904	~6,500	65%
Hybrid	7,770	~17,000	100%

The hybrid method achieves a 82% savings in tokens while retaining 100% of the original coverage. This approach combines two different strategies, one for compressing redundant information and one for preserving unique insights.

The pure Matryoshka approach ends up missing details from small files (configuration defaults, web server classes), because the agent only uses the tool to query large ones. The hybrid workflow does direct, full-content reads on small files, while leveraging Matryoshka to analyze bigger files, in a kind of divide-and-conquer strategy. All that's needed is to provide the agent an explicit hint on the strategy to use.

Why Hybrid Works

Small files (<300 lines) contain critical details:

util.py: All configuration defaults, the API decorator implementation
web.py: Server architecture, CORS handling, request schema

These fit comfortably in context, and there's no need to do anything different. Matryoshka adds value for:

__init__.py (2,244 lines): Query specific patterns without loading everything
README.md (4,660 lines): Search documentation sections on demand

Architecture

┌─────────────────────────────────────────────────────────┐
│                     Adapters                             │
│  ┌──────────┐  ┌──────────┐  ┌───────────────────────┐ │
│  │   Pipe   │  │   HTTP   │  │   MCP Server          │ │
│  └────┬─────┘  └────┬─────┘  └───────────┬───────────┘ │
│       │             │                     │             │
│       └─────────────┴─────────────────────┘             │
│                          │                               │
│                ┌─────────┴─────────┐                    │
│                │   LatticeTool     │                    │
│                │   (Stateful)      │                    │
│                │   • Document      │                    │
│                │   • Bindings      │                    │
│                │   • Session       │                    │
│                └─────────┬─────────┘                    │
│                          │                               │
│                ┌─────────┴─────────┐                    │
│                │  NucleusEngine    │                    │
│                │  • Parser         │                    │
│                │  • Type Checker   │                    │
│                │  • Evaluator      │                    │
│                └─────────┬─────────┘                    │
│                          │                               │
│                ┌─────────┴─────────┐                    │
│                │    Synthesis      │                    │
│                │  • Regex          │                    │
│                │  • Extractors     │                    │
│                │  • miniKanren     │                    │
│                └───────────────────┘                    │
└─────────────────────────────────────────────────────────┘

Getting Started

Install from npm:

npm install matryoshka-rlm

As MCP Server

Add to your MCP configuration:

{
  "mcpServers": {
    "lattice": {
      "command": "npx",
      "args": ["lattice-mcp"]
    }
  }
}

Programmatic Use

import { NucleusEngine } from "matryoshka-rlm";

const engine = new NucleusEngine();
await engine.loadFile("./document.txt");

const result = engine.execute('(grep "pattern")');
console.log(result.value); // Array of matches

Interactive REPL

npx lattice-repl
lattice> :load ./data.txt
lattice> (grep "ERROR")
lattice> (count RESULTS)

Conclusion

Matryoshka embodies the principle, emerging from RLM research, that documents are to be treated as external environments rather than as contexts to be parsed. This principle alters the fundamental character of the model’s engagement, no longer a passive reader but an active agent, navigating through and interrogating a document to extract specific information, somewhat as a programmer would browse through code. Combined with Barliman-style synthesis, in which a solution is built up in a series of small, well-defined steps, and pointer-based state management, it achieves:

82% token savings on real-world codebase analysis
100% coverage when combined with direct reads for small files
Incremental exploration where each query builds on previous results
No context rot because documents stay outside the model

We observe that variable bindings such as RESULTS refer to REPL state rather than holding data directly in model context. As we formulate and submit queries, what is sent to the server are mere pointers, placeholders indicating where the actual computation should occur. It is the server that executes the substantive computational tasks, returning only the distilled results.

source here: https://git.sr.ht/~yogthos/matryoshka

all 31 comments

sorted by: hot top new old

[-] dannoffs@hexbear.net 9 points 1 month ago

Cutting LLM token usage by 100% by not using LLMs.

[-] yogthos@lemmygrad.ml 13 points 1 month ago

I absolutely love how every single thread about LLMs will necessarily have at least one vapid comment like this.

[-] Le_Wokisme@hexbear.net 5 points 1 month ago

how's it feel being a garbage apologist? is someone paying you or is this just love of the game?

[-] yogthos@lemmygrad.ml 11 points 1 month ago* (last edited 1 month ago)

Fascinating. You’ve managed to distill a lack of critical thought into a single question. My motivation is curiosity... yours appears to be performance.

[-] WokePalpatine@hexbear.net 3 points 1 month ago

Fuck off, dude. You might as well be working on an internal combustion engine that's 10% more efficient. It's a technological dead end.

[-] yogthos@lemmygrad.ml 6 points 1 month ago* (last edited 1 month ago)

maybe find yourself a hobby other than opining on things you have no clue about dude

[-] gay_king_prince_charles@hexbear.net 5 points 1 month ago

One if you're lucky. But it's not surprising to get a contrarian quip on the contrainian quip website

[-] yogthos@lemmygrad.ml 10 points 1 month ago

At least would people try being creative or something with their critique. This is hexbear equivalent of libs sealioning into threads to tell you how China is authoritarian akchually on reddit!

[-] PM_ME_VINTAGE_30S@lemmy.sdf.org 7 points 1 month ago* (last edited 1 month ago)

We'll see how Matryoshka unifies these ideas into one system that maintains a persistent analytical state.

As a dynamical systems guy, this warms my heart. Thank you for putting this out there.

[-] yogthos@lemmygrad.ml 6 points 1 month ago* (last edited 1 month ago)

Most of the credit goes to the RLM paper to be honest, but it was really fun to try to implement the idea and combine it with a logic engine. And it really shows just how inefficient current tooling is. It blows my mind that there's so much low hanging fruit available, that's fairly simple to implement.

[-] git@hexbear.net 7 points 1 month ago

This is pretty cool, I’ll need to have a play and see how it improves my local model usage.

[-] yogthos@lemmygrad.ml 3 points 1 month ago

Let me know how it works out. I've been dogfooding it locally for a few days, but always good to hear reports from other people with these sorts of things. :)

[-] JoeByeThen@hexbear.net 6 points 1 month ago

Shoot, was the YouTube thumbnail right? Do I need to learn MCP?

Nice, btw. F the haters. stalin-approval

[-] yogthos@lemmygrad.ml 5 points 1 month ago* (last edited 1 month ago)

Haha I don't think I had a youtube link there. Yeah, using MCPs is handy, tools like crush can actually figure them out on their own. And thanks!

[-] JoeByeThen@hexbear.net 1 points 1 month ago* (last edited 1 month ago)

Oh, no. Lol not you, some NetworkChuck video youtube is constantly pushing on me.

[-] yogthos@lemmygrad.ml 2 points 1 month ago

oh haha that one's not me

[-] krakhead@hexbear.net 6 points 1 month ago

Pretty cool! I'm a novice programmer with spotty knowledge of tech so I wanted to ask you - where do you think we're at with limiting/eliminating hallucinations in LLMs? Is that something even possible?

[-] yogthos@lemmygrad.ml 7 points 1 month ago

The main approach right now is just doing reinforcement learning and minimizing operational context for the model. Another more expensive approach is using a quorum as seen here. You have several agents produce a solution and take the majority vote. While that sounds expensive, it might actually not be that much of a problem if you make smaller specialized models for particular domains. Another track people are pursuing is using neurosymbolic hybrid approach, where you could the LLM with a symbolic logic engine. This is my personal favorite route, and what I did with matryoshka is a variation on that. The LLM sits at the very top and it's job is to analyze natural language user input, then form declarative queries to the logic engine, and evaluate the output. This way actual logic of solving the problem is handled in a deterministic way.

[-] Ekranoplane@hexbear.net 6 points 1 month ago

You should be setting up your environment in a way the type system and integration tests prove the code is correct. Then let the AI spin in a closed loop until the tests pass. It will automatically deal with its own hallucinations this way. I've ported tens of thousands of lines of code from .net framework to .net9 this way.

You should be programming like that any way since human written code is also unreliable and filled with mistakes lol.

[-] yogthos@lemmygrad.ml 4 points 1 month ago

Oh yeah, TDD works great with LLMs I find. They just have to make all the pieces fit, and since there's a spec for what the code should be doing there's no way to cheat. Either the tests pass or they don't.

[-] Ekranoplane@hexbear.net 1 points 1 month ago

Yeah and if you use a highly opionated framework like Blazor the type system is so good you don't even need integration tests, the AI will simply make whatever page you ask it to. It's pretty incredible. Rubbish in anything other than Blazor though... It could probably handle JSP or something but Blazor is amazing even without the AI. Can't believe it's a MS product tbh they must keep the C# people in a different building.

[-] yogthos@lemmygrad.ml 1 points 1 month ago

I'm still kinda sad F# never caught on myself.

[-] PM_ME_VINTAGE_30S@lemmy.sdf.org 4 points 1 month ago* (last edited 1 month ago)

So I'm not an AI guy, I'm more of a dynamical systems/control theory guy with interests in AI. I.e. my background is more mathematics than programming.

I took a course last year on theoretical machine learning where one of my classmates did a literature review where he basically concluded that the current research indicates that hallucinations are a structural part of LLMs, but that we can work to reduce the probability of hallucinations to statistically negligible levels. (Math warning for both papers.)

The non-math version: we can limit (bound) the probability of a LLM hallucinating, but we cannot completely eliminate them.

[-] BigWeed@hexbear.net 6 points 1 month ago

Hey, good job, looks cool. A few observations: It makes sense to aggressively use token caching. It makes sense to improve context accuracy by treating docs as external to the input prompt. Smaller context = more calls, faster CoT, higher llm concurrency. We see a lot of this now, where coding agents aggressively cache documents, and use of tools to do more complicated tasks, e.g. code gen as tool use. I also see more moving away from RAG and more towards using good ole gnu tools like grep as exploration hints. RAG approaches are definitely is more prone to contextual bandit problems. But IMO, balancing exploration with exploitation is fundamentally a reinforcement learning problem rather than a linear flow from recursive llm calls, but it acts a reasonable surrogate model, and our RL algorithms consistently fall short. For example, you could help govern exploration behavior using deep-q or ppo to guide search, e.g. have your llm produce a top-5 exploration s-expression horizon, feed the exploration into the rl algorithm with cosine similarity on vectors + metadata (file size, LOC, etc) as features, then rank the exploration space. This would further reduce token count and increase parallelization in exploration (since top-n results are computed as batch in the llm with only marginal overhead).

Personally, I just wish I had a good search tool that didn't expose all of the LLM tomfoolery. For example, today I needed to find where a particular sql table was defined in the code since it used some crazy ORM and an initial grep didn't yield results so I jammed into openai codex and it found it. This was useful to me where github search was useless. I need more context-aware search, and it would be better if I could combine it with all of the other docs and slack convos floating around. Actual coding tasks require a greater level of human interaction where opaque results are less useful.

But again, good job, looks cool.

[-] yogthos@lemmygrad.ml 7 points 1 month ago

Thanks, honestly I'm really shocked it took people this long to realize that keeping state would be useful. The way MCP works by default is completely brain dead. And agree about RAGs, they're good for biasing the model, but that's about it. My thesis is that you don't even need reinforcement learning, you can have a symbolic logic engine that the LLM drives instead. The job of the LLM is to parse noisy outside inputs like user queries, and to analyze the results to decide what to do next. The context of the LLM should focus on this, I have a task, and I make some declarative queries to the logic engine. The engine then takes over and does actual genuine reasoning to produce the result. The key failing of symbolic AI was the ontological problem where you had a combinatory explosion trying to create the ontologies for it to operate on. But with LLMs we can have it build the context for the logic engine to operate within on the fly. And you can even do code generation in this way!

[-] aanes_appreciator@hexbear.net 5 points 1 month ago* (last edited 1 month ago)

This is cool.

The thing with LLM chuds is that the goal is to use LLMs as little as possible. The less data we must feed into the torment nexus, the less torment the nexus must produce for me to convert some random data into yaml

[-] yogthos@lemmygrad.ml 9 points 1 month ago

Importantly, this actually makes local models more capable because they don't need to hold as much context anymore.

this post was submitted on 16 Jan 2026

41 points (100.0% liked)

technology

24254 readers

336 users here now

On the road to fully automated luxury gay space communism.

Spreading Linux propaganda since 2020

Rules:

1. Obviously abide by the sitewide code of conduct. Bigotry will be met with an immediate ban
2. This community is about technology. Offtopic is permitted as long as it is kept in the comment sections
3. Although this is not /c/libre, FOSS related posting is tolerated, and even welcome in the case of effort posts
4. We believe technology should be liberating. As such, avoid promoting proprietary and/or bourgeois technology
5. Explanatory posts to correct the potential mistakes a comrade made in a post of their own are allowed, as long as they remain respectful
6. No crypto (Bitcoin, NFT, etc.) speculation, unless it is purely informative and not too cringe
7. Absolutely no tech bro shit. If you have a good opinion of Silicon Valley billionaires please manifest yourself so we can ban you.

founded 5 years ago

MODERATORS

context@hexbear.net

SexUnderSocialism@hexbear.net

gaycomputeruser@hexbear.net

Wakmrow@hexbear.net

SwitchyandWitchy@hexbear.net