Back to Blog
from-the-labdebuggingreliabilityproduction

From the Lab: Why Your AI Agent Keeps Hallucinating Production Data

Three weeks of debugging revealed a counter-intuitive pattern: more context made our agent worse. Here's what we learned about effective agent memory management.

by RakanLabs Team5 min read

From the Lab: Why Your AI Agent Keeps Hallucinating Production Data

Last month, a client's booking agent started confidently inventing appointment times that didn't exist. Users would receive confirmations for slots that were already taken. The agent seemed so certain, support initially blamed the database.

Spoiler: The database was fine. The agent was drowning in context.

The Symptoms

User: "Book me for next Tuesday afternoon"
Agent: "I've confirmed your appointment for Tuesday at 2:30pm"
Database: [No record created]
Actual availability: 2:30pm was booked three days ago

This happened in ~3% of requests. Enough to be painful, rare enough to be hard to reproduce.

The Investigation

Hypothesis 1: Model Hallucination (Wrong)

First instinct: the LLM is making stuff up. We added stricter prompts:

You MUST only use real data from the database.
NEVER invent information.
If a time slot doesn't exist, say so.

Result: Slightly worse. The agent now argued with itself about what was real.

Hypothesis 2: Stale Cache (Wrong)

Maybe we're showing outdated availability? We disabled all caching.

Result: No change. Plus we made the system slower.

Hypothesis 3: Context Overflow (Bingo)

We logged the full context sent to the agent:

Tokens: 3,847
- System prompt: 342 tokens
- Available slots: 2,891 tokens (187 slots)
- User history: 421 tokens
- Business rules: 193 tokens

The agent was trying to reason over 187 time slots at once. Sometimes it would latch onto slot #47 from the context, sometimes #103. Both were "afternoons" but weeks apart.

The Fix: Aggressive Context Pruning

Before: Show Everything

const context = {
  availableSlots: await db.getAllSlots({
    startDate: today,
    endDate: addMonths(today, 3), // 90 days
  }),
  userPreferences: await getUserHistory(userId),
  businessRules: await getAllRules(),
};

This felt safer. More information = better decisions, right?

Wrong. More information = more noise.

After: Show Just Enough

// Step 1: Parse intent first
const intent = await agent.parseUserIntent(message);
// Output: { preferredDate: "next Tuesday", timePreference: "afternoon" }

// Step 2: Fetch ONLY relevant slots
const targetDate = parseRelativeDate(intent.preferredDate);
const context = {
  availableSlots: await db.getSlots({
    date: targetDate,
    startTime: intent.timePreference === "afternoon" ? "12:00" : "09:00",
    endTime: intent.timePreference === "afternoon" ? "17:00" : "12:00",
  }),
  // Max 10 slots, sorted by preference
  limit: 10,
};

Result: Context dropped from 3,847 to 412 tokens. Hallucinations dropped from 3% to 0.1%.

The Counter-Intuitive Lesson

More context isn't better context. Agents work best with:

  1. Narrow context windows: Only the data needed for THIS decision
  2. Multi-step reasoning: Parse intent → fetch data → decide
  3. Explicit constraints: "These are the ONLY 5 options" works better than showing 100

This mirrors how humans work. If you ask me to pick a restaurant, don't read me the entire Yelp database. Give me 3-5 options that match my criteria.

The Pattern We Use Now

For Any Agent Decision:

// ❌ Don't do this
const hugeContext = await getAllPossibleData();
const decision = await agent.decide(hugeContext);

// ✅ Do this
const intent = await agent.understand(userInput);
const relevantData = await fetchRelevantOnly(intent, { limit: 10 });
const decision = await agent.decide(relevantData);

Concrete Rules:

  • < 500 tokens per agent call when possible
  • < 20 options for the agent to choose from
  • Multi-step if you need to reason over large datasets

When More Context Actually Helps

There are exceptions. Diagnostic agents benefit from comprehensive context:

// Bug diagnosis: show me everything
const context = {
  recentLogs: last100Lines,
  errorTrace: fullStackTrace,
  systemState: completeSnapshot,
};

const diagnosis = await agent.diagnose(context);

The difference: diagnostic tasks are one-shot and expensive by design. Booking agents run hundreds of times per day and need to be fast + reliable.

Measuring Context Effectiveness

We now track:

metrics.record({
  contextTokens: prompt.tokenCount,
  optionsProvided: context.availableSlots.length,
  responseTime: duration,
  hallucinationDetected: !decision.matchesDatabase,
  userAccepted: feedback.accepted,
});

If contextTokens correlates with hallucinationDetected, you're giving the agent too much.

Try This Tomorrow

Pick your highest-volume agent call. Log:

  1. Context size in tokens
  2. Number of "options" in the context
  3. Error rate

If context > 1,000 tokens or options > 50, try narrowing it. You might be surprised.


Update: This pattern has held up across 6 different agent migrations. Context pruning is now step one of our debugging checklist.

Need help debugging your agent? Let's talk.

Share this post

Ready to ship reliable AI?

Book a discovery call to discuss your project.