Guardrails: How Gola Keeps Your Agents from Going Off the Rails

Agents can behave unexpectedly. They get stuck in loops, exceed your context window, or forget what they're doing. Gola's guardrails catch these problems before they become serious issues.

What Can Go Wrong?

Here's what agents do in the wild:

Get stuck calling the same tool over and over
Fill up their context window and crash
Forget what they were doing halfway through
Hit rate limits and give up
Send malformed messages that break the API
Wander off into completely unrelated tasks

Core Guardrail Systems

1. Loop Detection & Prevention

The problem: Your agent keeps doing the same thing, slightly differently each time.

How Gola fixes it:

Watch for similar tool calls happening repeatedly
Calculate how similar they are (using edit distance)
Step in when it's obviously stuck

Configuration:

yaml

agent:
  behavior:
    loop_detection:
      enabled: true
      similarity_threshold: 0.85
      window_size: 5
      max_repetitions: 3

Example scenario:

# Agent repeatedly calls the same tool with similar arguments
1. search_files("config.yaml")
2. search_files("config.yml")  
3. search_files("config.yaml")  # Loop detected!

2. Context Window Management

The problem: Conversations get too long and the LLM throws a "context exceeded" error.

How it's handled:

Summarize large outputs: Tool outputs exceeding limits get condensed automatically
Prioritize recent messages: Recent messages are retained, older ones are dropped
Selective retention: Never drops system prompts or critical errors
Transparency: Adds notes when truncating so nothing's hidden

Configuration:

yaml

agent:
  behavior:
    memory:
      max_history_steps: 20
      eviction_strategy: "intelligent"
      summarization_threshold: 500  # tokens
      min_recent_steps: 5

Auto-recovery behavior:

Detects context length errors (413, 429 status codes)
Summarizes large tool messages first
Removes older messages if needed
Retries with reduced context
Adds truncation notices for transparency

3. Message Sequence Validation

The problem: LLM APIs are picky about message order. One message out of place and they refuse to work.

What gets checked:

Every tool call has a response (no orphans)
No two messages in a row from the same role
Tool responses actually match their tool calls
System messages are where they belong

Auto-repair capabilities:

Injects missing tool responses
Merges consecutive same-role messages
Reorders messages for valid sequence
Removes orphaned tool responses

4. Memory Eviction Strategies

The problem: You can't remember everything forever. Something's gotta go.

Your options:

FIFO (First In, First Out)

Very simple: oldest content gets deleted first
Good for: Most use cases
Bad for: Long workflows where the original goal matters

Intelligent Eviction

Keeps important context (initial task, errors, key tool outputs)
Good for: Complex agents that need to remember their purpose
Bad for: Nothing really, this is usually the right choice

Chunk-Based

Removes whole conversation segments at once
Good for: Multi-turn workflows with distinct phases
Bad for: Simple Q&A agents

Configuration:

yaml

agent:
  behavior:
    memory:
      eviction_strategy: "intelligent"
      preserve_strategy:
        preserve_initial_task: true
        preserve_errors: false
        preserve_successful_observations: false
        preserve_recent_count: 5

5. Token Management

The problem: Tokens cost money and have limits.

What Gola does:

Count tokens before sending (no surprises)
Switch to bigger models if needed
Budget tokens per step so you don't use them all at once
Compress prompts when they get too long

Token limits by model (examples):

GPT-4: 128K tokens
Claude 3: 200K tokens
Gemini 1.5: 1M tokens

6. Rate Limit Handling

The problem: Hit the API too fast and you get rate limited.

Our approach:

See a 429 error? Gola catches it
Wait a bit (1 second)
Try again (wait 2 seconds if it fails)
Keep doubling the wait time
Eventually it works (or gives up gracefully)

7. Error Recovery & Fallback

The problem: Things break. The agent needs to keep going.

How degradation works:

First try: Drop tool outputs that might be causing issues
Still not working?: Strip it down to just the conversation
Really stuck?: Start fresh with just the latest message
Give up: At least tell the user what went wrong and how to fix it

Configuration:

yaml

agent:
  behavior:
    continue_on_error: true
    max_retries: 3
    enable_memory_clearing: false

8. Step Limiting

The problem: Without limits, agents can run forever (and cost you a fortune).

Simple solution:

Set a max number of steps
When it hits the limit, wrap things up
Generate a summary of what got done

Configuration:

yaml

agent:
  max_steps: 15  # Maximum reasoning steps
  behavior:
    tool_timeout: 30  # Per-tool execution timeout

9. Tool Execution Timeouts

Purpose: Prevents hanging on slow or unresponsive tools.

Features:

Per-tool configurable timeouts
Global default timeout
Automatic cancellation and cleanup
Error injection for timeout events

10. Conversation Consistency

Purpose: Maintains coherent conversation flow.

Mechanisms:

Message ordering validation
Role alternation enforcement
Context preservation across retries
State consistency checks

11. API Error Resilience

Purpose: Handles various API failure modes gracefully.

Handled error types:

Network failures (retry with backoff)
Authentication errors (fail fast)
Validation errors (auto-repair)
Server errors (fallback strategies)
Timeout errors (reduce payload)

Keeping Complex Workflows on Track

When you build agents with multiple states (like the Sherpa travel agent), they can drift off course. Here's how that's prevented.

Things That Go Wrong

Agent forgets where it is: "Wait, did I already search for flights?"
- Fix: Keep state variables protected in memory
Stuck in a state loop: Searching flights → selecting flight → searching flights again
- Fix: Detect when states repeat
State history gets huge: Every little state change fills up context
- Fix: Smart eviction that keeps the important state markers
Error recovery causes loops: Error → retry → error → retry forever
- Fix: Progressive fallback that eventually resets state

Best Practices for State Machines

Use memory preservation to protect state variables:

yaml

agent:
  behavior:
    memory:
      preserve_strategy:
        preserve_initial_task: true  # Keep original goal
        preserve_recent_count: 3     # Keep recent state transitions

Configure loop detection for state-aware monitoring:

yaml

agent:
  behavior:
    loop_detection:
      state_aware: true  # Consider state in similarity calculation

Set appropriate step limits for workflow complexity:

yaml

agent:
  max_steps: 25  # Enough for complex workflows

Debugging When Things Go Wrong

See What's Happening

Turn on verbose mode to watch the guardrails work:

yaml

agent:
  behavior:
    verbose: true
    show_reasoning: true

logging:
  level: "debug"

What You'll See in Logs

[WARN] Loop detected: Similar tool calls in recent history
[INFO] Context truncated: 5 messages removed to fit within limits
[INFO] Successfully recovered from error on attempt 2
[WARN] Context window exceeded, applying truncation
[INFO] Injected synthetic tool response for orphaned call

These aren't errors - they're guardrails doing their job.

Metrics to Monitor

Loop detection triggers per session
Context truncation frequency
Error recovery success rate
Average retry count
Token usage efficiency

Configuration Examples

Production Config (Maximum Safety)

When you can't afford failures:

yaml

agent:
  max_steps: 20
  behavior:
    memory:
      max_history_steps: 30
      eviction_strategy: "intelligent"
      preserve_strategy:
        preserve_initial_task: true
        preserve_errors: true
        preserve_successful_observations: true
    loop_detection:
      enabled: true
      similarity_threshold: 0.8
      max_repetitions: 2
    continue_on_error: true
    max_retries: 5
    tool_timeout: 45

Development Config (See Everything)

When you're building and need to see what's happening:

yaml

agent:
  max_steps: 50
  behavior:
    memory:
      max_history_steps: 100
      eviction_strategy: "fifo"
    loop_detection:
      enabled: false  # Disable during development
    verbose: true
    show_reasoning: true
    continue_on_error: false  # Fail fast for debugging

Cheap and Fast Config

When you need to keep costs down:

yaml

agent:
  max_steps: 10
  behavior:
    memory:
      max_history_steps: 10
      eviction_strategy: "chunk-based"
      summarization_threshold: 200
      min_recent_steps: 3
    tool_timeout: 15
    max_retries: 2

Fixing Common Problems

Agent keeps doing the same thing?

Make loop detection more sensitive (lower similarity_threshold)
Let it repeat less before stopping (max_repetitions: 2)

Still getting context errors?

Keep less history (max_history_steps: 10)
Summarize more aggressively
Switch to FIFO eviction if intelligent isn't working

Agent forgets important context?

Use intelligent eviction (it's smarter about what to keep)
Increase preserve_recent_count
Make sure preserve_initial_task is true

Agent stops too soon?

Bump up max_steps
Give tools more time (tool_timeout: 60)
Check if loop detection is too aggressive

The Bottom Line

Guardrails keep your agents from misbehaving. Most of the time, you won't even know they're there—they just quietly fix problems before you notice them.

The defaults work for most use cases. If your agent encounters issues, check the logs. The guardrails will indicate exactly what they're doing and why.

Pro-tip: Start with eviction_strategy: "intelligent" and loop_detection: enabled: true. Those two settings prevent most problems.

Guardrails: How Gola Keeps Your Agents from Going Off the Rails ​

What Can Go Wrong? ​

Core Guardrail Systems ​

1. Loop Detection & Prevention ​

2. Context Window Management ​

3. Message Sequence Validation ​

4. Memory Eviction Strategies ​

FIFO (First In, First Out) ​

Intelligent Eviction ​

Chunk-Based ​

5. Token Management ​

6. Rate Limit Handling ​

7. Error Recovery & Fallback ​

8. Step Limiting ​

9. Tool Execution Timeouts ​

10. Conversation Consistency ​

11. API Error Resilience ​

Keeping Complex Workflows on Track ​

Things That Go Wrong ​

Best Practices for State Machines ​

Debugging When Things Go Wrong ​

See What's Happening ​

What You'll See in Logs ​

Metrics to Monitor ​

Configuration Examples ​

Production Config (Maximum Safety) ​

Development Config (See Everything) ​

Cheap and Fast Config ​

Fixing Common Problems ​

The Bottom Line ​

Guardrails: How Gola Keeps Your Agents from Going Off the Rails

What Can Go Wrong?

Core Guardrail Systems

1. Loop Detection & Prevention

2. Context Window Management

3. Message Sequence Validation

4. Memory Eviction Strategies

FIFO (First In, First Out)

Intelligent Eviction

Chunk-Based

5. Token Management

6. Rate Limit Handling

7. Error Recovery & Fallback

8. Step Limiting

9. Tool Execution Timeouts

10. Conversation Consistency

11. API Error Resilience

Keeping Complex Workflows on Track

Things That Go Wrong

Best Practices for State Machines

Debugging When Things Go Wrong

See What's Happening

What You'll See in Logs

Metrics to Monitor

Configuration Examples

Production Config (Maximum Safety)

Development Config (See Everything)

Cheap and Fast Config

Fixing Common Problems

The Bottom Line