Design Philosophy

Claw Code Agent’s defining feature is its backend independence. While Claude Code is locked to Anthropic’s API, Claw Code Agent targets any inference server that exposes an OpenAI-compatible /chat/completions endpoint. This means the same agent architecture runs against a local GPU, a cloud endpoint, or a proxy aggregator — the agent code never changes.

OpenAI-Compatible API Client

The openai_compat.py module implements a minimal OpenAICompatClient class that handles both synchronous and streaming completions.

Request Construction

The client constructs standardized payloads to /chat/completions:

{
  "model": "Qwen/Qwen3-Coder-30B-A3B-Instruct",
  "messages": [...],
  "tools": [...],           # Tool definitions in OpenAI format
  "temperature": 0.7,
  "stream": true,           # Optional
  "response_format": {...}  # Optional structured output schema
}

Authentication uses Bearer tokens. HTTP URLs are automatically upgraded to HTTPS for remote endpoints.

Streaming via SSE

When streaming is enabled, the client parses Server-Sent Events (SSE):

  1. Read response lines incrementally
  2. Accumulate data after data: prefixes
  3. Yield complete JSON objects as they arrive

The client emits typed events:

EventContent
message_startInitial response metadata
content_deltaIncremental text chunks
tool_call_deltaFunction invocation fragments
usageToken count statistics
message_stopCompletion signal

Tool Call Parsing

Two response formats are supported:

FormatStructure
Moderntool_calls array with function objects (name + arguments)
Legacyfunction_call object (single tool call)

Arguments parse flexibly: dictionaries pass through, JSON strings decode to objects, and null values default to empty dictionaries. Invalid JSON raises OpenAICompatError.

Usage Statistics

The client normalizes diverse naming conventions across backends:

MetricvLLMOllamaOpenAI
Input tokensprompt_tokensprompt_eval_countinput_tokens
Output tokenscompletion_tokenseval_countoutput_tokens

Supported Backends

vLLM (Primary)

vLLM is the recommended backend for its native tool-calling support and high throughput.

Launch command:

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-Coder-30B-A3B-Instruct \
  --host 127.0.0.1 --port 8000 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml

Key flags:

  • --enable-auto-tool-choice — Lets vLLM detect when the model wants to call tools
  • --tool-call-parser qwen3_xml — Uses the Qwen3-specific XML parser for extracting tool calls from model output

Environment setup:

export OPENAI_BASE_URL=http://127.0.0.1:8000/v1
export OPENAI_API_KEY=local-token
export OPENAI_MODEL=Qwen/Qwen3-Coder-30B-A3B-Instruct

Ollama

Ollama provides a simpler setup for models that support tool use.

export OPENAI_BASE_URL=http://127.0.0.1:11434/v1
export OPENAI_API_KEY=ollama
export OPENAI_MODEL=qwen3-coder:30b

Ollama handles model downloading and quantization automatically. The prompt_eval_count / eval_count naming convention is normalized by the client.

LiteLLM Proxy

LiteLLM acts as a unified proxy, allowing the agent to target 100+ model providers through a single endpoint.

export OPENAI_BASE_URL=http://127.0.0.1:4000/v1
export OPENAI_API_KEY=your-litellm-key
export OPENAI_MODEL=your-model-name

OpenRouter

OpenRouter provides cloud-hosted model access with automatic routing.

export OPENAI_BASE_URL=https://openrouter.ai/api/v1
export OPENAI_API_KEY=your-openrouter-key
export OPENAI_MODEL=qwen/qwen3-coder-30b

The project recommends Qwen3-Coder-30B-A3B-Instruct1 as the primary model:

  • Architecture — Mixture-of-Experts with 30B total parameters, 3B active per token
  • Strengths — Strong code generation, instruction following, and tool-use capabilities
  • Tool calling — Native support via vLLM’s qwen3_xml parser
  • Efficiency — MoE architecture means inference cost scales with active parameters (3B), not total parameters (30B)

Other models work if they support function/tool calling through the OpenAI API format, but Qwen3-Coder has been the primary development and testing target.

Cost Tracking

The cost_tracker.py module provides budget enforcement integrated into the agent loop. The CostTracker records events with labels and unit counts, and the agent’s _check_budget() method validates against multiple constraints each turn:

Budget TypeWhat’s Measured
Total tokensSum of input + output tokens
Input tokensPrompt tokens consumed
Output tokensCompletion tokens generated
Reasoning tokensTokens used for chain-of-thought (if applicable)
Estimated cost (USD)Calculated from token counts and model pricing
Tool callsNumber of tool invocations
Model callsNumber of API requests
Session turnsNumber of agent loop iterations

Budget violations halt execution immediately with diagnostic messages. Budgets are configurable via CLI flags and can be overridden by hook policy manifests.

CLI Invocation Modes

The main.py CLI supports multiple execution modes:

CommandModeUse Case
agent "prompt"SynchronousSingle task, returns result
agent-chatInteractiveMulti-turn REPL with session support
agent-bg "prompt"BackgroundAsync execution with process management
agent-resumeContinuationResume saved session with modified params
agent-psMonitoringList background sessions
agent-logsMonitoringView background session output
agent-attachMonitoringAttach to running session
agent-killControlTerminate background session

Common Flags

python3 -m src.main agent "task" \
  --cwd .                    # Working directory
  --allow-write              # Enable file modifications
  --allow-shell              # Enable shell commands
  --unsafe                   # Enable destructive operations
  --stream                   # Token-by-token streaming output
  --max-turns 20             # Limit agent loop iterations
  --max-tokens 100000        # Token budget
  --temperature 0.7          # Model temperature

Footnotes

References

Footnotes

  1. Qwen3-Coder-30B-A3B-Instruct on Hugging Face