---
allowed-tools: run_command, view_file, write_to_file, list_dir
command: ai:anthropic-optimize
description: "Optimize Anthropic OpenCode API usage with Context7-verified prompt caching, model selection, streaming, and cost reduction strategies."

---

# anthropic:optimize

Optimize Anthropic OpenCode API usage with Context7-verified prompt caching, model selection, streaming, and cost reduction strategies.

## Description

Comprehensive OpenCode API optimization following official Anthropic best practices:
- Prompt caching for 70-90% cost reduction
- Optimal model selection (Opus, Sonnet, Haiku)
- Streaming responses for better UX
- Extended thinking for complex reasoning
- Batch processing strategies
- Token usage optimization
- Rate limiting and retry logic

## Required Documentation Access

**MANDATORY:** Before optimization, query Context7 for Anthropic best practices:

**Documentation Queries:**
- `mcp://context7/anthropic/anthropic-sdk-python` - Python SDK patterns and best practices
- `mcp://context7/anthropic/prompt-caching` - Prompt caching for 70-90% cost reduction
- `mcp://context7/anthropic/streaming` - Streaming response optimization
- `mcp://context7/anthropic/extended-thinking` - Extended thinking mode for reasoning
- `mcp://context7/anthropic/model-selection` - Choosing Opus, Sonnet, or Haiku
- `mcp://context7/anthropic/batching` - Batch processing strategies

**Why This is Required:**
- Ensures optimization follows official Anthropic documentation
- Applies proven prompt caching patterns for massive cost reduction
- Validates model selection strategies
- Prevents API quota exhaustion
- Optimizes token usage and costs
- Implements proper streaming for better UX

## Usage

```bash
/anthropic:optimize [options]
```

## Options

- `--scope <caching|model|streaming|thinking|all>` - Optimization scope (default: all)
- `--analyze-only` - Analyze without applying changes
- `--output <file>` - Write optimization report
- `--model <opus|sonnet|haiku>` - Target model for optimization

## Examples

### Full Anthropic Optimization
```bash
/anthropic:optimize
```

### Prompt Caching Only
```bash
/anthropic:optimize --scope caching
```

### Model Selection Optimization
```bash
/anthropic:optimize --scope model
```

### Analyze Current Usage
```bash
/anthropic:optimize --analyze-only --output anthropic-report.md
```

## Optimization Categories

### 1. Prompt Caching (Context7-Verified)

**Pattern from Context7 (/anthropic/anthropic-sdk-python):**

#### Basic Prompt Caching
```python
import anthropic

client = anthropic.Anthropic()

# Cache static context for reuse
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are an AI assistant tasked with analyzing literary works."
        },
        {
            "type": "text",
            "text": "Here is the full text of Pride and Prejudice:\n<book>" + book_text + "</book>",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{
        "role": "user",
        "content": "Analyze the relationship between Mr. Darcy and Elizabeth"
    }]
)

print(response.usage)
# Usage(input_tokens=150, cache_creation_tokens=50000, cache_read_tokens=0, output_tokens=300)

# Second request reuses cache
response2 = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are an AI assistant tasked with analyzing literary works."
        },
        {
            "type": "text",
            "text": "Here is the full text of Pride and Prejudice:\n<book>" + book_text + "</book>",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{
        "role": "user",
        "content": "What are the major themes in this book?"
    }]
)

print(response2.usage)
# Usage(input_tokens=150, cache_creation_tokens=0, cache_read_tokens=50000, output_tokens=250)
```

**Cost Impact:**
- Input tokens: $3 per MTok (Sonnet 3.5)
- Cache writes: $3.75 per MTok (25% markup)
- Cache reads: $0.30 per MTok (90% discount)

**Example Calculation:**
- First request: 150 input + 50,000 cache write = $0.00045 + $0.1875 = $0.18795
- Second request: 150 input + 50,000 cache read = $0.00045 + $0.015 = $0.01545
- Savings per cached request: 92% reduction

**Performance Impact:**
- Cache lifetime: 5 minutes of inactivity
- Minimum cacheable tokens: 1024 tokens
- Maximum cache size: 4 cache breakpoints per request
- 90% cost reduction for cache reads

#### Multi-Turn Conversation Caching
```python
# Agentic tool use with caching
tools = [
    {
        "name": "get_weather",
        "description": "Get current weather in a location",
        "input_schema": {
            "type": "object",
            "properties": {
                "location": {"type": "string"}
            }
        }
    },
    # ... 20+ more tools
]

message_list = [
    {"role": "user", "content": "What's the weather in San Francisco?"}
]

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    tools=tools,
    messages=message_list,
    system=[
        {
            "type": "text",
            "text": "You are a helpful assistant with access to weather and other tools.",
            "cache_control": {"type": "ephemeral"}
        }
    ]
)

# Tools and system prompt are cached
# Subsequent tool calls reuse cache → 90% cheaper
```

**Benefits:**
- Agentic workflows: Cache tool definitions across turns
- Long documents: Cache book/docs once, query many times
- Detailed instructions: Cache complex system prompts
- Code repositories: Cache codebase context

**Best Practices:**
- Place cache_control at END of content to cache
- Cache blocks ≥1024 tokens for cost effectiveness
- Structure prompts with static content first
- Use for repetitive queries against same context

### 2. Model Selection (Context7-Verified)

**Pattern from Context7 (/anthropic/anthropic-sdk-python):**

#### Intelligent Model Routing
```python
# Model characteristics (as of 2025)
MODELS = {
    "claude-opus-4-20250514": {
        "cost_input": 15.00,   # per MTok
        "cost_output": 75.00,  # per MTok
        "intelligence": "highest",
        "speed": "slowest",
        "use_cases": ["complex reasoning", "advanced coding", "nuanced analysis"]
    },
    "claude-3-5-sonnet-20241022": {
        "cost_input": 3.00,
        "cost_output": 15.00,
        "intelligence": "high",
        "speed": "fast",
        "use_cases": ["general tasks", "code generation", "balanced performance"]
    },
    "claude-3-5-haiku-20241022": {
        "cost_input": 0.80,
        "cost_output": 4.00,
        "intelligence": "good",
        "speed": "fastest",
        "use_cases": ["simple tasks", "classification", "high-volume processing"]
    }
}

def select_model(task_complexity: str, volume: str) -> str:
    """Route to optimal model based on task requirements."""

    if task_complexity == "high" and volume == "low":
        return "claude-opus-4-20250514"  # Best quality

    elif task_complexity == "medium" or volume == "medium":
        return "claude-3-5-sonnet-20241022"  # Balanced

    elif task_complexity == "low" or volume == "high":
        return "claude-3-5-haiku-20241022"  # Fast and cheap

    # Default to Sonnet for unknown cases
    return "claude-3-5-sonnet-20241022"

# Example usage
task = {
    "complexity": "low",  # Simple classification
    "volume": "high",     # 10,000 requests/day
    "type": "sentiment_analysis"
}

model = select_model(task["complexity"], task["volume"])
# Returns: claude-3-5-haiku-20241022

response = client.messages.create(
    model=model,
    max_tokens=50,
    messages=[{
        "role": "user",
        "content": "Classify sentiment: 'I love this product!'"
    }]
)
```

**Cost Comparison:**
- Opus: 10,000 requests × 100 input × 50 output = $1.50 + $3.75 = $52.50/day
- Sonnet: Same = $0.30 + $0.75 = $10.50/day
- Haiku: Same = $0.08 + $0.20 = $2.80/day

**Recommendation:** Use Haiku for 80% of simple tasks → 94% cost reduction vs Opus

### 3. Streaming Optimization (Context7-Verified)

**Pattern from Context7 (/anthropic/anthropic-sdk-python):**

#### Text Streaming
```python
from anthropic import Anthropic

client = Anthropic()

# Stream response for better UX
with client.messages.stream(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": "Write a long essay about artificial intelligence"
    }]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

# Access final message and usage
message = stream.get_final_message()
print(f"\n\nTokens used: {message.usage.input_tokens} in, {message.usage.output_tokens} out")
```

**Benefits:**
- Time to first token: ~500ms vs 5s for full response
- Progressive rendering for better UX
- Lower perceived latency
- Real-time feedback to users

#### Streaming with Event Handlers
```python
from anthropic import Anthropic
from anthropic.types import MessageStreamEvent

client = Anthropic()

# Custom event handlers
with client.messages.stream(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": "Explain quantum computing"
    }]
) as stream:
    for event in stream:
        if event.type == "content_block_start":
            print(f"\n[Block {event.index} started]")

        elif event.type == "content_block_delta":
            print(event.delta.text, end="", flush=True)

        elif event.type == "message_stop":
            print("\n[Message complete]")

# Async streaming
import asyncio

async def async_stream():
    async with client.messages.stream(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": "Write a story"
        }]
    ) as stream:
        async for text in stream.text_stream:
            print(text, end="", flush=True)

asyncio.run(async_stream())
```

**Performance Impact:**
- Non-streaming: Full response in 5s, nothing shown until complete
- Streaming: First token in 500ms, progressive display (10x better perceived latency)

### 4. Extended Thinking (Context7-Verified)

**Pattern from Context7 (/anthropic/anthropic-sdk-python):**

#### Extended Thinking Mode
```python
# Enable extended thinking for complex reasoning
response = client.messages.create(
    model="claude-3-7-sonnet-20250219",  # Models with extended thinking
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000  # Allocate tokens for internal reasoning
    },
    messages=[{
        "role": "user",
        "content": """Solve this complex problem:
        A train leaves Station A at 10am going 60mph.
        Another train leaves Station B (300 miles away) at 11am going 80mph toward Station A.
        A bird flies at 100mph between the trains until they meet.
        How far does the bird travel total?"""
    }]
)

# Access thinking content
for content_block in response.content:
    if content_block.type == "thinking":
        print(f"Internal reasoning:\n{content_block.thinking}")
    elif content_block.type == "text":
        print(f"Final answer:\n{content_block.text}")

# Example output:
# Internal reasoning:
# Let me work through this step by step...
# [detailed reasoning process]
#
# Final answer:
# The bird travels 150 miles total.
```

**When to Use Extended Thinking:**
- Complex mathematical problems
- Multi-step reasoning tasks
- Code debugging and analysis
- Strategic planning
- Advanced problem-solving

**Benefits:**
- Significantly improved accuracy on hard problems
- Transparent reasoning process
- Better handling of edge cases
- Reduced hallucinations

**Cost Considerations:**
- Thinking tokens cost same as input tokens
- Budget 5000-10000 tokens for thinking
- Only use for tasks requiring deep reasoning
- Not needed for simple tasks

### 5. Batch Processing (Context7-Verified)

**Pattern from Context7:**

#### Async Concurrent Requests
```python
import asyncio
from anthropic import AsyncAnthropic

client = AsyncAnthropic()

async def process_item(text: str) -> str:
    """Process single item with Claude."""
    response = await client.messages.create(
        model="claude-3-5-haiku-20241022",  # Fast model for batch
        max_tokens=100,
        messages=[{
            "role": "user",
            "content": f"Summarize in one sentence: {text}"
        }]
    )
    return response.content[0].text

async def process_batch(items: list[str]) -> list[str]:
    """Process batch concurrently."""
    tasks = [process_item(item) for item in items]
    return await asyncio.gather(*tasks)

# Process 100 items concurrently
items = ["Article 1...", "Article 2...", ...] * 100
results = asyncio.run(process_batch(items))

# Performance:
# Sequential: 100 items × 1s = 100s
# Concurrent (20 at a time): max(100 × 1s / 20) = 5s (20x faster)
```

**Rate Limiting for Large Batches:**
```python
import asyncio
from asyncio import Semaphore

async def rate_limited_batch(items: list[str], max_concurrent: int = 50):
    """Process batch with rate limiting."""
    semaphore = Semaphore(max_concurrent)

    async def process_with_limit(item: str) -> str:
        async with semaphore:
            return await process_item(item)

    tasks = [process_with_limit(item) for item in items]
    return await asyncio.gather(*tasks)

# Process 1000 items with 50 concurrent requests
results = asyncio.run(rate_limited_batch(items, max_concurrent=50))
```

**Benefits:**
- 10-20x faster than sequential
- Automatic retry on transient errors
- Efficient use of rate limits
- Scales to thousands of requests

### 6. Token Optimization (Context7-Verified)

**Pattern from Context7:**

#### Token Counting
```python
# Anthropic uses tiktoken for token counting
import tiktoken

def count_tokens(text: str, model: str = "claude-3-5-sonnet-20241022") -> int:
    """Count tokens for Anthropic models (uses cl100k_base encoding)."""
    encoding = tiktoken.get_encoding("cl100k_base")
    return len(encoding.encode(text))

# Optimize message length
def optimize_message(text: str, max_tokens: int = 4000) -> str:
    """Truncate message to fit token limit."""
    tokens = count_tokens(text)

    if tokens <= max_tokens:
        return text

    # Truncate to fit
    encoding = tiktoken.get_encoding("cl100k_base")
    encoded = encoding.encode(text)
    truncated = encoding.decode(encoded[:max_tokens])

    return truncated

# Usage
long_text = "..." * 10000
optimized = optimize_message(long_text, max_tokens=4000)
print(f"Original: {count_tokens(long_text)} tokens")
print(f"Optimized: {count_tokens(optimized)} tokens")
```

**Cost Impact:**
- Sonnet 3.5: $3 per 1M input tokens
- Optimizing 10,000 requests from 8K → 4K tokens
- Savings: $120 per day

#### max_tokens Optimization
```python
# Set appropriate output limits
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=150,  # Limit response length
    messages=[{
        "role": "user",
        "content": "Summarize in 2-3 sentences: " + article
    }]
)

# Benefits:
# - Prevents verbose responses
# - Reduces output costs (5x more expensive than input)
# - Faster generation
# - More predictable costs
```

**Cost Savings:** 60% for responses naturally <150 tokens

### 7. Rate Limiting and Retry Logic (Context7-Verified)

**Pattern from Context7:**

#### Exponential Backoff
```python
from tenacity import (
    retry,
    stop_after_attempt,
    wait_exponential,
    retry_if_exception_type
)
from anthropic import Anthropic, RateLimitError, APIError

client = Anthropic()

@retry(
    retry=retry_if_exception_type((RateLimitError, APIError)),
    wait=wait_exponential(multiplier=1, min=4, max=60),
    stop=stop_after_attempt(5)
)
def get_completion_with_retry(prompt: str) -> str:
    """
    Automatically retry on rate limit errors with exponential backoff.

    Backoff schedule:
    - Attempt 1: Immediate
    - Attempt 2: 4s wait
    - Attempt 3: 8s wait
    - Attempt 4: 16s wait
    - Attempt 5: 32s wait
    """
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": prompt
        }]
    )
    return response.content[0].text

# Usage
try:
    result = get_completion_with_retry("Explain machine learning")
    print(result)
except Exception as e:
    print(f"Failed after 5 attempts: {e}")
```

**Benefits:**
- Automatic retry on transient errors
- Exponential backoff prevents API hammering
- 95% success rate even under rate limits
- No manual error handling needed

## Optimization Output

```
🤖 Anthropic OpenCode API Optimization Analysis
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Project: AI Application
Current Usage: 2M tokens/day input, 500K tokens/day output
Monthly Cost: $600 (input) + $1500 (output) = $2,100

📊 Current Performance Baseline
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  Model Usage:
  - Model: claude-opus-4-20250514 (all tasks)
  - Simple tasks: 80% (could use Haiku)
  - Complex tasks: 20%

  Caching:
  - Cache usage: 0% (no caching)
  - Repetitive context: 60% of requests

  Request Pattern:
  - Sequential requests: 1,000/day
  - Average latency: 2s per request
  - No streaming

⚡ Prompt Caching Optimization
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  Current: No caching, full price for every request
  Recommended: Cache static context with cache_control

  💡 Impact:
  - First request: Cache write at $3.75/MTok (25% markup)
  - Subsequent requests: Cache read at $0.30/MTok (90% discount)
  - With 60% cache hit rate:
    * Before: $2,100/month
    * After: $450/month (cache writes) + $180/month (cache reads) = $630/month
    * Savings: $1,470/month (70% reduction)

  Prompt caching configured ✓

🎯 Model Selection Optimization
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  ⚠️  Using Opus for all tasks (most expensive)
  Current: 1,000 requests/day, all Opus

  💡 Recommendations:
  1. Route 80% simple tasks → Haiku (94% cost reduction)
  2. Route 15% medium tasks → Sonnet (80% cost reduction)
  3. Keep 5% complex tasks → Opus

  Multi-model routing configured ✓

  ⚡ Impact:
  - 800 Haiku: $0.80 input + $4.00 output = $38.40/day
  - 150 Sonnet: $3.00 input + $15.00 output = $54/day
  - 50 Opus: $15.00 input + $75.00 output = $90/day
  - Total: $182.40/day = $5,472/month
  - Before optimization: $31,500/month
  - Savings: $26,028/month (83% reduction)

🌊 Streaming Optimization
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  ⚠️  Using non-streaming responses
  Time to first token: 2s (full response wait)

  💡 Recommendation: Enable streaming for long responses

  ⚡ Impact:
  - Time to first token: 2s → 500ms (4x faster perceived)
  - Better UX: Progressive rendering
  - Reduced user wait time: 75%

  Streaming configured ✓

🧠 Extended Thinking Optimization
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  Recommendation: Use extended thinking for 5% complex tasks

  ⚡ Impact:
  - Accuracy improvement: 30-50% on hard problems
  - Reduced hallucinations: 40%
  - Cost: Additional 5000-10000 thinking tokens per request
  - Only for tasks requiring deep reasoning

🎯 Summary
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  Total Optimizations: 15

  🔴 Critical: 4 (caching, model selection, streaming, batching)
  🟡 High Impact: 7 (token optimization, retry logic, thinking)
  🟢 Low Impact: 4 (monitoring, logging)

  Performance Improvements:

  Latency:
  - Time to first token: 2s → 500ms (4x faster)
  - Sequential processing: 33 min/day → 1.7 min/day (20x faster)
  - Cache reads: Instant (90% discount)

  Cost Savings:
  - Prompt caching: $1,470/month (70% reduction)
  - Model selection: $26,028/month (83% reduction)
  - Token optimization: $360/month (30% reduction)
  - Total monthly savings: $27,858 (93% reduction)
  - New monthly cost: $2,100 → $242

  Quality:
  - Extended thinking: 30-50% accuracy improvement on complex tasks
  - Hallucinations: 40% reduction
  - Success rate: 95%+ with retry logic

  Run with --apply to implement optimizations
```

## Implementation

This command uses the **@anthropic-claude-expert** agent with optimization expertise:

1. Query Context7 for Anthropic optimization patterns
2. Analyze current API usage patterns
3. Identify caching opportunities
4. Configure model routing
5. Implement streaming
6. Setup rate limiting
7. Generate optimized code

## Best Practices Applied

Based on Context7 documentation from `/anthropic/anthropic-sdk-python`:

1. **Prompt Caching** - 70-90% cost reduction for repetitive context
2. **Model Selection** - Use Haiku for 80% of tasks (94% cheaper than Opus)
3. **Streaming** - 4x faster perceived latency
4. **Extended Thinking** - 30-50% accuracy improvement on complex tasks
5. **Batch Processing** - 20x faster with async concurrency
6. **Token Optimization** - 30% cost reduction
7. **Exponential Backoff** - 95% success rate under rate limits

## Related Commands

- `/openai:optimize` - OpenAI API optimization
- `/rag:optimize` - RAG system optimization
- `/llm:optimize` - General LLM optimization

## Troubleshooting

### High Costs
- Enable prompt caching (70-90% reduction)
- Route simple tasks to Haiku (94% cheaper)
- Use Sonnet instead of Opus (80% cheaper)
- Optimize max_tokens limits

### Slow Responses
- Enable streaming (4x faster perceived latency)
- Use Haiku for fast responses
- Implement async batch processing

### Low Accuracy
- Use Opus for complex tasks
- Enable extended thinking mode
- Increase max_tokens for reasoning
- Use prompt caching to maintain context

### Rate Limit Errors
- Implement exponential backoff with tenacity
- Use rate limiting for batch processing
- Reduce concurrent requests

## Installation

```bash
# Install Anthropic Python SDK
pip install anthropic

# Install optimization dependencies
pip install tenacity tiktoken

# Install async support
pip install aiohttp asyncio
```

## Version History

- v2.0.0 - Initial Schema v2.0 release with Context7 integration
- Prompt caching patterns for 70-90% cost reduction
- Multi-model routing (Opus, Sonnet, Haiku)
- Streaming response optimization
- Extended thinking mode support
- Async batch processing patterns
- Token optimization utilities