--- command: ai:llm-optimize description: "ai:llm-optimize" --- # llm:optimize Optimize LLM inference performance with Context7-verified model selection, prompt engineering, and deployment strategies. ## Description Comprehensive LLM optimization following industry best practices: - Model selection and sizing - Prompt engineering optimization - Inference acceleration techniques - Context window management - Token usage optimization - Multi-model routing strategies ## Required Documentation Access **MANDATORY:** Before optimization, query Context7 for LLM best practices: **Documentation Queries:** - `mcp://context7/llm/model-selection` - Model selection strategies - `mcp://context7/llm/prompt-engineering` - Prompt optimization - `mcp://context7/llm/inference-optimization` - Inference acceleration - `mcp://context7/llm/token-optimization` - Token usage reduction - `mcp://context7/llm/context-management` - Context window strategies **Why This is Required:** - Ensures optimization follows industry best practices - Applies proven model selection criteria - Validates prompt engineering techniques - Prevents performance bottlenecks - Optimizes cost and latency ## Usage ```bash /llm:optimize [options] ``` ## Options - `--scope ` - Optimization scope (default: all) - `--analyze-only` - Analyze without applying changes - `--output ` - Write optimization report ## Optimization Categories ### 1. Model Selection (Context7-Verified) **Pattern: Right-Sizing Models** ```python # Use smallest capable model for task MODEL_ROUTING = { "simple_classification": "gpt-4o-mini", # Fast, cheap "code_generation": "gpt-4o", # Balanced "complex_reasoning": "gpt-4-turbo", # Powerful } def select_model(task_type: str) -> str: return MODEL_ROUTING.get(task_type, "gpt-4o-mini") ``` **Cost Impact:** - gpt-4o-mini: $0.15/$0.60 per 1M tokens (input/output) - gpt-4o: $2.50/$10.00 per 1M tokens - gpt-4-turbo: $10.00/$30.00 per 1M tokens **Recommendation:** Use gpt-4o-mini for 80% of tasks → 90% cost reduction ### 2. Prompt Engineering (Context7-Verified) **Pattern: System Prompts** ```python # Concise system prompt (saves tokens) SYSTEM_PROMPT = """You are a helpful assistant. Be concise.""" # vs # Verbose system prompt (wastes tokens) SYSTEM_PROMPT_VERBOSE = """ You are a highly knowledgeable AI assistant designed to help users with a wide variety of tasks. Your responses should be detailed, accurate, and helpful. Always maintain a professional tone... """ ``` **Token Savings:** 70% (15 tokens vs 50 tokens) **Pattern: Few-Shot Examples** ```python # Optimal: 2-3 examples FEW_SHOT = """ Classify sentiment: Text: "I love this!" → Positive Text: "It's okay" → Neutral Text: "Terrible!" → Negative Now classify: "{text}" """ # Too many: 10+ examples (wastes tokens and context) ``` **Performance Impact:** - 2-3 examples: 95% accuracy - 10+ examples: 96% accuracy (1% gain for 3x cost) ### 3. Inference Optimization (Context7-Verified) **Pattern: Streaming for Long Responses** ```python def stream_response(prompt: str): stream = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": prompt}], stream=True ) for chunk in stream: if chunk.choices[0].delta.content: yield chunk.choices[0].delta.content # Benefits: # - Time to first token: 500ms vs 5s # - Better UX (progressive rendering) # - Lower perceived latency ``` **Pattern: Parallel Requests** ```python import asyncio async def process_batch(prompts: list[str]) -> list[str]: tasks = [get_completion(p) for p in prompts] return await asyncio.gather(*tasks) # 10 sequential: 20s # 10 parallel: 2s (10x faster) ``` ### 4. Token Optimization (Context7-Verified) **Pattern: max_tokens Limit** ```python # Set appropriate limits response = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], max_tokens=150 # Limit response length ) # Benefits: # - Prevents verbose responses # - Reduces output costs # - Faster generation ``` **Cost Savings:** 40% for responses naturally <150 tokens **Pattern: Truncate Long Inputs** ```python import tiktoken def truncate_to_tokens(text: str, max_tokens: int = 4000) -> str: encoding = tiktoken.encoding_for_model("gpt-4o") tokens = encoding.encode(text) if len(tokens) <= max_tokens: return text return encoding.decode(tokens[:max_tokens]) # Usage long_text = "..." * 10000 optimized = truncate_to_tokens(long_text, max_tokens=4000) ``` ### 5. Context Window Management (Context7-Verified) **Pattern: Sliding Window** ```python def get_relevant_context(history: list, max_tokens: int = 4000): """Keep only recent messages that fit in context.""" total_tokens = 0 relevant = [] for msg in reversed(history): msg_tokens = count_tokens(msg["content"]) if total_tokens + msg_tokens > max_tokens: break relevant.insert(0, msg) total_tokens += msg_tokens return relevant # Benefits: # - Prevents context overflow # - Maintains relevant history # - Avoids errors ``` **Pattern: Summarization** ```python async def summarize_old_context(history: list) -> str: """Summarize old messages to save tokens.""" old_messages = history[:-5] # All but last 5 recent_messages = history[-5:] # Last 5 # Summarize old context summary = await client.chat.completions.create( model="gpt-4o-mini", messages=[{ "role": "user", "content": f"Summarize this conversation:\n{old_messages}" }], max_tokens=200 ) # Return summary + recent messages return [{ "role": "system", "content": f"Previous context: {summary.choices[0].message.content}" }] + recent_messages # Token savings: 70-80% for long conversations ``` ### 6. Multi-Model Routing (Context7-Verified) **Pattern: Task-Based Routing** ```python async def route_request(task: dict) -> str: """Route to optimal model based on task complexity.""" # Classify complexity if task["tokens"] < 500 and task["type"] == "simple": model = "gpt-4o-mini" # Cheap, fast elif task["requires_reasoning"]: model = "gpt-4-turbo" # Powerful else: model = "gpt-4o" # Balanced response = await client.chat.completions.create( model=model, messages=task["messages"] ) return response.choices[0].message.content # Cost optimization: 85% of tasks use gpt-4o-mini # Performance: No quality degradation for simple tasks ``` ## Optimization Output ``` 🚀 LLM Inference Optimization Analysis ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Current Usage: - Model: gpt-4-turbo (all tasks) - Prompts: Verbose system prompts - Context: No management - Monthly Cost: $2,500 📊 Optimization Recommendations ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1. Model Selection (90% cost reduction) - 80% tasks → gpt-4o-mini - 15% tasks → gpt-4o - 5% tasks → gpt-4-turbo Cost: $2,500 → $250/month 2. Prompt Engineering (70% token reduction) - Concise system prompts - 2-3 few-shot examples - Direct instructions Token savings: 40% 3. Context Management (50% savings) - Sliding window for history - Summarize old context - Truncate long inputs Token savings: 50% 4. Inference Optimization - Streaming: Better UX - Parallel requests: 10x faster - max_tokens limits: 40% savings 🎯 Summary ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Estimated Impact: - Cost: $2,500 → $250/month (90% reduction) - Latency: -60% (streaming + parallel) - Quality: Maintained (task-appropriate models) Run with --apply to implement optimizations ``` ## Implementation Uses **@openai-python-expert** and **@gemini-api-expert** agents: 1. Analyze current model usage 2. Classify tasks by complexity 3. Implement multi-model routing 4. Optimize prompts 5. Add context management 6. Enable streaming 7. Implement parallel processing ## Best Practices Applied 1. **Model Right-Sizing** - Use smallest capable model (90% savings) 2. **Concise Prompts** - 70% token reduction 3. **Streaming** - 10x better perceived latency 4. **Context Management** - 50% token savings 5. **Parallel Processing** - 10x throughput 6. **Token Limits** - 40% cost reduction ## Related Commands - `/openai:optimize` - OpenAI-specific optimization - `/rag:optimize` - RAG system optimization - `/ai:model-deployment` - Model deployment ## Installation ```bash pip install openai tiktoken tenacity redis ``` ## Version History - v2.0.0 - Initial release with Context7 patterns