---
name: image-generation
description: Generate logos, icons, UI mockups, hero images, product shots, and other design assets using OpenAI gpt-image-1.5 or Google Gemini (Nano Banana 2 Flash, Nano Banana Pro). Use whenever the user asks to create, design, generate, mock up, render, or illustrate a visual asset, or asks for image prompt engineering. Picks the right model for the job, writes a model-specific prompt following the official prompting rules of each provider, calls the API, and saves the PNG to disk.
allowed-tools: Read, Write, Bash, WebFetch
---

# Image Generation

Two model families, three jobs.

## Models covered

| Model ID | Family | Strengths | Default cost |
|---|---|---|---|
| `gemini-3.1-flash-image-preview` | Google (Nano Banana 2 Flash) | Fast, cheap, top-tier text-to-image quality | ~$0.07 / image (1K) |
| `gemini-3-pro-image-preview` | Google (Nano Banana Pro) | 4K, thinking mode, best-in-class text rendering, up to 14 reference images | ~$0.13 / image (2K) |
| `gpt-image-1.5` | OpenAI | Native transparent backgrounds, 5-image high-fidelity reference preservation, strong English text | ~$0.13 / image (high, 1024²) |

## The three jobs of this skill

1. **Pick the model** for the asset and brief. See [reference/model-selection.md](reference/model-selection.md).
2. **Write the prompt** following the model's house rules. **The two providers have OPPOSITE prompt structures** — get this wrong and outputs degrade badly:
   - **OpenAI gpt-image-1.5** wants labeled segments / line breaks, accepts negative prompts ("no watermark"), and rewards explicit constraints. Read [reference/openai-gpt-image-1-5.md](reference/openai-gpt-image-1-5.md).
   - **Gemini** wants narrative paragraphs, **negative phrasing actively backfires** (rewrite "no people" as "empty street"), aspect ratio goes in `imageConfig` not in the prompt text. Read [reference/gemini-image.md](reference/gemini-image.md).
3. **Run the API** via the bundled scripts. See [scripts/README.md](scripts/README.md).

## Default model selection — the 30-second rule

Default to **Gemini Flash** (`gemini-3.1-flash-image-preview`). Promote when:

| Promote to | When |
|---|---|
| **Gemini Pro** (`gemini-3-pro-image-preview`) | brand-critical final asset · ≥2K resolution · multi-reference brand-style work (5+ refs) · text-heavy image · Hebrew/Arabic/CJK text |
| **OpenAI gpt-image-1.5** | need native transparent-PNG output without post-processing · identity-preservation edits where you must keep up to 5 reference images at high fidelity · strong English text rendering at low cost · `gpt-image-1.5`-specific edit endpoint workflows (virtual try-on, inpainting masks) |

**Stay on Flash** for: exploration, thumbnails, batch variants, quick iteration, anything you'd throw away half of.

Full decision tree per asset type: [reference/model-selection.md](reference/model-selection.md).

## Workflow per request

1. **Confirm scope.** Ask the user for: asset type, brand context, color/style direction, target aspect ratio + size, and where to save. If the user gave you all of this in the request, skip and proceed.
2. **Pick the model.** Use the rule above; explain your choice in one sentence.
3. **Read the model-specific reference** — `reference/openai-gpt-image-1-5.md` or `reference/gemini-image.md`. Do not skip this. The two models have opposite prompt structures and you will get it wrong from memory.
4. **Open the matching template** in `templates/` for the asset type. Fill in the placeholders.
5. **For Hebrew or RTL text in the image, read [reference/hebrew-rtl.md](reference/hebrew-rtl.md) FIRST.** Defaults change.
6. **Show the prompt to the user before calling the API** unless they explicitly asked you to "just do it." One round of prompt review prevents most expensive re-generations.
7. **Run the script.** Save to `./generated-images/<descriptive-name>.png` in the cwd unless told otherwise. Create the directory if needed.
8. **Read the saved image with the Read tool — Claude is multimodal and will actually see the pixels.** This is the most important step in the loop. Critique the result against the brief: did the composition land? Is text legible? Are colors accurate? Did anything mangle (extra fingers, broken letterforms, wrong product geometry)? Spot the issues *before* the user has to.
9. **Show the user.** Use `open <path>` on macOS to surface the file, summarize what you see (good and bad), and propose either (a) ship it, (b) iterate with a specific change, or (c) regenerate from a revised prompt.

## Templates by asset type

| Asset | Template |
|---|---|
| Brand logo / mark / wordmark | [templates/logo.md](templates/logo.md) |
| Icon set (consistent style across icons) | [templates/icon-set.md](templates/icon-set.md) |
| Mobile app UI screen | [templates/ui-mobile.md](templates/ui-mobile.md) |
| Web dashboard / SaaS UI | [templates/ui-dashboard.md](templates/ui-dashboard.md) |
| Marketing hero image / banner | [templates/hero-image.md](templates/hero-image.md) |
| Product photography / hero shot | [templates/product-shot.md](templates/product-shot.md) |

For verbatim worked examples, see [examples.md](examples.md).

## API keys

Both keys live in `~/.claude/projects/-Users-shaharshavit/memory/api-keys.md` (already in your global CLAUDE.md instructions). Sections:

- **OpenAI (image generation)** → export as `OPENAI_IMAGE_API_KEY` before calling `scripts/openai-image.sh`
- **Google AI Studio (image generation)** → export as `GEMINI_IMAGE_API_KEY` before calling `scripts/gemini-image.sh`

These keys are *image-gen scoped*. Do not reuse them for chat completions or embeddings.

## Output convention

- Default save location: `./generated-images/<descriptive-name>.png` in the current working directory.
- Filename: descriptive kebab-case based on the brief (e.g., `logo-agentleh-monochrome-v1.png`, `hero-saas-landing-blue-v2.png`).
- Versioning: append `-v1`, `-v2`, etc. when iterating. Never overwrite a previous generation without asking.
- Use `open <path>` on macOS to show the user immediately after generation.

## Iteration patterns — autonomous "iterate until perfect" loop

The agent's job is to drive the image to "good enough to ship" *before* asking the user. Run this loop:

```
1. Generate (script call)
2. Read the saved file with the Read tool — Claude is multimodal and SEES the pixels
3. Self-critique against the brief. Score yourself honestly:
   - Did the composition land? (subject placement, framing, balance)
   - Is text legible and spelled correctly? (especially logos/UI labels)
   - Are colors right? (hex match, palette adherence)
   - Any defects? (mangled letterforms, extra fingers, broken geometry,
     wrong product details, drifted style)
   - Is the brief actually satisfied?
4. Decide:
   - SHIP → it meets the brief; show the user, summarize, suggest next steps
   - EDIT → specific localized fix needed; call the script again with --ref
            and a "change X, keep Y" prompt (Gemini) or /edits + input_fidelity=high
            (OpenAI). Re-enter the loop.
   - REWRITE → fundamental prompt issue; rewrite the prompt from scratch
              (don't tweak). Re-enter the loop.
5. Stop conditions (one or both must be checked every iteration):
   - 5 iterations consumed → stop and consult the user before continuing
   - $2 spent on this single asset → stop and consult the user
   - The result is shippable → stop, present, await user verdict
```

**The user is the final judge — but Claude is the first judge.** Don't show the user a flawed result and ask "is this good?" Show them after you've already verified it meets the brief, OR show them with an explicit critique ("the wordmark is off, I'm about to fix it — here's the plan") so they know you saw what they'd see.

### Mechanics by model

- **Gemini multi-turn editing** is the workhorse for refinement. `scripts/gemini-image.sh --ref <previous-output> --prompt "change X, keep Y"`. Always include the explicit preservation clause: *"Keep everything else in the image exactly the same — composition, lighting, colors, all other elements."*
- **OpenAI iteration** uses `/v1/images/edits` with `input_fidelity=high`. On gpt-image-1.5 the **first 5 input images** are preserved at higher fidelity, so pass logo + color swatch + typography reference + product photo + moodboard simultaneously when doing brand-consistent edits. The script supports `--ref` repeatable up to 5.
- **Drift discipline.** After 3 unsuccessful iterations on the *same* image, switch strategy — don't keep tweaking. Either rewrite the base prompt or change models (Flash → Pro for quality lift; Gemini → OpenAI for transparent-bg specific issues).
- **Budget tracking.** Roughly track spend in your head (see [reference/pricing.md](reference/pricing.md)). 5 Pro iterations at 4K = $1.20. If you blow past the asset's reasonable budget, surface it to the user before continuing.

## Pricing

Per-image cost matters for iteration discipline. Quick reference:

- Logo exploration on Gemini Flash 1K: ~$0.07/call
- Final logo on Gemini Pro 2K: ~$0.13/call
- Final logo on Gemini Pro 4K: ~$0.24/call
- Quick GPT logo at low quality: ~$0.009/call
- Final GPT logo at high quality: ~$0.13/call

Full table: [reference/pricing.md](reference/pricing.md).

## Hebrew / RTL — known weak spot

Both Gemini models have unreliable Hebrew/Arabic text rendering. **Default workflow for Hebrew text in images: generate text-free background with the image model, then composite Hebrew text on top via SVG/Canvas/Figma.** See [reference/hebrew-rtl.md](reference/hebrew-rtl.md) for the full pipeline before generating any Hebrew asset.

## When NOT to use this skill

- The user wants editable vector files (SVG with paths). These models output rasters; "vector-like" is aesthetic only. Use Illustrator / Figma / SVG by hand.
- The user wants exact pixel-perfect dimensions outside the supported sizes (e.g., 1200×630 OG image). Generate at the closest supported aspect, then crop/resize in post.
- The user wants a real-world photo of a specific real person. Both providers refuse / produce inaccurate likenesses, and there are policy issues.
- The user is asking for general "AI art" without a design brief — point them to the consumer Gemini app or ChatGPT instead; this skill is tuned for design work.