Why Your Annotations Aren't Working

Developers annotate screenshots the same way they would for a human colleague — a thin red arrow, maybe a circle, occasionally some text. This works for humans because humans understand visual conventions: an arrow means "look here," a circle means "this area," and red means "this is important."

AI vision models don't share these conventions. They process annotations as colored pixels in an image grid, not as semantic markers. A red arrow looks, to the model, like a diagonal red line overlaid on an interface. The model must independently determine that this line is an annotation (not a UI element), figure out which end is the arrowhead (indicating what it's pointing at), and infer the developer's intent from the position and direction.

This works about 60 to 70% of the time with well-placed annotations. It fails the other 30 to 40% of the time — the AI fixes the wrong element, misidentifies the arrow's target, or interprets the annotation as part of the interface itself.

The failures aren't random. They follow predictable patterns tied to how vision models process images. By adjusting annotation practices to account for these patterns, developers can increase AI interpretation accuracy significantly.


Rule 1: One Annotation, One Screenshot

This is the single highest-impact change you can make.

When a screenshot has multiple annotations, the AI must parse each one independently, determine what each is pointing at, and figure out the relationship between the issues. Attention is distributed across all annotations, and each additional mark dilutes the model's focus on the others.

Instead of this: One screenshot with 4 arrows pointing at 4 different problems, plus a text prompt listing all 4 issues.

Do this: Four screenshots, each with 1 arrow and a brief description. Send them in a single message if the issues are related.

The per-screenshot overhead is small (a few seconds of capture time). The interpretation accuracy improvement is substantial — single-annotation screenshots are interpreted correctly over 90% of the time versus roughly 60% for multi-annotation images.


Rule 2: Thick Lines, Sharp Arrowheads

Vision models downsample screenshots before processing. A 2880-pixel-wide Retina screenshot is typically reduced to 1024 pixels or less. A 2px arrow stroke at original resolution becomes sub-pixel after downsampling — effectively invisible.

Minimum stroke width: 4px at the output resolution. If your annotation tool draws at the original capture resolution, use 6 to 8px to survive downsampling.

Arrowhead style: Triangular filled arrowheads survive better than open arrowheads or simple line endpoints. The triangular head is a distinct shape that remains recognizable at low resolution. An open "V" arrowhead may look like a fork or junction after downsampling.

Arrow length: Longer arrows are more reliably detected than short ones. An arrow that spans 100+ pixels (at output resolution) crosses multiple vision patches and generates a stronger signal. A 30-pixel arrow might fall within a single patch and be lost in the patch embedding.


Rule 3: Use Colors That Contrast with the UI

The default annotation color in most tools is red — and red is the worst default for a significant portion of interfaces. Error states, required field indicators, delete buttons, notification badges, alert banners — all red. A red arrow on a page with red UI elements is visually ambiguous.

Before annotating, glance at the UI color scheme:

UI uses...Annotate with...
Red/orange (errors, alerts, CTAs)Blue or green
Blue (links, buttons, navigation)Red or orange
Green (success states, badges)Red or magenta
Gray/neutral (minimal color)Red (the default works here)
Dark themeWhite, yellow, or bright green
Light themeRed, blue, or dark green

The goal is maximum contrast between the annotation and the surrounding content. The vision model needs to distinguish "this is an annotation" from "this is part of the interface." Color contrast is the primary signal it uses for this distinction.


Rule 4: Crop Before You Annotate

A full-screen screenshot with an arrow pointing at a small element is asking the model to find a needle in a haystack. The arrow occupies a small fraction of the total image area, and after downsampling, both the arrow and the target element may be too small to interpret reliably.

Crop to the relevant region first, then annotate. A tight crop — just the section of the interface where the issue exists, with enough surrounding context to understand the layout — dramatically improves accuracy.

Practical sizing: crop to roughly 400×400 to 800×600 pixels around the area of interest. This gives the vision model enough context to understand the layout while keeping the annotation large relative to the image area.

Exception: If the issue is about the relationship between distant elements (e.g., "the header and the footer are too close"), a wider shot is necessary. In this case, use two annotations — one at each element — and describe the relationship in text.


Rule 5: Pair Arrows with Text Labels

An arrow says "look here." It doesn't say "the gap is too large" or "this color should be blue" or "this element should be left-aligned."

Adding a brief text label near the arrow tells the AI exactly what's wrong. Vision models read text in images with near-perfect accuracy — far higher accuracy than they achieve interpreting the meaning of a graphical arrow.

Effective labels:

Label placement: Place the label near the arrowhead, not at the tail. The model's attention follows the arrow to its point, and text near the point reinforces what the arrow is indicating.

Label styling: Use a contrasting color, a readable size (at least 14px at output resolution), and ideally a background swatch so the text is readable against any UI background.


Rule 6: Choose the Right Annotation Shape

Arrows aren't always the best tool. Different shapes communicate different types of issues:

Arrow (→): Best for pointing at a specific element. "This button is wrong." The arrow's direction identifies the target.

Rectangle (□): Best for highlighting a region or area. "Everything in this box needs to change." Rectangles define boundaries.

Ellipse (○): Best for drawing attention to a small detail. "Notice this icon" or "see this text." Circles work like a spotlight.

Double arrow (↔): Best for indicating spacing or distance. "The gap between these two elements." The two endpoints define the measurement.

When in doubt, use a rectangle. Rectangles are the least ambiguous annotation shape for vision models because they define a clear, bounded region. The model can be confident that "everything inside this rectangle is the subject."


Rule 7: Describe What You Want, Not Just What's Wrong

The screenshot and annotation show the current state. Your text prompt should describe the desired state. This pairing — visual current state + textual desired state — gives the AI everything it needs to generate a targeted fix.

Weak prompt: "Fix this" + screenshot with arrow

Strong prompt: "The gap between the card and the header is ~32px. It should be 16px. Also the card border-radius should be 12px not 8px." + screenshot with rectangle around the card

The annotation draws the AI's visual attention. The text specifies the action. Neither is sufficient alone. Together, they produce the highest accuracy.


Rule 8: Use Before/After Pairs When Possible

For visual changes, sending two screenshots — one showing the current state and one showing a reference — is more effective than describing the difference in text.

Labeling: Add a text label directly on each image: "Current" and "Expected" (or "Before" and "Reference"). The AI reads these labels and understands the comparison context.

This is especially effective for:


Rule 9: Include Context Metadata

The AI doesn't know your viewport size, browser, OS, or display resolution from the screenshot pixels. These are important for generating correct CSS fixes.

Minimum context to include in your prompt or in an AI Context Banner on the image:

If your annotation tool adds an AI Context Banner automatically, you don't need to type this. If it doesn't, add a one-line context string to your prompt.


The Annotation Decision Checklist

Before pasting an annotated screenshot into an AI tool, run through this checklist:

  1. One issue per screenshot? If not, split into multiple screenshots.
  2. Annotation visible at 50% zoom? If not, increase stroke width or crop tighter.
  3. Annotation color contrasts with UI? If not, switch to a contrasting color.
  4. Text label near the annotation? If not, add a brief label describing the issue.
  5. Context included? Either as an AI Context Banner or in the text prompt.
  6. Desired state described in text? The screenshot shows what exists — your prompt should say what you want.

Frequently Asked Questions

Does it matter which annotation tool I use?

The tool matters less than the annotation practices. However, tools that auto-copy after annotation, add AI Context Banners, and default to high-contrast colors reduce friction and eliminate steps where accuracy is lost.

Should I annotate programmatic output (terminal, logs) or just UI?

Text-based output rarely needs annotation. For terminal output, console logs, and JSON, copy-paste the text directly. Text-as-text is always more accurate than text-in-a-screenshot. Reserve screenshot annotation for genuinely visual content — layout, styling, visual bugs.

How many colors should I use in annotations?

One color per screenshot is ideal. If you must use multiple (e.g., annotating two different issues on one image), use two strongly contrasting colors and reference them in your text prompt: "The red arrow shows the spacing issue. The blue rectangle shows the color problem."

Do rectangles and circles work as well as arrows?

Yes, for identifying regions, rectangles work better than arrows because they define clear boundaries. For pointing at specific elements, arrows work better because they indicate direction. For general attention-drawing, both work. The AI's interpretation of rectangles (bounded area) is more reliable than its interpretation of arrow direction.


Key Takeaways

  • Single-annotation screenshots are interpreted correctly 90%+ of the time. Multi-annotation screenshots drop to ~60%.
  • Minimum 4px stroke width at output resolution. Thinner lines disappear after downsampling.
  • Choose annotation colors that contrast with the UI being annotated. Red-on-red-UI is the most common cause of misinterpretation.
  • Crop tightly before annotating. A 600×400 crop with one clear annotation is far more effective than a full-screen screenshot with a distant arrow.
  • Pair annotations with text labels. Vision models read text more reliably than they interpret graphical markers.
  • Describe the desired state in your text prompt. The screenshot shows what exists. The prompt says what you want.

References

  • Dosovitskiy et al., "An Image is Worth 16x16 Words" (2020) — ViT architecture explaining patch tokenization and downsampling
  • Anthropic, "Vision documentation — Image best practices" — resolution handling and format support
  • OpenAI, "GPT-4V system card" — vision model annotation interpretation capabilities
  • Rylaarsdam et al., "Evaluating LLMs on Visual Reasoning Benchmarks" — annotation interpretation accuracy data
  • Anthropic, "Claude Code documentation" — image paste support in CLI coding tool
  • OpenAI, "Introducing Codex" (May 2025) — screenshot input for coding agents