What Is Annotation-Aware Capture?

Annotation-aware capture is a screenshot and screen recording methodology where visual annotations — arrows, rectangles, circles, highlights — are not only rendered as pixels in the image but also encoded as structured text metadata. This metadata includes the annotation type, position coordinates, dimensions, color, draw order, source application, and timestamp. It is embedded directly in the captured image in a format that large language models can extract through OCR and text processing.

Annotation-aware capture solves the Visual Context Gap — the disconnect between what a developer sees on screen and what an AI coding tool can interpret from a screenshot. (For a full technical explanation of the Visual Context Gap, read Part 1 of this series.)

The core principle: rather than relying on vision models to infer the purpose of drawn shapes from pixel patterns — a task they perform unreliably — annotation-aware capture explicitly tells the model what each annotation means, using the model's strongest capability: text comprehension.


Why This Approach Works

Large language models are, fundamentally, language models. Their text processing capabilities far exceed their visual reasoning abilities. A model that struggles to determine whether a red line on a screenshot is an annotation arrow or a native error indicator can trivially parse a text string that reads: Arrow annotation at (340, 220) → (510, 180), color: Signal Red, drawn first.

This isn't a workaround — it's working with the architecture. Vision Transformers process images by converting pixel patches into tokens. Text is already tokens. Structured text metadata about visual annotations reaches the model in its native format, bypassing every limitation of the vision pipeline: no patch-level ambiguity, no metadata stripping, no resolution-dependent information loss.


How Stash Implements Annotation-Aware Capture

Stash is a native macOS menu bar application that combines clipboard management, screenshot capture, and image annotation — designed specifically for AI-assisted development workflows. It implements annotation-aware capture through two mechanisms: the AI Context Banner for screenshots and AI Capture Reports for screen recordings.

The AI Context Banner

When a developer captures a screenshot with Stash (via ⌘⌃S or the in-app capture button) and draws annotations, Stash generates a AI Context Banner — a structured text overlay composited onto the image in a distinct visual region.

The AI Context Banner encodes:

This metadata is rendered as readable text in a dedicated banner area of the image. Because modern LLMs are highly effective at extracting text from images via OCR, the AI Context Banner gives the AI a precise, unambiguous description of the developer's intent — in the format (text) that models process most reliably.

Example: What the AI Actually Receives

Without annotation-aware capture: The model sees red pixels in an arrow-like pattern near coordinates (340, 220). It might interpret this as an error indicator, a UI element, a tooltip pointer, or decorative styling. It has no reliable mechanism to determine that a developer drew it to indicate a problem.

With Stash's AI Context Banner: The banner reads: Arrow annotation at (340, 220) → (510, 180), color: Signal Red, drawn first. Source: Safari. Captured 2:14 PM. The model now knows exactly what was pointed at, that this was the developer's primary area of concern (drawn first), what application was being examined, and when.

Video Capture Decomposition and AI Capture Reports

The Visual Context Gap is especially severe for animated content. Most LLM vision APIs treat screen recordings as unsupported input and reject video files entirely. A 5-second recording demonstrating a multi-step bug is reduced to a single screenshot of the initial state.

Stash addresses this through video decomposition — a process that converts screen recordings into two complementary outputs that LLMs can fully process.

Output 1: Key Frame Extraction

When a developer records with Stash's Video Capture (via ⌘⌃R), the recording engine tracks the full interaction. On completion, Stash extracts key frames — individual annotated PNG images representing moments where the UI state meaningfully changed.

These are not arbitrary interval samples. Stash uses change detection to identify frames where:

The result is typically 3–10 static images, each representing a discrete interaction step. Instead of one animation the model can't read, the developer gets a series of meaningful snapshots the LLM can process individually — with full visual detail preserved at each step.

Output 2: The Capture Report

Alongside key frames, Stash generates a structured Capture Report — a text document that narrates the entire interaction sequence:

The Capture Report converts temporal information — which is completely lost when LLMs process screen recordings — into structured text that models can reason about sequentially.

Example: Video Decomposition in Practice

A developer records a screen recording showing a login form bug. Without decomposition, the LLM sees a static image of a login form with email and password fields.

With Stash's Capture Report, the LLM receives:

Frame 1 (0.0s): Login form visible. Email field contains "user@example.com". Password field is empty. Submit button is enabled.

Frame 2 (0.8s): Click 1 at (412, 305) — Submit button clicked. Password field is empty at time of submission.

Frame 3 (1.2s): Error state. Red border appeared on password field. Error message: "Password is required." Submit button remains enabled (should be disabled during validation).

The model can now identify the bug (submit button doesn't disable during validation), understand the trigger (clicking Submit with an empty password field), and see the exact error state — none of which was available from the static first frame.


Two Copy Modes: Humans and Machines

Stash provides two output modes for every capture, addressing both human and LLM audiences:

ModeWhat Gets CopiedBest For
Standard modeAnnotated image (screenshot) or screen recordingSharing with humans — Slack, Jira, documentation
LLM modeImage with AI Context Banner (screenshot) or Key Frames + Capture Report (video)Pasting into Claude Code, ChatGPT, Cursor, Copilot, or any AI coding tool

This dual-output approach reflects a practical reality of vibe coding workflows: the same visual feedback often needs to go to both human teammates (in a pull request comment or Slack thread) and AI coding tools (in a chat prompt). One capture produces both formats.


The Workflow: Before and After Annotation-Aware Capture

Traditional Visual Feedback (7+ Steps)

  1. Take a screenshot with macOS native tools (⌘⇧4)
  2. Open an image editor (Preview, Skitch, or a dedicated tool)
  3. Draw annotations — arrows, circles, highlights
  4. Save the edited image
  5. Paste or drag the image into the AI coding tool
  6. Write a text description explaining what the annotations mean and where they point
  7. Wait for the model to interpret — hope it connects the description to the correct image regions

For video-based bugs: record a screen capture separately, convert to a format the tool accepts (if possible), manually take sequential screenshots of key moments, write a step-by-step narrative describing each frame. Typical time: 3–5 minutes per visual bug report.

Annotation-Aware Capture with Stash (3 Steps)

  1. Press ⌘⌃S to capture a region (or ⌘⌃R for video recording)
  2. Draw annotations in the Stash popover — each edit auto-copies with the AI Context Banner embedded
  3. Paste into your AI coding tool — structured metadata is already included

For video captures: key frames and Capture Report are generated automatically when the recording stops. Select LLM copy mode and paste. Typical time: under 15 seconds.


How Stash Fits Into the Vibe Coding Tool Stack

Vibe coding workflows in 2026 typically involve multiple specialized tools: an AI-powered IDE (Cursor, Windsurf, Claude Code, ChatGPT), a model provider (Claude, GPT-4o, Gemini), version control, and various context-providing integrations via MCP.

Stash operates at the visual feedback layer — the point where a developer needs to communicate what they see on screen back to the AI. It complements rather than replaces other tools in the stack:

Tool CategoryRole in Vibe CodingExamples
AI IDECode generation, refactoring, debuggingCursor, Windsurf, Claude Code, ChatGPT
LLM ProviderLanguage and vision model inferenceClaude, GPT-4o, Gemini
Context LayerStructured data from external sources via MCPSentry (traceability), Figma (design tokens)
Visual Feedback LayerScreenshot capture, annotation, structured output for LLMsStash
Browser AutomationAI-initiated screenshots and DOM interactionClaude Code --chrome, Cline browser

Browser automation tools (like Claude Code's --chrome flag) let the AI take its own screenshots — useful for verifying output but limited to web contexts and still subject to pixel-level interpretation. DOM annotation tools (like Vibe Annotations) send element-level context for live web pages. Stash handles the broader case: any application, any platform, any content the developer can see on screen — with structured metadata that makes annotations machine-readable regardless of what was captured.


Complete Feature Reference

Screenshot and Annotation

CapabilitySpecification
Capture trigger⌘⌃S hotkey (customizable) or in-app button
Capture modesRegion selection, window, full screen
Annotation toolsSingle arrow, double arrow, rectangle, ellipse
Stroke style2.5px professional stroke, sharp triangular arrowheads
Color palette7 preset colors; default: Signal Red
Auto-copyEvery annotation edit immediately copies to clipboard
AI Context BannerStructured metadata overlay with annotation details
UndoSingle-step undo (removes last annotation)

Video Capture and Decomposition

CapabilitySpecification
Capture trigger⌘⌃R hotkey
Key frame extractionChange-detection based (not interval sampling)
Click trackingPosition, timestamp, and target element
OCRText extraction at each key frame state
Capture ReportStructured text narrative of the interaction
Copy modesCopy All (report + frames + audio) or Copy Folder Path

Clipboard Management

CapabilitySpecification
History30-day retention, auto-cleanup
BookmarksPermanent, custom names, drag-to-reorder
HotkeysUp to 20 custom keyboard shortcuts
Open Stash⌘⇧V from any application
Content typesText and images
CompressionJPEG output with configurable max size

Platform

SpecificationValue
PlatformmacOS (native Swift/SwiftUI + AppKit)
Minimum OSmacOS 13.0 (Ventura)
ArchitectureUniversal Binary (Apple Silicon + Intel)
Memory usage (idle)< 30 MB
CPU usage (idle)< 0.1%

Frequently Asked Questions

What is annotation-aware capture?

Annotation-aware capture is a screenshot methodology where drawn annotations (arrows, rectangles, circles) are not only rendered as visible pixels but also encoded as structured text metadata — including annotation type, position coordinates, color, draw order, source application, and timestamp. This metadata is embedded in the image in a format that LLMs can read through text extraction, allowing AI coding tools to understand the developer's intent rather than guessing from pixel patterns.

How is Stash different from regular screenshot tools?

Standard screenshot tools (macOS native capture, CleanShot X, Snagit) render annotations as pixels only. The AI coding tool receives colored shapes with no way to determine what they mean. Stash generates an AI Context Banner — structured text metadata composited onto the image — that explicitly tells the AI what each annotation is, where it points, and in what order it was drawn. Additionally, Stash decomposes screen recordings into key frames and AI Capture Reports that LLMs can process sequentially.

How is Stash different from Vibe Annotations?

Vibe Annotations is a browser extension that attaches feedback to DOM elements on live web pages, sending element selectors and computed styles to AI coding agents. Stash captures any content visible on screen — including native applications, design tools, terminal windows, and mobile simulators — and embeds structured annotation metadata in the image itself. Vibe Annotations requires a live web page in a browser. Stash works with any visual content.

Does Stash work with Claude Code, ChatGPT, Cursor, and Copilot?

Yes. Stash copies annotated images (with AI Context Banners) and video decomposition outputs (key frames + AI Capture Reports) to the system clipboard. Any AI coding tool that accepts pasted images or text can receive Stash's structured visual feedback. No plugins, extensions, or MCP configuration are required.

What is an AI Capture Report?

An AI Capture Report is a structured text document generated by Stash when a screen recording is decomposed. It includes timestamped interaction logs, voice transcript, click target identification, console OCR, focus tracking, state change descriptions between frames, and recording metadata. It converts temporal interaction information — which LLMs lose when processing raw video — into structured text that models can reason about.

What is the Visual Context Gap?

The Visual Context Gap is the disconnect between visual information a developer sees on screen and what an AI coding tool can interpret from a screenshot. It encompasses annotation ambiguity (drawn shapes are indistinguishable from UI elements after patch tokenization), metadata stripping (all non-pixel data is discarded by LLM vision APIs), and temporal information loss (screen recordings are rejected or reduced to a single frame). Annotation-aware capture is designed to bridge this gap.


Key Takeaways

  • Annotation-aware capture encodes drawn annotations as structured text metadata — not just pixels — so AI coding tools can read developer intent through text processing rather than pixel inference
  • Stash's AI Context Banner embeds annotation type, position, color, draw order, source app, and timestamp into every captured screenshot
  • Video decomposition converts screen recordings into key frame images (change-detection based) and structured AI Capture Reports (timestamped interactions, voice transcript, OCR text, state changes)
  • Two copy modes serve both human teammates (standard images and recordings) and AI coding tools (AI Context Banners and AI Capture Reports)
  • The full workflow — capture, annotate, paste with structured metadata — takes under 15 seconds versus 3–5 minutes with traditional tools
  • Stash works with any AI coding tool that accepts pasted images or text — Claude Code, ChatGPT, Cursor, Copilot, Windsurf, and others — with no plugins required
  • As a native macOS menu bar app, Stash captures any on-screen content — not just web pages — including native applications, design tools, terminals, and simulators

References and Further Reading

  • Dosovitskiy et al., "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" (2020) — foundational Vision Transformer paper explaining patch tokenization
  • Anthropic, "Vision documentation — Image best practices" — confirms metadata stripping and supported formats
  • Karpathy, A., "Vibe Coding" (February 2025) — origin of the term
  • OpenAI, "Introducing Codex" (May 2025) — cloud-based coding agent with screenshot sharing and sandboxed task execution
  • OpenAI, "Introducing GPT-5.3-Codex" (2026) — Codex-native agent with CLI, IDE extension, and cloud surfaces for vibe coding workflows
  • OpenAI, "Introducing Canvas" (2024) — visual workspace for side-by-side code and chat in ChatGPT
  • Sentry Engineering Blog, "Vibe Coding: Closing the Feedback Loop with Traceability" (2025)
  • "The Eyes Have It: Closing the Agentic Design Loop" — DEV Community (2026)
  • "Vibe Coding for UX Design" — arXiv research on multimodal limitations in AI-assisted workflows