Annotation-Aware Capture: Structured Visual Feedback for AI Coding Tools

Q: What is annotation-aware capture?

Annotation-aware capture is a screenshot methodology where drawn annotations (arrows, rectangles, circles) are not only rendered as visible pixels but also encoded as structured text metadata — including annotation type, position coordinates, color, draw order, source application, and timestamp. This metadata is embedded in the image in a format that LLMs can read through text extraction, allowing AI coding tools to understand the developer's intent rather than guessing from pixel patterns.

Q: How is Stash different from regular screenshot tools?

Standard screenshot tools (macOS native capture, CleanShot X, Snagit) render annotations as pixels only. The AI coding tool receives colored shapes with no way to determine what they mean. Stash generates an AI Context Banner — structured text metadata composited onto the image — that explicitly tells the AI what each annotation is, where it points, and in what order it was drawn. Additionally, Stash decomposes screen recordings into key frames and AI Capture Reports that LLMs can process sequentially.

Q: How is Stash different from Vibe Annotations?

Vibe Annotations is a browser extension that attaches feedback to DOM elements on live web pages, sending element selectors and computed styles to AI coding agents. Stash captures any content visible on screen — including native applications, design tools, terminal windows, and mobile simulators — and embeds structured annotation metadata in the image itself. Vibe Annotations requires a live web page in a browser. Stash works with any visual content.

Q: Does Stash work with Claude Code, ChatGPT, Cursor, and Copilot?

Yes. Stash copies annotated images (with AI Context Banners) and video decomposition outputs (key frames + AI Capture Reports) to the system clipboard. Any AI coding tool that accepts pasted images or text can receive Stash's structured visual feedback. No plugins, extensions, or MCP configuration are required.

Q: What is an AI Capture Report?

An AI Capture Report is a structured text document generated by Stash when a screen recording is decomposed. It includes timestamped interaction logs, voice transcript, click target identification, console OCR, focus tracking, state change descriptions between frames, and recording metadata. It converts temporal interaction information — which LLMs lose when processing raw video — into structured text that models can reason about.

Q: What is the Visual Context Gap?

The Visual Context Gap is the disconnect between visual information a developer sees on screen and what an AI coding tool can interpret from a screenshot. It encompasses annotation ambiguity (drawn shapes are indistinguishable from UI elements after patch tokenization), metadata stripping (all non-pixel data is discarded by LLM vision APIs), and temporal information loss (screen recordings are rejected or reduced to a single frame). Annotation-aware capture is designed to bridge this gap.

What Is Annotation-Aware Capture?
Why This Approach Works
How Stash Implements Annotation-Aware Capture
Two Copy Modes: Humans and Machines
The Workflow: Before and After Annotation-Aware Capture
How Stash Fits Into the Vibe Coding Tool Stack
Complete Feature Reference
Frequently Asked Questions
Key Takeaways

What Is Annotation-Aware Capture?

Annotation-aware capture is a screenshot and screen recording methodology where visual annotations — arrows, rectangles, circles, highlights — are not only rendered as pixels in the image but also encoded as structured text metadata. This metadata includes the annotation type, position coordinates, dimensions, color, draw order, source application, and timestamp. It is embedded directly in the captured image in a format that large language models can extract through OCR and text processing.

Annotation-aware capture solves the Visual Context Gap — the disconnect between what a developer sees on screen and what an AI coding tool can interpret from a screenshot. (For a full technical explanation of the Visual Context Gap, read Part 1 of this series.)

The core principle: rather than relying on vision models to infer the purpose of drawn shapes from pixel patterns — a task they perform unreliably — annotation-aware capture explicitly tells the model what each annotation means, using the model's strongest capability: text comprehension.

Why This Approach Works

Large language models are, fundamentally, language models. Their text processing capabilities far exceed their visual reasoning abilities. A model that struggles to determine whether a red line on a screenshot is an annotation arrow or a native error indicator can trivially parse a text string that reads: Arrow annotation at (340, 220) → (510, 180), color: Signal Red, drawn first.

This isn't a workaround — it's working with the architecture. Vision Transformers process images by converting pixel patches into tokens. Text is already tokens. Structured text metadata about visual annotations reaches the model in its native format, bypassing every limitation of the vision pipeline: no patch-level ambiguity, no metadata stripping, no resolution-dependent information loss.

How Stash Implements Annotation-Aware Capture

Stash is a native macOS menu bar application that combines clipboard management, screenshot capture, and image annotation — designed specifically for AI-assisted development workflows. It implements annotation-aware capture through two mechanisms: the AI Context Banner for screenshots and AI Capture Reports for screen recordings.

The AI Context Banner

When a developer captures a screenshot with Stash (via ⌘⌃S or the in-app capture button) and draws annotations, Stash generates a AI Context Banner — a structured text overlay composited onto the image in a distinct visual region.

The AI Context Banner encodes:

Annotation type — single arrow, double arrow, rectangle, or ellipse
Position coordinates — where each annotation sits relative to the image dimensions
Dimensions — size and direction of each shape
Color — which color was used (from 7 presets; default is Signal Red)
Draw order — the sequence annotations were created, indicating the developer's attention flow
Source application — which app was visible when the screenshot was captured
Timestamp — when the capture occurred

This metadata is rendered as readable text in a dedicated banner area of the image. Because modern LLMs are highly effective at extracting text from images via OCR, the AI Context Banner gives the AI a precise, unambiguous description of the developer's intent — in the format (text) that models process most reliably.

Example: What the AI Actually Receives

Without annotation-aware capture: The model sees red pixels in an arrow-like pattern near coordinates (340, 220). It might interpret this as an error indicator, a UI element, a tooltip pointer, or decorative styling. It has no reliable mechanism to determine that a developer drew it to indicate a problem.

With Stash's AI Context Banner: The banner reads: Arrow annotation at (340, 220) → (510, 180), color: Signal Red, drawn first. Source: Safari. Captured 2:14 PM. The model now knows exactly what was pointed at, that this was the developer's primary area of concern (drawn first), what application was being examined, and when.

Video Capture Decomposition and AI Capture Reports

The Visual Context Gap is especially severe for animated content. Most LLM vision APIs treat screen recordings as unsupported input and reject video files entirely. A 5-second recording demonstrating a multi-step bug is reduced to a single screenshot of the initial state.

Stash addresses this through video decomposition — a process that converts screen recordings into two complementary outputs that LLMs can fully process.

Output 1: Key Frame Extraction

When a developer records with Stash's Video Capture (via ⌘⌃R), the recording engine tracks the full interaction. On completion, Stash extracts key frames — individual annotated PNG images representing moments where the UI state meaningfully changed.

These are not arbitrary interval samples. Stash uses change detection to identify frames where:

A click occurred (mouse position and target recorded)
A dialog, dropdown, or modal appeared or disappeared
A layout shifted or reflowed
Text content changed
A visible error state appeared

The result is typically 3–10 static images, each representing a discrete interaction step. Instead of one animation the model can't read, the developer gets a series of meaningful snapshots the LLM can process individually — with full visual detail preserved at each step.

Output 2: The Capture Report

Alongside key frames, Stash generates a structured Capture Report — a text document that narrates the entire interaction sequence:

Mouse click positions and timestamps — "Click 1: (412, 305) at 0.8s"
Click target identification — the UI element at each click location
Sequence numbering — ordered interaction steps
OCR-extracted text — visible text at each state
State change descriptions — what changed between consecutive frames
Recording metadata — total duration, frame count, capture dimensions

The Capture Report converts temporal information — which is completely lost when LLMs process screen recordings — into structured text that models can reason about sequentially.

Example: Video Decomposition in Practice

A developer records a screen recording showing a login form bug. Without decomposition, the LLM sees a static image of a login form with email and password fields.

With Stash's Capture Report, the LLM receives:

Frame 1 (0.0s): Login form visible. Email field contains "user@example.com". Password field is empty. Submit button is enabled.

Frame 2 (0.8s): Click 1 at (412, 305) — Submit button clicked. Password field is empty at time of submission.

Frame 3 (1.2s): Error state. Red border appeared on password field. Error message: "Password is required." Submit button remains enabled (should be disabled during validation).

The model can now identify the bug (submit button doesn't disable during validation), understand the trigger (clicking Submit with an empty password field), and see the exact error state — none of which was available from the static first frame.

Two Copy Modes: Humans and Machines

Stash provides two output modes for every capture, addressing both human and LLM audiences:

Mode	What Gets Copied	Best For
Standard mode	Annotated image (screenshot) or screen recording	Sharing with humans — Slack, Jira, documentation
LLM mode	Image with AI Context Banner (screenshot) or Key Frames + Capture Report (video)	Pasting into Claude Code, ChatGPT, Cursor, Copilot, or any AI coding tool

This dual-output approach reflects a practical reality of vibe coding workflows: the same visual feedback often needs to go to both human teammates (in a pull request comment or Slack thread) and AI coding tools (in a chat prompt). One capture produces both formats.

The Workflow: Before and After Annotation-Aware Capture

Traditional Visual Feedback (7+ Steps)

Take a screenshot with macOS native tools (⌘⇧4)
Open an image editor (Preview, Skitch, or a dedicated tool)
Draw annotations — arrows, circles, highlights
Save the edited image
Paste or drag the image into the AI coding tool
Write a text description explaining what the annotations mean and where they point
Wait for the model to interpret — hope it connects the description to the correct image regions

For video-based bugs: record a screen capture separately, convert to a format the tool accepts (if possible), manually take sequential screenshots of key moments, write a step-by-step narrative describing each frame. Typical time: 3–5 minutes per visual bug report.

Annotation-Aware Capture with Stash (3 Steps)

Press ⌘⌃S to capture a region (or ⌘⌃R for video recording)
Draw annotations in the Stash popover — each edit auto-copies with the AI Context Banner embedded
Paste into your AI coding tool — structured metadata is already included

For video captures: key frames and Capture Report are generated automatically when the recording stops. Select LLM copy mode and paste. Typical time: under 15 seconds.

How Stash Fits Into the Vibe Coding Tool Stack

Vibe coding workflows in 2026 typically involve multiple specialized tools: an AI-powered IDE (Cursor, Windsurf, Claude Code, ChatGPT), a model provider (Claude, GPT-4o, Gemini), version control, and various context-providing integrations via MCP.

Stash operates at the visual feedback layer — the point where a developer needs to communicate what they see on screen back to the AI. It complements rather than replaces other tools in the stack:

Tool Category	Role in Vibe Coding	Examples
AI IDE	Code generation, refactoring, debugging	Cursor, Windsurf, Claude Code, ChatGPT
LLM Provider	Language and vision model inference	Claude, GPT-4o, Gemini
Context Layer	Structured data from external sources via MCP	Sentry (traceability), Figma (design tokens)
Visual Feedback Layer	Screenshot capture, annotation, structured output for LLMs	Stash
Browser Automation	AI-initiated screenshots and DOM interaction	Claude Code --chrome, Cline browser

Browser automation tools (like Claude Code's --chrome flag) let the AI take its own screenshots — useful for verifying output but limited to web contexts and still subject to pixel-level interpretation. DOM annotation tools (like Vibe Annotations) send element-level context for live web pages. Stash handles the broader case: any application, any platform, any content the developer can see on screen — with structured metadata that makes annotations machine-readable regardless of what was captured.

Complete Feature Reference

Screenshot and Annotation

Capability	Specification
Capture trigger	⌘⌃S hotkey (customizable) or in-app button
Capture modes	Region selection, window, full screen
Annotation tools	Single arrow, double arrow, rectangle, ellipse
Stroke style	2.5px professional stroke, sharp triangular arrowheads
Color palette	7 preset colors; default: Signal Red
Auto-copy	Every annotation edit immediately copies to clipboard
AI Context Banner	Structured metadata overlay with annotation details
Undo	Single-step undo (removes last annotation)

Video Capture and Decomposition

Capability	Specification
Capture trigger	⌘⌃R hotkey
Key frame extraction	Change-detection based (not interval sampling)
Click tracking	Position, timestamp, and target element
OCR	Text extraction at each key frame state
Capture Report	Structured text narrative of the interaction
Copy modes	Copy All (report + frames + audio) or Copy Folder Path

Clipboard Management

Capability	Specification
History	30-day retention, auto-cleanup
Bookmarks	Permanent, custom names, drag-to-reorder
Hotkeys	Up to 20 custom keyboard shortcuts
Open Stash	⌘⇧V from any application
Content types	Text and images
Compression	JPEG output with configurable max size

Platform

Specification	Value
Platform	macOS (native Swift/SwiftUI + AppKit)
Minimum OS	macOS 13.0 (Ventura)
Architecture	Universal Binary (Apple Silicon + Intel)
Memory usage (idle)	< 30 MB
CPU usage (idle)	< 0.1%

Frequently Asked Questions

What is annotation-aware capture?

Annotation-aware capture is a screenshot methodology where drawn annotations (arrows, rectangles, circles) are not only rendered as visible pixels but also encoded as structured text metadata — including annotation type, position coordinates, color, draw order, source application, and timestamp. This metadata is embedded in the image in a format that LLMs can read through text extraction, allowing AI coding tools to understand the developer's intent rather than guessing from pixel patterns.

How is Stash different from regular screenshot tools?

Standard screenshot tools (macOS native capture, CleanShot X, Snagit) render annotations as pixels only. The AI coding tool receives colored shapes with no way to determine what they mean. Stash generates an AI Context Banner — structured text metadata composited onto the image — that explicitly tells the AI what each annotation is, where it points, and in what order it was drawn. Additionally, Stash decomposes screen recordings into key frames and AI Capture Reports that LLMs can process sequentially.

How is Stash different from Vibe Annotations?

Vibe Annotations is a browser extension that attaches feedback to DOM elements on live web pages, sending element selectors and computed styles to AI coding agents. Stash captures any content visible on screen — including native applications, design tools, terminal windows, and mobile simulators — and embeds structured annotation metadata in the image itself. Vibe Annotations requires a live web page in a browser. Stash works with any visual content.

Does Stash work with Claude Code, ChatGPT, Cursor, and Copilot?

Yes. Stash copies annotated images (with AI Context Banners) and video decomposition outputs (key frames + AI Capture Reports) to the system clipboard. Any AI coding tool that accepts pasted images or text can receive Stash's structured visual feedback. No plugins, extensions, or MCP configuration are required.

What is an AI Capture Report?

An AI Capture Report is a structured text document generated by Stash when a screen recording is decomposed. It includes timestamped interaction logs, voice transcript, click target identification, console OCR, focus tracking, state change descriptions between frames, and recording metadata. It converts temporal interaction information — which LLMs lose when processing raw video — into structured text that models can reason about.

What is the Visual Context Gap?

The Visual Context Gap is the disconnect between visual information a developer sees on screen and what an AI coding tool can interpret from a screenshot. It encompasses annotation ambiguity (drawn shapes are indistinguishable from UI elements after patch tokenization), metadata stripping (all non-pixel data is discarded by LLM vision APIs), and temporal information loss (screen recordings are rejected or reduced to a single frame). Annotation-aware capture is designed to bridge this gap.

Key Takeaways

Annotation-aware capture encodes drawn annotations as structured text metadata — not just pixels — so AI coding tools can read developer intent through text processing rather than pixel inference
Stash's AI Context Banner embeds annotation type, position, color, draw order, source app, and timestamp into every captured screenshot
Video decomposition converts screen recordings into key frame images (change-detection based) and structured AI Capture Reports (timestamped interactions, voice transcript, OCR text, state changes)
Two copy modes serve both human teammates (standard images and recordings) and AI coding tools (AI Context Banners and AI Capture Reports)
The full workflow — capture, annotate, paste with structured metadata — takes under 15 seconds versus 3–5 minutes with traditional tools
Stash works with any AI coding tool that accepts pasted images or text — Claude Code, ChatGPT, Cursor, Copilot, Windsurf, and others — with no plugins required
As a native macOS menu bar app, Stash captures any on-screen content — not just web pages — including native applications, design tools, terminals, and simulators

Part 2 of 2 — Read Part 1:

← The Visual Context Gap: Why AI Coding Tools Can't Understand Your Screenshots

References and Further Reading

Dosovitskiy et al., "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" (2020) — foundational Vision Transformer paper explaining patch tokenization
Anthropic, "Vision documentation — Image best practices" — confirms metadata stripping and supported formats
Karpathy, A., "Vibe Coding" (February 2025) — origin of the term
OpenAI, "Introducing Codex" (May 2025) — cloud-based coding agent with screenshot sharing and sandboxed task execution
OpenAI, "Introducing GPT-5.3-Codex" (2026) — Codex-native agent with CLI, IDE extension, and cloud surfaces for vibe coding workflows
OpenAI, "Introducing Canvas" (2024) — visual workspace for side-by-side code and chat in ChatGPT
Sentry Engineering Blog, "Vibe Coding: Closing the Feedback Loop with Traceability" (2025)
"The Eyes Have It: Closing the Agentic Design Loop" — DEV Community (2026)
"Vibe Coding for UX Design" — arXiv research on multimodal limitations in AI-assisted workflows

Annotation-Aware Capture: Structured Visual Feedback for AI Coding Tools

Table of Contents

What Is Annotation-Aware Capture?

Why This Approach Works

How Stash Implements Annotation-Aware Capture

The AI Context Banner

Example: What the AI Actually Receives

Video Capture Decomposition and AI Capture Reports

Output 1: Key Frame Extraction

Output 2: The Capture Report

Example: Video Decomposition in Practice

Two Copy Modes: Humans and Machines

The Workflow: Before and After Annotation-Aware Capture

Traditional Visual Feedback (7+ Steps)

Annotation-Aware Capture with Stash (3 Steps)

How Stash Fits Into the Vibe Coding Tool Stack

Complete Feature Reference

Screenshot and Annotation

Video Capture and Decomposition

Clipboard Management

Platform

Frequently Asked Questions

What is annotation-aware capture?

How is Stash different from regular screenshot tools?

How is Stash different from Vibe Annotations?

Does Stash work with Claude Code, ChatGPT, Cursor, and Copilot?

What is an AI Capture Report?

What is the Visual Context Gap?

Key Takeaways

References and Further Reading

Related Articles