I spent a day testing which AI vision models can actually read what's on my screen. Not for fun — I'm building automation that needs to understand UI state.
The Setup
Three image sizes: 4K (396 KB), 1080p (129 KB), 540p (36 KB). Twelve models. One question: "What do you see?"
The Surprise
GPT-4o-mini uses 36,000 tokens per image. That's not a typo. Same image, Qwen-7b uses 75 tokens. Four hundred times difference.
Price per million requests:
- Qwen-2.5-vl-7b: $50
- GPT-4o-mini: $5,551
I expected OpenAI to be expensive. I didn't expect 100x expensive.
Image Size Matters (A Lot)
Most models failed on small images (960x540). They'd hallucinate — confidently describing applications that weren't open, URLs that didn't exist.
Only one model scored 3/3 accuracy across all sizes: gemini-2.5-flash-lite. Everyone else needed at least 1080p to work reliably.
Full Benchmark Results
| Model | Size | File | Speed | Quality | Price |
|---|---|---|---|---|---|
| amazon/nova-2-lite-v1:free | 3840x2160 | 396 KB | 13.5s | 2/3 partial | FREE |
| 1920x1080 | 129 KB | 6.5s | 2/3 partial | FREE | |
| 960x540 | 36 KB | 13.5s | 2/3 partial | FREE | |
| amazon/nova-lite-v1 | 3840x2160 | 396 KB | 4.7s | 1/3 bad | - |
| 1920x1080 | 129 KB | 3.8s | 1/3 bad | - | |
| 960x540 | 36 KB | 3.6s | 1/3 bad | - | |
| bytedance/ui-tars-1.5-7b | 3840x2160 | 396 KB | 5.8s | 3/3 OK | - |
| 1920x1080 | 129 KB | 3.1s | 2/3 partial | - | |
| 960x540 | 36 KB | 2.6s | 2/3 partial | - | |
| google/gemini-2.0-flash-exp:free | 960x540 | 36 KB | 2.3s | 3/3 OK | FREE |
| google/gemini-2.5-flash-lite | 3840x2160 | 396 KB | 4.0s | 3/3 OK | $0.00026 |
| 1920x1080 | 129 KB | 2.6s | 3/3 OK | $0.00021 | |
| 960x540 | 36 KB | 2.1s | 3/3 OK | $0.00021 | |
| google/gemma-3-4b-it:free | 3840x2160 | 396 KB | 7.3s | 1/3 bad | FREE |
| 1920x1080 | 129 KB | 6.5s | 2/3 partial | FREE | |
| 960x540 | 36 KB | 4.9s | 2/3 partial | FREE | |
| meta-llama/llama-3.2-11b-vision | 3840x2160 | 396 KB | 38.8s | 2/3 partial | - |
| 1920x1080 | 129 KB | 31.2s | 3/3 OK | - | |
| 960x540 | 36 KB | 24.0s | 2/3 partial | - | |
| microsoft/phi-4-multimodal | 1920x1080 | 129 KB | 10.2s | 2/3 partial | $0.00002 |
| 960x540 | 36 KB | 6.8s | 2/3 partial | $0.00002 | |
| openai/gpt-4o-mini | 3840x2160 | 396 KB | 4.8s | 3/3 OK | - |
| 1920x1080 | 129 KB | 5.5s | 3/3 OK | - | |
| 960x540 | 36 KB | 3.5s | 2/3 partial | - | |
| qwen/qwen-2.5-vl-7b-instruct | 3840x2160 | 396 KB | 6.8s | 3/3 OK | $0.00005 |
| 1920x1080 | 129 KB | 3.4s | 3/3 OK | $0.00005 | |
| 960x540 | 36 KB | 2.0s | 1/3 bad | $0.00004 | |
| qwen/qwen2.5-vl-72b-instruct | 3840x2160 | 396 KB | 7.8s | 3/3 OK | $0.00008 |
| 1920x1080 | 129 KB | 7.4s | 3/3 OK | $0.00009 | |
| 960x540 | 36 KB | 2.9s | 2/3 partial | $0.00002 | |
| qwen/qwen3-vl-8b-instruct | 3840x2160 | 396 KB | 7.6s | 3/3 OK | - |
| 1920x1080 | 129 KB | 9.1s | 3/3 OK | - | |
| 960x540 | 36 KB | 4.9s | 2/3 partial | - |
Quality scoring: 3/3 OK = correctly identified URL, dashboard stats, and VS Code
The Winners
| Use Case | Model | Speed | Price/req |
|---|---|---|---|
| Free | gemini-2.0-flash-exp:free | 2.3s | $0 |
| Cheap + accurate | qwen-2.5-vl-7b | 3.4s | $0.00005 |
| Any image size | gemini-2.5-flash-lite | 2.1s | $0.0002 |
What I Learned
- Token counting varies wildly — same image, same prompt, 75 vs 36,000 tokens
- "Mini" doesn't mean cheap — GPT-4o-mini is more expensive than GPT-4o for images
- Resolution is quality — 1080p is the sweet spot, 540p breaks most models
- Free models work — gemini-2.0-flash-exp:free handles small images perfectly