Benchmarking Vision Models for Screen Automation

I spent a day testing which AI vision models can actually read what's on my screen. Not for fun — I'm building automation that needs to understand UI state.

The Setup

Three image sizes: 4K (396 KB), 1080p (129 KB), 540p (36 KB). Twelve models. One question: "What do you see?"

The Surprise

GPT-4o-mini uses 36,000 tokens per image. That's not a typo. Same image, Qwen-7b uses 75 tokens. Four hundred times difference.

Price per million requests:

  • Qwen-2.5-vl-7b: $50
  • GPT-4o-mini: $5,551

I expected OpenAI to be expensive. I didn't expect 100x expensive.

Image Size Matters (A Lot)

Most models failed on small images (960x540). They'd hallucinate — confidently describing applications that weren't open, URLs that didn't exist.

Only one model scored 3/3 accuracy across all sizes: gemini-2.5-flash-lite. Everyone else needed at least 1080p to work reliably.

Full Benchmark Results

ModelSizeFileSpeedQualityPrice
amazon/nova-2-lite-v1:free3840x2160396 KB13.5s2/3 partialFREE
1920x1080129 KB6.5s2/3 partialFREE
960x54036 KB13.5s2/3 partialFREE
amazon/nova-lite-v13840x2160396 KB4.7s1/3 bad-
1920x1080129 KB3.8s1/3 bad-
960x54036 KB3.6s1/3 bad-
bytedance/ui-tars-1.5-7b3840x2160396 KB5.8s3/3 OK-
1920x1080129 KB3.1s2/3 partial-
960x54036 KB2.6s2/3 partial-
google/gemini-2.0-flash-exp:free960x54036 KB2.3s3/3 OKFREE
google/gemini-2.5-flash-lite3840x2160396 KB4.0s3/3 OK$0.00026
1920x1080129 KB2.6s3/3 OK$0.00021
960x54036 KB2.1s3/3 OK$0.00021
google/gemma-3-4b-it:free3840x2160396 KB7.3s1/3 badFREE
1920x1080129 KB6.5s2/3 partialFREE
960x54036 KB4.9s2/3 partialFREE
meta-llama/llama-3.2-11b-vision3840x2160396 KB38.8s2/3 partial-
1920x1080129 KB31.2s3/3 OK-
960x54036 KB24.0s2/3 partial-
microsoft/phi-4-multimodal1920x1080129 KB10.2s2/3 partial$0.00002
960x54036 KB6.8s2/3 partial$0.00002
openai/gpt-4o-mini3840x2160396 KB4.8s3/3 OK-
1920x1080129 KB5.5s3/3 OK-
960x54036 KB3.5s2/3 partial-
qwen/qwen-2.5-vl-7b-instruct3840x2160396 KB6.8s3/3 OK$0.00005
1920x1080129 KB3.4s3/3 OK$0.00005
960x54036 KB2.0s1/3 bad$0.00004
qwen/qwen2.5-vl-72b-instruct3840x2160396 KB7.8s3/3 OK$0.00008
1920x1080129 KB7.4s3/3 OK$0.00009
960x54036 KB2.9s2/3 partial$0.00002
qwen/qwen3-vl-8b-instruct3840x2160396 KB7.6s3/3 OK-
1920x1080129 KB9.1s3/3 OK-
960x54036 KB4.9s2/3 partial-

Quality scoring: 3/3 OK = correctly identified URL, dashboard stats, and VS Code

The Winners

Use CaseModelSpeedPrice/req
Freegemini-2.0-flash-exp:free2.3s$0
Cheap + accurateqwen-2.5-vl-7b3.4s$0.00005
Any image sizegemini-2.5-flash-lite2.1s$0.0002

What I Learned

  1. Token counting varies wildly — same image, same prompt, 75 vs 36,000 tokens
  2. "Mini" doesn't mean cheap — GPT-4o-mini is more expensive than GPT-4o for images
  3. Resolution is quality — 1080p is the sweet spot, 540p breaks most models
  4. Free models work — gemini-2.0-flash-exp:free handles small images perfectly
← Back to blog