Benchmarking Vision Models for Screen Automation

I spent a day testing which AI vision models can actually read what's on my screen. Not for fun — I'm building automation that needs to understand UI state.

The Setup

Three image sizes: 4K (396 KB), 1080p (129 KB), 540p (36 KB). Twelve models. One question: "What do you see?"

The Surprise

GPT-4o-mini uses 36,000 tokens per image. That's not a typo. Same image, Qwen-7b uses 75 tokens. Four hundred times difference.

Price per million requests:

Qwen-2.5-vl-7b: $50
GPT-4o-mini: $5,551

I expected OpenAI to be expensive. I didn't expect 100x expensive.

Image Size Matters (A Lot)

Most models failed on small images (960x540). They'd hallucinate — confidently describing applications that weren't open, URLs that didn't exist.

Only one model scored 3/3 accuracy across all sizes: gemini-2.5-flash-lite. Everyone else needed at least 1080p to work reliably.

Full Benchmark Results

Model	Size	File	Speed	Quality	Price
amazon/nova-2-lite-v1:free	3840x2160	396 KB	13.5s	2/3 partial	FREE
	1920x1080	129 KB	6.5s	2/3 partial	FREE
	960x540	36 KB	13.5s	2/3 partial	FREE
amazon/nova-lite-v1	3840x2160	396 KB	4.7s	1/3 bad	-
	1920x1080	129 KB	3.8s	1/3 bad	-
	960x540	36 KB	3.6s	1/3 bad	-
bytedance/ui-tars-1.5-7b	3840x2160	396 KB	5.8s	3/3 OK	-
	1920x1080	129 KB	3.1s	2/3 partial	-
	960x540	36 KB	2.6s	2/3 partial	-
google/gemini-2.0-flash-exp:free	960x540	36 KB	2.3s	3/3 OK	FREE
google/gemini-2.5-flash-lite	3840x2160	396 KB	4.0s	3/3 OK	$0.00026
	1920x1080	129 KB	2.6s	3/3 OK	$0.00021
	960x540	36 KB	2.1s	3/3 OK	$0.00021
google/gemma-3-4b-it:free	3840x2160	396 KB	7.3s	1/3 bad	FREE
	1920x1080	129 KB	6.5s	2/3 partial	FREE
	960x540	36 KB	4.9s	2/3 partial	FREE
meta-llama/llama-3.2-11b-vision	3840x2160	396 KB	38.8s	2/3 partial	-
	1920x1080	129 KB	31.2s	3/3 OK	-
	960x540	36 KB	24.0s	2/3 partial	-
microsoft/phi-4-multimodal	1920x1080	129 KB	10.2s	2/3 partial	$0.00002
	960x540	36 KB	6.8s	2/3 partial	$0.00002
openai/gpt-4o-mini	3840x2160	396 KB	4.8s	3/3 OK	-
	1920x1080	129 KB	5.5s	3/3 OK	-
	960x540	36 KB	3.5s	2/3 partial	-
qwen/qwen-2.5-vl-7b-instruct	3840x2160	396 KB	6.8s	3/3 OK	$0.00005
	1920x1080	129 KB	3.4s	3/3 OK	$0.00005
	960x540	36 KB	2.0s	1/3 bad	$0.00004
qwen/qwen2.5-vl-72b-instruct	3840x2160	396 KB	7.8s	3/3 OK	$0.00008
	1920x1080	129 KB	7.4s	3/3 OK	$0.00009
	960x540	36 KB	2.9s	2/3 partial	$0.00002
qwen/qwen3-vl-8b-instruct	3840x2160	396 KB	7.6s	3/3 OK	-
	1920x1080	129 KB	9.1s	3/3 OK	-
	960x540	36 KB	4.9s	2/3 partial	-

Quality scoring: 3/3 OK = correctly identified URL, dashboard stats, and VS Code

The Winners

Use Case	Model	Speed	Price/req
Free	gemini-2.0-flash-exp:free	2.3s	$0
Cheap + accurate	qwen-2.5-vl-7b	3.4s	$0.00005
Any image size	gemini-2.5-flash-lite	2.1s	$0.0002

What I Learned

Token counting varies wildly — same image, same prompt, 75 vs 36,000 tokens
"Mini" doesn't mean cheap — GPT-4o-mini is more expensive than GPT-4o for images
Resolution is quality — 1080p is the sweet spot, 540p breaks most models
Free models work — gemini-2.0-flash-exp:free handles small images perfectly