Semantic OCR: Block Detection Without Vision Models

I needed a way to detect UI blocks on screenshots without relying on expensive vision models. 50 experiments later, I found what works.

The Problem

Vision models charge per token. A single 4K screenshot can cost 36,000 tokens with GPT-4o-mini. I needed something cheaper for my automation pipeline.

The idea: use Tesseract OCR (free) to extract text, then algorithmically group words into meaningful blocks.

5 Algorithms Tested

MethodApproachResult
M1 Word ClusteringGroup words by distanceBest for lists (Gmail)
M2 WhitespaceFind empty areas between contentBest for cards (BBC)
M3 Connected ComponentsBinary image + dilationMerges everything
M4 HierarchicalRegions first, then M1= M1, doesn't see images
M5 Image DetectionVariance analysisUnreliable

First Attempt: Column Detection

My first approach was simple: find vertical gaps in the text, split into columns.

Column Detection - Bad Result

Result: useless. The algorithm sees vertical stripes, not content blocks. TikTok card, Epstein photo, sidebar — all mixed together in arbitrary columns.

The Winner: Hierarchy (M2 + M1)

Neither M1 nor M2 alone was perfect. The solution: combine them.

  1. M2 finds blocks — detects whitespace, everything between = block
  2. M1 finds elements inside — clusters text within each block

Semantic OCR Result

Result on BBC News: 11 blocks, 24 elements. Each news card is a separate block with clickable elements inside.

Key Parameters

y_th=40    # vertical merge threshold (lines -> paragraphs)
min_w=100  # minimum block width (removes narrow artifacts)
x_th=350   # horizontal merge (words -> lines)

Gmail Test

Same algorithm on Gmail: sidebar and content are separate blocks. Each email is a distinct element.

  • Block 1: Sidebar (Inbox, Starred, Sent...)
  • Block 2: Email list (42 separate emails)

The API

GET /semantic?window=chrome&visualize=true

Returns:

{
  "mode": "hierarchy",
  "blocks": [
    {
      "name": "CONTENT",
      "bounds": [x1, y1, x2, y2],
      "elements": [
        {"text": "TikTok owner signs deal...", "bounds": [...]}
      ]
    }
  ],
  "total_elements": 24
}

Cost Comparison

ApproachCost per 1M requests
GPT-4o-mini vision$5,551
Tesseract + Algorithm$0

The algorithm isn't as smart as GPT-4, but for structured UIs it's good enough — and infinitely cheaper.

← Back to blog