Screenbox - Dmytro Klymentiev

Name: Screenbox
Author: Dmytro Klymentiev

Screenbox / AI Virtual Desktops

open source Active GitHub Website

Python Docker TigerVNC Chromium MCP Protocol FastAPI

What it does

Screenbox is a virtual desktop for AI agents. Each agent gets its own Docker isolated desktop with a real Chromium browser that it can see, click, and type in. Unlike headless MCP browser automation tools, Screenbox gives the agent a visible screen -- a self-hosted AI sandbox where it interacts with software the same way a human would. If you need an agent to fill out a form, navigate a UI, or work with any application that expects someone at the screen, Screenbox handles it.

Screenbox gives your AI agent its own computer screen. A real Chromium browser running in a Docker container, with a visible display that the agent can see via screenshots, understand via OCR and semantic page maps, and interact with via clicks, keystrokes, and shell commands. The agent sees what a human would see and acts the way a human would act.

You watch it work. Connect via RDP or VNC and see the agent navigating pages, clicking buttons, typing into fields in real time. When it gets stuck, take the mouse, help it, then hand control back. Human-in-the-loop is built into the system, not bolted on.

Each desktop is fully isolated. No bind mounts, no shared filesystems. Files move only through explicit API calls. Multiple agents can have separate desktops, each with its own browser sessions, cookies, and state. Save a desktop's state as a snapshot and restore it later.

Results & Impact

Used daily in production for browser automation, web research, content management, and application testing. Multiple agents running concurrent desktops on a single server.

21 MCP tools covering the full interaction spectrum: screenshot, OCR, click, type, shell commands, Chrome DevTools, file upload/download, window management, and knowledge compilation. Connected to Claude Code via one config line.

Featured on Product Hunt. Open source, self-hosted, MIT licensed.

Knowledge compilation means agents improve between sessions. Action logs from one session are compiled into declarative facts that are auto-injected into future sessions. An agent that learned how to navigate a complex UI yesterday does not need to figure it out again today.

Key Features

Real Chromium Browser. Not headless, not Playwright. A full Chromium instance with extensions, DevTools, cookies, and sessions. Everything a real browser can do.
21 MCP Tools. Screenshot, OCR, click, type, shell, Chrome DevTools, file I/O, window management, knowledge tools. Complete desktop control from any MCP client.
Docker Isolation. Each desktop is a separate container. No bind mounts, no shared filesystem. Files move only through explicit API calls. Per-agent API keys enforce ownership.
Human-in-the-Loop. Watch agents work via RDP or VNC in real time. Take mouse and keyboard control when they need help. Release control and let the agent continue.
Knowledge Compilation. Action logs compiled into declarative facts between sessions. Auto-injected into future interactions. Agents learn from experience without retraining.
Semantic Page Map. Chrome extension generates structured maps of interactive elements with coordinates. Agents understand page structure without OCR for every click.
Snapshots. Save and restore desktop state on demand. Browser sessions, files, cookies, everything. Resume work exactly where you left off.
Cross-Platform. Linux (native Docker), macOS (Docker Desktop), Windows (WSL2). ~2 GB RAM per desktop, no GPU needed. Lightweight enough for laptops.

How it works

Screenbox is an MCP server that manages Docker containers as virtual desktops. Each container runs Xvnc (TigerVNC) for the display, Chromium as the browser, and a set of CLI tools (xdotool, xclip, ImageMagick) for interaction.

When an agent takes a screenshot, the MCP server runs import -window root inside the container, normalizes the image, optionally overlays a grid, and returns it. When the agent clicks, xdotool moves the cursor and sends a click event. When the agent types, characters are sent one by one through xdotool's key simulation.

Chrome operations go through a custom extension that provides semantic page maps (structured lists of interactive elements with coordinates), JavaScript evaluation, tab management, and CSS selectors. This is faster and more reliable than OCR for browser-specific interactions.

A Docker API proxy sits between the MCP server and the Docker daemon, whitelisting allowed operations and handling binary data streaming for screenshots and file transfers.

Stack: Python (MCP server), Docker (desktops), TigerVNC (display), Chromium (browser), xdotool (input), ImageMagick (screenshots). Dashboard optional.

Quick Start

# Clone and set up
git clone https://github.com/dklymentiev/screenbox.git
cd screenbox
./setup.sh
docker compose up -d

# Add to Claude Code / Claude Desktop / Cursor:
{
  "mcpServers": {
    "screenbox": {
      "url": "http://localhost:8080/mcp"
    }
  }
}

# Tell your agent:
# "Create a desktop and go to github.com"

Use Cases

For AI agent developers. Let your agent interact with real web applications. Your agent needs to fill out a web form, navigate a multi-step wizard, or extract data from a dashboard that has no API. Give it a Screenbox desktop and it handles the UI like a human would.

For QA and testing. Visual testing with an AI reviewer. An agent opens your web app, navigates through user flows, takes screenshots, and reports visual issues or broken interactions. Not a replacement for unit tests, but a complement that catches what automated tests miss.

For research and data collection. Navigate authenticated platforms and complex UIs. Some data lives behind logins, CAPTCHAs, and interactive interfaces. An agent with a Screenbox desktop can log in, navigate, and extract information from platforms that actively resist headless scrapers.

Lessons Learned

Real browsers beat headless for agent work. Headless browsers are great for automated testing but terrible for agent workflows. Agents need to see what is on screen, understand visual layout, and handle the unexpected. A real, visible browser with a real display means the agent works with the same interface a human sees, including all the dynamic content, popups, and JavaScript that headless browsers often miss.

Agents need to learn between sessions. Without knowledge compilation, an agent that figured out a complex UI navigation yesterday will waste time figuring it out again today. Compiling action logs into declarative facts and auto-injecting them into future sessions cut repeat-task time significantly. The agent remembers "the submit button is in the bottom-right corner" instead of re-discovering it every time.

Human oversight is a feature, not a limitation. The ability to watch an agent work via RDP and take control when needed is not a crutch. It is what makes agent desktop automation practical. Agents get stuck on CAPTCHAs, unexpected modals, and ambiguous UI states. A human can resolve these in seconds. Without oversight, the agent would fail silently or loop forever.

FAQ

What is Screenbox?

Screenbox gives AI agents their own virtual desktop with a real Chromium browser. Your agent can see the screen, click, type, and navigate, just like a human. Each desktop is an isolated Docker container.

How is this different from headless browser automation?

Headless browsers are invisible. Screenbox desktops are real visible screens you can watch via RDP. The browser has extensions, DevTools, and cookies. You can take control when the agent gets stuck.

What AI tools does it work with?

Any MCP-compatible client: Claude Desktop, Claude Code, Cursor, Copilot. 21 MCP tools. One line of config to connect.

Is it safe?

Each desktop is a fully isolated Docker container with no bind mounts. Per-agent API keys enforce ownership. Do not expose the API to the public internet.

What is knowledge compilation?

Action logs from sessions are distilled into declarative facts and auto-injected into future sessions. Agents learn from experience without retraining.