June 13, 2026

How We Test MCP Servers: The Full Methodology

When we started MCP Select, we couldn't find a single publicly available benchmark for MCP servers. Directories ranked by GitHub stars. Listicles copied README features. But nobody had actually run the servers and measured whether they worked. So we built a test harness. It's open-source, runs on a single command, and produces results we publish in full. This post explains exactly how it works.

The Problem We Were Trying to Solve

Most MCP server directories work like this: a developer finds a repository with a polished README, adds it to a list, and assigns a star rating based on... something. Maybe the README length. Maybe the author's Twitter following. Maybe a gut feeling. We tried that. It failed immediately. The first server we tested, a browser automation tool with 600+ stars, crashed on startup because it expected a global Chrome binary that wasn't mentioned in the docs. The second, a database connector with 400+ stars, silently swallowed connection errors and returned empty arrays. Both looked great on paper. Both were broken in practice. We needed a way to know, with certainty, whether a server actually works before recommending it.

The Harness in One Sentence

For every server, we install it in a clean environment, connect to it with the MCP SDK, run five production tasks, measure the latency, validate the output, and kill the process. Then we publish the raw numbers. Here's what that means in practice.

Step 1: Discovery

We find servers by scanning GitHub and npm for packages matching mcp-server or @modelcontextprotocol. We also watch the official Anthropic servers repository for new additions. When we find a candidate, we read its README and check three things: - Does it expose tools via the MCP protocol (not just a REST wrapper)? - Can it be installed via npx or npm install without privileged access? - Does it have a license that permits evaluation? If all three are true, it goes on the test queue.

Step 2: Installation in a Clean Environment

This is where most evaluations fail silently. We spin up a fresh Node.js environment with no global packages, no cached credentials, and no .env file. We run exactly what the README says, usually npx -y package-name. If the install fails, the server is rejected immediately. No partial credit. The environment is stripped down. We only allow these system variables through: - PATH, HOME, USER, SHELL: basic OS necessities - LANG, LC_ALL, TZ, TMPDIR: locale and temp paths - One declared API key, if the server needs it Everything else is blocked. If a server tries to read OPENAI_API_KEY or ANTHROPIC_API_KEY from the host, it gets an empty string. This prevents buggy or malicious servers from exfiltrating unrelated credentials.

Step 3: Two-Phase Connection

Before we run any tests, we try to connect and discover the server's tools. This is a separate step because many servers fail here, and the failure mode matters. ``javascript const discovery = await harness.discover(10000); ` The discover() method attempts to connect and call listTools() with a 10-second timeout. It classifies the result into one of five categories: | Status | Meaning | Example | |--------|---------|---------| | ready | Connected, tools listed | Playwright MCP | | auth_required | Needs an API key | Browserbase MCP | | auth_invalid | Key provided but rejected | Exa MCP with wrong key | | timeout | Hung for 10 seconds | Redis MCP | | error | Crashed or protocol violation | Firecrawl MCP | This classification determines what happens next. ready servers go straight to testing. auth_required servers get a key injected (if we have one) and are re-tested. timeout and error servers are documented as-is. We don't hide failures.

Step 4: Production Tasks

This is the core of the benchmark. Every server runs five tasks that match its actual purpose. Not synthetic unit tests. Real user intents. For browser automation servers, the tasks are: 1. Navigate to a local test page 2. Take a screenshot 3. Get a DOM snapshot 4. Click a button 5. Fill a form field For database servers: 1. Connect to the database 2. Create a table 3. Insert a row 4. Query the row back 5. Delete the table Each task is adapted to the server's actual tool names. Playwright uses
browser_navigate, Puppeteer uses puppeteer_navigate. The harness looks up the correct tool name dynamically. We don't hardcode tool names. ### Latency Measurement Every task is timed with performance.now() from the moment the tool call starts to the moment the result returns. We collect: - **p50 latency**: median execution time - **p90 latency**: 90th percentile (catches outliers) - **Average latency**: mean across all tasks - **Connect time**: time from process spawn to tool discovery The per-task timeout is 30 seconds. If a task hangs longer than that, it's killed and marked as failed. ### Output Validation This is where most benchmarks cheat. They check if the output contains a substring like "success" or "done". That's broken. Error messages often contain those words too. We validate semantically. For a navigation task, we check that the returned URL matches the target. For a screenshot task, we verify the result contains image data. For a form fill, we query the DOM afterwards to confirm the value was actually entered. `javascript // Not this: output.content[0].text.includes("success") // This: output.content[0].type === "image" && output.content[0].data.length > 100 ` If validation passes, the task is marked passed. If the tool throws an error, it's error. If validation fails (wrong output format, missing data), it's failed. ### Stateful Adapters Some servers need setup before tasks and cleanup after. The harness supports four lifecycle hooks: - **preConnect**: runs before any task (e.g., start a browser session) - **preTask**: runs before each task (e.g., navigate to a fresh page) - **postTask**: runs after each task (e.g., capture logs) - **postConnect**: runs after all tasks (e.g., close the browser) If preConnect fails, every subsequent task is skipped. This prevents cascading failures from a bad initial state.

Step 5: Process Cleanup

MCP servers spawn child processes. Those child processes spawn more child processes. If we don't clean up aggressively, we get zombies. The harness uses a three-layer kill strategy: 1. **SIGTERM** the tracked child PID, wait 1.5 seconds 2. **SIGKILL** the process group (kills children first, then parent) 3. **
pkill** by package name pattern as a final fallback This is overkill by design. We'd rather kill too aggressively than leave a hung Chrome process consuming 2GB of RAM.

What We Report

After all tasks complete, the harness produces a JSON file with this structure:
`json { "date": "2026-06-08T13:27:13Z", "server": "Playwright MCP", "tool_count": 23, "tests": [ { "name": "navigate", "status": "passed", "latency_ms": 7038 }, { "name": "screenshot", "status": "passed", "latency_ms": 4120 }, ... ], "overall": { "total_tests": 5, "passed": 5, "failed": 0, "skipped": 0, "pass_rate": 100, "avg_latency_ms": 4321, "p50_latency_ms": 4120, "p90_latency_ms": 6070 } } ` We do not calculate an opaque "quality score." Different use cases care about different things. A CI pipeline cares about 100% pass rate. A prototyping session cares about sub-second latency. We show all the numbers so you can weight them yourself.

Auth Classification

After testing, we classify each server by how easy it is to evaluate: Free to Test: install via
npx, no key needed. Examples: Playwright MCP, Puppeteer MCP, Filesystem MCP. Requires Auth: needs a free API key. We test with the free tier when we have one. Examples: Tavily MCP, Browserbase MCP. Requires Paid: no free tier available. We verify the package installs but cannot run full tasks. Examples: some Stripe integrations. This classification appears on every server card so you know what you're getting into before you click "Install."

Reproducing Our Results

Every benchmark script is in our GitHub repository. To reproduce a result:
`bash git clone https://github.com/AlexGn/mcpselect cd mcpselect/tests/browser-automation node run.mjs `` The test page is a self-contained HTML file served on an ephemeral port. No network dependencies. Same DOM on every run.

What We Don't Do

To be clear about our limits: - We don't test every possible tool on a server. We test five representative tasks. - We don't benchmark with real user traffic. We test on a local machine with clean state. - We don't evaluate documentation quality, community support, or pricing fairness. We measure whether the server works when you install it. These other factors matter. But they're not what we measure.

The Bottom Line

Our benchmark isn't perfect. But it's reproducible, transparent, and honest. Every number on this site came from running actual code against actual servers. If you disagree with a result, you can run the same test and see for yourself. That's the point. In a space where most rankings are based on README polish and Twitter hype, we wanted something you could verify.
Run it yourself. Clone the test harness, pick a server, and see what happens. Or browse our benchmarked servers to see the results. *Last updated: June 13, 2026. Test harness version 2.0.0. MCP SDK 1.0.0.*