OpenHands Review 2026: Open-Source AI Coding Agent, 72% SWE-Bench, and the Self-Hosting Catch

reviewopen-sourceautonomous-agentvspricingsetup-guide

OpenHands has 74,700 GitHub stars, beats Devin 2.0 on SWE-bench by nearly 27 points, and the cloud tier starts at zero. Those three facts should have every developer evaluating Devin at least trying it. The reason most don’t — and why this review matters — is that OpenHands’ strengths and failure modes are both more extreme than a 30-second summary captures.

This is a post-v1.7.0 (May 1, 2026) look at whether OpenHands earns a permanent slot in a professional workflow.

What OpenHands actually is

OpenHands (formerly OpenDevin) is an autonomous software engineering agent from All Hands AI, a startup that raised an $18.8M Series A. Give it a task in plain English — fix this GitHub issue, implement this feature — and it spins up a sandboxed environment, writes code, runs tests, and iterates until the task resolves or the context runs out. You watch a live action stream rather than write the code yourself.

The project ships in four configurations:

  • Open Source (self-hosted): MIT-licensed core stack. Install via Docker, point at any LLM via API key, run the web GUI at localhost:3000. No seat fees.
  • Cloud Free (BYOK tier): Hosted on OpenHands infrastructure. Bring your own Anthropic, OpenAI, or Mistral key. Hard cap at 10 conversations per day. LLM billed at-cost through your own key — no markup added by OpenHands.
  • Cloud Pro ($20/month): Covers runtime compute so sandbox VMs don’t appear as a separate line item. Unlocks GitHub, GitLab, Bitbucket, and Slack integrations. Includes $20 in one-time cloud credits. LLM still billed at-cost.
  • Enterprise (custom pricing): Self-hosted Kubernetes deployment with PostgreSQL-backed multi-tenancy, RBAC, and the Agent Control Plane — a centralized system for managing agent fleets at scale, launched May 6, 2026.

One architectural note before going further: the V0→V1 SDK split happened in November 2025. A substantial portion of community tutorials and third-party guides describe the V0 architecture, which is a different codebase. If a setup article doesn’t reference V1 or the Software Agent SDK, treat it as outdated.

Pricing in practice

The cost model is genuinely different from every other tool in this space:

PlanMonthly feeLLM costRuntime
Self-hosted (OSS)$0Your API key, market ratesYour server
Cloud Free$0Your API key, market ratesCovered (10 conv/day limit)
Cloud Pro$20At-cost via OpenHands LLM providerCovered
EnterpriseCustomCustomIncluded

When you use the OpenHands LLM provider on Cloud Pro, you pay Anthropic/OpenAI rates directly with zero markup. Claude Sonnet 4.5 runs $3 per 1M input and $15 per 1M output tokens — the same rate as calling the API yourself.

Compare that to Devin 2.0: also $20/month Pro, but ACU overages (the compute units Devin burns per task) can add $100–$400/month for active users. OpenHands’ cost structure is substantially more predictable for teams running high task volume.

If you want to run the self-hosted version without idling a local workstation, cloud GPU instances (e.g., RunPod) are a reasonable option for on-demand compute. For hardware sizing context, the runaihome.com guide to local LLM servers covers memory requirements by model size.

What it gets right

GitHub issue → pull request

This is the workflow OpenHands was purpose-built for, and the one where it performs most reliably. Point it at a GitHub issue URL: it reads the issue, navigates the repository, identifies files to change, creates a branch, writes the fix, runs tests, and opens a PR. The loop is coherent end-to-end in a way that simpler agents aren’t.

For maintainers carrying a backlog of labeled good-first-issue bugs — the kind that require a real fix but don’t need senior judgment — this is not a toy. The code review step remains mandatory; OpenHands closes issues but doesn’t guarantee quality.

Planning Mode (beta since v1.6.0)

Before March 30, 2026, OpenHands jumped straight into execution on every task. Sometimes that produced a correct solution in two minutes. Sometimes it dug into a hole and burned tokens going in circles on a wrong assumption.

Planning Mode changes the loop: the agent writes an implementation plan and pauses for your approval before touching the codebase. For anything beyond a trivial single-file fix, this one feature meaningfully improves completion rates. It’s still labeled beta as of v1.7.0, but stable enough to use daily.

Model flexibility

This is OpenHands’ structural advantage over every commercial autonomous agent. Claude 4.5 Sonnet, GPT-5, Gemini 3.1, Devstral 24B, Qwen3-235B — you swap models by changing a config field. When Anthropic ships a better model, you configure it once. Devin and Kiro use locked or constrained model stacks; you’re dependent on the vendor’s release cycle.

SWE-bench scores vary significantly by model choice. Devstral 24B scores approximately 46.8% on SWE-bench Verified when used as the OpenHands backend. Smaller open-weight models perform lower. Claude Sonnet 4.5 reaches ~72% on the V1 SDK harness. The model you select determines what “OpenHands” actually delivers.

Software Agent SDK

For teams building on top of agentic infrastructure, OpenHands ships a composable Python SDK under Apache 2.0 (separate repository). You can define custom agent workflows, delegate specialized tasks to sub-agents via TaskToolSet, and integrate OpenHands programmatically into CI/CD pipelines without running the GUI. This is the layer most enterprise self-hosted deployments are built on.

The Agent Control Plane (Enterprise, launched May 6, 2026) extends this into fleet management: least-privilege access controls scoped at the workflow level, spend tracking per workflow for cost attribution, and full action logging for debugging and compliance. It directly addresses the concern that autonomous agents running in production don’t have granular permission boundaries.

Where it breaks

The Docker dependency

Every self-hosted OpenHands instance requires Docker. The agent spawns sandbox containers for each task — this isolates filesystem access and prevents runaway processes from affecting the host. On a developer workstation, manageable. In a CI/CD pipeline, the required socket mount (/var/run/docker.sock) creates port-mapping, permission, and resource allocation issues that take real engineering time to resolve.

The v1.7.0 release added a SANDBOX_KVM_ENABLED environment variable to pass KVM acceleration through to sandbox containers, which improves performance on supported hardware, but doesn’t remove the Docker requirement.

Critical version mismatch: The SANDBOX_RUNTIME_CONTAINER_IMAGE tag must exactly match the openhands image version. Running openhands:1.7.0 against runtime:1.6.0-nikolaik fails immediately with a cryptic container start error. Always pull matching tags.

Git credential handling

Multiple independent users have documented the same failure patterns: OpenHands sometimes attempts to push to the default branch directly, handles credentials incorrectly, and can’t reliably retrieve PR comments or status checks via CLI tools. Automated PR workflows built on OpenHands need additional safeguards — branch protection rules that prevent direct pushes to main, and explicit credential injection into the container environment.

The platform also lacks native secrets management. Passing an API key or database credential to an agent task requires environment variable injection, with no built-in secret store or masking.

Browser tool reliability

When OpenHands tasks require checking a live URL — reading API docs, verifying deployment state, scraping a configuration value — the browsing tool is the flakiest part of the stack. JavaScript-heavy pages, bot detection, and site structure changes cause silent failures that derail otherwise correct task flows. The documented workaround: reach for curl or library-level HTTP calls whenever the agent needs to interact with a web resource. Don’t build production workflows that depend on browser tool success.

The 10-conversation free ceiling

The Cloud Free tier provides genuine evaluation access, but 10 conversations per day is a hard ceiling that real development work exceeds quickly. Solo developers running focused testing may stay under it. Any team scenario hits the limit on day one. The upgrade to Cloud Pro ($20/month) is straightforward, but the free tier isn’t a sustainable daily workflow for professionals.

SWE-bench in context

OpenHands publishes two commonly cited scores:

  • 77.6% on SWE-bench Verified — V0 harness, using inference-time scaling with Claude 3.5 Sonnet Thinking. This is the badge visible on the GitHub repository.
  • ~72.8% on SWE-bench Verified — V1 SDK harness, with Claude Sonnet 4.5. Newer infrastructure, lower score, more representative of typical production behavior.

Both numbers are legitimate for what they measure. Neither is directly comparable to Claude Code’s 87.6% or Devin’s 45.8% without noting that all three use different evaluation harnesses and setups.

An important benchmark context: OpenAI’s independent audit found that some frontier models have likely encountered SWE-bench Verified tasks during pretraining. Scores in the high 70s–80s should be read as “capable on representative real-world issues,” not “will resolve this fraction of your actual bug backlog.” The SWE-bench Pro leaderboard — which uses a harder, less-contaminated task set — shows frontier models scoring 30–50 percentage points lower than their Verified numbers.

What the comparison to Devin is worth noting: a 72–77% vs 45.8% gap is large enough to be real even accounting for harness differences. OpenHands with a frontier model resolves real GitHub issues at a substantially higher rate than Devin 2.0.

Comparison table

ToolSWE-bench VerifiedPriceModel lockSelf-hostBest for
OpenHands~72–77% (V1/V0)$0–$20/mo + LLMNoneYes (Docker)GitHub issue automation, privacy-first BYOK workflows
Devin 2.045.8%$20/mo + ACU overagesYesNoDefined async tasks, enterprise Cognizant workflows
Claude Code87.6%$20–$200/moAnthropic onlyTerminalBest agentic quality, Anthropic-native
ClineNot benchmarked$0 + API costNoneVia VS CodeIDE-integrated Plan/Act, real-time coding
AiderNot benchmarked$0 + API costNoneTerminalLightweight, version-controlled, minimal setup

Honest take

Use OpenHands if you’re automating GitHub issue resolution at scale and need a cost-predictable, model-agnostic tool that you can fully self-host. The Pro tier at $20/month plus at-cost LLM is cheaper than Devin for high-volume task queues, and the 72–77% SWE-bench score is a real advantage.

Use the self-hosted path if your team handles pre-launch IP, NDA-bound code, or regulated-industry data that can’t leave your network. Pair it with a local Ollama model (Qwen3-235B or Devstral 24B) and you get an autonomous coding agent with zero external data exposure — no other tool in this roundup matches that combination.

Don’t use OpenHands as your daily IDE companion. It has no real-time autocomplete, no editor integration, and the conversation-based interaction is slower than Cursor or Cline for iterative sessions. OpenHands handles batched, well-defined tasks; it’s not a replacement for an IDE-integrated assistant.

Budget setup time. The Docker socket permissions, version pinning, and Git credential injection each have documented failure modes. Expect at least a day of configuration before the self-hosted setup runs reliably in production. The Cloud Pro tier eliminates most of this at the cost of $20/month.

OpenHands is the best open-source autonomous coding agent available as of May 2026. The model flexibility and benchmark performance are genuine advantages over Devin. The trade is infrastructure friction and rougher real-world Git behavior. Neither is fundamental — both are the expected cost of running open-source infrastructure at the frontier.

1V1 STARTER KIT · CURSOR

Skip the week of trial-and-error setting up Cursor.

12 production-tested .cursorrules templates, 3 workflow configs, the cost-control checklist. Everything I wish I had on day one.

Get it for $19 (early bird) →

Sources

Last updated May 24, 2026. Pricing and features change frequently; verify current state before purchasing.

Was this article helpful?