Evaluate your agent in a full simulation.Improve it for the real world.

For teams shipping AI agents: a private, ever-evolving benchmark grounded in your agent's real environment. SailFar runs the evaluations, explains what failed, and opens the fix.

Request early access

Evaluation is the bottleneck.

Agent teams ship blind and find out what broke from real users.

A faster and more reliable feedback loop.

01

Understand

Agent environment
Mapped
.sailfar/env.mdenvironment map
tool_sources.json20 sources
tool_surface_spec.json31 tools
Environment
Union Airlines support agent
FastAPI · LangChain · Postgres · 31 tools · 6-step call flow
02

Simulate

Scenario set
237 scenarios
S1.1Basic booking lookuphappy path
S2.1Economy change deniedprod replica
S3.3Past flown bookingedge case
S4.4Refund with no IDadversarial
+ 233 more
03

Evaluate

Evaluation report
BaseTest
Task accuracy97%
Hallucination rate4%
Tool efficiency94%
Tone alignment91%
S4.4 Identity verification still failing
04

Improve

SF SailFar agent
Found the gap on S4.4 — the agent refunded without verifying identity. Opened a PR with the fix.
Require ID before refundPR #248
- refund(order)
+ verify_identity(order)
+ refund(order)
checks passed · re-ran S4.4 → PASS

Ship faster. Earn the trust.

Evaluation becomes your team's moat.

A self-evolving, multimodal evaluator

Stays consistent and aligned, evolving automatically from minimal builder feedback, and grades text, voice, image, and video alike.

Tailored tests in minutes

Every new feature gets its own scenarios, in minutes, not sprints.

Comprehensive + adversarial

Covers the full behavior space, from common paths to adversarial edge cases, not just the few you'd think to write.

Managed cloud sandbox

A full, clean simulation of your agent's world, with no dependency on or interference with your production data or environment.

FAQ

How much work does SailFar take from my team?+

No code changes or SDK required to start. SailFar remembers your feedback, asks for input on the cases that matter most, and uses a small calibration set to align the evaluator. If simulation needs code changes, SailFar can drive a coding agent to open them.

How can I trust the simulations, scenarios, and evaluator?+

SailFar first understands your agent and environment. Scenarios are grounded in that world, optionally seeded from production traces, lightly guided by your input, and expanded with long-horizon reasoning plus adversarial play to mine edge cases. The evaluator is calibrated against your feedback and held-out examples, so misses become new rubrics and regression scenarios.

Can SailFar evaluate voice, image, and video agents?+

Yes. SailFar evaluates text, voice, image, and video agents end-to-end: trace, output, and outcome. For voice and video, it can test tone, interruption handling, visual fidelity, physical realism, and prompt adherence.

Is my data used to train models?+

No. Your data stays yours. SailFar can work with approved samples, traces, and environment context to build evaluations, but your data is not used to train anyone's models.

A bespoke, evolving benchmark for every agent.

We're working with a small group of early design partners.

Request early access