Evaluate your agent in a full simulation.Improve it for the real world.

For teams shipping AI agents: a private, ever-evolving benchmark grounded in your agent's real environment. SailFar runs the evaluations, explains what failed, and opens the fix.

Request early access

Evaluation is the bottleneck.

Agent teams ship blind and find out what broke from real users.

Production data is limited, stale, and off-limits.
Public benchmarks are too generic to reflect your agent's reality.
Synthetic data requires building a full simulation from scratch.
Homegrown evaluators drift and need constant realignment.
Multimodal outputs are hard to grade reliably.

A faster and more reliable feedback loop.

Understand

Agent environment

Mapped

.sailfar/env.mdenvironment map

tool_sources.json20 sources

tool_surface_spec.json31 tools

Environment

Union Airlines support agent

FastAPI · LangChain · Postgres · 31 tools · 6-step call flow

Simulate

Scenario set

237 scenarios

S1.1Basic booking lookuphappy path

S2.1Economy change deniedprod replica

S3.3Past flown bookingedge case

S4.4Refund with no IDadversarial

+ 233 more

Evaluate

Evaluation report

BaseTest

Task accuracy97%

Hallucination rate4%

Tool efficiency94%

Tone alignment91%

S4.4 Identity verification still failing

Improve

SF SailFar agent

Found the gap on S4.4 — the agent refunded without verifying identity. Opened a PR with the fix.

Require ID before refundPR #248

- refund(order)
+ verify_identity(order)
+ refund(order)

✓ checks passed · re-ran S4.4 → PASS

Ship faster. Earn the trust.

Evaluation becomes your team's moat.

A self-evolving, multimodal evaluator

Stays consistent and aligned, evolving automatically from minimal builder feedback, and grades text, voice, image, and video alike.

Tailored tests in minutes

Every new feature gets its own scenarios, in minutes, not sprints.

Comprehensive + adversarial

Covers the full behavior space, from common paths to adversarial edge cases, not just the few you'd think to write.

Managed cloud sandbox

A full, clean simulation of your agent's world, with no dependency on or interference with your production data or environment.

FAQ

How much work does SailFar take from my team?+

No code changes or SDK required to start. SailFar remembers your feedback, asks for input on the cases that matter most, and uses a small calibration set to align the evaluator. If simulation needs code changes, SailFar can drive a coding agent to open them.

How can I trust the simulations, scenarios, and evaluator?+

SailFar first understands your agent and environment. Scenarios are grounded in that world, optionally seeded from production traces, lightly guided by your input, and expanded with long-horizon reasoning plus adversarial play to mine edge cases. The evaluator is calibrated against your feedback and held-out examples, so misses become new rubrics and regression scenarios.

Can SailFar evaluate voice, image, and video agents?+

Yes. SailFar evaluates text, voice, image, and video agents end-to-end: trace, output, and outcome. For voice and video, it can test tone, interruption handling, visual fidelity, physical realism, and prompt adherence.

Is my data used to train models?+

No. Your data stays yours. SailFar can work with approved samples, traces, and environment context to build evaluations, but your data is not used to train anyone's models.

A bespoke, evolving benchmark for every agent.

We're working with a small group of early design partners.

Request early access