Documentation.

Evaluate any model end-to-end — driven by agents, or with a single pip package.

Overview

Spherics Bench is the evaluation layer for AI models. You bring a model; we provide the benchmark suites, sample orchestration, scoring, live telemetry, and failure analysis — end-to-end, with nothing to wire up yourself.

There are two ways to evaluate, and an agent can drive either of them for you. Everything is reachable from the benchmark catalog and your experiments.

Agents

This is the heart of Spherics Bench. We natively integrate evaluation agents that run the whole loop for you — picking a benchmark, launching the run, streaming telemetry, and triaging failures. You can also connect your own agents, so they offload evaluation work and minimise your team's effort. Either way, you mostly just describe what you want validated.

Find the right benchmarkDescribe an objective; an agent picks suitable benchmarks
Launch evaluationsStart a run with the right mode and sample budget
Stream & steerWatch the live score and step in at any stage
Triage failuresCluster failure modes and surface the worst samples
Summarise resultsGenerate a shareable report of what was tested
Offload your workloadConnect your own agents to do all of the above for you

API mode

In API mode we stream evaluation samples to your model and score the responses in real time. Your model stays wherever it runs — we just send it the inputs and read back the outputs, sample by sample. Ideal for hosted or API-served models.

You expose your model behind one HTTP endpoint. It receives a sample's input and returns the answer in the benchmark's output format (shown on every benchmark page). Any framework works — here's the whole contract:

python
# Minimal model endpoint (FastAPI)
from fastapi import FastAPI
app = FastAPI()

@app.post("/infer")
def infer(sample: dict):
    # sample = the benchmark input (question, images, choices, ...)
    answer = my_model(sample)
    # return JSON matching the benchmark's output_format
    return {"answer": answer, "reasoning": "optional"}

Deploy it anywhere reachable over HTTPS, then give Spherics Bench the URL when you launch (the Model endpoint field) — or just hand it to an agent. We POST each sample to your endpoint and score what comes back.

Container mode

In container mode we launch a dedicated, isolated job that runs the full benchmark against your model end-to-end — no machine of your own to keep online. Ideal for self-contained model images and longer evaluations.

Live telemetry

Every run streams live to the Experiments dashboard: the mean score as it converges, per-category breakdowns, processed samples, latency and efficiency, and a failure report.

The dashboard is built fluidly. The views you see are assembled on the fly from which model you are benchmarking (a transformer surfaces different internals than a diffusion model) and which benchmark is running (audio, video, and image benchmarks expose fundamentally different outputs and metrics). There is nothing to configure and no code to write — the right visualizations simply appear for your run.

Quickstart

Prefer to do it yourself? It's one package and one call. Install it, then point it at a benchmark and your model:

bash
pip install benchcloud
python
import benchcloud

benchcloud.run(
    benchmark="<benchmark_id>",
    model="<your_model_id>",
)
Loading package guide from the API...

Wonder if that's it? Indeed — that's it. And our agents will handle even this for you. Just prompt your stuff.

Error codes

The API uses standard HTTP status codes. Authenticated endpoints expect a bearer token from sign-in.

200OK — request succeeded
201Created — experiment / resource created
204No content — succeeded with nothing to return
400Bad request — check your parameters
401Unauthorized — missing or invalid bearer token
403Forbidden — no access to this resource or team
404Not found — unknown benchmark, experiment, or run
422Unprocessable — request body failed validation
429Too many requests — slow down and retry
5xxServer error — retry with backoff

Ready to evaluate?

Pick a benchmark and let an agent take it from there.

Browse benchmarks