Development only

Visualization lab

Preview every curated dashboard visualization with mock or real benchmark data. Use the tile names below when prompting (e.g. “confusion-matrix medium, score-convergence small for the ABC benchmark”).

VIZ_LAB

Tile names for prompting

Sizes: width = small (1/3) · medium (2/3) · large (3/3); height = short · medium · tall.

kpi-metricprogress-metricinsight-cardscore-convergenceline-chartbar-charthistogramradar-chartconfusion-matrixheatmapfailure-reportfailure-explorerlatency-explorerimage-galleryimage-comparisonvideo-viewerpoint-cloud-viewertrajectory-viewerthroughput-metricgaugetraining-curvesscatter-plotcalibration-curvecode-blocktext-reportstat-gridimage-overlayembedding-3dsample-panelbox-plotsingle-imagescore-cdfsample-heatstripdistribution-comparisonper-class-metricstoken-usagegeneration-gridmulti-channel-timeseriesreward-curveranked-resultstradeoff-scattergrouped-barmulti-radardata-tablepr-curveroc-curvetoken-logprobagent-tracefilmstripaction-distributiontrajectory-overlayrecall-at-kwaveform-spectrogram

Benchmark templates

Production layout rules shared by RoboSpatial and LIBERO dashboards.

RoboSpatial

Compact flat vectorsReasoning beside failures

LIBERO

Compact flat vectorsReasoning beside failuresRow 2 · Component latency · bar-chart small short

metric

← analysis.aggregate.score

KPI metric

Reported metric for this run

82%

▲ 3.40

across 385 scored samples

metric

← analysis.aggregate.completed_count · analysis.aggregate.sample_count

Progress

100%

metric

← analysis.diagnosis

Insight card

Agent-surfaced findings

Accuracy drops 18% on multi-object scenes versus single-object.
Most failures cluster in the rotational-motion category.
Latency P95 is 2.3× the median — a few slow samples dominate.

chart

← results[].score · results[].correct

Score convergence

Running mean score over samples

chart

← results[].score

Line chart

Metric trend over samples

chart

← analysis.by_question_type

Bar chart

Accuracy by category

Spatial

91%

Counting

78%

Rotation

62%

Occlusion

55%

Temporal

71%

distribution

← results[].score

Histogram

Per-sample score distribution

chart

← analysis.by_modality · analysis.by_question_type

Radar chart

Accuracy across modalities

matrix

← analysis.answer_confusion

Confusion matrix

Predicted vs expected

rows = predicted · cols = expected

matrix

← analysis.answer_confusion

Heatmap

Grid intensity

Spatial

Counting

Rotation

Temporal

Easy

Medium

Hard

diagnostic

← failure-reports/summary

Failure report

Structured failure summary and recommendations

Failure summary

Failure summary for this benchmark run

critical

Recommendations · 2

Prioritize the largest low-score cluster.

Investigate

Inspect repeated grounding errors.

diagnostic

← analysis.by_question_type · analysis.failure_diagnosis

Failure explorer

Failure categories by frequency

Rotation

Occlusion

Counting

Temporal

Other

diagnostic

← results[].latency_ms

Latency explorer

Per-sample latency distribution

Mean

263.4ms

P50

245.0ms

P95

400.0ms

Max

441.0ms

media

← results[].sample assets · foxglove: image topics

Image gallery

Sample visual inputs

media

← results[].sample assets

Image comparison

Input vs prediction

media

← results[].sample assets (video)

Video viewer

Sample video stream

sample_204 · driving sequence

spatial

← foxglove: point-cloud topics

Point cloud viewer

Projected 3D points

drag to rotate

spatial

← foxglove: path/pose topics

Trajectory viewer

Ground truth vs predicted path

Ground truth Predicted

metric

← analysis.performance.tokens_per_second

Throughput

142tok/s

metric

← analysis.aggregate.score

Gauge

Single rate, 0–100%

chart

← benchmark-specific series (train/val/…)

Training curves

Multiple series with legend

chart

← benchmark-specific (e.g. embeddings)

Scatter plot

2D points, grouped

chart

← results[].confidence + correctness

Calibration curve

Confidence vs accuracy

text

← results[].model_output (JSON)

Code / JSON

VS Code-coloured code or JSON

{
  "answer": "B",
  "confidence": 0.82,
  "reasoning": "Object rotates clockwise across frames.",
  "tokens": {
    "input": 1432,
    "output": 96
  }
}

text

← analysis.diagnosis

Text report

Written failure narrative

37% of failures involve rotational motion that reverses mid-sequence.
The model confuses left/right turns under heavy occlusion.
Counting errors grow sharply beyond 6 objects in frame.
Recommend re-running the rotation slice with a higher frame budget.

diagnostic

← analysis.failure_diagnosis

Stat grid

Key failure statistics

Failures

117

Zero-score

Low-score

Recovered

Threshold

0.50

Worst slice

Rotation

media

← image + foxglove annotations

Image overlay

Masks / boxes / keypoints on an image

sample_204 · detections

spatial

← benchmark-specific embeddings

Embedding (3D)

Rotatable 3D scatter

drag to rotate

media

← configured per channel (image + text)

Sample panel

Image (+ point overlays) over text blocks

Question

Point to the free space where the mug can be placed.

distribution

← results[].<numeric field>

Box plot

Min · 25th · mean · 75th · max

media

← configured image (e.g. saliency / uncertainty map)

Single image

One image (with optional overlays)

Saliency map

chart

← results[].score

Score CDF

Cumulative fraction scoring ≤ x

distribution

← results[].score

Sample heat-strip

Every sample, colored by score

180 samples · mean 73%

distribution

← results[].score + correctness

Distribution comparison

Overlaid histograms (e.g. correct vs incorrect)

matrix

← analysis.by_question_type

Per-class metrics

Class × metric grid

Precisi…

Recall

Spatial

0.9

Counting

0.8

0.7

0.8

Rotation

0.7

0.6

Occlusi…

0.6

Temporal

0.8

0.7

distribution

← results[].usage.output_tokens

Token usage

Output-token distribution

media

← results[].generated assets + quality score

Generation grid

Generated images with per-item scores

chart

← foxglove: numeric topics (sensors / joints / reward)

Multi-channel time-series

Several signals on a shared time axis

chart

← results[].reward / return per step

Reward curve

Return per episode with ± band

diagnostic

← results[].retrieved / ranked items

Ranked results

Query → top-k with relevance

Query · robot arm grasping a transparent cup

1doc_1000 · manipulation0.94
2doc_1001 · grasping0.85
3doc_1002 · perception0.77
4doc_1003 · control0.67
5doc_1004 · manipulation0.56
6doc_1005 · grasping0.49
7doc_1006 · perception0.38
8doc_1007 · control0.31
9doc_1008 · manipulation0.22

chart

← leaderboard entries (accuracy vs latency / cost)

Trade-off scatter

Two metrics per run · Pareto frontier

chart

← analysis.by_question_type across runs

Grouped bar

Compare series per category

chart

← analysis.by_modality across runs

Multi-series radar

Compare models across axes

text

← results[] (any tabular fields)

Data table

Paginated raw sample rows

Sample	Category	Predicted	Expected	Score	Latency
s_1000	context	No	Yes	0	553ms
s_1001	compat	Yes	Yes	1	468ms
s_1002	config	No	Yes	0	562ms
s_1003	context	Yes	Yes	1	182ms
s_1004	compat	Yes	Yes	1	479ms
s_1005	config	Yes	Yes	1	344ms
s_1006	context	No	Yes	0	733ms
s_1007	compat	Yes	Yes	1	494ms
s_1008	config	Yes	Yes	1	706ms
s_1009	context	No	Yes	0	770ms
s_1010	compat	Yes	Yes	1	488ms
s_1011	config	Yes	Yes	1	560ms
s_1012	context	No	Yes	0	643ms
s_1013	compat	Yes	Yes	1	553ms
s_1014	config	Yes	Yes	1	610ms
s_1015	context	Yes	Yes	1	780ms
s_1016	compat	Yes	Yes	1	681ms
s_1017	config	Yes	Yes	1	581ms
s_1018	context	Yes	Yes	1	683ms
s_1019	compat	Yes	Yes	1	574ms
s_1020	config	Yes	Yes	1	610ms
s_1021	context	No	Yes	0	265ms
s_1022	compat	No	Yes	0	224ms
s_1023	config	Yes	Yes	1	691ms
s_1024	context	Yes	Yes	1	773ms

60 rows1 / 3

chart

← results[].confidence + correctness

PR curve

Precision vs recall

chart

← results[].confidence + correctness

ROC curve

TPR vs FPR

text

← results[].logprobs / token probabilities

Token logprobs

Output tokens colored by confidence

The gripper should approach the transparent cup from the left side to avoid occluding the handle .

lowhigh confidence

diagnostic

← results[].trace / steps / tool calls

Agent trace

Ordered tool-call / reasoning steps

planDecompose task
Identify cup, plan collision-free grasp approach.
tooldetect_objects(scene)
Returned 3 objects: cup (0.94), plate (0.88), table.
toolestimate_pose(cup)
6-DoF pose estimated; confidence 0.71.
toolplan_grasp(pose)
Top-down grasp rejected (occlusion); retry side grasp.
toolplan_grasp(pose, side)
Feasible grasp found; width 6.2cm.
actexecute(trajectory)
Grasp executed; object lifted 12cm.

media

← frame sequence (diffusion steps / video / rollout)

Filmstrip

Ordered image sequence

8 frames · scroll to scrub · click to enlarge

distribution

← results[].action vectors

Action distribution

Per-dimension value spread

-0.20…0.20

0.03…0.48

0.02…0.52

roll

-0.23…0.31

pitch

-0.52…0.06

yaw

-0.61…0.04

grip

-0.43…0.26

spatial

← foxglove: predicted + ground-truth paths

Trajectory overlay

Predicted vs ground-truth path + error

ground truth predicted

mean error 0.12

chart

← results[].retrieval hits by rank

Recall@k

Retrieval metric vs k

media

← audio asset + transcript

Waveform & spectrogram

Audio waveform + spectrogram + transcript

“turn left at the second intersection and stop near the blue container”

Sample explorer

Configurable full-width card: narrow score + sample selector (live samples on top), then a modality-agnostic input channel and output channel you pick per benchmark.

Sample explorer

Score and per-sample channels

Sample #100

Input · image + question

Question

Point to the free space where the mug can be placed.

Output · answer + reasoning

Question

Point to the free space where the mug can be placed.

Foxglove topics → native viz

Multimodal payloads arrive as Foxglove-style topics; the curated viz fill from them automatically (point clouds, paths, camera images).

/lidar/points · foxglove.PointCloud/camera/front/image · foxglove.CompressedImage/planning/path · nav_msgs/Path

Point cloud

/lidar/points · foxglove.PointCloud

drag to rotate

Trajectory

/planning/path · nav_msgs/Path

Camera

/camera/front/image · CompressedImage

Visualization lab

Tile names for prompting

Benchmark templates

Sample explorer

Foxglove topics → native viz

Agent Artefacts section

Agent Artefacts