Experiments
Development only

Visualization lab

Preview every curated dashboard visualization with mock or real benchmark data. Use the tile names below when prompting (e.g. “confusion-matrix medium, score-convergence small for the ABC benchmark”).

VIZ_LAB

Tile names for prompting

Sizes: width = small (1/3) · medium (2/3) · large (3/3); height = short · medium · tall.

kpi-metricprogress-metricinsight-cardscore-convergenceline-chartbar-charthistogramradar-chartconfusion-matrixheatmapfailure-reportfailure-explorerlatency-explorerimage-galleryimage-comparisonvideo-viewerpoint-cloud-viewertrajectory-viewerthroughput-metricgaugetraining-curvesscatter-plotcalibration-curvecode-blocktext-reportstat-gridimage-overlayembedding-3dsample-panelbox-plotsingle-imagescore-cdfsample-heatstripdistribution-comparisonper-class-metricstoken-usagegeneration-gridmulti-channel-timeseriesreward-curveranked-resultstradeoff-scattergrouped-barmulti-radardata-tablepr-curveroc-curvetoken-logprobagent-tracefilmstripaction-distributiontrajectory-overlayrecall-at-kwaveform-spectrogram

Benchmark templates

Production layout rules shared by RoboSpatial and LIBERO dashboards.

RoboSpatial
Compact flat vectorsReasoning beside failures
LIBERO
Compact flat vectorsReasoning beside failuresRow 2 · Component latency · bar-chart small short
metric
W
H
analysis.aggregate.score

KPI metric

Reported metric for this run

82%
3.40

across 385 scored samples

metric
W
H
analysis.aggregate.completed_count · analysis.aggregate.sample_count

Progress

100%
metric
W
H
analysis.diagnosis

Insight card

Agent-surfaced findings

  • Accuracy drops 18% on multi-object scenes versus single-object.
  • Most failures cluster in the rotational-motion category.
  • Latency P95 is 2.3× the median — a few slow samples dominate.
chart
W
H
results[].score · results[].correct

Score convergence

Running mean score over samples

chart
W
H
results[].score

Line chart

Metric trend over samples

chart
W
H
analysis.by_question_type

Bar chart

Accuracy by category

Spatial
91%
Counting
78%
Rotation
62%
Occlusion
55%
Temporal
71%
distribution
W
H
results[].score

Histogram

Per-sample score distribution

chart
W
H
analysis.by_modality · analysis.by_question_type

Radar chart

Accuracy across modalities

matrix
W
H
analysis.answer_confusion

Confusion matrix

Predicted vs expected

rows = predicted · cols = expected

A
B
C
D
A
72
6
3
2
B
5
64
8
4
C
4
7
58
9
D
2
5
6
66
matrix
W
H
analysis.answer_confusion

Heatmap

Grid intensity

Spatial
Counting
Rotation
Temporal
Easy
24
36
57
52
Medium
71
82
26
71
Hard
21
90
23
49
diagnostic
W
H
failure-reports/summary

Failure report

Structured failure summary and recommendations

Failure summary

Failure summary for this benchmark run

critical
Recommendations · 2

Prioritize the largest low-score cluster.

Investigate

Inspect repeated grounding errors.

diagnostic
W
H
analysis.by_question_type · analysis.failure_diagnosis

Failure explorer

Failure categories by frequency

Rotation
38
Occlusion
31
Counting
22
Temporal
17
Other
9
diagnostic
W
H
results[].latency_ms

Latency explorer

Per-sample latency distribution

Mean
263.4ms
P50
245.0ms
P95
400.0ms
Max
441.0ms
media
W
H
results[].sample assets · foxglove: image topics

Image gallery

Sample visual inputs

sample_100
sample_100
sample_101
sample_101
sample_102
sample_102
sample_103
sample_103
sample_104
sample_104
sample_105
sample_105
sample_106
sample_106
sample_107
sample_107
sample_108
sample_108
media
W
H
results[].sample assets

Image comparison

Input vs prediction

Model input
Model input
Prediction overlay
Prediction overlay
media
W
H
results[].sample assets (video)

Video viewer

Sample video stream

sample_204 · driving sequence

spatial
W
H
foxglove: point-cloud topics

Point cloud viewer

Projected 3D points

drag to rotate
spatial
W
H
foxglove: path/pose topics

Trajectory viewer

Ground truth vs predicted path

Ground truth Predicted
metric
W
H
analysis.performance.tokens_per_second

Throughput

142tok/s
metric
W
H
analysis.aggregate.score

Gauge

Single rate, 0–100%

chart
W
H
benchmark-specific series (train/val/…)

Training curves

Multiple series with legend

chart
W
H
benchmark-specific (e.g. embeddings)

Scatter plot

2D points, grouped

chart
W
H
results[].confidence + correctness

Calibration curve

Confidence vs accuracy

text
W
H
results[].model_output (JSON)

Code / JSON

VS Code-coloured code or JSON

{
  "answer": "B",
  "confidence": 0.82,
  "reasoning": "Object rotates clockwise across frames.",
  "tokens": {
    "input": 1432,
    "output": 96
  }
}
text
W
H
analysis.diagnosis

Text report

Written failure narrative

  • 37% of failures involve rotational motion that reverses mid-sequence.
  • The model confuses left/right turns under heavy occlusion.
  • Counting errors grow sharply beyond 6 objects in frame.
  • Recommend re-running the rotation slice with a higher frame budget.
diagnostic
W
H
analysis.failure_diagnosis

Stat grid

Key failure statistics

Failures
117
Zero-score
41
Low-score
76
Recovered
9
Threshold
0.50
Worst slice
Rotation
media
W
H
image + foxglove annotations

Image overlay

Masks / boxes / keypoints on an image

sample_204 · detections

sample_204 · detections

spatial
W
H
benchmark-specific embeddings

Embedding (3D)

Rotatable 3D scatter

drag to rotate
media
W
H
configured per channel (image + text)

Sample panel

Image (+ point overlays) over text blocks

Question

Point to the free space where the mug can be placed.

distribution
W
H
results[].<numeric field>

Box plot

Min · 25th · mean · 75th · max

media
W
H
configured image (e.g. saliency / uncertainty map)

Single image

One image (with optional overlays)

Saliency map

chart
W
H
results[].score

Score CDF

Cumulative fraction scoring ≤ x

distribution
W
H
results[].score

Sample heat-strip

Every sample, colored by score

180 samples · mean 73%
distribution
W
H
results[].score + correctness

Distribution comparison

Overlaid histograms (e.g. correct vs incorrect)

matrix
W
H
analysis.by_question_type

Per-class metrics

Class × metric grid

Precisi…
Recall
F1
Spatial
0.9
0.9
0.9
Counting
0.8
0.7
0.8
Rotation
0.7
0.6
0.6
Occlusi…
0.6
0.6
0.6
Temporal
0.8
0.7
0.7
distribution
W
H
results[].usage.output_tokens

Token usage

Output-token distribution

media
W
H
results[].generated assets + quality score

Generation grid

Generated images with per-item scores

CLIP 0.70
CLIP 0.70
CLIP 0.87
CLIP 0.87
CLIP 0.76
CLIP 0.76
CLIP 0.71
CLIP 0.71
CLIP 0.87
CLIP 0.87
CLIP 0.87
CLIP 0.87
chart
W
H
foxglove: numeric topics (sensors / joints / reward)

Multi-channel time-series

Several signals on a shared time axis

chart
W
H
results[].reward / return per step

Reward curve

Return per episode with ± band

diagnostic
W
H
results[].retrieved / ranked items

Ranked results

Query → top-k with relevance

Query · robot arm grasping a transparent cup
  1. 1doc_1000 · manipulation0.94
  2. 2doc_1001 · grasping0.85
  3. 3doc_1002 · perception0.77
  4. 4doc_1003 · control0.67
  5. 5doc_1004 · manipulation0.56
  6. 6doc_1005 · grasping0.49
  7. 7doc_1006 · perception0.38
  8. 8doc_1007 · control0.31
  9. 9doc_1008 · manipulation0.22
chart
W
H
leaderboard entries (accuracy vs latency / cost)

Trade-off scatter

Two metrics per run · Pareto frontier

chart
W
H
analysis.by_question_type across runs

Grouped bar

Compare series per category

chart
W
H
analysis.by_modality across runs

Multi-series radar

Compare models across axes

text
W
H
results[] (any tabular fields)

Data table

Paginated raw sample rows

SampleCategoryPredictedExpectedScoreLatency
s_1000contextNoYes0553ms
s_1001compatYesYes1468ms
s_1002configNoYes0562ms
s_1003contextYesYes1182ms
s_1004compatYesYes1479ms
s_1005configYesYes1344ms
s_1006contextNoYes0733ms
s_1007compatYesYes1494ms
s_1008configYesYes1706ms
s_1009contextNoYes0770ms
s_1010compatYesYes1488ms
s_1011configYesYes1560ms
s_1012contextNoYes0643ms
s_1013compatYesYes1553ms
s_1014configYesYes1610ms
s_1015contextYesYes1780ms
s_1016compatYesYes1681ms
s_1017configYesYes1581ms
s_1018contextYesYes1683ms
s_1019compatYesYes1574ms
s_1020configYesYes1610ms
s_1021contextNoYes0265ms
s_1022compatNoYes0224ms
s_1023configYesYes1691ms
s_1024contextYesYes1773ms
60 rows1 / 3
chart
W
H
results[].confidence + correctness

PR curve

Precision vs recall

chart
W
H
results[].confidence + correctness

ROC curve

TPR vs FPR

text
W
H
results[].logprobs / token probabilities

Token logprobs

Output tokens colored by confidence

The gripper should approach the transparent cup from the left side to avoid occluding the handle 
lowhigh confidence
diagnostic
W
H
results[].trace / steps / tool calls

Agent trace

Ordered tool-call / reasoning steps

  1. planDecompose task

    Identify cup, plan collision-free grasp approach.

  2. tooldetect_objects(scene)

    Returned 3 objects: cup (0.94), plate (0.88), table.

  3. toolestimate_pose(cup)

    6-DoF pose estimated; confidence 0.71.

  4. toolplan_grasp(pose)

    Top-down grasp rejected (occlusion); retry side grasp.

  5. toolplan_grasp(pose, side)

    Feasible grasp found; width 6.2cm.

  6. actexecute(trajectory)

    Grasp executed; object lifted 12cm.

media
W
H
frame sequence (diffusion steps / video / rollout)

Filmstrip

Ordered image sequence

step 1
step 1
step 2
step 2
step 3
step 3
step 4
step 4
step 5
step 5
step 6
step 6
step 7
step 7
step 8
step 8
8 frames · scroll to scrub · click to enlarge
distribution
W
H
results[].action vectors

Action distribution

Per-dimension value spread

x
-0.200.20
y
0.030.48
z
0.020.52
roll
-0.230.31
pitch
-0.520.06
yaw
-0.610.04
grip
-0.430.26
spatial
W
H
foxglove: predicted + ground-truth paths

Trajectory overlay

Predicted vs ground-truth path + error

ground truth predicted
mean error 0.12
chart
W
H
results[].retrieval hits by rank

Recall@k

Retrieval metric vs k

media
W
H
audio asset + transcript

Waveform & spectrogram

Audio waveform + spectrogram + transcript

spectrogram

turn left at the second intersection and stop near the blue container

Sample explorer

Configurable full-width card: narrow score + sample selector (live samples on top), then a modality-agnostic input channel and output channel you pick per benchmark.

Sample explorer

Score and per-sample channels

0%
Sample #100
Input · image + question
Question

Point to the free space where the mug can be placed.

Output · answer + reasoning
Question

Point to the free space where the mug can be placed.

Foxglove topics → native viz

Multimodal payloads arrive as Foxglove-style topics; the curated viz fill from them automatically (point clouds, paths, camera images).

/lidar/points · foxglove.PointCloud/camera/front/image · foxglove.CompressedImage/planning/path · nav_msgs/Path

Point cloud

/lidar/points · foxglove.PointCloud

drag to rotate

Trajectory

/planning/path · nav_msgs/Path

Camera

/camera/front/image · CompressedImage

/camera/front/image
/camera/front/image

Agent Artefacts section