Visualization lab
Preview every curated dashboard visualization with mock or real benchmark data. Use the tile names below when prompting (e.g. “confusion-matrix medium, score-convergence small for the ABC benchmark”).
Tile names for prompting
Sizes: width = small (1/3) · medium (2/3) · large (3/3); height = short · medium · tall.
kpi-metricprogress-metricinsight-cardscore-convergenceline-chartbar-charthistogramradar-chartconfusion-matrixheatmapfailure-reportfailure-explorerlatency-explorerimage-galleryimage-comparisonvideo-viewerpoint-cloud-viewertrajectory-viewerthroughput-metricgaugetraining-curvesscatter-plotcalibration-curvecode-blocktext-reportstat-gridimage-overlayembedding-3dsample-panelbox-plotsingle-imagescore-cdfsample-heatstripdistribution-comparisonper-class-metricstoken-usagegeneration-gridmulti-channel-timeseriesreward-curveranked-resultstradeoff-scattergrouped-barmulti-radardata-tablepr-curveroc-curvetoken-logprobagent-tracefilmstripaction-distributiontrajectory-overlayrecall-at-kwaveform-spectrogramBenchmark templates
Production layout rules shared by RoboSpatial and LIBERO dashboards.
KPI metric
Reported metric for this run
across 385 scored samples
Progress
Insight card
Agent-surfaced findings
- Accuracy drops 18% on multi-object scenes versus single-object.
- Most failures cluster in the rotational-motion category.
- Latency P95 is 2.3× the median — a few slow samples dominate.
Score convergence
Running mean score over samples
Line chart
Metric trend over samples
Bar chart
Accuracy by category
Histogram
Per-sample score distribution
Radar chart
Accuracy across modalities
Confusion matrix
Predicted vs expected
rows = predicted · cols = expected
Heatmap
Grid intensity
Failure report
Structured failure summary and recommendations
Failure summary for this benchmark run
Prioritize the largest low-score cluster.
Inspect repeated grounding errors.
Failure explorer
Failure categories by frequency
Latency explorer
Per-sample latency distribution
Image gallery
Sample visual inputs
Image comparison
Input vs prediction
Video viewer
Sample video stream
sample_204 · driving sequence
Point cloud viewer
Projected 3D points
Trajectory viewer
Ground truth vs predicted path
Throughput
Gauge
Single rate, 0–100%
Training curves
Multiple series with legend
Scatter plot
2D points, grouped
Calibration curve
Confidence vs accuracy
Code / JSON
VS Code-coloured code or JSON
{
"answer": "B",
"confidence": 0.82,
"reasoning": "Object rotates clockwise across frames.",
"tokens": {
"input": 1432,
"output": 96
}
}Text report
Written failure narrative
- 37% of failures involve rotational motion that reverses mid-sequence.
- The model confuses left/right turns under heavy occlusion.
- Counting errors grow sharply beyond 6 objects in frame.
- Recommend re-running the rotation slice with a higher frame budget.
Stat grid
Key failure statistics
Image overlay
Masks / boxes / keypoints on an image
sample_204 · detections
Embedding (3D)
Rotatable 3D scatter
Sample panel
Image (+ point overlays) over text blocks
Point to the free space where the mug can be placed.
Box plot
Min · 25th · mean · 75th · max
Single image
One image (with optional overlays)
Saliency map
Score CDF
Cumulative fraction scoring ≤ x
Sample heat-strip
Every sample, colored by score
Distribution comparison
Overlaid histograms (e.g. correct vs incorrect)
Per-class metrics
Class × metric grid
Token usage
Output-token distribution
Generation grid
Generated images with per-item scores
Multi-channel time-series
Several signals on a shared time axis
Reward curve
Return per episode with ± band
Ranked results
Query → top-k with relevance
- 1doc_1000 · manipulation0.94
- 2doc_1001 · grasping0.85
- 3doc_1002 · perception0.77
- 4doc_1003 · control0.67
- 5doc_1004 · manipulation0.56
- 6doc_1005 · grasping0.49
- 7doc_1006 · perception0.38
- 8doc_1007 · control0.31
- 9doc_1008 · manipulation0.22
Trade-off scatter
Two metrics per run · Pareto frontier
Grouped bar
Compare series per category
Multi-series radar
Compare models across axes
Data table
Paginated raw sample rows
| Sample | Category | Predicted | Expected | Score | Latency |
|---|---|---|---|---|---|
| s_1000 | context | No | Yes | 0 | 553ms |
| s_1001 | compat | Yes | Yes | 1 | 468ms |
| s_1002 | config | No | Yes | 0 | 562ms |
| s_1003 | context | Yes | Yes | 1 | 182ms |
| s_1004 | compat | Yes | Yes | 1 | 479ms |
| s_1005 | config | Yes | Yes | 1 | 344ms |
| s_1006 | context | No | Yes | 0 | 733ms |
| s_1007 | compat | Yes | Yes | 1 | 494ms |
| s_1008 | config | Yes | Yes | 1 | 706ms |
| s_1009 | context | No | Yes | 0 | 770ms |
| s_1010 | compat | Yes | Yes | 1 | 488ms |
| s_1011 | config | Yes | Yes | 1 | 560ms |
| s_1012 | context | No | Yes | 0 | 643ms |
| s_1013 | compat | Yes | Yes | 1 | 553ms |
| s_1014 | config | Yes | Yes | 1 | 610ms |
| s_1015 | context | Yes | Yes | 1 | 780ms |
| s_1016 | compat | Yes | Yes | 1 | 681ms |
| s_1017 | config | Yes | Yes | 1 | 581ms |
| s_1018 | context | Yes | Yes | 1 | 683ms |
| s_1019 | compat | Yes | Yes | 1 | 574ms |
| s_1020 | config | Yes | Yes | 1 | 610ms |
| s_1021 | context | No | Yes | 0 | 265ms |
| s_1022 | compat | No | Yes | 0 | 224ms |
| s_1023 | config | Yes | Yes | 1 | 691ms |
| s_1024 | context | Yes | Yes | 1 | 773ms |
PR curve
Precision vs recall
ROC curve
TPR vs FPR
Token logprobs
Output tokens colored by confidence
Agent trace
Ordered tool-call / reasoning steps
- planDecompose task
Identify cup, plan collision-free grasp approach.
- tooldetect_objects(scene)
Returned 3 objects: cup (0.94), plate (0.88), table.
- toolestimate_pose(cup)
6-DoF pose estimated; confidence 0.71.
- toolplan_grasp(pose)
Top-down grasp rejected (occlusion); retry side grasp.
- toolplan_grasp(pose, side)
Feasible grasp found; width 6.2cm.
- actexecute(trajectory)
Grasp executed; object lifted 12cm.
Filmstrip
Ordered image sequence
Action distribution
Per-dimension value spread
Trajectory overlay
Predicted vs ground-truth path + error
Recall@k
Retrieval metric vs k
Waveform & spectrogram
Audio waveform + spectrogram + transcript
“turn left at the second intersection and stop near the blue container”
Sample explorer
Configurable full-width card: narrow score + sample selector (live samples on top), then a modality-agnostic input channel and output channel you pick per benchmark.
Sample explorer
Score and per-sample channels
Point to the free space where the mug can be placed.
Point to the free space where the mug can be placed.
Foxglove topics → native viz
Multimodal payloads arrive as Foxglove-style topics; the curated viz fill from them automatically (point clouds, paths, camera images).
/lidar/points · foxglove.PointCloud/camera/front/image · foxglove.CompressedImage/planning/path · nav_msgs/PathPoint cloud
/lidar/points · foxglove.PointCloud
Trajectory
/planning/path · nav_msgs/Path
Camera
/camera/front/image · CompressedImage