A 6-12 month research program,
not a 3-year bet.
Four parallelizable research gaps stand between the demo you just saw and a shipping product. Click any section to expand.
Architecture: The Tool Surface Map
5 layersClaude Spatial is a five-layer system. Claude's cross-domain reasoning sits at the top, making tool calls through a Spatial MCP schema into the data sources that hardware engineers already use.
Phase 1: Prove Latency
Weeks 1–6Does the latent capability exist at all, and how much does tool access recover? Establish the zero-shot baseline, then measure the delta that Spatial MCP provides.
- Run Claude on GeoGramBench subset — procedural geometry code (OpenCascade/CadQuery), describe shapes, identify features, flag tolerancing issues. Target: >60% on local primitive recognition.3
- Spatial relationship questions from 3DSRBench — multi-view renders of mechanical assemblies. Target: >70% on common viewpoints.4
- CAD-as-code eval — CadQuery scripts, tolerance-critical features, interference fit, DFM issues. No fine-tuning.
- Build minimal Spatial MCP schema: select_region, measure_distance, query_feature_tree.
- Re-run identical evals with tool access enabled. Measure the delta.
- Hypothesis: tool access closes 40-60% of the gap without fine-tuning — analogous to how Claude Code's bash tool unlocks capabilities already in the model.
Phase 2: Close the Gap
Weeks 7–10SpatialVLM proved synthetic data generation unlocks metric spatial reasoning — 2B VQA examples from 10M images.5 CAD-Llama proved CAD-as-code fine-tuning achieves 0.966 command accuracy.2 This phase follows both playbooks.
- Generate synthetic CAD training data following the SpatialVLM pipeline: CAD viewport screenshots paired with scene graph ground truth.5
- Synthetic GD&T annotation pairs — tolerance callouts matched to geometry features.
- Multi-view renders of assemblies with feature tree labels for visual grounding.
- Tolerance chain VQA pairs — "what is the stack-up from datum A to surface B?"
- Fine-tune using SpatialLLM recipe: 3D-informed multimodal alignment + visual instruction tuning. Avoid fine-tuning the visual encoder directly (hurts generalization).6
Phase 3: Ship and Dogfood
Weeks 11–12- Run full eval suite across all three tiers (zero-shot, tool-augmented, fine-tuned). Publish internal benchmark.
- Ship dogfood version to core partners for real-world validation across manufacturing, product design, and architecture workflows.
- Novel eval: tolerance chain reasoning — no existing baseline. Becomes Anthropic's proprietary benchmark.
- Document the honest gap: where tool augmentation alone is sufficient vs where fine-tuning is required.
Four Research Gaps — Parallelizable
Independent workstreamsThese workstreams are independent and can run concurrently. None requires the others to be complete before starting.
Full pipeline: Raw 3D (any format) → CAD Encoder → Canonical JSON/code → Token compression → Context window
Deployment Tiers — RSP Constraint
3 tiersReading ITAR defense CAD or weapons manufacturing geometry may trigger elevated capability thresholds under Anthropic's Responsible Scaling Policy. The Spatial MCP schema must account for deployment tier from day one.
- CADmium — Mila (Dec 2025). Fine-tuned Qwen2.5-Coder on minimal-JSON CAD histories. Code LLMs naturally handle structured CAD formats.
- Li et al. — CAD-Llama (CVPR 2025). 99.9% unconditional generation success. 0.966 command type accuracy on structured CAD code.
- GeoGramBench (NeurIPS 2025 submission). 500 problems testing Program-to-Geometry translation. <50% at highest abstraction, >80% on local primitives.
- Ma et al. — 3DSRBench (2024). 2,762 3D spatial reasoning questions across 12 subtypes. SOTA LMMs show degraded performance on uncommon viewpoints.
- SpatialVLM — Google (2024). 2B VQA examples on 10M images. Proved synthetic data generation unlocks metric-scale spatial reasoning.
- SpatialLLM (CVPR 2025). 3D-informed multimodal alignment. SOTA on 3DSRBench. Finding: fine-tuning CLIP with 3D data hurts generalization.
- cadrille (ICLR 2026). First unified multimodal CAD reconstruction. RL fine-tuning on procedural data outperforms SFT on handcrafted datasets.
- IJCAI 2025 Survey. Comprehensive taxonomy of spatial reasoning approaches (image-based, point cloud-based, hybrid).
- Point2CAD / Point2Sequence. Reconstructs B-rep CAD from point clouds by predicting sketch+extrude sequences. Mesh → point cloud → parametric CAD pipeline.
- BrepGen (2024). Generates B-rep CAD directly including topology (faces, edges, vertices with NURBS parameters). Candidate target representation for CAD encoder output.
- Xu et al. — CAD-MLLM (2024). Unified encoder for text, multi-view images, and 3D point clouds. Frozen vision/point encoders with trainable linear projections into LLM.