A 6-12 month research program,
not a 3-year bet.

Four parallelizable research gaps stand between the demo you just saw and a shipping product. Click any section to expand.

Architecture: The Tool Surface Map

5 layers

Claude Spatial is a five-layer system. Claude's cross-domain reasoning sits at the top, making tool calls through a Spatial MCP schema into the data sources that hardware engineers already use.

Layer 1 — Claude Reasoning

Cross-domain synthesis · Context window · Code reasoning that transfers to CAD-as-code^1,2

↓ tool calls

Layer 2 — Spatial MCP Tool Schema

select_region · measure_distance · query_feature_tree · set_section_plane

↓

Layer 3 — Data Sources

CAD / GeometrySolidWorks, Fusion 360, STEP files

SimulationANSYS Fluent, FEA, CFD, thermal

PCB / ElectricalAltium, KiCad, nets, DRC

PLM / SupplySAP, Arena, BOM, lead times

InspectionCMM, X-ray, GD&T, SPC, AOI

↓ any 3D format

Layer 3.5 — CAD Encoder

Mesh / point cloud / STEP / STL / OBJ / USD → canonical JSON/code representation · Reverse-engineers design intent from raw geometry^7,9

↓ cross-domain outputs

Layer 5 — Cross-Domain Outputs

ManufacturingBlast radius, tolerance chains, DFM review, inspection

ArchitectureCode compliance, structural clash detection, MEP coordination

Product Design0-to-1 concepting, parametric exploration, fit & finish

3D Art & VFXScene layout, rigging review, asset optimization, lighting

RoboticsKinematic validation, workspace planning, collision paths

FacilitiesSpace planning, equipment layout, egress analysis

Medical DevicesImplant fit, biocompat review, FDA submission geometry

PackagingNesting optimization, structural sim, unfolding & dielines

Phase 1: Prove Latency

Weeks 1–6

Does the latent capability exist at all, and how much does tool access recover? Establish the zero-shot baseline, then measure the delta that Spatial MCP provides.

Weeks 1-2: Zero-shot baseline

Run Claude on GeoGramBench subset — procedural geometry code (OpenCascade/CadQuery), describe shapes, identify features, flag tolerancing issues. Target: >60% on local primitive recognition.³
Spatial relationship questions from 3DSRBench — multi-view renders of mechanical assemblies. Target: >70% on common viewpoints.⁴
CAD-as-code eval — CadQuery scripts, tolerance-critical features, interference fit, DFM issues. No fine-tuning.

Weeks 3-6: Build Spatial MCP + re-evaluate

Build minimal Spatial MCP schema: select_region, measure_distance, query_feature_tree.
Re-run identical evals with tool access enabled. Measure the delta.
Hypothesis: tool access closes 40-60% of the gap without fine-tuning — analogous to how Claude Code's bash tool unlocks capabilities already in the model.

Success metric: Tool-augmented Claude closes ≥40% of zero-shot gap on GeoGramBench assembly tasks.

Phase 2: Close the Gap

Weeks 7–10

SpatialVLM proved synthetic data generation unlocks metric spatial reasoning — 2B VQA examples from 10M images.⁵ CAD-Llama proved CAD-as-code fine-tuning achieves 0.966 command accuracy.² This phase follows both playbooks.

Generate synthetic CAD training data following the SpatialVLM pipeline: CAD viewport screenshots paired with scene graph ground truth.⁵
Synthetic GD&T annotation pairs — tolerance callouts matched to geometry features.
Multi-view renders of assemblies with feature tree labels for visual grounding.
Tolerance chain VQA pairs — "what is the stack-up from datum A to surface B?"
Fine-tune using SpatialLLM recipe: 3D-informed multimodal alignment + visual instruction tuning. Avoid fine-tuning the visual encoder directly (hurts generalization).⁶

Target: Match CAD-Llama's 0.966 command accuracy on assembly feature identification. Feature ID accuracy >0.90 on 3DSRBench-style assembly QA.

Phase 3: Ship and Dogfood

Weeks 11–12

Run full eval suite across all three tiers (zero-shot, tool-augmented, fine-tuned). Publish internal benchmark.
Ship dogfood version to core partners for real-world validation across manufacturing, product design, and architecture workflows.
Novel eval: tolerance chain reasoning — no existing baseline. Becomes Anthropic's proprietary benchmark.
Document the honest gap: where tool augmentation alone is sufficient vs where fine-tuning is required.

Final metric: Claude with Spatial MCP tools matches or exceeds CAD-Llama's accuracy without full fine-tuning — proving the tool surface closes the gap.

Four Research Gaps — Parallelizable

Independent workstreams

These workstreams are independent and can run concurrently. None requires the others to be complete before starting.

3D Vision Fine-Tune CAD viewport → scene graph training pairs. Synthetic GD&T annotations. Following SpatialVLM's data generation pipeline. Key risk: fine-tuning visual encoders with 3D data can hurt generalization.^5,6

Spatial MCP Schema Tool definitions for select_region, measure_distance, query_feature_tree, set_section_plane. Auth model for CAD file access. Geometry context protocol.

CAD Encoder Format-agnostic geometry compiler: mesh / point cloud / STEP / STL / OBJ / USD → canonical JSON/code. Reverse-engineers design intent (primitives, features, holes, fillets) from raw triangles. Builds on cadrille and Point2CAD research. Unlocks non-CAD domains (3D art, architecture, robotics) where native formats are meshes, not parametric CAD.^7,9

Token-Efficient Geometry The back half of the encoder pipeline: once geometry is in canonical form, compress it for context. 50-part feature trees can exceed 100K tokens. Hierarchical summarization preserving tolerance-critical features while discarding cosmetic detail. Follows CADmium's minimal-JSON approach.¹

Full pipeline: Raw 3D (any format) → CAD Encoder → Canonical JSON/code → Token compression → Context window

Deployment Tiers — RSP Constraint

3 tiers

Reading ITAR defense CAD or weapons manufacturing geometry may trigger elevated capability thresholds under Anthropic's Responsible Scaling Policy. The Spatial MCP schema must account for deployment tier from day one.

Commercial

Standard Anthropic API. Consumer electronics, automotive, medical devices.

GovCloud / FedRAMP

Government-compliant infrastructure. Defense subcontractors, aerospace. ITAR-aware.

Air-Gapped On-Prem

Fully isolated deployment. Classified programs. No external data egress.

References

CADmium — Mila (Dec 2025). Fine-tuned Qwen2.5-Coder on minimal-JSON CAD histories. Code LLMs naturally handle structured CAD formats.
Li et al. — CAD-Llama (CVPR 2025). 99.9% unconditional generation success. 0.966 command type accuracy on structured CAD code.
GeoGramBench (NeurIPS 2025 submission). 500 problems testing Program-to-Geometry translation. <50% at highest abstraction, >80% on local primitives.
Ma et al. — 3DSRBench (2024). 2,762 3D spatial reasoning questions across 12 subtypes. SOTA LMMs show degraded performance on uncommon viewpoints.
SpatialVLM — Google (2024). 2B VQA examples on 10M images. Proved synthetic data generation unlocks metric-scale spatial reasoning.
SpatialLLM (CVPR 2025). 3D-informed multimodal alignment. SOTA on 3DSRBench. Finding: fine-tuning CLIP with 3D data hurts generalization.
cadrille (ICLR 2026). First unified multimodal CAD reconstruction. RL fine-tuning on procedural data outperforms SFT on handcrafted datasets.
IJCAI 2025 Survey. Comprehensive taxonomy of spatial reasoning approaches (image-based, point cloud-based, hybrid).
Point2CAD / Point2Sequence. Reconstructs B-rep CAD from point clouds by predicting sketch+extrude sequences. Mesh → point cloud → parametric CAD pipeline.
BrepGen (2024). Generates B-rep CAD directly including topology (faces, edges, vertices with NURBS parameters). Candidate target representation for CAD encoder output.
Xu et al. — CAD-MLLM (2024). Unified encoder for text, multi-view images, and 3D point clouds. Frozen vision/point encoders with trainable linear projections into LLM.