# Agent Workflow Instruction: Unity-Based Laparoscopic Stereo Benchmark Simulator ## 1. Mission Prepare a workflow for building a **Unity-based 3D laparoscopic scene simulator** for **on-the-fly benchmarking of stereo-matching and 3D reconstruction algorithms**. The simulator must generate controllable synthetic laparoscopic stereo scenes with exact ground truth, allowing researchers to test how different algorithms behave under controlled surgical imaging conditions such as specularity, low texture, smoke, blood, tissue deformation, lighting variation, camera motion, tool occlusion, and stereo-baseline changes. The main research contribution should not be “a synthetic dataset” alone. The contribution should be: > A controllable, reproducible, parameterized benchmarking environment for failure-mode analysis of stereo depth estimation and 3D reconstruction in laparoscopic scenes. --- ## 2. Core Research Question Design the workflow around this question: > How do different stereo-matching and 3D reconstruction algorithms fail under specific laparoscopic imaging conditions, and can a controllable simulator expose these failure modes more systematically than fixed real-world datasets? The system should support: * real-time or batch generation of stereo laparoscopic image pairs; * exact metric ground truth; * automatic algorithm evaluation; * controlled parameter sweeps; * comparison with real laparoscopic stereo benchmarks. --- ## 3. Target Users The workflow should assume the simulator will be used by: * computer vision researchers; * medical image analysis researchers; * surgical robotics researchers; * PhD/MSc students working on laparoscopic 3D reconstruction; * developers benchmarking stereo algorithms before testing on real datasets. --- ## 4. Required Simulator Outputs For every generated frame or sequence, the simulator must export the following: ```text left_rgb.png right_rgb.png depth_left.exr / depth_left.npy depth_right.exr / depth_right.npy disparity_left.npy disparity_right.npy surface_normals.npy occlusion_mask.png specularity_mask.png instrument_mask.png tissue_mask.png semantic_labels.png camera_intrinsics.json camera_extrinsics.json stereo_calibration.json scene_parameters.json ``` The `scene_parameters.json` file must record all randomized and manually set parameters, including: ```json { "baseline_mm": 5.0, "focal_length_px": 800, "working_distance_mm": 80, "camera_pitch_deg": 0, "camera_yaw_deg": 0, "light_intensity": 1.0, "light_angle_deg": 30, "tissue_specularity": 0.6, "tissue_roughness": 0.35, "smoke_density": 0.2, "blood_coverage": 0.1, "deformation_amplitude_mm": 3.0, "deformation_frequency_hz": 1.2, "tool_occlusion_ratio": 0.25, "motion_blur": false, "noise_level": 0.01, "random_seed": 12345 } ``` --- ## 5. Core Scene Components The simulator should contain the following modular scene components. ### 5.1 Stereo Laparoscope Camera Implement a virtual stereo laparoscope with configurable: * stereo baseline; * focal length; * sensor size; * image resolution; * lens distortion; * convergence angle; * near/far clipping planes; * working distance; * camera pose; * camera trajectory. The camera should support: * fixed stereo capture; * moving stereo camera; * rectified output; * non-rectified raw output; * exportable calibration parameters. ### 5.2 Tissue and Organ Scene Create at least one deformable laparoscopic tissue scene. Minimum viable scene: * one abdominal organ or tissue surface; * non-planar geometry; * wet/specular material; * subtle tissue texture; * deformable surface; * local folds, ridges, and valleys. The tissue shader should support: * diffuse color variation; * roughness variation; * specular highlights; * wetness; * subsurface-like appearance; * procedural vessels or texture; * optional blood patches. ### 5.3 Surgical Instruments Add laparoscopic instruments that can: * enter and leave the field of view; * occlude tissue; * touch or deform tissue; * create difficult object boundaries; * generate metallic specular highlights. At minimum, include: * grasper; * forceps or generic tool shaft; * optional needle holder or scissors. ### 5.4 Lighting System Implement controllable endoscopic lighting: * point or spot lights attached near the camera; * adjustable light intensity; * adjustable falloff; * adjustable direction; * asymmetric illumination; * overexposure; * shadow regions. Lighting must be logged in the scene parameter file. ### 5.5 Surgical Artifacts Add optional artifacts: * smoke or haze; * blood; * bubbles; * blur; * image noise; * compression artifacts; * vignetting; * lens distortion; * saturation; * specular bloom. Each artifact must be controllable independently. --- ## 6. Parameter Sweep Design The simulator must support automatic benchmark generation through parameter sweeps. Define parameter groups: ### 6.1 Camera Parameters ```text baseline_mm: [2, 4, 6, 8, 10] working_distance_mm: [40, 60, 80, 100, 120] focal_length_px: [600, 800, 1000, 1200] camera_motion_speed: [static, slow, medium, fast] ``` ### 6.2 Tissue Parameters ```text texture_level: [low, medium, high] specularity: [none, low, medium, high] roughness: [low, medium, high] deformation_amplitude: [0, 1, 3, 5, 10] mm deformation_frequency: [0, 0.5, 1.0, 2.0] Hz ``` ### 6.3 Scene Difficulty Parameters ```text smoke_density: [0, 0.1, 0.3, 0.5] blood_coverage: [0, 0.05, 0.15, 0.3] tool_occlusion_ratio: [0, 0.1, 0.25, 0.5] light_angle: [0, 15, 30, 45, 60] degrees image_noise: [none, low, medium, high] motion_blur: [false, true] ``` The workflow must include scripts for generating benchmark subsets such as: ```text baseline_sweep/ specularity_sweep/ smoke_sweep/ deformation_sweep/ tool_occlusion_sweep/ lighting_sweep/ combined_hard_cases/ ``` --- ## 7. Algorithm Benchmarking Interface Design a plug-in interface so different stereo-matching algorithms can be benchmarked automatically. Each algorithm adapter should accept: ```text left_rgb right_rgb camera_intrinsics stereo_calibration ``` Each algorithm should output: ```text predicted_disparity predicted_depth optional_confidence_map runtime_ms ``` The benchmark runner should support both: * classical stereo methods; * deep-learning stereo methods. Example algorithm categories: ```text Block Matching Semi-Global Matching ELAS / LIBELAS-style methods RAFT-Stereo PSMNet GC-Net StereoNet HITNet laparoscopic-specific stereo methods custom user algorithms ``` The agent should design a wrapper format such as: ```bash python run_algorithm.py \ --algorithm raft_stereo \ --left path/to/left.png \ --right path/to/right.png \ --calib path/to/stereo_calibration.json \ --output path/to/prediction/ ``` --- ## 8. Evaluation Metrics The benchmark must compute metrics at several levels. ### 8.1 Disparity Metrics ```text End-Point Error Mean Absolute Disparity Error Median Disparity Error Bad-1px Bad-2px Bad-3px Bad-5px ``` ### 8.2 Depth Metrics ```text Depth RMSE in mm Depth MAE in mm Absolute Relative Error Squared Relative Error Threshold Accuracy Scale Drift ``` ### 8.3 3D Reconstruction Metrics ```text Point-cloud Chamfer Distance Surface-to-surface distance Normal consistency Completeness Accuracy F-score at distance thresholds ``` ### 8.4 Temporal Metrics For sequences: ```text Temporal depth consistency Frame-to-frame disparity jitter Optical-flow-aware depth stability Runtime stability ``` ### 8.5 Region-Stratified Metrics Metrics must also be computed separately for: ```text tissue regions instrument regions specular regions blood regions smoke-affected regions shadow regions occlusion boundaries depth discontinuities low-texture tissue high-curvature tissue ``` This region-stratified evaluation is essential. The benchmark should reveal not only which algorithm is best on average, but also which visual or geometric condition causes failure. --- ## 9. Benchmark Reports The workflow must generate automatic reports containing: ```text summary table per algorithm metric curves across parameter sweeps failure-case visualizations error heatmaps depth/disparity overlays runtime comparison ranking by scenario ranking by robustness per-region error breakdown ``` Each benchmark report should include plots such as: ```text Depth RMSE vs specularity Bad-3px vs smoke density Chamfer distance vs deformation amplitude Runtime vs image resolution Error near tool boundaries Error in low-texture regions ``` The report should clearly identify: ```text best average algorithm most robust algorithm fastest algorithm best algorithm under smoke best algorithm under specular highlights best algorithm near instruments worst failure modes per algorithm ``` --- ## 10. Validation Against Real Datasets The simulator workflow must include a validation stage against real laparoscopic stereo datasets. Use these real benchmarks as external references: ```text SCARED SERV-CT EndoAbS Hamlyn laparoscopic datasets StereoMIS ``` The agent should prepare a protocol for checking whether synthetic benchmark results correlate with real-data performance. Important validation questions: ```text Do algorithms that perform well in simulation also perform well on SCARED? Do specularity failures in simulation match failures on real laparoscopic tissue? Do smoke and tool occlusion scenarios produce realistic ranking changes? Does the simulator overestimate performance? Which synthetic parameters best predict real-world errors? ``` The goal is not to prove the simulator fully replaces real data. The goal is to show that it provides controlled failure-mode analysis that complements real datasets. --- ## 11. Repository Structure Prepare the project with the following structure: ```text laparo-stereo-sim/ │ ├── unity_project/ │ ├── Assets/ │ ├── Packages/ │ ├── ProjectSettings/ │ └── README.md │ ├── benchmark_runner/ │ ├── algorithms/ │ │ ├── sgm/ │ │ ├── raft_stereo/ │ │ ├── psmnet/ │ │ └── custom_template/ │ │ │ ├── evaluation/ │ │ ├── disparity_metrics.py │ │ ├── depth_metrics.py │ │ ├── pointcloud_metrics.py │ │ ├── temporal_metrics.py │ │ └── region_metrics.py │ │ │ ├── visualization/ │ │ ├── plot_metrics.py │ │ ├── render_error_maps.py │ │ └── generate_report.py │ │ │ ├── configs/ │ │ ├── baseline_sweep.yaml │ │ ├── specularity_sweep.yaml │ │ ├── smoke_sweep.yaml │ │ ├── deformation_sweep.yaml │ │ └── combined_hard_cases.yaml │ │ │ └── run_benchmark.py │ ├── datasets/ │ ├── synthetic/ │ ├── real/ │ │ ├── scared/ │ │ ├── serv_ct/ │ │ ├── endoabs/ │ │ └── hamlyn/ │ └── README.md │ ├── reports/ │ ├── figures/ │ ├── tables/ │ └── benchmark_summary.md │ ├── docs/ │ ├── simulator_design.md │ ├── benchmark_protocol.md │ ├── algorithm_interface.md │ └── validation_protocol.md │ ├── scripts/ │ ├── export_unity_sequence.py │ ├── convert_depth_to_disparity.py │ ├── rectify_stereo_pair.py │ └── prepare_real_datasets.py │ ├── environment.yml ├── requirements.txt └── README.md ``` --- ## 12. Development Phases ### Phase 1: Literature and Requirement Review Prepare a compact review of: ```text existing laparoscopic stereo datasets existing surgical simulators existing stereo depth algorithms evaluation metrics for stereo and reconstruction known laparoscopic stereo failure modes ``` Deliverables: ```text docs/literature_summary.md docs/design_requirements.md docs/benchmark_gap_analysis.md ``` ### Phase 2: Minimal Unity Prototype Build a simple Unity scene with: ```text one deformable tissue surface one stereo laparoscope one controllable light source left/right RGB export depth export camera calibration export ``` Deliverables: ```text unity_project/ sample_output/ docs/prototype_notes.md ``` ### Phase 3: Ground Truth Export Add export of: ```text depth maps disparity maps surface normals segmentation masks occlusion masks camera parameters scene parameter logs ``` Deliverables: ```text sample_output/ground_truth/ scripts/convert_depth_to_disparity.py docs/ground_truth_specification.md ``` ### Phase 4: Scene Realism and Artifacts Add: ```text specular tissue material low-texture tissue blood patches smoke instrument occlusion motion blur noise lighting variation deformation ``` Deliverables: ```text unity_project/realistic_scene/ docs/artifact_controls.md sample_sequences/ ``` ### Phase 5: Benchmark Runner Implement the algorithm interface and evaluation system. Deliverables: ```text benchmark_runner/run_benchmark.py benchmark_runner/algorithms/custom_template/ benchmark_runner/evaluation/ benchmark_runner/visualization/ ``` ### Phase 6: Algorithm Integration Integrate several baseline algorithms. Minimum recommended set: ```text OpenCV StereoBM OpenCV StereoSGBM RAFT-Stereo or equivalent deep model one laparoscopic-specific method if available one custom placeholder adapter ``` Deliverables: ```text benchmark_runner/algorithms/ docs/algorithm_interface.md reports/baseline_algorithm_results.md ``` ### Phase 7: Controlled Experiments Run parameter sweeps. Minimum experiments: ```text specularity sweep smoke sweep baseline sweep deformation sweep tool occlusion sweep lighting sweep combined hard-case sweep ``` Deliverables: ```text reports/parameter_sweep_results.md reports/figures/ reports/tables/ ``` ### Phase 8: Real Dataset Validation Evaluate selected algorithms on real datasets and compare trends. Deliverables: ```text datasets/real/ scripts/prepare_real_datasets.py reports/real_dataset_validation.md reports/sim_to_real_correlation.md ``` ### Phase 9: Final Research Packaging Prepare final outputs: ```text README.md docs/methodology.md docs/benchmark_protocol.md reports/final_benchmark_report.md paper_outline.md demo_video_plan.md ``` --- ## 13. Technical Requirements ### Unity Requirements Use: ```text Unity 2022 LTS or newer HDRP if photorealistic rendering is needed Unity Perception or custom ground-truth export tools C# scripts for camera control and scene randomization deterministic random seeds headless or batch rendering if possible ``` The Unity project must support: ```text manual scene preview scripted batch generation reproducible randomization per-frame metadata export config-driven experiments ``` ### Python Requirements Use Python for: ```text benchmark orchestration algorithm adapters metric computation plot generation report generation real dataset preparation ``` Recommended libraries: ```text numpy opencv-python scipy matplotlib pandas open3d torch torchvision tqdm pyyaml scikit-image ``` --- ## 14. Reproducibility Requirements Every experiment must be reproducible. Each benchmark run must save: ```text git commit hash Unity version Python environment algorithm version algorithm parameters simulator configuration random seed date and time hardware information runtime logs ``` Each output folder should contain: ```text config.yaml scene_parameters.json algorithm_parameters.json metrics.csv summary.json visualizations/ ``` --- ## 15. Dataset Format Use a clear dataset format: ```text sequence_0001/ │ ├── left/ │ ├── 000000.png │ ├── 000001.png │ └── ... │ ├── right/ │ ├── 000000.png │ ├── 000001.png │ └── ... │ ├── depth_left/ │ ├── 000000.npy │ ├── 000001.npy │ └── ... │ ├── disparity_left/ │ ├── 000000.npy │ ├── 000001.npy │ └── ... │ ├── masks/ │ ├── tissue/ │ ├── instrument/ │ ├── specularity/ │ ├── occlusion/ │ └── smoke/ │ ├── normals/ │ ├── 000000.npy │ └── ... │ ├── calibration/ │ ├── intrinsics_left.json │ ├── intrinsics_right.json │ ├── extrinsics.json │ └── stereo_calibration.json │ └── metadata/ ├── scene_parameters.json ├── random_seed.txt └── config.yaml ``` --- ## 16. Ground Truth Rules The ground truth must be geometrically consistent. Depth and disparity must satisfy: ```text disparity = focal_length_px * baseline_m / depth_m ``` The agent must verify: ```text depth values are metric invalid pixels are masked occluded regions are labeled left/right consistency is documented camera intrinsics are correct baseline units are correct coordinate systems are documented ``` Do not apply image-to-image realism translation unless the workflow includes a validation step proving that depth/disparity labels remain geometrically valid. --- ## 17. Failure-Mode Taxonomy The benchmark should classify failures into categories: ```text specular highlight failure low-texture failure smoke or haze failure blood contamination failure tool-boundary failure occlusion-boundary failure deformation failure motion-blur failure overexposure failure shadow failure narrow-baseline failure long-working-distance failure ``` For each algorithm, the report should answer: ```text Where does the algorithm fail? Why does it fail? Is the failure local or global? Is the failure stable across frames? Does confidence estimation detect the failure? Does the failure also appear on real data? ``` --- ## 18. Minimum Viable Prototype The minimum useful prototype should include: ```text one synthetic organ surface one stereo laparoscope one light source one specular tissue material one moving tool one deformation mode RGB stereo export depth export disparity export calibration export OpenCV SGBM benchmark one deep stereo model benchmark metric report error visualization ``` The minimum demonstration should show: ```text Algorithm A vs Algorithm B under increasing specularity Algorithm A vs Algorithm B under increasing smoke Algorithm A vs Algorithm B near tool occlusions Algorithm A vs Algorithm B on deforming tissue ``` --- ## 19. Expected Research Output The final project should support a paper or thesis with the following structure: ```text Title Abstract Introduction Related Work Simulator Design Benchmark Protocol Ground Truth Generation Algorithm Interface Controlled Experiments Real Dataset Validation Results Failure-Mode Analysis Limitations Conclusion ``` Possible title: ```text A Unity-Based Controllable Benchmark for Failure-Mode Analysis of Stereo Depth Estimation in Laparoscopic Scenes ``` Alternative title: ```text On-the-Fly Synthetic Benchmarking of Stereo Reconstruction Algorithms under Controlled Laparoscopic Imaging Conditions ``` --- ## 20. Key Success Criteria The workflow is successful if it produces: ```text a working stereo laparoscopic simulator accurate ground-truth depth and disparity reproducible parameter sweeps automatic stereo algorithm evaluation clear failure-mode analysis comparison with real laparoscopic datasets useful plots and benchmark reports documentation sufficient for other researchers to reproduce results ``` The project should be considered weak if it only produces visually appealing synthetic images without rigorous ground truth, controlled experiments, or validation against real datasets. --- ## 21. Agent Deliverables Checklist The agent must prepare: ```text [ ] literature summary [ ] benchmark gap analysis [ ] simulator architecture [ ] Unity scene design [ ] camera model specification [ ] ground-truth export specification [ ] dataset format specification [ ] parameter sweep plan [ ] algorithm plug-in interface [ ] evaluation metric definitions [ ] benchmark report template [ ] real dataset validation plan [ ] repository structure [ ] development milestones [ ] risk analysis [ ] minimum viable prototype plan [ ] final paper/thesis outline ``` --- ## 22. Major Risks The workflow must explicitly address these risks: ```text synthetic-to-real gap unrealistic tissue appearance incorrect ground-truth disparity invalid labels after post-processing lack of real-data validation too much focus on rendering instead of benchmarking too few baseline algorithms weak novelty compared with existing synthetic datasets poor documentation non-reproducible experiments ``` For each risk, define a mitigation strategy. --- ## 23. Recommended Initial Milestone The first milestone should be: > Build a minimal Unity stereo laparoscope scene that exports rectified left/right RGB images, metric depth, disparity, masks, and calibration, then benchmark OpenCV SGBM and one deep stereo model under a controlled specularity sweep. This milestone should produce: ```text one synthetic sequence one parameter sweep two algorithm outputs one metrics table one error heatmap one short technical report ``` Only after this milestone is complete should the project expand to smoke, blood, deformation, tools, and real-dataset validation.