# Agent Workflow Instruction: Unity-Based Laparoscopic Stereo Benchmark Simulator

## 1. Mission

Prepare a workflow for building a **Unity-based 3D laparoscopic scene simulator** for **on-the-fly benchmarking of stereo-matching and 3D reconstruction algorithms**.

The simulator must generate controllable synthetic laparoscopic stereo scenes with exact ground truth, allowing researchers to test how different algorithms behave under controlled surgical imaging conditions such as specularity, low texture, smoke, blood, tissue deformation, lighting variation, camera motion, tool occlusion, and stereo-baseline changes.

The main research contribution should not be “a synthetic dataset” alone. The contribution should be:

> A controllable, reproducible, parameterized benchmarking environment for failure-mode analysis of stereo depth estimation and 3D reconstruction in laparoscopic scenes.

---

## 2. Core Research Question

Design the workflow around this question:

> How do different stereo-matching and 3D reconstruction algorithms fail under specific laparoscopic imaging conditions, and can a controllable simulator expose these failure modes more systematically than fixed real-world datasets?

The system should support:

* real-time or batch generation of stereo laparoscopic image pairs;
* exact metric ground truth;
* automatic algorithm evaluation;
* controlled parameter sweeps;
* comparison with real laparoscopic stereo benchmarks.

---

## 3. Target Users

The workflow should assume the simulator will be used by:

* computer vision researchers;
* medical image analysis researchers;
* surgical robotics researchers;
* PhD/MSc students working on laparoscopic 3D reconstruction;
* developers benchmarking stereo algorithms before testing on real datasets.

---

## 4. Required Simulator Outputs

For every generated frame or sequence, the simulator must export the following:

```text
left_rgb.png
right_rgb.png
depth_left.exr / depth_left.npy
depth_right.exr / depth_right.npy
disparity_left.npy
disparity_right.npy
surface_normals.npy
occlusion_mask.png
specularity_mask.png
instrument_mask.png
tissue_mask.png
semantic_labels.png
camera_intrinsics.json
camera_extrinsics.json
stereo_calibration.json
scene_parameters.json
```

The `scene_parameters.json` file must record all randomized and manually set parameters, including:

```json
{
  "baseline_mm": 5.0,
  "focal_length_px": 800,
  "working_distance_mm": 80,
  "camera_pitch_deg": 0,
  "camera_yaw_deg": 0,
  "light_intensity": 1.0,
  "light_angle_deg": 30,
  "tissue_specularity": 0.6,
  "tissue_roughness": 0.35,
  "smoke_density": 0.2,
  "blood_coverage": 0.1,
  "deformation_amplitude_mm": 3.0,
  "deformation_frequency_hz": 1.2,
  "tool_occlusion_ratio": 0.25,
  "motion_blur": false,
  "noise_level": 0.01,
  "random_seed": 12345
}
```

---

## 5. Core Scene Components

The simulator should contain the following modular scene components.

### 5.1 Stereo Laparoscope Camera

Implement a virtual stereo laparoscope with configurable:

* stereo baseline;
* focal length;
* sensor size;
* image resolution;
* lens distortion;
* convergence angle;
* near/far clipping planes;
* working distance;
* camera pose;
* camera trajectory.

The camera should support:

* fixed stereo capture;
* moving stereo camera;
* rectified output;
* non-rectified raw output;
* exportable calibration parameters.

### 5.2 Tissue and Organ Scene

Create at least one deformable laparoscopic tissue scene.

Minimum viable scene:

* one abdominal organ or tissue surface;
* non-planar geometry;
* wet/specular material;
* subtle tissue texture;
* deformable surface;
* local folds, ridges, and valleys.

The tissue shader should support:

* diffuse color variation;
* roughness variation;
* specular highlights;
* wetness;
* subsurface-like appearance;
* procedural vessels or texture;
* optional blood patches.

### 5.3 Surgical Instruments

Add laparoscopic instruments that can:

* enter and leave the field of view;
* occlude tissue;
* touch or deform tissue;
* create difficult object boundaries;
* generate metallic specular highlights.

At minimum, include:

* grasper;
* forceps or generic tool shaft;
* optional needle holder or scissors.

### 5.4 Lighting System

Implement controllable endoscopic lighting:

* point or spot lights attached near the camera;
* adjustable light intensity;
* adjustable falloff;
* adjustable direction;
* asymmetric illumination;
* overexposure;
* shadow regions.

Lighting must be logged in the scene parameter file.

### 5.5 Surgical Artifacts

Add optional artifacts:

* smoke or haze;
* blood;
* bubbles;
* blur;
* image noise;
* compression artifacts;
* vignetting;
* lens distortion;
* saturation;
* specular bloom.

Each artifact must be controllable independently.

---

## 6. Parameter Sweep Design

The simulator must support automatic benchmark generation through parameter sweeps.

Define parameter groups:

### 6.1 Camera Parameters

```text
baseline_mm: [2, 4, 6, 8, 10]
working_distance_mm: [40, 60, 80, 100, 120]
focal_length_px: [600, 800, 1000, 1200]
camera_motion_speed: [static, slow, medium, fast]
```

### 6.2 Tissue Parameters

```text
texture_level: [low, medium, high]
specularity: [none, low, medium, high]
roughness: [low, medium, high]
deformation_amplitude: [0, 1, 3, 5, 10] mm
deformation_frequency: [0, 0.5, 1.0, 2.0] Hz
```

### 6.3 Scene Difficulty Parameters

```text
smoke_density: [0, 0.1, 0.3, 0.5]
blood_coverage: [0, 0.05, 0.15, 0.3]
tool_occlusion_ratio: [0, 0.1, 0.25, 0.5]
light_angle: [0, 15, 30, 45, 60] degrees
image_noise: [none, low, medium, high]
motion_blur: [false, true]
```

The workflow must include scripts for generating benchmark subsets such as:

```text
baseline_sweep/
specularity_sweep/
smoke_sweep/
deformation_sweep/
tool_occlusion_sweep/
lighting_sweep/
combined_hard_cases/
```

---

## 7. Algorithm Benchmarking Interface

Design a plug-in interface so different stereo-matching algorithms can be benchmarked automatically.

Each algorithm adapter should accept:

```text
left_rgb
right_rgb
camera_intrinsics
stereo_calibration
```

Each algorithm should output:

```text
predicted_disparity
predicted_depth
optional_confidence_map
runtime_ms
```

The benchmark runner should support both:

* classical stereo methods;
* deep-learning stereo methods.

Example algorithm categories:

```text
Block Matching
Semi-Global Matching
ELAS / LIBELAS-style methods
RAFT-Stereo
PSMNet
GC-Net
StereoNet
HITNet
laparoscopic-specific stereo methods
custom user algorithms
```

The agent should design a wrapper format such as:

```bash
python run_algorithm.py \
  --algorithm raft_stereo \
  --left path/to/left.png \
  --right path/to/right.png \
  --calib path/to/stereo_calibration.json \
  --output path/to/prediction/
```

---

## 8. Evaluation Metrics

The benchmark must compute metrics at several levels.

### 8.1 Disparity Metrics

```text
End-Point Error
Mean Absolute Disparity Error
Median Disparity Error
Bad-1px
Bad-2px
Bad-3px
Bad-5px
```

### 8.2 Depth Metrics

```text
Depth RMSE in mm
Depth MAE in mm
Absolute Relative Error
Squared Relative Error
Threshold Accuracy
Scale Drift
```

### 8.3 3D Reconstruction Metrics

```text
Point-cloud Chamfer Distance
Surface-to-surface distance
Normal consistency
Completeness
Accuracy
F-score at distance thresholds
```

### 8.4 Temporal Metrics

For sequences:

```text
Temporal depth consistency
Frame-to-frame disparity jitter
Optical-flow-aware depth stability
Runtime stability
```

### 8.5 Region-Stratified Metrics

Metrics must also be computed separately for:

```text
tissue regions
instrument regions
specular regions
blood regions
smoke-affected regions
shadow regions
occlusion boundaries
depth discontinuities
low-texture tissue
high-curvature tissue
```

This region-stratified evaluation is essential. The benchmark should reveal not only which algorithm is best on average, but also which visual or geometric condition causes failure.

---

## 9. Benchmark Reports

The workflow must generate automatic reports containing:

```text
summary table per algorithm
metric curves across parameter sweeps
failure-case visualizations
error heatmaps
depth/disparity overlays
runtime comparison
ranking by scenario
ranking by robustness
per-region error breakdown
```

Each benchmark report should include plots such as:

```text
Depth RMSE vs specularity
Bad-3px vs smoke density
Chamfer distance vs deformation amplitude
Runtime vs image resolution
Error near tool boundaries
Error in low-texture regions
```

The report should clearly identify:

```text
best average algorithm
most robust algorithm
fastest algorithm
best algorithm under smoke
best algorithm under specular highlights
best algorithm near instruments
worst failure modes per algorithm
```

---

## 10. Validation Against Real Datasets

The simulator workflow must include a validation stage against real laparoscopic stereo datasets.

Use these real benchmarks as external references:

```text
SCARED
SERV-CT
EndoAbS
Hamlyn laparoscopic datasets
StereoMIS
```

The agent should prepare a protocol for checking whether synthetic benchmark results correlate with real-data performance.

Important validation questions:

```text
Do algorithms that perform well in simulation also perform well on SCARED?
Do specularity failures in simulation match failures on real laparoscopic tissue?
Do smoke and tool occlusion scenarios produce realistic ranking changes?
Does the simulator overestimate performance?
Which synthetic parameters best predict real-world errors?
```

The goal is not to prove the simulator fully replaces real data. The goal is to show that it provides controlled failure-mode analysis that complements real datasets.

---

## 11. Repository Structure

Prepare the project with the following structure:

```text
laparo-stereo-sim/
│
├── unity_project/
│   ├── Assets/
│   ├── Packages/
│   ├── ProjectSettings/
│   └── README.md
│
├── benchmark_runner/
│   ├── algorithms/
│   │   ├── sgm/
│   │   ├── raft_stereo/
│   │   ├── psmnet/
│   │   └── custom_template/
│   │
│   ├── evaluation/
│   │   ├── disparity_metrics.py
│   │   ├── depth_metrics.py
│   │   ├── pointcloud_metrics.py
│   │   ├── temporal_metrics.py
│   │   └── region_metrics.py
│   │
│   ├── visualization/
│   │   ├── plot_metrics.py
│   │   ├── render_error_maps.py
│   │   └── generate_report.py
│   │
│   ├── configs/
│   │   ├── baseline_sweep.yaml
│   │   ├── specularity_sweep.yaml
│   │   ├── smoke_sweep.yaml
│   │   ├── deformation_sweep.yaml
│   │   └── combined_hard_cases.yaml
│   │
│   └── run_benchmark.py
│
├── datasets/
│   ├── synthetic/
│   ├── real/
│   │   ├── scared/
│   │   ├── serv_ct/
│   │   ├── endoabs/
│   │   └── hamlyn/
│   └── README.md
│
├── reports/
│   ├── figures/
│   ├── tables/
│   └── benchmark_summary.md
│
├── docs/
│   ├── simulator_design.md
│   ├── benchmark_protocol.md
│   ├── algorithm_interface.md
│   └── validation_protocol.md
│
├── scripts/
│   ├── export_unity_sequence.py
│   ├── convert_depth_to_disparity.py
│   ├── rectify_stereo_pair.py
│   └── prepare_real_datasets.py
│
├── environment.yml
├── requirements.txt
└── README.md
```

---

## 12. Development Phases

### Phase 1: Literature and Requirement Review

Prepare a compact review of:

```text
existing laparoscopic stereo datasets
existing surgical simulators
existing stereo depth algorithms
evaluation metrics for stereo and reconstruction
known laparoscopic stereo failure modes
```

Deliverables:

```text
docs/literature_summary.md
docs/design_requirements.md
docs/benchmark_gap_analysis.md
```

### Phase 2: Minimal Unity Prototype

Build a simple Unity scene with:

```text
one deformable tissue surface
one stereo laparoscope
one controllable light source
left/right RGB export
depth export
camera calibration export
```

Deliverables:

```text
unity_project/
sample_output/
docs/prototype_notes.md
```

### Phase 3: Ground Truth Export

Add export of:

```text
depth maps
disparity maps
surface normals
segmentation masks
occlusion masks
camera parameters
scene parameter logs
```

Deliverables:

```text
sample_output/ground_truth/
scripts/convert_depth_to_disparity.py
docs/ground_truth_specification.md
```

### Phase 4: Scene Realism and Artifacts

Add:

```text
specular tissue material
low-texture tissue
blood patches
smoke
instrument occlusion
motion blur
noise
lighting variation
deformation
```

Deliverables:

```text
unity_project/realistic_scene/
docs/artifact_controls.md
sample_sequences/
```

### Phase 5: Benchmark Runner

Implement the algorithm interface and evaluation system.

Deliverables:

```text
benchmark_runner/run_benchmark.py
benchmark_runner/algorithms/custom_template/
benchmark_runner/evaluation/
benchmark_runner/visualization/
```

### Phase 6: Algorithm Integration

Integrate several baseline algorithms.

Minimum recommended set:

```text
OpenCV StereoBM
OpenCV StereoSGBM
RAFT-Stereo or equivalent deep model
one laparoscopic-specific method if available
one custom placeholder adapter
```

Deliverables:

```text
benchmark_runner/algorithms/
docs/algorithm_interface.md
reports/baseline_algorithm_results.md
```

### Phase 7: Controlled Experiments

Run parameter sweeps.

Minimum experiments:

```text
specularity sweep
smoke sweep
baseline sweep
deformation sweep
tool occlusion sweep
lighting sweep
combined hard-case sweep
```

Deliverables:

```text
reports/parameter_sweep_results.md
reports/figures/
reports/tables/
```

### Phase 8: Real Dataset Validation

Evaluate selected algorithms on real datasets and compare trends.

Deliverables:

```text
datasets/real/
scripts/prepare_real_datasets.py
reports/real_dataset_validation.md
reports/sim_to_real_correlation.md
```

### Phase 9: Final Research Packaging

Prepare final outputs:

```text
README.md
docs/methodology.md
docs/benchmark_protocol.md
reports/final_benchmark_report.md
paper_outline.md
demo_video_plan.md
```

---

## 13. Technical Requirements

### Unity Requirements

Use:

```text
Unity 2022 LTS or newer
HDRP if photorealistic rendering is needed
Unity Perception or custom ground-truth export tools
C# scripts for camera control and scene randomization
deterministic random seeds
headless or batch rendering if possible
```

The Unity project must support:

```text
manual scene preview
scripted batch generation
reproducible randomization
per-frame metadata export
config-driven experiments
```

### Python Requirements

Use Python for:

```text
benchmark orchestration
algorithm adapters
metric computation
plot generation
report generation
real dataset preparation
```

Recommended libraries:

```text
numpy
opencv-python
scipy
matplotlib
pandas
open3d
torch
torchvision
tqdm
pyyaml
scikit-image
```

---

## 14. Reproducibility Requirements

Every experiment must be reproducible.

Each benchmark run must save:

```text
git commit hash
Unity version
Python environment
algorithm version
algorithm parameters
simulator configuration
random seed
date and time
hardware information
runtime logs
```

Each output folder should contain:

```text
config.yaml
scene_parameters.json
algorithm_parameters.json
metrics.csv
summary.json
visualizations/
```

---

## 15. Dataset Format

Use a clear dataset format:

```text
sequence_0001/
│
├── left/
│   ├── 000000.png
│   ├── 000001.png
│   └── ...
│
├── right/
│   ├── 000000.png
│   ├── 000001.png
│   └── ...
│
├── depth_left/
│   ├── 000000.npy
│   ├── 000001.npy
│   └── ...
│
├── disparity_left/
│   ├── 000000.npy
│   ├── 000001.npy
│   └── ...
│
├── masks/
│   ├── tissue/
│   ├── instrument/
│   ├── specularity/
│   ├── occlusion/
│   └── smoke/
│
├── normals/
│   ├── 000000.npy
│   └── ...
│
├── calibration/
│   ├── intrinsics_left.json
│   ├── intrinsics_right.json
│   ├── extrinsics.json
│   └── stereo_calibration.json
│
└── metadata/
    ├── scene_parameters.json
    ├── random_seed.txt
    └── config.yaml
```

---

## 16. Ground Truth Rules

The ground truth must be geometrically consistent.

Depth and disparity must satisfy:

```text
disparity = focal_length_px * baseline_m / depth_m
```

The agent must verify:

```text
depth values are metric
invalid pixels are masked
occluded regions are labeled
left/right consistency is documented
camera intrinsics are correct
baseline units are correct
coordinate systems are documented
```

Do not apply image-to-image realism translation unless the workflow includes a validation step proving that depth/disparity labels remain geometrically valid.

---

## 17. Failure-Mode Taxonomy

The benchmark should classify failures into categories:

```text
specular highlight failure
low-texture failure
smoke or haze failure
blood contamination failure
tool-boundary failure
occlusion-boundary failure
deformation failure
motion-blur failure
overexposure failure
shadow failure
narrow-baseline failure
long-working-distance failure
```

For each algorithm, the report should answer:

```text
Where does the algorithm fail?
Why does it fail?
Is the failure local or global?
Is the failure stable across frames?
Does confidence estimation detect the failure?
Does the failure also appear on real data?
```

---

## 18. Minimum Viable Prototype

The minimum useful prototype should include:

```text
one synthetic organ surface
one stereo laparoscope
one light source
one specular tissue material
one moving tool
one deformation mode
RGB stereo export
depth export
disparity export
calibration export
OpenCV SGBM benchmark
one deep stereo model benchmark
metric report
error visualization
```

The minimum demonstration should show:

```text
Algorithm A vs Algorithm B under increasing specularity
Algorithm A vs Algorithm B under increasing smoke
Algorithm A vs Algorithm B near tool occlusions
Algorithm A vs Algorithm B on deforming tissue
```

---

## 19. Expected Research Output

The final project should support a paper or thesis with the following structure:

```text
Title
Abstract
Introduction
Related Work
Simulator Design
Benchmark Protocol
Ground Truth Generation
Algorithm Interface
Controlled Experiments
Real Dataset Validation
Results
Failure-Mode Analysis
Limitations
Conclusion
```

Possible title:

```text
A Unity-Based Controllable Benchmark for Failure-Mode Analysis of Stereo Depth Estimation in Laparoscopic Scenes
```

Alternative title:

```text
On-the-Fly Synthetic Benchmarking of Stereo Reconstruction Algorithms under Controlled Laparoscopic Imaging Conditions
```

---

## 20. Key Success Criteria

The workflow is successful if it produces:

```text
a working stereo laparoscopic simulator
accurate ground-truth depth and disparity
reproducible parameter sweeps
automatic stereo algorithm evaluation
clear failure-mode analysis
comparison with real laparoscopic datasets
useful plots and benchmark reports
documentation sufficient for other researchers to reproduce results
```

The project should be considered weak if it only produces visually appealing synthetic images without rigorous ground truth, controlled experiments, or validation against real datasets.

---

## 21. Agent Deliverables Checklist

The agent must prepare:

```text
[ ] literature summary
[ ] benchmark gap analysis
[ ] simulator architecture
[ ] Unity scene design
[ ] camera model specification
[ ] ground-truth export specification
[ ] dataset format specification
[ ] parameter sweep plan
[ ] algorithm plug-in interface
[ ] evaluation metric definitions
[ ] benchmark report template
[ ] real dataset validation plan
[ ] repository structure
[ ] development milestones
[ ] risk analysis
[ ] minimum viable prototype plan
[ ] final paper/thesis outline
```

---

## 22. Major Risks

The workflow must explicitly address these risks:

```text
synthetic-to-real gap
unrealistic tissue appearance
incorrect ground-truth disparity
invalid labels after post-processing
lack of real-data validation
too much focus on rendering instead of benchmarking
too few baseline algorithms
weak novelty compared with existing synthetic datasets
poor documentation
non-reproducible experiments
```

For each risk, define a mitigation strategy.

---

## 23. Recommended Initial Milestone

The first milestone should be:

> Build a minimal Unity stereo laparoscope scene that exports rectified left/right RGB images, metric depth, disparity, masks, and calibration, then benchmark OpenCV SGBM and one deep stereo model under a controlled specularity sweep.

This milestone should produce:

```text
one synthetic sequence
one parameter sweep
two algorithm outputs
one metrics table
one error heatmap
one short technical report
```

Only after this milestone is complete should the project expand to smoke, blood, deformation, tools, and real-dataset validation.