Curriculum Learning & Progressive Hardening¶

Beyond Pass/Fail¶

A naive validation pipeline simply asks: did the policy pass or fail? SimOps goes further. The closed-loop feedback generates structured training signals that automatically refine the policy through curriculum learning and progressive hardening.

flowchart LR
    subgraph Naive["❌ Naive Approach"]
        N1["Train"] --> N2["Validate"]
        N2 -->|Pass| N3["Deploy"]
        N2 -->|Fail| N4["Retrain<br/>from scratch"]
    end

    subgraph SimOps["✅ SimOps Approach"]
        S1["Train"] --> S2["Validate"]
        S2 -->|Pass| S3["Deploy"]
        S2 -->|Fail| S4["Analyze failure"]
        S4 --> S5["Auto-adjust<br/>curriculum"]
        S5 --> S1
    end

What Is Curriculum Learning?¶

Curriculum learning structures the training process from easy to hard, mirroring how humans learn. Instead of throwing the full complexity of a task at the agent from the start, the training environment gradually increases difficulty.

In SimOps, this curriculum is not manually designed — it is automatically generated from validation feedback.

flowchart TD
    subgraph CL["Automated Curriculum Pipeline"]
        V["🔬 Validation Results"] --> FA["📋 Failure Analysis"]
        FA --> FC["🏷️ Failure Classification"]
        FC --> CG["📐 Curriculum Generator"]
        CG --> L1["Level 1: Nominal conditions"]
        CG --> L2["Level 2: Mild perturbations"]
        CG --> L3["Level 3: Edge cases"]
        CG --> L4["Level 4: Adversarial scenarios"]
        L1 --> TR["🏋️ Training Simulator"]
        L2 --> TR
        L3 --> TR
        L4 --> TR
    end

Curriculum Dimensions¶

The curriculum adjusts along multiple axes simultaneously:

Dimension	Easy → Hard	Example
Physics perturbation	Nominal parameters	±30% mass, friction, damping
Sensor noise	Clean signals	Realistic noise + dropout
Task complexity	Static targets	Moving, accelerating targets
Environmental variation	Single environment	Randomized lighting, obstacles
Timing pressure	Relaxed deadlines	Real-time constraints
Failure recovery	No disturbances	External pushes, sensor failure

Progressive Hardening¶

Progressive hardening is the process by which a policy becomes increasingly robust through iterative validation-training cycles. Each cycle exposes new weaknesses, which are then addressed in the next training round.

flowchart TD
    subgraph Cycle1["Cycle 1: Basic Competence"]
        C1T["Train on nominal<br/>conditions"]
        C1V["Validate"]
        C1R["Result: 70% pass<br/>Fails on fast targets"]
        C1T --> C1V --> C1R
    end

    subgraph Cycle2["Cycle 2: Speed Robustness"]
        C2T["Train with expanded<br/>speed range"]
        C2V["Validate"]
        C2R["Result: 85% pass<br/>Fails on sensor noise"]
        C2T --> C2V --> C2R
    end

    subgraph Cycle3["Cycle 3: Sensor Robustness"]
        C3T["Train with realistic<br/>sensor artifacts"]
        C3V["Validate"]
        C3R["Result: 93% pass<br/>Fails on edge contacts"]
        C3T --> C3V --> C3R
    end

    subgraph Cycle4["Cycle 4: Contact Robustness"]
        C4T["Train with diverse<br/>contact scenarios"]
        C4V["Validate"]
        C4R["Result: 97% pass ✅<br/>Ready for hardware"]
        C4T --> C4V --> C4R
    end

    Cycle1 --> Cycle2 --> Cycle3 --> Cycle4

The Hardening Gradient¶

Each hardening cycle operates at a specific feedback level:

graph BT
    subgraph Levels["Feedback Levels"]
        L1["Level 1: Reward Adjustment<br/>━━━━━━━━━━━━━━━━━━━<br/>Modify reward weights based on<br/>which failure modes dominate"]
        L2["Level 2: Domain Randomization<br/>━━━━━━━━━━━━━━━━━━━<br/>Expand parameter ranges where<br/>validation found brittleness"]
        L3["Level 3: Scenario Injection<br/>━━━━━━━━━━━━━━━━━━━<br/>Add specific failure scenarios<br/>as training environments"]
        L4["Level 4: Architecture Adaptation<br/>━━━━━━━━━━━━━━━━━━━<br/>Suggest network architecture changes<br/>when policy capacity is insufficient"]
    end

    L1 --> L2 --> L3 --> L4

Level	Trigger	Action	Impact
1. Reward Adjustment	Consistent failure pattern	Re-weight reward components	Low — fastest iteration
2. Domain Randomization	Brittle at parameter boundaries	Expand randomization ranges	Medium — broader coverage
3. Scenario Injection	Specific failure modes	Add targeted training scenarios	High — addresses root causes
4. Architecture Adaptation	Policy capacity exhausted	Modify network size/structure	Highest — fundamental change

Automated Feedback Loop Detail¶

From Validation Failure to Training Signal¶

sequenceDiagram
    participant VS as 🔬 Validation Sim
    participant FA as 📋 Failure Analyzer
    participant CG as 📐 Curriculum Generator
    participant TS as 🏋️ Training Sim

    VS->>FA: Validation results + telemetry
    FA->>FA: Classify failure modes
    FA->>FA: Compute severity scores
    FA->>CG: Failure report + root causes

    CG->>CG: Determine feedback level
    CG->>CG: Generate curriculum update

    alt Level 1: Reward
        CG->>TS: Updated reward weights
    else Level 2: Randomization
        CG->>TS: Expanded DR parameters
    else Level 3: Scenario
        CG->>TS: New training scenarios
    else Level 4: Architecture
        CG->>TS: Architecture recommendations
    end

    TS->>TS: Resume training with updates
    TS->>VS: New policy candidate

Failure Classification Taxonomy¶

The failure analyzer categorizes each failure into a structured taxonomy:

Category	Subcategories	Typical Feedback Level
Kinematic failure	Joint limits, workspace boundary, singularity	Level 1–2
Dynamic failure	Balance loss, excessive force, velocity limit	Level 2–3
Contact failure	Missed contact, slip, unintended collision	Level 2–3
Perception failure	Object misdetection, depth error, latency	Level 2–3
Planning failure	Suboptimal trajectory, timing error	Level 1–3
Generalization failure	Works in training distribution, fails outside	Level 3–4

Convergence Monitoring¶

SimOps tracks the hardening progress to detect convergence and identify diminishing returns:

graph LR
    subgraph Monitoring["Convergence Dashboard"]
        direction TB
        M1["📈 Pass rate trend across cycles"]
        M2["📊 Failure mode distribution shift"]
        M3["🎯 Gap metrics convergence"]
        M4["⏱️ Training efficiency per cycle"]
        M5["🔄 New failure discovery rate"]
    end

Convergence Criteria¶

A policy is considered hardened when:

Pass rate exceeds target threshold (e.g., > 95%) across all scenario categories
No new failure modes discovered in the last N validation runs
Gap metrics are within bounded tolerances (see Sim-to-Real Gap Quantification)
Marginal improvement per cycle drops below efficiency threshold

Key Insight

Progressive hardening transforms validation from a binary gate into a continuous improvement engine. Each failure makes the next policy stronger, and the system knows when it has converged.

Comparison: Manual vs SimOps Hardening¶

Aspect	Manual Approach	SimOps Automated
Failure analysis	Engineer inspects logs	Automated classification
Curriculum design	Hand-crafted difficulty levels	Auto-generated from failures
Feedback speed	Days (human in the loop)	Minutes (automated pipeline)
Coverage	Limited by engineer intuition	Systematic + exhaustive
Reproducibility	Low — depends on individual	High — deterministic pipeline
Convergence detection	Subjective judgment	Quantitative criteria