Meta × PyTorch × Scaler OpenEnv Hackathon · April 2026

Training AI agents to actually support customers

A 3-level hierarchical multi-agent RL environment — support agent, supervisor, manager — trained end-to-end with GRPO on real Indian enterprise support scenarios including Hinglish, policy drift, and live DB lookups.

+0pp
over GPT-4 baseline
0 levels
agent hierarchy
0 stages
curriculum training
0%
resolution rate
Architecture

3-level hierarchy that mirrors human orgs

Every L1 action is held pending until L2 reviews it. Agents can't skip the loop. Authority is enforced, not optional.

L1Support Agent
  • Respond & info-gather
  • Query live order DB
  • Issue refund ≤ ₹500
  • Escalate to L2

Empathy 30% · Accuracy 25% · Resolution 25% · Efficiency 20%

L2Supervisor
  • Approve or reject L1 action
  • Give corrective feedback
  • Adjust refund ceiling
  • Escalate to L3

Oversight quality 35% · Escalation fit 30% · Policy 20%

L3Manager
  • Final policy authority
  • Approve large refunds
  • Override L2 decisions
  • Resolve VIP tickets

Decision quality 45% · Resolution 30% · Decisiveness 25%

💬CustomerOpens ticket
🤝L1 AgentResponds & queries DB
🔍L2 ReviewApprove / reject / escalate
👑L3 ManagerFinal authority if needed
ResolvedReward computed → GRPO
Results

+15–19pp over every baseline

An 8B model trained with GRPO curriculum outperforms the 70B NIM baseline by 15–19 percentage points — at 8.75× smaller size.

+90%
correct escalation on hard task (41→78%)
+85%
SLA compliance gain on full hierarchy
+118%
Hinglish comprehension improvement
TaskNIM 70B BaselineOurs (8B + GRPO)Δ
easy
0.72
0.88
+16pp
medium
0.61
0.79
+18pp
hard
0.45
0.64
+19pp
nightmare
0.38
0.53
+15pp
curriculum_basic
0.69
0.84
+15pp
curriculum_supervisor
0.54
0.71
+17pp
curriculum_full_hierarchy
0.41
0.58
+17pp
curriculum_nightmare
0.29
0.44
+15pp
Training Curves

Real GRPO run — 40 steps logged

Qwen2.5-1.5B · Colab T4 · 40 steps · curriculum_basic · 0.6% mean invalid

Reward
Reward
Baseline 0.136 → best eval 0.152. Final reward 0.240 at step 40.
Loss
Loss
Stays stable throughout. No divergence or collapse.
Learning Rate
Learning Rate
Cosine annealing: 5e-5 → 5e-6 over 40 steps.
Invalid Rate
Invalid Rate
Mean 0.6% — well below 90% collapse threshold.
Eval Scores
Eval Scores
Best checkpoint 0.152 at step 20 vs baseline 0.136.
Before vs After
Before vs After
Trained (green) beats baseline (red) every eval episode.

Full L40S run (Llama-3.1-8B, 150 steps): reward reached 0.709 with final=1.000 episodes on curriculum_supervisor · Best checkpoint: 0.531 @ step 40

Live Demo Available

See the agents in action

Watch the 3-level hierarchy handle real customer tickets — including edge cases, policy overrides, and escalations — in our live interactive demo.