Meta × PyTorch × Scaler OpenEnv Hackathon · April 2026

Training AI agents to actually support customers

A 3-level hierarchical multi-agent RL environment — support agent, supervisor, manager — trained end-to-end with GRPO on real Indian enterprise support scenarios including Hinglish, policy drift, and live DB lookups.

GitHub HF Space

+0pp

over GPT-4 baseline

0 levels

agent hierarchy

0 stages

curriculum training

resolution rate

Architecture

3-level hierarchy that mirrors human orgs

Every L1 action is held pending until L2 reviews it. Agents can't skip the loop. Authority is enforced, not optional.

L1Support Agent

Respond & info-gather
Query live order DB
Issue refund ≤ ₹500
Escalate to L2

Empathy 30% · Accuracy 25% · Resolution 25% · Efficiency 20%

L2Supervisor

Approve or reject L1 action
Give corrective feedback
Adjust refund ceiling
Escalate to L3

Oversight quality 35% · Escalation fit 30% · Policy 20%

L3Manager

Final policy authority
Approve large refunds
Override L2 decisions
Resolve VIP tickets

Decision quality 45% · Resolution 30% · Decisiveness 25%

💬CustomerOpens ticket

🤝L1 AgentResponds & queries DB

🔍L2 ReviewApprove / reject / escalate

👑L3 ManagerFinal authority if needed

✅ResolvedReward computed → GRPO

Results

+15–19pp over every baseline

An 8B model trained with GRPO curriculum outperforms the 70B NIM baseline by 15–19 percentage points — at 8.75× smaller size.

+90%

correct escalation on hard task (41→78%)

+85%

SLA compliance gain on full hierarchy

+118%

Hinglish comprehension improvement

TaskNIM 70B BaselineOurs (8B + GRPO)Δ

easy

0.72

0.88

+16pp

medium

0.61

0.79

+18pp

hard

0.45

0.64

+19pp

nightmare

0.38

0.53

+15pp

curriculum_basic

0.69

0.84

+15pp

curriculum_supervisor

0.54

0.71

+17pp

curriculum_full_hierarchy

0.41

0.58

+17pp

curriculum_nightmare

0.29

0.44

+15pp

Training Curves

Real GRPO run — 40 steps logged

Qwen2.5-1.5B · Colab T4 · 40 steps · curriculum_basic · 0.6% mean invalid

Reward

Baseline 0.136 → best eval 0.152. Final reward 0.240 at step 40.

Loss

Stays stable throughout. No divergence or collapse.

Learning Rate

Cosine annealing: 5e-5 → 5e-6 over 40 steps.

Invalid Rate

Mean 0.6% — well below 90% collapse threshold.

Eval Scores

Best checkpoint 0.152 at step 20 vs baseline 0.136.

Before vs After

Trained (green) beats baseline (red) every eval episode.

Full L40S run (Llama-3.1-8B, 150 steps): reward reached 0.709 with final=1.000 episodes on curriculum_supervisor · Best checkpoint: 0.531 @ step 40

Live Demo Available

See the agents in action

Watch the 3-level hierarchy handle real customer tickets — including edge cases, policy overrides, and escalations — in our live interactive demo.

GitHub Dashboard