CarRacing RL: PPO Expert → DAgger → Diffusion Augmentation

A CMU Deep RL project exploring expert training, dataset generation, imitation learning, and diffusion-based augmentation on Gymnasium CarRacing-v2.

CarRacing-v2 PPO Imitation Learning DAgger Diffusion Offline Dataset
CarRacing policy rollout preview
Replace with your best rollout GIF/screenshot (e.g., assets/images/carracing_demo.gif).

Links

Tip: include at least Code + Report to maximize rubric score (and remove anything you don’t have yet).

Objective & Metrics

This project studies how to train a strong continuous-control policy in CarRacing-v2 and then distill/improve a student policy using imitation learning (DAgger) and diffusion-based synthetic data. The goal is improved robustness when the student deviates from expert trajectories.

Primary objective
High-return driving with low crash rate
Key metric
Episode return (mean ± std)
Robustness metric
Off-distribution recovery (DAgger)
Data metric
Dataset size / coverage
When you have numbers, replace these with concrete values (e.g., “Avg return: 820 ± 60 over 50 eval episodes”).

Introduction

CarRacing-v2 is a continuous-control benchmark where an agent must drive a procedurally generated track from pixel observations. While a PPO expert can achieve strong performance, student policies trained via behavior cloning often fail under distribution shift: a small deviation from expert states can compound, sending the student into unseen states and causing crashes.

This project builds a full pipeline: (1) train an expert with PPO, (2) generate and validate an offline dataset, (3) train a student via imitation learning, (4) improve with DAgger, and (5) explore diffusion-based data/policy augmentation to increase coverage of critical recovery behaviors.

System Pipeline

Methods

PPO Expert Training

Offline Dataset Format

DAgger (Dataset Aggregation)

Diffusion-Based Augmentation

Results

CarRacing policy rollout preview

CarRacing policy rollout preview CarRacing policy rollout preview CarRacing policy rollout preview
Tip: one reward curve + one evaluation table image usually scores well and looks credible.

Discussion

A PPO expert provides high-quality supervision, but naive distillation can fail when the student drifts into unseen states. DAgger directly addresses this by re-labeling the student’s on-policy distribution with expert actions. Diffusion-based augmentation is explored as a complementary approach to increase dataset coverage and model complex action distributions for difficult recovery behaviors.

Key practical lessons include enforcing a single observation preprocessing pipeline across expert/student/diffusion, saving best checkpoints frequently for stability, and implementing dataset validation tools to catch shape mismatches early.

My Contribution