AutoMoE

Introduction

Developing an autonomous driving policy that integrates multiple perception tasks is a complex challenge. In this project, I built AutoMoE — a modular self-driving model based on a Mixture-of-Experts (MoE) architecture. The goal was to combine specialized neural networks (experts) for different vision tasks using a learned gating mechanism, and train the system to imitate a driving autopilot in the CARLA simulator.

🛠️

This was a personal build, not a research-grade study. I optimized for shipping a working stack under student-level time/compute, so evaluation rigor is lighter than a formal paper. Also, as a result, there aren't demo images or videos to show.

💡

Key Insight: The architecture is designed to be interpretable (each expert's influence is visible) and modular (experts can be improved or swapped independently).

Architecture

Multi-Task Experts with Gating and Trajectory Policy

The AutoMoE model is composed of several distinct modules working in tandem. Here's an overview of the key components:

Perception Experts

A set of pre-trained neural networks, each devoted to a specific perception task. My implementation includes experts for object detection, semantic segmentation, drivable area segmentation, and a nuScenes-style 3D detection expert.

BDD100K DetectionSegmentationDrivable AreanuScenes 3D

Context Encoder

A small network that encodes the ego-vehicle's state into a 64-dimensional context vector. This includes the car's speed, steering angle, throttle, brake, and potentially other situational data (time of day, weather). The context encoder transforms these signals into a learned embedding that represents the driving context for the gating network.

Gating Network (MoE)

The brain that fuses expert outputs. It takes as input the set of features from all perception experts and the context vector, producing a weighting for each expert as well as a fused feature vector for decision-making.

gating_network.py

# Gating mechanism pseudocode
expert_features = [expert(image) for expert in experts]
context = context_encoder(vehicle_state)

# Process through MLPs and apply softmax
gating_logits = gating_layer(concat(expert_features, context))
weights = softmax(gating_logits)

# Weighted fusion
fused_features = sum(w * f for w, f in zip(weights, expert_features))

Trajectory Policy Head

A lightweight CNN plus MLP head that converts fused features into a driving plan. It uses a small EasyBackbone CNN (4 convolutional layers with downsampling) that processes the front-facing camera image to produce a 512-dim visual feature vector. The model predicts H=10 waypoints into the future, plus a speed for each.

10Future Waypoints

512-DVisual Features

256-DFused Features

🧩

Architecture Specs

Perception

• ResNet-18 experts for detection, segmentation, drivable
• nuScenes expert: ResNet18 image + PointNet LiDAR (concat or sum)
• Expert features pooled to 256-D via lightweight MLPs

Gating + Policy

• Context: 64-D encoder over speed/steer/throttle/brake
• Gating MLP: 128-D hidden, temperature softmax + optional Gumbel top-k
• Policy: 4-layer CNN (EasyBackbone) → 512-D → 10 waypoints + speed

Data Collection & Dataset Generation

Training a data-hungry model like AutoMoE required a substantial and rich dataset. Since my focus was on imitation learning, I needed expert driving trajectories to mimic. For this, I leveraged the CARLA simulator's built-in autopilot to generate demonstration data.

Version 1

CARLA Autopilot Images

• Multi-camera RGB images (4 cameras)
• Ego vehicle speed & control inputs
• Environment metadata
• ~68K frames, ~188 GB

⚠️ Missing LiDAR data for nuScenes expert

→

Version 2

CARLA Multimodal Dataset

• Everything in V1, plus:
• Semantic segmentation labels
• 32-beam LiDAR point clouds
• 2D bounding box annotations
• ~82K frames (67K train, 8.4K val, 7.2K test)

✓ Complete multimodal coverage

📊

2,500+ downloads in a month on Hugging Face!

Why this data mattered

• Perception experts pre-trained on BDD100K/nuScenes, then fine-tuned on this CARLA multimodal set before gating/policy training to shrink the domain gap.
• Attempted to add Waymo Open Dataset (blocked by Linux setup) and NVIDIA Cosmos Drive Dreams (compute-limited), so stayed sim-only for this iteration.

Training via Imitation Learning

With the architecture assembled and data in hand, I trained AutoMoE using pure imitation learning (behavior cloning). The training objective was straightforward: have the model predict the same trajectory that the CARLA autopilot executed, given the current observations.

Loss Function

L_ADE+L_FDE+L_Speed+λ · L_Smooth

ADE: Average Displacement Error over all waypoints

FDE: Final Displacement Error for endpoint accuracy

Speed: L1 loss on the speed sequence

Smooth: Penalizes large changes between waypoints

Freeze Experts

Keep pre-trained expert weights frozen to preserve learned perception capabilities.

Train Gating + Policy

Train the gating network and trajectory policy head on the imitation objective.

Optional: Unfreeze & Fine-tune

Briefly unfreeze select experts for joint training with smaller learning rate.

Results & Lessons Learned

Offline, the gating network improved modestly with noisy top-k routing, but closed-loop inference flopped: in CARLA the ego sometimes did not even leave the spawn point (likely a wiring bug or brittle policy). I didn’t push research-grade evals—this was a fast engineering build under tight time/compute.

📈

Offline Metrics (val set)

Run	ADE_L1	FDE_L1	Entropy	Expert Usage
Baseline	0.33	0.57	0.75	[0.18, 0.12, 0.11, 0.59]
Noisy Top-k (k=2)	0.30	0.51	0.54	[0.35, 0.12, 0.39, 0.14]

Val split only; no closed-loop route metrics yet (car failed to move reliably). ADE/FDE are displacement errors on predicted waypoints; entropy shows routing sharpness; expert usage is the average gate weight per expert across validation (order here: detection, segmentation, drivable, nuScenes image/LiDAR).

📉

Covariate Shift

Pure behavior cloning is brittle to compounding errors. When the model made a slight mistake, there was no mechanism to recover — it spiraled into worse decisions.

⚖️

Limited Expert Utilization

Gating weights often didn't sharply differentiate between experts. The autopilot data may not have demonstrated enough varied behavior to force clear specialization (entropy ~0.75 even after training).

🌐

Domain Gap

Even with fine-tuning on CARLA data, differences between simulation and original datasets (BDD100K, nuScenes) remained and hurt performance.

📊

Validation Metrics

Not setting up rigorous quantitative evaluation scenarios made it harder to pinpoint weaknesses and measure progress objectively.

"After seeing the model struggle, I became an imitation learning hater and got RL-pilled. This experience convinced me that reinforcement learning is needed to achieve truly reliable autonomy."

Future Directions

While AutoMoE in its current form has room for improvement, it provides a framework that I can iteratively refine. Here are the technical directions I'm excited about pursuing:

⚡

FastViT Backbones

Replace CNN backbones with Apple's FastViT — a hybrid vision transformer that's 8× smaller and 20× faster without sacrificing accuracy.

🔄

Self-Supervised Learning

Incorporate masked image modeling, contrastive learning, or cross-modal self-supervision to make better use of unlabeled data.

🎮

Reinforcement Learning

Use the trained model as an initial policy for RL fine-tuning in simulation, allowing it to learn recovery maneuvers and handle edge cases.

🌍

Real-World Data

Test on Waymo Open Dataset or use domain adaptation techniques to bridge the sim-to-real gap.

AutoMoE taught me what it actually takes to build autonomy for the real world: relentless experimentation, plenty of mistakes, and pruning a messy search space into a cleaner vision. The stack on paper looked tidy—experts, gating, policy—but the process forced deep work across data, training, and integration. Those lessons will travel with me into every future AI/ML build.

Resources

📦

GitHub Repository

Full source code and implementation

🤗

Hugging Face Dataset

CARLA Autopilot Multimodal Dataset