Introduction
Developing an autonomous driving policy that integrates multiple perception tasks is a complex challenge. In this project, I built AutoMoE — a modular self-driving model based on a Mixture-of-Experts (MoE) architecture. The goal was to combine specialized neural networks (experts) for different vision tasks using a learned gating mechanism, and train the system to imitate a driving autopilot in the CARLA simulator.
Architecture
Multi-Task Experts with Gating and Trajectory Policy
The AutoMoE model is composed of several distinct modules working in tandem. Here's an overview of the key components:
Perception Experts
A set of pre-trained neural networks, each devoted to a specific perception task. My implementation includes experts for object detection, semantic segmentation, drivable area segmentation, and a nuScenes-style 3D detection expert.
Context Encoder
A small network that encodes the ego-vehicle's state into a 64-dimensional context vector. This includes the car's speed, steering angle, throttle, brake, and potentially other situational data (time of day, weather). The context encoder transforms these signals into a learned embedding that represents the driving context for the gating network.
Gating Network (MoE)
The brain that fuses expert outputs. It takes as input the set of features from all perception experts and the context vector, producing a weighting for each expert as well as a fused feature vector for decision-making.
# Gating mechanism pseudocode expert_features = [expert(image) for expert in experts] context = context_encoder(vehicle_state) # Process through MLPs and apply softmax gating_logits = gating_layer(concat(expert_features, context)) weights = softmax(gating_logits) # Weighted fusion fused_features = sum(w * f for w, f in zip(weights, expert_features))
Trajectory Policy Head
A lightweight CNN plus MLP head that converts fused features into a driving plan. It uses a small EasyBackbone CNN (4 convolutional layers with downsampling) that processes the front-facing camera image to produce a 512-dim visual feature vector. The model predicts H=10 waypoints into the future, plus a speed for each.
Architecture Specs
- • ResNet-18 experts for detection, segmentation, drivable
- • nuScenes expert: ResNet18 image + PointNet LiDAR (concat or sum)
- • Expert features pooled to 256-D via lightweight MLPs
- • Context: 64-D encoder over speed/steer/throttle/brake
- • Gating MLP: 128-D hidden, temperature softmax + optional Gumbel top-k
- • Policy: 4-layer CNN (EasyBackbone) → 512-D → 10 waypoints + speed
Data Collection & Dataset Generation
Training a data-hungry model like AutoMoE required a substantial and rich dataset. Since my focus was on imitation learning, I needed expert driving trajectories to mimic. For this, I leveraged the CARLA simulator's built-in autopilot to generate demonstration data.
CARLA Autopilot Images
- • Multi-camera RGB images (4 cameras)
- • Ego vehicle speed & control inputs
- • Environment metadata
- • ~68K frames, ~188 GB
CARLA Multimodal Dataset
- • Everything in V1, plus:
- • Semantic segmentation labels
- • 32-beam LiDAR point clouds
- • 2D bounding box annotations
- • ~82K frames (67K train, 8.4K val, 7.2K test)
- • Perception experts pre-trained on BDD100K/nuScenes, then fine-tuned on this CARLA multimodal set before gating/policy training to shrink the domain gap.
- • Attempted to add Waymo Open Dataset (blocked by Linux setup) and NVIDIA Cosmos Drive Dreams (compute-limited), so stayed sim-only for this iteration.
Training via Imitation Learning
With the architecture assembled and data in hand, I trained AutoMoE using pure imitation learning (behavior cloning). The training objective was straightforward: have the model predict the same trajectory that the CARLA autopilot executed, given the current observations.
Loss Function
Freeze Experts
Keep pre-trained expert weights frozen to preserve learned perception capabilities.
Train Gating + Policy
Train the gating network and trajectory policy head on the imitation objective.
Optional: Unfreeze & Fine-tune
Briefly unfreeze select experts for joint training with smaller learning rate.
Results & Lessons Learned
Offline, the gating network improved modestly with noisy top-k routing, but closed-loop inference flopped: in CARLA the ego sometimes did not even leave the spawn point (likely a wiring bug or brittle policy). I didn’t push research-grade evals—this was a fast engineering build under tight time/compute.
Offline Metrics (val set)
| Run | ADEL1 | FDEL1 | Entropy | Expert Usage |
|---|---|---|---|---|
| Baseline | 0.33 | 0.57 | 0.75 | [0.18, 0.12, 0.11, 0.59] |
| Noisy Top-k (k=2) | 0.30 | 0.51 | 0.54 | [0.35, 0.12, 0.39, 0.14] |
Val split only; no closed-loop route metrics yet (car failed to move reliably). ADE/FDE are displacement errors on predicted waypoints; entropy shows routing sharpness; expert usage is the average gate weight per expert across validation (order here: detection, segmentation, drivable, nuScenes image/LiDAR).
Covariate Shift
Pure behavior cloning is brittle to compounding errors. When the model made a slight mistake, there was no mechanism to recover — it spiraled into worse decisions.
Limited Expert Utilization
Gating weights often didn't sharply differentiate between experts. The autopilot data may not have demonstrated enough varied behavior to force clear specialization (entropy ~0.75 even after training).
Domain Gap
Even with fine-tuning on CARLA data, differences between simulation and original datasets (BDD100K, nuScenes) remained and hurt performance.
Validation Metrics
Not setting up rigorous quantitative evaluation scenarios made it harder to pinpoint weaknesses and measure progress objectively.
"After seeing the model struggle, I became an imitation learning hater and got RL-pilled. This experience convinced me that reinforcement learning is needed to achieve truly reliable autonomy."
Future Directions
While AutoMoE in its current form has room for improvement, it provides a framework that I can iteratively refine. Here are the technical directions I'm excited about pursuing:
FastViT Backbones
Replace CNN backbones with Apple's FastViT — a hybrid vision transformer that's 8× smaller and 20× faster without sacrificing accuracy.
Self-Supervised Learning
Incorporate masked image modeling, contrastive learning, or cross-modal self-supervision to make better use of unlabeled data.
Reinforcement Learning
Use the trained model as an initial policy for RL fine-tuning in simulation, allowing it to learn recovery maneuvers and handle edge cases.
Real-World Data
Test on Waymo Open Dataset or use domain adaptation techniques to bridge the sim-to-real gap.
AutoMoE taught me what it actually takes to build autonomy for the real world: relentless experimentation, plenty of mistakes, and pruning a messy search space into a cleaner vision. The stack on paper looked tidy—experts, gating, policy—but the process forced deep work across data, training, and integration. Those lessons will travel with me into every future AI/ML build.