
Limitaitons of CNN
Limited global scene understanding
CNN models mainly capture local spatial patterns such as lane markings and edges, but are less effective at modeling long-range dependencies and broader scene context.
Lack of temporal awareness
The baseline uses a single front-view RGB image as input. As a result, it cannot explicitly model recent motion history or anticipate future trajectory changes.
Weak long-term stability
In closed-loop driving, small prediction errors can accumulate over time, leading to lane drifting and unstable turning behavior, especially in curves and intersections.
Poor recovery capability
Once the vehicle deviates from the expert trajectory, the CNN policy often struggles to recover smoothly. This may result in catastrophic divergence.
Balanced Data Collection

ViT Framework
For the input, the model uses four historical frames together with a normalized speed sequence. This gives the model short-term temporal context, so it can reason not only about the current scene, but also about recent motion.
For the output, the model predicts five future waypoints and one target speed. The future waypoints are represented in the ego coordinate frame, so they describe the short-term driving trajectory relative to the current vehicle position.
For the training target, we use expert waypoints collected from CARLA trajectories, together with speed supervision.

CLosed-loop Deployment
Historical Observation Buffer
- A rolling buffer stores the most recent 4 historical frames together with the corresponding normalized speed values.
- At each new time step, the latest observation is appended and the oldest one is removed.
Model Prediction
- The buffered observations are passed into the ViT-based driving framework.
- The model predicts 5 future waypoints in the ego coordinate frame.
- It also predicts a target speed for the next driving stage.
Waypoint Control
- The predicted waypoints are converted into executable vehicle commands by a waypoint controller.
- Instead of directly regressing steering at each frame, the controller follows the predicted short-term trajectory.
- This helps generate smoother motion and reduces oscillation during turns and lane corrections.
Closed-Loop Execution
- The vehicle moves according to the controller output, and a new camera frame and speed measurement are collected.
- The new observation is fed back into the buffer, forming a closed-loop deployment cycle.
- This process continues until the vehicle reaches a failure condition or completes the test route.
