Robust Reinforcement Learning-based Autonomous Driving Agent for Simulation and Real World

Robust Reinforcement Learning-based Autonomous Driving Agent for Simulation and Real World

We asked Róbert Moni to tell us more about his recent work. Enjoy the read!

The author’s perspective

Most of us, proud nerd community members, experience driving first time by the discrete actions taken on our keyboards. We believe that the harder we push the forward arrow (or the W-key), the car from the game will accelerate faster (sooo true 😊 ). Few of us believes that we can resolve this task with machine learning. Even fever of us believes that this can be done accurately and in a robust mode with a basic Deep Reinforcement Learning (DRL) method known as Deep Q-Learning Networks (DQN).

It turned to be true in the case of a Duckiebot, and even more, with some added computer vision techniques it was able to perform well both in simulation (where the training process was carried out) and real world.

The pipeline

The complete training pipeline carried out in the Duckietown-gym environment is visualized in the figure above and works as follows. First, the camera images go through several preprocessing steps:

  • resizing to a smaller resolution (60×80) for faster processing;
  • cropping the upper part of the image, which doesn’t contain useful information for the navigation;
  • segmenting important parts of the image based on their color (lane markings);
  • and normalizing the image;
  • finally a sequence is formed from the last 5 camera images, which will be the input of the Convolutional Neural Network (CNN) policy network (the agent itself).

The agent is trained in the simulator with the DQN algorithm based on a reward function that describes how accurately the robot follows the optimal curve. The output of the network is mapped to wheel speed commands.

The workings

The CNN was trained with the preprocessed images. The network was designed such that the inference can be performed real-time on a computer with limited resources (i.e. it has no dedicated GPU). The input of the network is a tensor with the shape of (40, 80, 15), which is the result of stacking five RGB images. The network consists of three convolutional layers, each followed by ReLU (nonlinearity function) and MaxPool (dimension reduction) operations.

The convolutional layers use 32, 32, 64 filters with size 3 × 3. The MaxPool layers use 2 × 2 filters. The convolutional layers are followed by fully connected layers with 128 and 3 outputs. The output of the last layer corresponds to the selected action. The output of the neural network (one of the three actions) is mapped to wheel speed commands; these actions correspond to turning left, turning right, or going straight, respectively.

Learn more

Our work was acknowledged and presented at the IEEE World Congress on Computational Intelligence 2020 conference. We plan to publish the source code after AI-DO5 competition. Our paper is available on, and

Check out our sim and real demo on Youtube performed at our Duckietown Robotarium put together at Budapest University of Technology and Economics. .