Mastering Classic Reinforcement Learning Algorithms

Ends soon! Keep adding new skills with 10,000+ programs for $239 (usually $399). Save now.

Mastering Classic Reinforcement Learning Algorithms

Instructor: Ashutosh Trivedi

Included with

Learn more

5 modules

Gain insight into a topic and learn the fundamentals.

Intermediate level

Recommended experience

1 week to complete

at 10 hours a week

Flexible schedule

Learn at your own pace

5 modules

Gain insight into a topic and learn the fundamentals.

Intermediate level

Recommended experience

1 week to complete

at 10 hours a week

Flexible schedule

Learn at your own pace

What you'll learn

Formulate sequential decision-making problems as deterministic decision processes, Markov chains, and finite Markov decision processes.
Explain and apply core reinforcement-learning concepts, including discounting, value functions, policies, Bellman equations, and optimality.
Implement planning algorithms for finite Markov decision processes, including value iteration, policy iteration, and linear programming formulations.
Compare tabular reinforcement-learning algorithms, including bandits, Monte Carlo methods, temporal-difference learning, SARSA, and Q-learning.

Skills you'll gain

Details to know

Shareable certificate

Add to your LinkedIn profile

See how employees at top companies are mastering in-demand skills

Learn more about Coursera for Business

logos of Petrobras, TATA, Danone, Capgemini, P&G and L'Oreal

There are 5 modules in this course

How can an agent learn to make good decisions through repeated interaction with an uncertain environment? This course introduces the mathematical and algorithmic foundations of classical reinforcement learning, with an emphasis on finite Markov decision processes and tabular methods.

The course begins with the simplest settings in which the central ideas are clearest: deterministic decision processes, discounted rewards, and Bellman optimality equations. It then introduces stochasticity through Markov chains and Markov decision processes, where learners study policies, value functions, expected discounted reward, and dynamic programming. With this foundation in place, the course turns to planning methods for known models, including value iteration, policy iteration, and linear programming formulations. The second half of the course studies reinforcement learning when the model is unknown and the agent must learn from sampled experience. Topics include multi-armed bandits, exploration and exploitation, Monte Carlo methods, temporal-difference learning, SARSA, Q-learning, and convergence principles. The course ends with a final assessment in which learners solve the same finite MDP from both model-based planning and model-free learning perspectives. By the end of the course, learners will be able to formulate finite decision-making problems as Markov decision processes, solve them using classical planning algorithms, and implement tabular reinforcement-learning algorithms from sampled data. This course provides the foundation for later study of deep reinforcement learning, reward programming, and trustworthy AI systems. This course can be taken for academic credit as part of CU Boulder’s Masters of Science in Computer Science (MS-CS) and Master of Science in Artificial Intelligence (MS-AI) degrees offered on the Coursera platform. These fully accredited graduate degrees offer targeted courses, short 8-week sessions, and pay-as-you-go tuition. Admission is based on performance in three preliminary courses, not academic history. CU degrees on Coursera are ideal for recent graduates or working professionals. Learn more: MS in Artificial Intelligence: https://www.coursera.org/degrees/ms-artificial-intelligence-boulder MS in Computer Science: https://coursera.org/degrees/ms-computer-science-boulder

This module introduces the modeling and optimization foundations for sequential decision-making in their simplest form: deterministic decision processes with discounted rewards. We begin with states, actions, transitions, and rewards as a language for representing decision problems over time. We then develop value functions and Bellman equations as tools for optimizing long-term return. The goal is to build intuition for why dynamic programming is correct in the simpler setting of deterministic decision processes before introducing stochastic transitions, learning from sampled experience, and bootstrapping in later modules.

What's included

11 videos12 readings2 assignments

11 videosTotal 69 minutes

Course Introduction7 minutes
Decision-Making over Time 3 minutes
States, Actions, Transitions, and Rewards 2 minutes
From Unfolded Decisions to State-Based Models 2 minutes
Formal Definition of a Deterministic Decision Process 4 minutes
Discounting Infinite Reward Streams 9 minutes
Runs, Histories, Policies, and Values9 minutes
Discounted Optimality Equations5 minutes
Checking Values and Extracting Policies5 minutes
Why Bellman Equations Characterize Optimal Behavior10 minutes
Existence, Uniqueness, and Value Iteration13 minutes

12 readingsTotal 110 minutes

Earn Academic Credit for your Work!10 minutes
Course Support10 minutes
Assessment Expectations5 minutes
Sequential Decision-Making as Optimization10 minutes
States, Actions, Transitions, and Rewards10 minutes
Deterministic Decision Processes10 minutes
Discounting Infinite Reward Streams10 minutes
Policies, Runs, and Values10 minutes
Bellman Equations and Dynamic Programming10 minutes
Why Bellman Equations Characterize Optimal Behavior10 minutes
Existence, Uniqueness, and Value Iteration10 minutes
Module Summary5 minutes

2 assignmentsTotal 50 minutes

AI Policy Quiz5 minutes
Deterministic Decision Processes45 minutes

This module adds stochasticity to the deterministic picture developed in the previous module. Learners continue with the surprise-quiz example, now with uncertain outcomes: studying usually helps but may not always help, and relaxing may reduce preparation but may not always do so. The module first introduces stochastic transitions as probability distributions over next states, then studies Markov chains as stochastic systems without choices and finally adds actions to obtain Markov decision processes. The goal is to make expected discounted reward, policies, and Bellman equations feel like natural extensions of the deterministic setting.

What's included

8 videos8 readings1 assignment

8 videosTotal 70 minutes

Module Introduction2 minutes
From Deterministic to Stochastic Transitions10 minutes
Markov Chains23 minutes
Markov Decision Processes7 minutes
Policies and Values8 minutes
Checking Values and Extracting Policies3 minutes
Bellman Optimality Equations7 minutes
Why Bellman Optimality Equations Are Correct9 minutes

8 readingsTotal 70 minutes

From Deterministic to Stochastic Transitions10 minutes
Markov Chains10 minutes
Expected Discounted Reward in Markov Chains10 minutes
Markov Decision Processses10 minutes
Policies, Value Functions, and Expected Return 10 minutes
Bellman Equations for MDPs10 minutes
Comparing DDPs, Markov Chains, and MDPs5 minutes
Module Summary5 minutes

1 assignmentTotal 45 minutes

Markov Chains and Markov Decision Processes45 minutes

This module focuses on known-model optimization. Learners use Bellman equations as computational tools for policy evaluation, policy improvement, value iteration, policy iteration, and linear programming formulations of discounted MDPs.

What's included

9 videos8 readings1 assignment

9 videosTotal 41 minutes

Module Introduction2 minutes
Planning Setup4 minutes
Policy Evaluation6 minutes
From Values to Better Policies 8 minutes
The Bellman Optimality Operator 5 minutes
Value Iteration as Fixed-Point Computation6 minutes
Alternating Evaluation and Improvement5 minutes
The Linear Programming View of Optimality 3 minutes
Module Summary2 minutes

8 readingsTotal 75 minutes

Planning with a Known Model 10 minutes
Policy Evaluation10 minutes
Policy Improvement10 minutes
The Bellman Optimality Operator 10 minutes
Value Iteration10 minutes
Policy Iteration10 minutes
Linear Programming for Discounted MDPs10 minutes
Module Summary5 minutes

1 assignmentTotal 45 minutes

Dynamic Programming in MDPs45 minutes

This module begins the transition from planning to reinforcement learning. In planning, the MDP model is known and Bellman backups compute expectations exactly. In reinforcement learning, the model is replaced by sampled experience. Learners first view reinforcement learning as sample-based dynamic programming, then study rewards, uncertainty, agent--environment interaction, bandit estimation, exploration versus exploitation, Monte Carlo policy evaluation, and Monte Carlo control.

What's included

9 videos11 readings1 assignment

9 videosTotal 37 minutes

Module Introduction2 minutes
From Planning to Reinforcement Learning3 minutes
Rewards, Uncertainty, and Exploration3 minutes
The Agent–Environment Interface3 minutes
One-Armed Bandits5 minutes
Multi-Armed Bandits5 minutes
Monte Carlo Policy Evaluation8 minutes
Monte Carlo Control6 minutes
Module Summary3 minutes

11 readingsTotal 74 minutes

From Planning to Learning10 minutes
From Planning to Reinforcement Learning10 minutes
Rewards, Uncertainty, and Behavior5 minutes
The Agent–Environment Interaction Loop5 minutes
One-Armed Bandits10 minutes
Multi-Armed Bandits10 minutes
Monte Carlo Estimation2 minutes
Returns as Random Variables5 minutes
Monte Carlo Policy Evaluation5 minutes
Monte Carlo Control10 minutes
Module Summary2 minutes

1 assignmentTotal 45 minutes

Learning from Sampled Experience 45 minutes

This module completes the tabular reinforcement-learning part of Course 1. Module 4 introduced sample-based learning through bandits and Monte Carlo methods. Module 5 introduces temporal-difference learning: updating after one sampled transition by combining an observed reward with a bootstrapped value estimate. The module ends by summarizing tabular reinforcement learning and motivating the transition to function approximation and deep RL.

What's included

8 videos9 readings1 assignment

8 videosTotal 33 minutes

Learning before the Episode Ends4 minutes
One-Step Bootstrapped Prediction5 minutes
On-Policy Temporal-Difference Control 4 minutes
Off-Policy Temporal-Difference Control5 minutes
What Policy Is Being Learned? 4 minutes
Smoother Targets and Overestimation3 minutes
Reducing Maximization Bias3 minutes
Between Monte Carlo and One-Step TD4 minutes

9 readingsTotal 39 minutes

Why Temporal-Difference Learning?5 minutes
TD(0) Policy Evaluation5 minutes
On-Policy TD Control5 minutes
Q-Learning: Off-Policy TD Control5 minutes
On-Policy and Off-Policy Learning2 minutes
Expected SARSA and Maximization Bias5 minutes
Double Q-Learning5 minutes
n-Step TD2 minutes
Why Move Beyond Tabular Methods?5 minutes

1 assignmentTotal 45 minutes

Control, Exploration, and Tabular RL Algorithms45 minutes

Instructor

Ashutosh Trivedi

University of Colorado Boulder

2 Courses47 learners

Offered by

University of Colorado Boulder

Why people choose Coursera for their career

Felipe M.

Learner since 2018

"To be able to take courses at my own pace and rhythm has been an amazing experience. I can learn whenever it fits my schedule and mood."

Jennifer J.

Learner since 2020

"I directly applied the concepts and skills I learned from my courses to an exciting new project at work."

Larry W.

Learner since 2021

"When I need courses on topics that my university doesn't offer, Coursera is one of the best places to go."

Chaitanya A.

"Learning isn't just about being better at your job: it's so much more than that. Coursera allows me to learn without limits."

Frequently asked questions

To access the course materials, assignments and to earn a Certificate, you will need to purchase the Certificate experience when you enroll in a course. You can try a Free Trial instead, or apply for Financial Aid. The course may offer 'Full Course, No Certificate' instead. This option lets you see all course materials, submit required assessments, and get a final grade. This also means that you will not be able to purchase a Certificate experience.

When you enroll in the course, you get access to all of the courses in the Specialization, and you earn a certificate when you complete the work. Your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile.

Yes. In select learning programs, you can apply for financial aid or a scholarship if you can’t afford the enrollment fee. If fin aid or scholarship is available for your learning program selection, you’ll find a link to apply on the description page.