Mastering Q-Learning: Hands-On Examples and Key Concepts
What is Q-Learning?
Q-learning is a model-free, reinforcement learning algorithm used to determine the optimal action-selection policy for any given Markov Decision Process (MDP). In Q-learning, the agent learns to make decisions by learning the expected long-term rewards associated with each action in a given state, without requiring a model of the environment.
How Does Q-Learning Work?
Here are the steps involved in Q-learning:
- Define the state and action space: The state space represents the possible states that the agent can be in, while the action space represents the possible actions that the agent can take.
- Initialize the Q-table: The Q-table is a matrix that stores the expected reward for each action in each state. It is initially set to 0.
- Observe the state: The agent observes the current state of the environment.
- Choose an action: The agent selects an action to take based on an exploration-exploitation trade-off. The exploration-exploitation trade-off is a balance between choosing actions that have yielded high rewards in the past (exploitation) and trying out new actions to learn more about the environment (exploration).
- Execute the action: The agent executes the chosen action in the environment and observes the reward received.
- Update the Q-table: The Q-table is updated using the observed reward, using the Q-learning update rule. The Q-learning update rule updates the expected reward for the chosen action in the observed state based on the observed reward and the expected reward for the next state-action pair.
- Repeat steps 3–6 until convergence: Steps 3–6 are repeated until the Q-values converge to the optimal values, which represent the maximum expected long-term reward for each action in each state.
Once the Q-table has converged to its optimal values, the agent can use it to select the best action to take in any given state by choosing the action with the highest expected reward.
Q-learning is a popular algorithm for solving MDPs because it can learn optimal policies without requiring a model of the environment, and it can handle large state spaces.
Example Of Q-Learning
here’s a real-time example of Q-learning in machine learning for navigating a maze:
Let’s say we have a 3x3 maze, where the start position is in the top-left corner, and the goal position is in the bottom-right corner. The agent can move up, down, left, or right to navigate the maze. There are no obstacles in this maze, so the only reward is a positive reward of +10 for reaching the goal state.
The Q-table for this problem would be a 3x3x4 (states x actions) matrix. The states correspond to the positions in the maze, and the actions correspond to moving up, down, left, or right. The Q-value for a state-action pair represents the expected long-term reward for taking that action from that state.
Initially, the Q-table is filled with zeros:
Q(s, a) = 0 for all s, a
Here’s how the Q-learning algorithm would work in this example:
1. Initialize the Q-table:
Q = [
[[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]],
[[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]],
[[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]]
]
2. Observe the state:
Let’s say the agent starts in position (0, 0).
3. Choose an action:
The agent chooses an action based on the current state and the values in the Q-table. Initially, the agent might choose a random action to encourage exploration. Let’s say the agent chooses to move right.
4. Execute the action:
The agent moves right and ends up in position (0, 1). Since the agent is not at the goal state yet, there is no reward.
5. Update the Q-table:
The Q-value for the current state-action pair is updated using the Q-learning update rule:
Q(s, a) = Q(s, a) + alpha * (reward + gamma * max(Q(s', a')) - Q(s, a))
where alpha
is the learning rate, gamma
is the discount factor, reward
is the observed reward, s'
is the new state, and a'
is the action that maximizes the Q-value for the new state.
In this case, alpha
and gamma
might be set to 0.5 and 0.9, respectively. Since there is no reward in the new state, reward
is 0. The new state is (0, 1), and the action that maximizes the Q-value for this state is moving down. Let's say the Q-value for the (0, 1) down action is 5. The Q-value for the (0, 0) right action is initially 0, so the new Q-value is:
Q(0, 0) = Q(0, 0) + 0.5 * (0 + 0.9 * 5 - 0) = 2.25
The updated Q-table is:
Q = [
[[0, 2.25, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]],
[[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]],
[[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]]
]
6. Repeat steps 2-5 until the agent reaches the goal state:
The agent keeps choosing actions, updating the Q-table, and moving through the maze until it reaches the goal state. Once the agent reaches the goal state, it receives a reward of +10, and the episode ends.
7. Repeat steps 1-6 for multiple episodes:
To improve the Q-table further, the Q-learning algorithm can be run for multiple episodes, where each episode is a complete run through the maze from the start state to the goal state. Over time, the Q-table will converge to the optimal policy, which is the policy that maximizes the expected long-term reward.
That's a simple example of Q-learning in machine learning for navigating a maze using a Q-table and the Q-learning update rule.
Top Machine Learning Mastery: Elevate Your Skills with this Step-by-Step Tutorial
1. Need for Machine Learning, Basic Principles, Applications, Challenges
4. Logistic Regression (Binary Classification)
8. Gradient Boosting (XGboost)
11. Neural Network Representation (Perceptron Learning)
15. Dimensionality Reduction (PCA, SVD)
16. Clustering (K-Means Clustering, Hierarchical Clustering)
19. Reinforcement Learning Fundamentals and Applications
20. Q-Learning
Dive into an insightful Machine Learning tutorial for exam success and knowledge expansion. More concepts and hands-on projects coming soon — follow my Medium profile for updates!