A Foundational Algorithm in Reinforcement Learning
Q-Learning is a model-free reinforcement learning algorithm that agents use to learn the quality of an action in a particular state. It aims to find an optimal policy, which is a mapping from states to actions, that maximizes the expected cumulative reward. The "Q" in Q-Learning stands for "Quality," referring to the value function that the algorithm learns.
The core idea is to estimate a function, Q(s, a), which represents the expected future reward of taking action a in state s, and then following the optimal policy thereafter. This function is iteratively updated as the agent interacts with its environment.
The algorithm maintains a Q-table, which is a lookup table where rows represent states and columns represent actions. Each cell Q(s, a) stores the estimated value of taking action a in state s.
The agent interacts with the environment by performing actions and observing rewards and next states. For each transition (state s, action a, reward r, next state s'), the Q-value is updated using the following Bellman equation:
Q(s, a) ← Q(s, a) + α * [r + γ * maxa' Q(s', a') - Q(s, a)]
Where:
α (alpha) is the learning rate (0 ≤ α ≤ 1). It determines how much new information overrides old information. A higher learning rate means the agent is more sensitive to new discoveries.γ (gamma) is the discount factor (0 ≤ γ ≤ 1). It determines the importance of future rewards. A value closer to 1 means the agent cares more about long-term rewards, while a value closer to 0 means it prioritizes immediate rewards.r is the immediate reward received after taking action a in state s.maxa' Q(s', a') is the maximum Q-value for any possible action a' in the next state s'. This represents the estimated best future reward from state s'.A crucial aspect of Q-Learning is balancing exploration (trying new actions to discover potentially better rewards) and exploitation (using the currently known best actions to maximize rewards). A common strategy for this is the epsilon-greedy policy:
ε (epsilon), the agent chooses a random action (exploration).1 - ε, the agent chooses the action with the highest Q-value for the current state (exploitation).Typically, ε starts high and decays over time, encouraging exploration early on and more exploitation as the agent learns.
Q-Learning has a wide range of applications, including:
Let's illustrate Q-Learning with a simple grid world. The agent navigates a 3x3 grid to reach a goal state. The agent receives a small negative reward for each step (to encourage efficiency) and a large positive reward upon reaching the goal. Walls or obstacles would result in a large negative reward or staying in the same state.
A conceptual representation of a simple grid world environment.
Adjust the parameters and "Train Agent" to see how Q-values update. The output will show the learned Q-values for a selected state-action pair over time.