Q-Learning

A Foundational Algorithm in Reinforcement Learning

Introduction to Q-Learning

Q-Learning is a model-free reinforcement learning algorithm that agents use to learn the quality of an action in a particular state. It aims to find an optimal policy, which is a mapping from states to actions, that maximizes the expected cumulative reward. The "Q" in Q-Learning stands for "Quality," referring to the value function that the algorithm learns.

The core idea is to estimate a function, Q(s, a), which represents the expected future reward of taking action a in state s, and then following the optimal policy thereafter. This function is iteratively updated as the agent interacts with its environment.

The Q-Learning Algorithm

The algorithm maintains a Q-table, which is a lookup table where rows represent states and columns represent actions. Each cell Q(s, a) stores the estimated value of taking action a in state s.

Learning Process

The agent interacts with the environment by performing actions and observing rewards and next states. For each transition (state s, action a, reward r, next state s'), the Q-value is updated using the following Bellman equation:

Q(s, a) ← Q(s, a) + α * [r + γ * maxa' Q(s', a') - Q(s, a)]

Where:

Exploration vs. Exploitation

A crucial aspect of Q-Learning is balancing exploration (trying new actions to discover potentially better rewards) and exploitation (using the currently known best actions to maximize rewards). A common strategy for this is the epsilon-greedy policy:

Typically, ε starts high and decays over time, encouraging exploration early on and more exploitation as the agent learns.

Key Components

Applications

Q-Learning has a wide range of applications, including:

Simplified Interactive Example

Let's illustrate Q-Learning with a simple grid world. The agent navigates a 3x3 grid to reach a goal state. The agent receives a small negative reward for each step (to encourage efficiency) and a large positive reward upon reaching the goal. Walls or obstacles would result in a large negative reward or staying in the same state.

Simple 3x3 Grid World Example

A conceptual representation of a simple grid world environment.

Q-Learning Simulation Controls

Adjust the parameters and "Train Agent" to see how Q-values update. The output will show the learned Q-values for a selected state-action pair over time.

Training progress and Q-value updates will appear here...