Reinforcement Learning - Glossary

Reinforcement Learning (RL)

Reinforcement Learning (RL) is a type of machine learning where an agent learns to make a sequence of decisions by trying to maximize a reward signal it receives for its actions.

Core Components

Agent: The learner or decision-maker.
Environment: The external world the agent interacts with.
State (S): A representation of the current situation of the environment.
Action (A): A choice the agent can make in a given state.
Reward (R): A scalar feedback signal from the environment indicating how good or bad the last action was.
Policy (π): A strategy or mapping from states to actions, which the agent uses to decide what action to take.
Value Function (V or Q): Predicts the expected future reward. The state-value function V(s) estimates the value of being in state s, and the action-value function Q(s, a) estimates the value of taking action a in state s.
Model: The agent's representation of how the environment works. It can predict the next state and reward given a current state and action.

The RL Loop

The fundamental interaction between an agent and its environment is a cycle:

The agent observes the current state (S) of the environment.
Based on its policy (π), the agent selects an action (A).
The agent performs the action in the environment.
The environment transitions to a new state (S') and provides a reward (R) to the agent.
The agent uses this reward and new state to update its policy and/or value function, learning from the experience.

Key Concepts & Algorithms

Exploration vs. Exploitation: A fundamental dilemma where the agent must balance trying new actions to discover potentially better rewards (exploration) with using its current knowledge to obtain known rewards (exploitation).

Value-Based Methods

These methods aim to learn the optimal value function, from which an optimal policy can be derived (e.g., by always choosing the action with the highest Q-value).

Q-Learning: A model-free, off-policy temporal difference learning algorithm. It learns an action-value function Q(s, a).
SARSA (State-Action-Reward-State-Action): A model-free, on-policy temporal difference learning algorithm.

Policy-Based Methods

These methods directly learn the policy function, mapping states to probabilities of taking actions.

Policy Gradients: Algorithms that update the policy parameters in the direction that increases the expected reward.

Actor-Critic Methods

Combine aspects of both value-based and policy-based methods. An "actor" learns the policy, and a "critic" learns a value function to evaluate the actor's actions.

A3C (Asynchronous Advantage Actor-Critic): A popular actor-critic algorithm.
DDPG (Deep Deterministic Policy Gradient): An actor-critic method designed for continuous action spaces.

Model-Based Methods

These methods learn a model of the environment and use it to plan or simulate future outcomes.

Applications

Reinforcement learning is used in a wide variety of fields, including:

Robotics (e.g., robot locomotion, manipulation)
Game Playing (e.g., AlphaGo, Atari games)
Autonomous Driving
Recommender Systems
Resource Management
Personalized Medicine

Example: Imagine training a robot to walk. The agent (robot) is in a state (its current posture). It can choose actions (move leg forward, adjust balance). If it falls, it receives a negative reward. If it takes a step successfully, it receives a small positive reward. Over many trials, it learns a policy (a sequence of leg movements) that maximizes its total reward, leading to stable walking.