REINFORCEMENT LEARNING ELEMENTS

REINFORCEMENT LEARNING ELEMENTS


Reinforcement learning requires four critical components, in addition to the agent and the environment: policy, reward signal, value function, and model.

Policy

A policy describes an agent’s behaviour at a specific point in time. The policy can be as simple as a function or a lookup table in the simplest cases, but it can also involve complex function computations. The policy is central to what the agent observes.

Signal for a reward

At each state, the agent receives an instant signal from the environment known as the reward signal or simply reward. As previously stated, depending on the agent’s actions, rewards can be positive or negative.

The value function

The value function indicates how beneficial specific actions are and how much the agent can expect to be rewarded. The agent’s policy and the reward influence the value function, and its goal is to estimate values in order to maximize rewards.

MOST COMMON REINFORCEMENT LEARNING ALGORITHMS


Reinforcement learning algorithms are most commonly used in AI and gaming applications and are divided into two types: There are model-free and model-based RL algorithms. Model-free RL algorithms include Q-learning and deep Q learning.

Q-Learning

Q-learning is a temporal difference learning Off policy RL algorithm. The temporal difference learning methods compare successive predictions made over time. It learns the value function Q (S, a), which indicates how good it is to perform action “a” at a given state “s.”

The Monte Carlo method

The Monte Carlo (MC) method is one of the most effective ways for an agent to determine the best policy to maximize the cumulative reward. This method is only applicable to story-driven tasks with a clear end goal.
Using this method, the agent learns directly from episodes of experience. This also implies that, at the outset, the agent has no idea which action will result in the highest reward, so the actions are chosen at random.

After trying out a large number of random policies, the agent will become more aware of the policies that yield the highest rewards and will become more skilled at policy selection.

SARSA

State-action-reward-state-action (SARSA) is a temporal difference learning method that operates on policy. This means it derives the value function from the current action, which is derived from the current policy.

The acronym SARSA refers to the fact that the main function used to update the Q-value is determined by the agent’s current state (S), the action chosen (A), the reward for the action (R), the state the agent enters after performing the action (S), and the action performed in the new state (S) (A)

Deep Q Neural Network (DQN)

DQN is a Q-learning algorithm that uses neural networks, as the name suggests. In a large state space environment, defining and updating a Q-table will be a difficult and complex task. To solve such a problem, a DQN algorithm can be used. The neural network estimates the Q-values for each action and state rather than defining a Q-table.

unni12

Leave a Reply