Reinforcement Learning Basics¶

In reinforcement learning (RL), an agent interacts with the environment, taking actions $a$, receiving a reward $r$, and moving to a new state $s$
The agent is tasked with maximizing the accumulated rewards or returns $R$ over time by finding optimal actions (policy)

Exploration in RL¶

Any RL algorithm requires exploration phase to collect data for training the policy or value functions
A good exploration phase will make the learning feasible or more efficient

Classic Exploration¶

Most are specifically designed for bandit problems, and thus, they are hard to apply in large-scale or high-dimensional problems (e.g., Atari games), resulting in increased computational demands that can be impractical
These methods heavily rely on specific assumptions about reward distributions or system dynamics, limiting their adaptability when assumptions do not hold
They may struggle in dynamic environments with complex reward structures or frequent changes, leading to suboptimal performance. However, their simplicity make them suitable for bandit setting.

Multi-Armed Bandit Problem¶

In this problem, a machine with multiple arms is given. Pulling an arm results in a reward
Each machine offers a random reward based on a unique probability distribution, which is not known beforehand
The goal is to maximize the total rewards accumulated over a series of lever pulls
The multi-armed bandit problem epitomizes the exploration-exploitation tradeoff dilemma within reinforcement learning
Unlike general RL scenarios, actions chosen in bandit problems do not influence the reward distribution of the arms

Problem Formulation¶

Mathematically, given reward distrbutions of the bandit arms $B=\{R_1, R_2, ..., R_k\}$, each distribution being associated with the rewards delivered by one of the ${\displaystyle K\in \mathbb {N} ^{+}}$. The agent iteratively plays one lever per round and observes the associated reward $r_t$. The objective is to maximize the sum of the collected rewards. The horizon ${\displaystyle H}$ refers to the remaining number of rounds to be played
The regret ${\displaystyle \rho }$ after ${\displaystyle T}$ rounds is defined as the expected difference between the reward sum associated with an optimal strategy (always choosing the best arm) and the sum of the collected rewards: $$ \rho=T\mu^*-\sum_{t=1}^Tr_t $$ where $\mu^*$ is the maximal reward mean ${\displaystyle \mu ^{*}=\max _{k}\{\mu _{k}\}}$ and ${\displaystyle {{r}}_{t}}$ is the reward in round t
The goal can be casted to minizing the regret $\rho$

Implemntation of Multi-arm Bandit¶

The BanditArm Class¶

# Code adapted from: https://github.com/WhatIThinkAbout/BabyRobot/tree/master/Multi_Armed_Bandits
import random
import numpy as np
# display all floating point numbers to 3 decimal places
np.set_printoptions(formatter={'float': lambda x: "{0:0.3f}".format(x)})
import pandas as pd
import scipy.stats as stats
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
from tqdm import tqdm
from IPython.core.pylabtools import figsize


class BanditArm:
    """ the base arm class """
    
    def __init__(self, q):                
        self.q = q        # the true reward value              
        self.initialize() # reset the arm
        
    def initialize(self):
        self.Q = 0   # the estimate of this arm's reward value                
        self.n = 0   # the number of times this arm has been tried        
    
    def pull(self):
        """ return a random amount of reward """
        
        # the reward is a guassian distribution with unit variance around the true
        # value 'q'
        value = np.random.randn() + self.q        
        
        # never allow a charge less than 0 to be returned        
        return 0 if value < 0 else value
                    
    def update(self,R):
        """ update this arm statistics after it has returned reward value 'R' """     
    
        # increment the number of times this arm has been tried
        self.n += 1

        # the new estimate of the mean is calculated from the old estimate
        self.Q = (1 - 1.0/self.n) * self.Q + (1.0/self.n) * R
    
    def sample(self,t):
        """ return an estimate of the arm's reward value """
        return self.Q

The ArmTester Class¶

"""
    Helper Functions
"""

# return the index of the largest value in the supplied list
# - arbitrarily select between the largest values in the case of a tie
# (the standard np.argmax just chooses the first value in the case of a tie)

def random_argmax(value_list):
  """ a random tie-breaking argmax"""
  values = np.asarray(value_list)
  return np.argmax(np.random.random(values.shape) * (values==values.max()))


class ArmTester():
    """ create and test a set of arms over a single test run """

    def __init__(self, arm_order, arm=BanditArm, multiplier=2, **kwargs ):  
        
        # create arms with a mean value defined by arm order 
        self.arms = [arm((q*multiplier)+2, **kwargs) for q in arm_order]     
        
        # set the number of arms equal to the number created
        self.number_of_arms = len(self.arms)           

        self.optimal_arm_index = np.argmax(arm_order)   
        
        # by default a arm tester records 2 bits of information over a run
        self.number_of_stats = kwargs.pop('number_of_stats', 2)       
             
            
    def initialize_run(self, number_of_steps):
        """ reset counters at the start of a run """
        
        # save the number of steps over which the run will take place
        self.number_of_steps = number_of_steps
        
        # reset the actual number of steps that the test ran for
        self.total_steps = 0
        
        # monitor the total reward obtained over the run
        self.total_reward = 0
        
        # the current total reward at each timestep of the run
        self.total_reward_per_timestep = []
        
        # the actual reward obtained at each timestep
        self.reward_per_timestep = []
           
        # stats for each time-step
        # - by default records: estimate, number of trials
        self.arm_stats = np.zeros(shape=(number_of_steps+1, 
                                            self.number_of_arms, 
                                            self.number_of_stats))
        
        # ensure that all arms are re-initialized
        for arm in self.arms: arm.initialize()

    def pull_and_update(self,arm_index):
        """ pull & update the specified arm and associated parameters """
        
        # pull the chosen arm and update its mean reward value
        reward = self.arms[arm_index].pull()
        self.arms[arm_index].update(reward)

        # update the total reward
        self.total_reward += reward   
        
        # store the current total reward at this timestep
        self.total_reward_per_timestep.append(self.total_reward)
        
        # store the reward obtained at this timestep
        self.reward_per_timestep.append(reward)        
        
        
    def get_arm_stats( self, t ):
        """ get the current information from each arm """        
        arm_stats = [[arm.Q, arm.n] for arm in self.arms]
        return arm_stats     
    
    def get_mean_reward( self ):
        """ the total reward averaged over the number of time steps """
        return (self.total_reward/self.total_steps)
    
    def get_total_reward_per_timestep( self ):
        """ the cumulative total reward at each timestep of the run """
        return self.total_reward_per_timestep
    
    def get_reward_per_timestep( self ):
        """ the actual reward obtained at each timestep of the run """
        return self.reward_per_timestep
    
    def get_estimates(self):
        """ get the estimate of each arm's reward at each timestep of the run """
        return self.arm_stats[:,:,0]  
    
    def get_number_of_trials(self):
        """ get the number of trials of each arm at each timestep of the run """
        return self.arm_stats[:,:,1]          
                
    def get_arm_percentages( self ):
        """ get the percentage of times each arm was tried over the run """
        return (self.arm_stats[:,:,1][self.total_steps]/self.total_steps)        
    
    def get_optimal_arm_percentage( self ):
        """ get the percentage of times the optimal arm was tried """        
        final_trials = self.arm_stats[:,:,1][self.total_steps]
        return (final_trials[self.optimal_arm_index]/self.total_steps)
    
    def get_time_steps( self ):
        """ get the number of time steps that the test ran for """
        return self.total_steps
    
    def select_arm( self, t ):
        """ Greedy Selection"""
        
        # choose the arm with the current highest mean reward or arbitrarily
        # select an arm in the case of a tie            
        arm_index = random_argmax([arm.sample(t+1) for arm in self.arms]) 
        return arm_index     
    
    
    def run( self, number_of_steps, maximum_total_reward = float('inf')):  
        """ perform a single run, over the set of arms, 
            for the defined number of steps """
        
        # reset the run counters
        self.initialize_run(number_of_steps)
        
        # loop for the specified number of time-steps
        for t in range(number_of_steps):

            # get information about all arms at the start of the time step
            self.arm_stats[t] = self.get_arm_stats(t)            
            
            # select an arm
            arm_index = self.select_arm(t)
            
            # pull the chosen arm and update its mean reward value
            self.pull_and_update(arm_index)
            
            # test if the accumulated total reward is greater than the maximum
            if self.total_reward > maximum_total_reward:
                break
       
        # save the actual number of steps that have been run
        self.total_steps = t    
    
        # get the stats for each arm at the end of the run        
        self.arm_stats[t+1] = self.get_arm_stats(t+1)           
        
        return self.total_steps, self.total_reward

The BanditEvironment Class¶

class BanditEvironment():
    """ setup and run repeated arm tests to get the average results """
    
    def __init__(self, 
                 arm_tester = ArmTester,
                 number_of_tests = 1000,
                 number_of_steps = 30,                      
                 maximum_total_reward = float('inf'),
                 **kwargs ):  
        
        self.arm_tester = arm_tester
        self.number_of_tests = number_of_tests
        self.number_of_steps = number_of_steps    
        self.maximum_total_reward = maximum_total_reward
        self.number_of_arms = self.arm_tester.number_of_arms
                
    def initialize_run(self):
        
        # keep track of the average values over the run
        self.mean_total_reward = 0.
        self.optimal_selected = 0.                
        self.mean_time_steps = 0.
        self.arm_percentages = np.zeros(self.number_of_arms)   
        self.estimates = np.zeros(shape=(self.number_of_steps+1,self.number_of_arms))        
        self.number_of_trials = np.zeros(shape=(self.number_of_steps+1,self.number_of_arms))
        
        # the cumulative total reward per timestep
        self.cumulative_reward_per_timestep = np.zeros(shape=(self.number_of_steps))
        
        # the actual reward obtained at each timestep
        self.reward_per_timestep = np.zeros(shape=(self.number_of_steps))
        
    def get_mean_total_reward(self):
        """ the final total reward averaged over the number of timesteps """
        return self.mean_total_reward
    
    def get_cumulative_reward_per_timestep(self):
        """ the cumulative total reward per timestep """
        return self.cumulative_reward_per_timestep
    
    def get_reward_per_timestep(self):
        """ the mean actual reward obtained at each timestep """
        return self.reward_per_timestep

    def get_optimal_selected(self):
        """ the mean times the optimal arm was selected """
        return self.optimal_selected

    def get_arm_percentages(self): 
        """ the mean of the percentage times each arm was selected """
        return self.arm_percentages
    
    def get_estimates(self):
        """ per arm reward estimates """
        return self.estimates
    
    def get_number_of_trials(self):
        """ per arm number of trials """
        return self.number_of_trials  
    
    def get_mean_time_steps(self):
        """ the average number of trials of each test """
        return self.mean_time_steps
    
    def update_mean( self, current_mean, new_value, n ):
        """ calculate the new mean from the previous mean and the new value """
        return (1 - 1.0/n) * current_mean + (1.0/n) * new_value

    def update_mean_array( self, current_mean, new_value, n ):
        """ calculate the new mean from the previous mean and the new value for an array """

        new_value = np.array(new_value)
        
        # pad the new array with its last value to make sure its the same length as the original           
        pad_length = (current_mean.shape[0] - new_value.shape[0])               
        
        if pad_length > 0:
            new_array = np.pad(new_value,(0,pad_length), mode='constant', constant_values=new_value[-1])
        else:
            new_array = new_value
    
        return (1 - 1.0/n) * current_mean + (1.0/n) * new_array
        
    def record_test_stats(self,n):
        """ update the mean value for each statistic being tracked over a run """

        # calculate the new means from the old means and the new value
        tester = self.arm_tester
        self.mean_total_reward = self.update_mean( self.mean_total_reward, tester.get_mean_reward(), n)
        self.optimal_selected = self.update_mean( self.optimal_selected, tester.get_optimal_arm_percentage(), n)
        self.arm_percentages = self.update_mean( self.arm_percentages, tester.get_arm_percentages(), n)        
        self.mean_time_steps = self.update_mean( self.mean_time_steps, tester.get_time_steps(), n)
        
        self.cumulative_reward_per_timestep = self.update_mean_array( self.cumulative_reward_per_timestep, 
                                                                      tester.get_total_reward_per_timestep(), n)        

        # check if the tests are only running until a maximum reward value is reached
        if self.maximum_total_reward == float('inf'):

            self.estimates = self.update_mean_array( self.estimates, tester.get_estimates(), n)        
            self.cumulative_reward_per_timestep = self.update_mean_array( self.cumulative_reward_per_timestep, 
                                                                          tester.get_total_reward_per_timestep(), n)
            self.reward_per_timestep = self.update_mean_array( self.reward_per_timestep, tester.get_reward_per_timestep(), n)
            self.number_of_trials = self.update_mean_array( self.number_of_trials, tester.get_number_of_trials(), n)
    
    def run(self):
        """ repeat the test over a set of arms for the specified number of trials """
                    
        # do the specified number of runs for a single test
        self.initialize_run()
        for n in tqdm(range(1,self.number_of_tests+1)):

            # do one run of the test                               
            self.arm_tester.run( self.number_of_steps, self.maximum_total_reward )
            self.record_test_stats(n)

Finally, let's define the experiment running function¶

def run_multiple_tests( tester, max_steps = 500, show_arm_percentages = True ):
    mean_rewards = []
    number_of_tests = 500
    number_of_steps = max_steps
    maximum_total_reward = 800

    experiment = BanditEvironment(arm_tester   = tester,
                                  number_of_tests = number_of_tests,
                                  number_of_steps = number_of_steps,
                                  maximum_total_reward = maximum_total_reward)
    experiment.run()

    print(f'Mean Reward per Time Step = {experiment.get_mean_total_reward():0.3f}')
    print(f'Optimal Arm Selected = {experiment.get_optimal_selected():0.3f}')    
    print(f'Average Number of Trials Per Run = {experiment.get_mean_time_steps():0.3f}')
    if show_arm_percentages:
      print(f'Arm Percentages = {experiment.get_arm_percentages()}') 
        
    mean_rewards.append(f"{experiment.get_mean_total_reward():0.3f}")
    return experiment.get_cumulative_reward_per_timestep()

Let's test the default greedy strategy¶

# create 5 arms in a fixed order
ARM_ORDER = [2,1,3,5,4]
MULTIPLIER = 0.1
np.random.seed(0)
random.seed(0)

Given the order, the true values for the arms are defined by this fixed rule:

print([(q*MULTIPLIER) for q in ARM_ORDER])

[0.2, 0.1, 0.30000000000000004, 0.5, 0.4]

This mean the optimal arm is:

print(np.argmax([(q*MULTIPLIER)+2 for q in ARM_ORDER]))

3

# Greedy Selection
greedy_result = run_multiple_tests(ArmTester(ARM_ORDER, BanditArm, MULTIPLIER ) )

100%|████████████████████████████████████████████████████████████████████████████████| 500/500 [00:05<00:00, 97.80it/s]

Mean Reward per Time Step = 2.321
Optimal Arm Selected = 0.214
Average Number of Trials Per Run = 346.776
Arm Percentages = [0.208 0.184 0.176 0.214 0.218]

Let's implement a better approach: $\epsilon$-greedy¶

class EpsilonGreedySelectionTester( ArmTester ):

    def __init__(self, arm_order, multiplier, epsilon = 0.2 ):  
        
        # create a standard arm tester
        super().__init__(arm_order=arm_order, multiplier=multiplier) 
        
        # save the probability of selecting the non-greedy action
        self.epsilon = epsilon
    
    
    def select_arm( self, t ):
        """ Epsilon-Greedy Arm Selection"""
        
        # probability of selecting a random arm
        p = np.random.random()

        # if the probability is less than epsilon then a random arm is chosen from the complete set
        if p < self.epsilon:
            arm_index = np.random.choice(self.number_of_arms)
        else:
            # choose the arm with the current highest mean reward or arbitrary select a arm in the case of a tie            
            arm_index = random_argmax([arm.sample(t) for arm in self.arms])               
        
        return arm_index

Let's test $\epsilon$-greedy¶

# Epsilon Greedy
_ = run_multiple_tests(EpsilonGreedySelectionTester(ARM_ORDER, MULTIPLIER, epsilon = 0.8 ) )

100%|███████████████████████████████████████████████████████████████████████████████| 500/500 [00:04<00:00, 116.50it/s]

Mean Reward per Time Step = 2.336
Optimal Arm Selected = 0.275
Average Number of Trials Per Run = 343.228
Arm Percentages = [0.171 0.166 0.179 0.275 0.210]

We can tune $\epsilon$ to improve the performance further. Note that:

$\epsilon=0$: Greedy Selection
$\epsilon=1$: Random Selection
Normally, a balanced $0<\epsilon<1$ will be the best

for e in [0.0,0.2,0.4,0.6,0.8,1.0]:
    print(f"---STATISTICS For Epsilon-Greedy (e={e})--")
    run_multiple_tests(EpsilonGreedySelectionTester(ARM_ORDER, MULTIPLIER, epsilon = e ) )

  2%|█▉                                                                              | 12/500 [00:00<00:04, 106.82it/s]

---STATISTICS For Epsilon-Greedy (e=0.0)--

100%|████████████████████████████████████████████████████████████████████████████████| 500/500 [00:05<00:00, 98.54it/s]
  2%|█▌                                                                               | 10/500 [00:00<00:05, 96.80it/s]

Mean Reward per Time Step = 2.308
Optimal Arm Selected = 0.196
Average Number of Trials Per Run = 348.730
Arm Percentages = [0.216 0.208 0.192 0.196 0.188]
---STATISTICS For Epsilon-Greedy (e=0.2)--

100%|████████████████████████████████████████████████████████████████████████████████| 500/500 [00:05<00:00, 97.06it/s]
  3%|██▏                                                                             | 14/500 [00:00<00:03, 122.70it/s]

Mean Reward per Time Step = 2.402
Optimal Arm Selected = 0.442
Average Number of Trials Per Run = 334.058
Arm Percentages = [0.091 0.064 0.141 0.442 0.262]
---STATISTICS For Epsilon-Greedy (e=0.4)--

100%|███████████████████████████████████████████████████████████████████████████████| 500/500 [00:04<00:00, 111.05it/s]
  3%|██▍                                                                             | 15/500 [00:00<00:03, 139.33it/s]

Mean Reward per Time Step = 2.385
Optimal Arm Selected = 0.400
Average Number of Trials Per Run = 336.296
Arm Percentages = [0.110 0.096 0.142 0.400 0.251]
---STATISTICS For Epsilon-Greedy (e=0.6)--

100%|███████████████████████████████████████████████████████████████████████████████| 500/500 [00:04<00:00, 113.03it/s]
  3%|██▍                                                                             | 15/500 [00:00<00:03, 136.79it/s]

Mean Reward per Time Step = 2.362
Optimal Arm Selected = 0.343
Average Number of Trials Per Run = 339.462
Arm Percentages = [0.142 0.130 0.164 0.343 0.221]
---STATISTICS For Epsilon-Greedy (e=0.8)--

As we can see, for this specific ARM_ORDER, only some $\epsilon$ show clearly better results than Greedy. They can somehow identify the best arm, indicated by a higher percentage.

epsilon_greedy_result = run_multiple_tests(EpsilonGreedySelectionTester(ARM_ORDER, MULTIPLIER, epsilon = 0.4 ) )

100%|███████████████████████████████████████████████████████████████████████████████| 500/500 [00:04<00:00, 100.66it/s]

Mean Reward per Time Step = 2.386
Optimal Arm Selected = 0.396
Average Number of Trials Per Run = 336.164
Arm Percentages = [0.112 0.099 0.154 0.396 0.238]

Now we start implementing some advanced baselines.

The Upper Confidence Bound (UCB) Exploring Strategy¶

class UCBArm( BanditArm ):

    def __init__( self, q, **kwargs ):    
        """ initialize the UCB arm """                  
        
        # store the confidence level controlling exploration -- corresponding to hyperparameter c in the slides
        self.c = kwargs.pop('confidence_level', 2.0)       
                
        # pass the true reward value to the base class  
        super().__init__(q)           
        
    def uncertainty(self, t): 
        """ calculate the uncertainty in the estimate of this arm's mean """
        
 
        return self.c * (np.sqrt(np.log(t) / self.n))         
        
    def sample(self,t):
        """ the UCB reward is the estimate of the mean reward plus its uncertainty """
        return self.Q + self.uncertainty(t)

The Thompson Sampling Exploring Strategy¶

Here we need to assume the reward distribution is Gaussian, which is correct in our experiment setting.

class GaussianThompsonArm( BanditArm ):
    def __init__(self, q):                
                
        self.τ_0 = 0.0001  # the posterior precision
        self.μ_0 = 1       # the posterior mean
        
        # pass the true reward value to the base class             
        super().__init__(q)         
        
    def sample(self,t):
        """ return a value from the the posterior normal distribution """
        return (np.random.randn() / np.sqrt(self.τ_0)) + self.μ_0    
                    
    def update(self,R):
        """ update this arm after it has returned reward value 'R' """   

        # do a standard update of the estimated mean
        super().update(R)    
               
        # update the mean and precision of the posterior
        self.μ_0 = ((self.τ_0 * self.μ_0) + (self.n * self.Q))/(self.τ_0 + self.n)        
        self.τ_0 += 1

Let's test the new methods. For UCB, we can try to tune confidence level c¶

# UCB
for c in [0.0,0.2,0.4,0.6]:
    print(f"---STATISTICS For UCB (c={c})--")
    run_multiple_tests( ArmTester(ARM_ORDER, UCBArm, MULTIPLIER, confidence_level = c ))

  0%|                                                                                          | 0/500 [00:00<?, ?it/s]C:\Users\thaihung\.conda\envs\test_env\lib\site-packages\ipykernel_launcher.py:16: RuntimeWarning: invalid value encountered in double_scalars
  app.launch_new_instance()
C:\Users\thaihung\.conda\envs\test_env\lib\site-packages\ipykernel_launcher.py:16: RuntimeWarning: divide by zero encountered in double_scalars
  app.launch_new_instance()
  1%|█▏                                                                                | 7/500 [00:00<00:07, 64.16it/s]

---STATISTICS For UCB (c=0.0)--

100%|████████████████████████████████████████████████████████████████████████████████| 500/500 [00:11<00:00, 42.45it/s]
  1%|▉                                                                                 | 6/500 [00:00<00:11, 41.87it/s]

Mean Reward per Time Step = 2.213
Optimal Arm Selected = 0.000
Average Number of Trials Per Run = 362.284
Arm Percentages = [1.000 0.000 0.000 0.000 0.000]
---STATISTICS For UCB (c=0.2)--

100%|████████████████████████████████████████████████████████████████████████████████| 500/500 [00:08<00:00, 60.29it/s]
  1%|▉                                                                                 | 6/500 [00:00<00:08, 54.89it/s]

Mean Reward per Time Step = 2.413
Optimal Arm Selected = 0.451
Average Number of Trials Per Run = 332.876
Arm Percentages = [0.080 0.050 0.151 0.451 0.268]
---STATISTICS For UCB (c=0.4)--

100%|████████████████████████████████████████████████████████████████████████████████| 500/500 [00:09<00:00, 55.37it/s]
  2%|█▎                                                                                | 8/500 [00:00<00:06, 78.02it/s]

Mean Reward per Time Step = 2.429
Optimal Arm Selected = 0.487
Average Number of Trials Per Run = 330.420
Arm Percentages = [0.062 0.032 0.158 0.487 0.262]
---STATISTICS For UCB (c=0.6)--

100%|████████████████████████████████████████████████████████████████████████████████| 500/500 [00:08<00:00, 55.79it/s]

Mean Reward per Time Step = 2.430
Optimal Arm Selected = 0.515
Average Number of Trials Per Run = 330.180
Arm Percentages = [0.058 0.037 0.117 0.515 0.274]

ucb_result = run_multiple_tests( ArmTester(ARM_ORDER, UCBArm, MULTIPLIER, confidence_level = 0.6 ))

  0%|                                                                                          | 0/500 [00:00<?, ?it/s]C:\Users\thaihung\.conda\envs\test_env\lib\site-packages\ipykernel_launcher.py:16: RuntimeWarning: invalid value encountered in double_scalars
  app.launch_new_instance()
C:\Users\thaihung\.conda\envs\test_env\lib\site-packages\ipykernel_launcher.py:16: RuntimeWarning: divide by zero encountered in double_scalars
  app.launch_new_instance()
100%|████████████████████████████████████████████████████████████████████████████████| 500/500 [00:08<00:00, 56.85it/s]

Mean Reward per Time Step = 2.422
Optimal Arm Selected = 0.491
Average Number of Trials Per Run = 331.346
Arm Percentages = [0.073 0.044 0.136 0.491 0.256]

# Thompson Sampling
print(f"---STATISTICS For Thompson Sampling--")
thompson_sampling_result = run_multiple_tests( ArmTester(ARM_ORDER, GaussianThompsonArm, MULTIPLIER ))

  2%|█▍                                                                                | 9/500 [00:00<00:07, 69.55it/s]

---STATISTICS For Thompson Sampling--

100%|████████████████████████████████████████████████████████████████████████████████| 500/500 [00:07<00:00, 66.45it/s]

Mean Reward per Time Step = 2.511
Optimal Arm Selected = 0.991
Average Number of Trials Per Run = 319.374
Arm Percentages = [0.000 0.001 0.001 0.991 0.007]

As we can see, UCB and Thompson Sampling can easily determine the optimal arm. Reducing MULTIPLER will make the problem harder.

Result Summary¶

test_names = ['Greedy','Epsilon Greedy','UCB','Thompson Sampling']

plt.figure(figsize=(10,8))
plt.yticks(np.arange(0., 4200, 600))
rewards = [greedy_result, epsilon_greedy_result, ucb_result, thompson_sampling_result]
for test in range(len(rewards)):
    plt.plot(rewards[test], label = f'{test_names[test]}')
    
plt.legend()    
plt.title('Mean Total Reward vs Time', fontsize=15)
plt.xlabel('Time Step')
plt.ylabel('Mean Total Reward')

Text(0, 0.5, 'Mean Total Reward')