Elsevier

Neural Networks

Volume 143, November 2021, Pages 218-229
Neural Networks

2021 Special Issue on AI and Brain Science: AI-powered Brain Science
The asymmetric learning rates of murine exploratory behavior in sparse reward environments

https://doi.org/10.1016/j.neunet.2021.05.030Get rights and content

Abstract

Goal-oriented behaviors of animals can be modeled by reinforcement learning algorithms. Such algorithms predict future outcomes of selected actions utilizing action values and updating those values in response to the positive and negative outcomes. In many models of animal behavior, the action values are updated symmetrically based on a common learning rate, that is, in the same way for both positive and negative outcomes. However, animals in environments with scarce rewards may have uneven learning rates. To investigate the asymmetry in learning rates in reward and non-reward, we analyzed the exploration behavior of mice in five-armed bandit tasks using a Q-learning model with differential learning rates for positive and negative outcomes. The positive learning rate was significantly higher in a scarce reward environment than in a rich reward environment, and conversely, the negative learning rate was significantly lower in the scarce environment. The positive to negative learning rate ratio was about 10 in the scarce environment and about 2 in the rich environment. This result suggests that when the reward probability was low, the mice tend to ignore failures and exploit the rare rewards. Computational modeling analysis revealed that the increased learning rates ratio could cause an overestimation of and perseveration on rare-rewarding events, increasing total reward acquisition in the scarce environment but disadvantaging impartial exploration.

Introduction

Learning from mistakes can contribute to making more appropriate decisions. Even animals, not just humans, learn from both positive and negative experiences, the former reinforcing successful actions and the latter inhibiting unsuccessful ones (Thorndike, 1911). Even though animals need to adapt to different environments with variable reward distribution (Barnett & Spencer, 1951), it is unclear whether they react to success and failure in the same manner in all situations. For example, when animals experience low success rate in finding something to eat in an environment with scarce food, should they learn it as failure because they fail so many times, or as success because they beat the odds? The purpose of this study is to investigate how animals behave in such a harsh environment and to analyze their exploration–exploitation strategies from the viewpoint of reinforcement learning. The use of experimental animals, placed under an operant task without a priori knowledge of the environment, will help to clarify their basic learning strategies.

To estimate rodent learning strategies, many reinforcement learning models have been examined (Cinotti et al., 2019, Ito and Doya, 2009). One of the most popular reinforcement learning models, Q-learning, updates the value of an action using reward prediction error (RPE), which is the difference between received reward and expected reward (Sutton & Barto, 2018). Positive RPEs occur when more is received than expected, resulting in the reinforcement of a reward-inducing action. Conversely, negative RPEs occur when less is received than expected, resulting in a reduction of action-selection (Neftci and Averbeck, 2019, Sutton and Barto, 2018). As such, positive and negative RPEs have opposite effects on the future selection probability of the executed action based on said action’s outcome.

Q-learning algorithms typically use the same learning rate for both positive and negative RPEs, where the learning rate is the efficacy of updating action values. The reason for adopting this symmetric update is its simplicity, and this simplicity is valuable for accurate estimation of expected rewards as well. However, there is no guarantee that symmetry will always be beneficial in a natural environment.

In the natural environment, food and other rewards are usually dispersed over very large spaces. The probability of securing a reward repeatedly in the same location can be small, and the reward pattern is highly binary since the reward can either be obtained or not obtained at all. In such an environment, starving animals have to focus on exploitation of rare opportunities (Barnett & Spencer, 1951), even at the expense of accurate estimation. This issue is generally discussed as a trade-off between exploration and exploitation. In the context of Q-learning, this trade-off is a problem of parameter regulation of action choice policies such as ε-greedy and soft-max functions (Daw et al., 2006, Tokic and Palm, 2012). However, this problem could also be addressed by referring to principles of neuroscience on evaluation of how RPEs affect action values.

Past research has revealed that dopamine neurons in the midbrain exhibit firing patterns similar to those of positive and negative RPEs (Dabney et al., 2020, Schultz, 2015). Dopaminergic neurons project to the striatum in the basal ganglia, which are considered to have a role of encoding the behavioral action values (Neftci and Averbeck, 2019, Samejima et al., 2005, Sutton and Barto, 2018). The basal ganglia have direct pathway that promote behavior and indirect pathway that switch behavior (Nonomura et al., 2018, Ueda et al., 2017). Dopamine has different modes of plasticity for each pathway (Shen, Flajolet, Greengard, & Surmeier, 2008), suggesting that RPE-like signals have different effects on the promotion and switching.

Frank et al. reported that positive and negative learning rates of humans vary depending on the dopaminergic signaling pathway-related gene polymorphisms and the administration of dopamine replacement medication for patients with Parkinson’s disease (Frank et al., 2009, Frank et al., 2007, Frank et al., 2004).

The asymmetry between positive and negative learning rates has been considered a type of irrational cognitive bias, and the significance of this asymmetry has been unclear. In this regard, Cazé and Van Der Meer (2013) theoretically demonstrated that in environments with low average reward rates, lowering the negative learning rate relative to the positive learning rate can increase total reward acquisition. If the asymmetry has such benefit, then humans may vary their learning rates in a reward probability-dependent manner. To test this hypothesis, Gershman (2015) conducted experiments on humans with 2-armed bandit tasks. However, they found that the negative learning rate was always higher than the positive learning rate, regardless of whether the reward probability was high or low; therefore, this experiment on humans failed to support the hypothesis. But we think this failure can be attributed, as shown below, to a human brain-specific characteristics and the binary choice task.

Animals including humans tend to continue to choose the same option, especially if they were rewarded during the previous trial; on the other hand, they tend to stop choosing the same option and choose another if they were not rewarded. This short-term tendency is called the law of “Win-Stay Lose-Shift” (WSLS) and has been observed in non-human primates (Lee, Conroy, McGreevy, & Barraclough, 2004), rats (Skelin et al., 2014), and mice (Amodeo, Jones, Sweeney, & Ragozzino, 2012) as well. In rats, WSLS behavior has been modeled by Q-learning with a forgetting term of Q value (Cinotti et al., 2019, Ito and Doya, 2009) and Win-Stay has been found to be impaired by the dopamine D1/D2 receptor antagonist Flupentixol (Cinotti et al., 2019). Inhibitory optogenetic analysis revealed that both Win-Stay and Lose-Shift in mice are regulated by direct pathways in the dorsolateral striatum (Bergstrom et al., 2018). These studies with rodents clearly show that the basal ganglia dopaminergic system is involved in WSLS. In humans, the WSLS heuristic is most plainly seen in the rock–paper–scissors game. EEG studies of this game indicate that reaction time after a win tends to be longer than after a loss, and that feedback-related negativity after a win, not a loss, varies in a win-rate dependent manner (Forder & Dyson, 2016). This suggests that the complex and slow neural processes involved in Win-Stay decision exist possibly in the anterior cingulate gyrus in humans. Win-Stay and Lose-Shift may depend on distinct brain regions that have developed differently in animals and human beings.

Analytical studies of behavioral data involving Q-learning are generally conducted using a task with a small number of action options, and in many cases, the number is only two. While Gershman (2015) used the 2-armed bandit task, the experiment that first reported asymmetric variability in human learning rates used a task of choosing between two of six alternatives (Frank et al., 2004). In order to capture the reward probability-dependent change in learning rates, a more difficult task with more than two choices is considered necessary. In a binary selection task the transition destination of “Lose-Shift” is limited to one; therefore, the exploratory behavior, in the sense of choosing among several, cannot be analyzed. Moreover, binary selection tasks do not provide information on the complexity of the exploration pattern or how often a particular exploration is performed.

Foraging behavior is critical for animals, and they may adjust their exploration and exploitation patterns according to the distribution of food. From the perspective of reinforcement learning, there are two possible ways to adjust the exploration–exploitation pattern. One is to directly change factors of action selection policy. The behavioral choices of humans and rodents fit well to a Q-learning model with a soft-max policy function that considers values of multiple potential alternatives (Cinotti et al., 2019, Daw et al., 2006). One example of the direct adjustment of policy is the change of the inverse temperature parameter in the soft-max function (Humphries, Khamassi, & Gurney, 2012). Another way to modulate the exploration–exploitation pattern is to regulate the positive and negative learning rates individually (Cazé and Van Der Meer, 2013, Gershman, 2015). We hypothesized that both are possible, but that the dissimilarity of learning rates may become more evident in a low reward probability environment. Since the asymmetric learning rates involve the dopamine-basal ganglia system (Frank et al., 2009, Frank et al., 2007, Frank et al., 2004), it is necessary to focus on it. We expected that this hypothesis could be tested by using experimental animals with relatively large basal ganglia compared to the cerebral cortex, as the latter could interfere with low level learning. In the present study, we designed a five-armed bandit task (5-ABT) with Bernoulli rewards (the reward received is either a zero or a one) (Tamatsukuri & Takahashi, 2019) for mice. Since an extended trial period is required to perform the low reward probability task, the nest box was connected to the operant chamber to perform the tasks continuously (Remmelink, Chau, Smit, Verhage, & Loos, 2017). Using this 5-ABT with varying reward distribution, we analyzed the long-term behavioral changes of mice in a low- and high-reward probability environment.

Section snippets

Imbalance of Win-Stay Lose-Shift

We observed behavioral choice patterns in the equiprobability steady-state tasks ALL30 and ALL50 where the reward probability for all choices was set uniformly and constantly at either 30% or 50% (Fig. 1A). We evaluated Shannon entropy on the action selection to assess the variation in mice’s choice. We calculated entropies for 300 trials of nose-poke patterns for five holes in the two groups. Statistically significant difference was found in the entropies among the two groups (ALL30 =1.71±0.31(

Discussion

In the equiprobability steady-state tasks (Fig. 1A), we found different mean entropies in environments with different reward probabilities (Fig. 1B). The result clearly indicates that the reward probability has an effect on the mice’s exploration pattern. Interestingly, there was a significantly greater difference among the individuals in choosing whether to explore or not when the reward probability was high (Fig. 1B, ALL50). It may suggest that satisficing criteria (such as aspiration level) (

Ethical statements

All animal procedures were conducted in accordance with the institutional ethical guidelines for animal experiments of the National Defense Medical College (Tokorozawa, Saitama, Japan). All experimental procedures were approved by the Animal Research Committee of the National Defense Medical College (18064).

Animals

This study included 28 male and 28 female in-bred C57BL/6J mice that were maintained on a 12-h light/dark cycle and at 2225C ambient temperature. The mice were offspring of C57BL/6J mice

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

We are grateful to T. Tobita, K. Isoda, S. Kanda, and M. Kawabata for their excellent assistance. We also thank K. Hasegawa and D.A. Tyurmin for language assistance and advice. This work was supported by a grant for Advanced Research on Defense Medicine from the Ministry of Defense of Japan and JSPS KAKENHI Grant Numbers: JP20H04259, JP20K05933, JP20K07958, JP20K11948, JP17H04696 and JP18H03539.

References (46)

  • AkaikeH.

    A new look at the statistical model identification

    IEEE Transactions on Automatic Control

    (1974)
  • BarnettS.A. et al.

    Feeding, social behaviour and interspecific competition in wild rats

    Behaviour

    (1951)
  • CazéR.D. et al.

    Adaptive properties of differential learning rates for positive and negative outcomes

    Biological Cybernetics

    (2013)
  • CinottiF. et al.

    Dopamine blockade impairs the exploration-exploitation trade-off in rats

    Scientific Reports

    (2019)
  • DabneyW. et al.

    A distributional code for value in dopamine-based reinforcement learning

    Nature

    (2020)
  • DawN.D. et al.

    Cortical substrates for exploratory decisions in humans

    Nature

    (2006)
  • ForderL. et al.

    Behavioural and neural modulation of win-stay but not lose-shift strategies as a function of outcome value in Rock, Paper, Scissors

    Scientific Reports

    (2016)
  • FrankM.J. et al.

    Prefrontal and striatal dopaminergic genes predict individual differences in exploration and exploitation

    Nature Neuroscience

    (2009)
  • FrankM.J. et al.

    Genetic triple dissociation reveals multiple roles for dopamine in reinforcement learning

    Proceedings of the National Academy of Sciences of the United States of America

    (2007)
  • FrankM.J. et al.

    By carrot or by stick: Cognitive reinforcement learning in Parkinsonism

    Science (80-)

    (2004)
  • GershmanS.J.

    Do learning rates adapt to the distribution of rewards?

    Psychonomic Bulletin & Review

    (2015)
  • GershmanS.J. et al.

    Computational rationality: A converging paradigm for intelligence in brains, minds, and machines

    Science (80-)

    (2015)
  • HumphriesM.D. et al.

    Dopaminergic control of the exploration-exploitation trade-off via the basal ganglia

    Frontiers in Neuroscience

    (2012)
  • Cited by (11)

    View all citing articles on Scopus
    View full text