2021 Special Issue on AI and Brain Science: AI-powered Brain ScienceThe asymmetric learning rates of murine exploratory behavior in sparse reward environments
Introduction
Learning from mistakes can contribute to making more appropriate decisions. Even animals, not just humans, learn from both positive and negative experiences, the former reinforcing successful actions and the latter inhibiting unsuccessful ones (Thorndike, 1911). Even though animals need to adapt to different environments with variable reward distribution (Barnett & Spencer, 1951), it is unclear whether they react to success and failure in the same manner in all situations. For example, when animals experience low success rate in finding something to eat in an environment with scarce food, should they learn it as failure because they fail so many times, or as success because they beat the odds? The purpose of this study is to investigate how animals behave in such a harsh environment and to analyze their exploration–exploitation strategies from the viewpoint of reinforcement learning. The use of experimental animals, placed under an operant task without a priori knowledge of the environment, will help to clarify their basic learning strategies.
To estimate rodent learning strategies, many reinforcement learning models have been examined (Cinotti et al., 2019, Ito and Doya, 2009). One of the most popular reinforcement learning models, Q-learning, updates the value of an action using reward prediction error (RPE), which is the difference between received reward and expected reward (Sutton & Barto, 2018). Positive RPEs occur when more is received than expected, resulting in the reinforcement of a reward-inducing action. Conversely, negative RPEs occur when less is received than expected, resulting in a reduction of action-selection (Neftci and Averbeck, 2019, Sutton and Barto, 2018). As such, positive and negative RPEs have opposite effects on the future selection probability of the executed action based on said action’s outcome.
Q-learning algorithms typically use the same learning rate for both positive and negative RPEs, where the learning rate is the efficacy of updating action values. The reason for adopting this symmetric update is its simplicity, and this simplicity is valuable for accurate estimation of expected rewards as well. However, there is no guarantee that symmetry will always be beneficial in a natural environment.
In the natural environment, food and other rewards are usually dispersed over very large spaces. The probability of securing a reward repeatedly in the same location can be small, and the reward pattern is highly binary since the reward can either be obtained or not obtained at all. In such an environment, starving animals have to focus on exploitation of rare opportunities (Barnett & Spencer, 1951), even at the expense of accurate estimation. This issue is generally discussed as a trade-off between exploration and exploitation. In the context of Q-learning, this trade-off is a problem of parameter regulation of action choice policies such as -greedy and soft-max functions (Daw et al., 2006, Tokic and Palm, 2012). However, this problem could also be addressed by referring to principles of neuroscience on evaluation of how RPEs affect action values.
Past research has revealed that dopamine neurons in the midbrain exhibit firing patterns similar to those of positive and negative RPEs (Dabney et al., 2020, Schultz, 2015). Dopaminergic neurons project to the striatum in the basal ganglia, which are considered to have a role of encoding the behavioral action values (Neftci and Averbeck, 2019, Samejima et al., 2005, Sutton and Barto, 2018). The basal ganglia have direct pathway that promote behavior and indirect pathway that switch behavior (Nonomura et al., 2018, Ueda et al., 2017). Dopamine has different modes of plasticity for each pathway (Shen, Flajolet, Greengard, & Surmeier, 2008), suggesting that RPE-like signals have different effects on the promotion and switching.
Frank et al. reported that positive and negative learning rates of humans vary depending on the dopaminergic signaling pathway-related gene polymorphisms and the administration of dopamine replacement medication for patients with Parkinson’s disease (Frank et al., 2009, Frank et al., 2007, Frank et al., 2004).
The asymmetry between positive and negative learning rates has been considered a type of irrational cognitive bias, and the significance of this asymmetry has been unclear. In this regard, Cazé and Van Der Meer (2013) theoretically demonstrated that in environments with low average reward rates, lowering the negative learning rate relative to the positive learning rate can increase total reward acquisition. If the asymmetry has such benefit, then humans may vary their learning rates in a reward probability-dependent manner. To test this hypothesis, Gershman (2015) conducted experiments on humans with 2-armed bandit tasks. However, they found that the negative learning rate was always higher than the positive learning rate, regardless of whether the reward probability was high or low; therefore, this experiment on humans failed to support the hypothesis. But we think this failure can be attributed, as shown below, to a human brain-specific characteristics and the binary choice task.
Animals including humans tend to continue to choose the same option, especially if they were rewarded during the previous trial; on the other hand, they tend to stop choosing the same option and choose another if they were not rewarded. This short-term tendency is called the law of “Win-Stay Lose-Shift” (WSLS) and has been observed in non-human primates (Lee, Conroy, McGreevy, & Barraclough, 2004), rats (Skelin et al., 2014), and mice (Amodeo, Jones, Sweeney, & Ragozzino, 2012) as well. In rats, WSLS behavior has been modeled by Q-learning with a forgetting term of Q value (Cinotti et al., 2019, Ito and Doya, 2009) and Win-Stay has been found to be impaired by the dopamine receptor antagonist Flupentixol (Cinotti et al., 2019). Inhibitory optogenetic analysis revealed that both Win-Stay and Lose-Shift in mice are regulated by direct pathways in the dorsolateral striatum (Bergstrom et al., 2018). These studies with rodents clearly show that the basal ganglia dopaminergic system is involved in WSLS. In humans, the WSLS heuristic is most plainly seen in the rock–paper–scissors game. EEG studies of this game indicate that reaction time after a win tends to be longer than after a loss, and that feedback-related negativity after a win, not a loss, varies in a win-rate dependent manner (Forder & Dyson, 2016). This suggests that the complex and slow neural processes involved in Win-Stay decision exist possibly in the anterior cingulate gyrus in humans. Win-Stay and Lose-Shift may depend on distinct brain regions that have developed differently in animals and human beings.
Analytical studies of behavioral data involving Q-learning are generally conducted using a task with a small number of action options, and in many cases, the number is only two. While Gershman (2015) used the 2-armed bandit task, the experiment that first reported asymmetric variability in human learning rates used a task of choosing between two of six alternatives (Frank et al., 2004). In order to capture the reward probability-dependent change in learning rates, a more difficult task with more than two choices is considered necessary. In a binary selection task the transition destination of “Lose-Shift” is limited to one; therefore, the exploratory behavior, in the sense of choosing among several, cannot be analyzed. Moreover, binary selection tasks do not provide information on the complexity of the exploration pattern or how often a particular exploration is performed.
Foraging behavior is critical for animals, and they may adjust their exploration and exploitation patterns according to the distribution of food. From the perspective of reinforcement learning, there are two possible ways to adjust the exploration–exploitation pattern. One is to directly change factors of action selection policy. The behavioral choices of humans and rodents fit well to a Q-learning model with a soft-max policy function that considers values of multiple potential alternatives (Cinotti et al., 2019, Daw et al., 2006). One example of the direct adjustment of policy is the change of the inverse temperature parameter in the soft-max function (Humphries, Khamassi, & Gurney, 2012). Another way to modulate the exploration–exploitation pattern is to regulate the positive and negative learning rates individually (Cazé and Van Der Meer, 2013, Gershman, 2015). We hypothesized that both are possible, but that the dissimilarity of learning rates may become more evident in a low reward probability environment. Since the asymmetric learning rates involve the dopamine-basal ganglia system (Frank et al., 2009, Frank et al., 2007, Frank et al., 2004), it is necessary to focus on it. We expected that this hypothesis could be tested by using experimental animals with relatively large basal ganglia compared to the cerebral cortex, as the latter could interfere with low level learning. In the present study, we designed a five-armed bandit task (5-ABT) with Bernoulli rewards (the reward received is either a zero or a one) (Tamatsukuri & Takahashi, 2019) for mice. Since an extended trial period is required to perform the low reward probability task, the nest box was connected to the operant chamber to perform the tasks continuously (Remmelink, Chau, Smit, Verhage, & Loos, 2017). Using this 5-ABT with varying reward distribution, we analyzed the long-term behavioral changes of mice in a low- and high-reward probability environment.
Section snippets
Imbalance of Win-Stay Lose-Shift
We observed behavioral choice patterns in the equiprobability steady-state tasks ALL30 and ALL50 where the reward probability for all choices was set uniformly and constantly at either 30% or 50% (Fig. 1A). We evaluated Shannon entropy on the action selection to assess the variation in mice’s choice. We calculated entropies for 300 trials of nose-poke patterns for five holes in the two groups. Statistically significant difference was found in the entropies among the two groups (ALL30
Discussion
In the equiprobability steady-state tasks (Fig. 1A), we found different mean entropies in environments with different reward probabilities (Fig. 1B). The result clearly indicates that the reward probability has an effect on the mice’s exploration pattern. Interestingly, there was a significantly greater difference among the individuals in choosing whether to explore or not when the reward probability was high (Fig. 1B, ALL50). It may suggest that satisficing criteria (such as aspiration level) (
Ethical statements
All animal procedures were conducted in accordance with the institutional ethical guidelines for animal experiments of the National Defense Medical College (Tokorozawa, Saitama, Japan). All experimental procedures were approved by the Animal Research Committee of the National Defense Medical College (18064).
Animals
This study included 28 male and 28 female in-bred C57BL/6J mice that were maintained on a -h light/dark cycle and at ambient temperature. The mice were offspring of C57BL/6J mice
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
We are grateful to T. Tobita, K. Isoda, S. Kanda, and M. Kawabata for their excellent assistance. We also thank K. Hasegawa and D.A. Tyurmin for language assistance and advice. This work was supported by a grant for Advanced Research on Defense Medicine from the Ministry of Defense of Japan and JSPS KAKENHI Grant Numbers: JP20H04259, JP20K05933, JP20K07958, JP20K11948, JP17H04696 and JP18H03539.
References (46)
- et al.
Differences in BTBR T+ tf/J and C57BL/6J mice on probabilistic reversal learning and stereotyped behaviors
Behavioural Brain Research
(2012) - et al.
Dorsolateral striatum engagement interferes with early discrimination learning
Cell Reports
(2018) - et al.
On the shape of the probability weighting function
Cognitive Psychology
(1999) The statistical structures of reinforcement learning with asymmetric value updates
Journal of Mathematical Psychology
(2018)- et al.
Reinforcement learning and decision making in monkeys during a competitive game
Cognitive Brain Research
(2004) - et al.
Distinct dopaminergic control of the direct and indirect pathways in reward-based and avoidance learning behaviors
Neuroscience
(2014) - et al.
Hierarchical Bayesian parameter estimation for cumulative prospect theory
Journal of Mathematical Psychology
(2011) - et al.
Adrenergic receptor-mediated modulation of striatal firing patterns
Neuroscience Research
(2016) - et al.
The importance of falsification in computational cognitive modeling
Trends in Cognitive Sciences
(2017) - et al.
Guaranteed satisficing and finite regret: Analysis of a cognitive satisficing value function
BioSystems
(2019)
A new look at the statistical model identification
IEEE Transactions on Automatic Control
Feeding, social behaviour and interspecific competition in wild rats
Behaviour
Adaptive properties of differential learning rates for positive and negative outcomes
Biological Cybernetics
Dopamine blockade impairs the exploration-exploitation trade-off in rats
Scientific Reports
A distributional code for value in dopamine-based reinforcement learning
Nature
Cortical substrates for exploratory decisions in humans
Nature
Behavioural and neural modulation of win-stay but not lose-shift strategies as a function of outcome value in Rock, Paper, Scissors
Scientific Reports
Prefrontal and striatal dopaminergic genes predict individual differences in exploration and exploitation
Nature Neuroscience
Genetic triple dissociation reveals multiple roles for dopamine in reinforcement learning
Proceedings of the National Academy of Sciences of the United States of America
By carrot or by stick: Cognitive reinforcement learning in Parkinsonism
Science (80-)
Do learning rates adapt to the distribution of rewards?
Psychonomic Bulletin & Review
Computational rationality: A converging paradigm for intelligence in brains, minds, and machines
Science (80-)
Dopaminergic control of the exploration-exploitation trade-off via the basal ganglia
Frontiers in Neuroscience
Cited by (11)
Neural Networks special issue on Artificial Intelligence and Brain Science
2022, Neural NetworksThe computational roots of positivity and confirmation biases in reinforcement learning
2022, Trends in Cognitive SciencesCitation Excerpt :While these findings challenge the idea that metacognition ensures that updating biases are normative, they might connect the asymmetric updating observed in RL to the original theoretical accounts of asymmetric belief updating, if overconfidence (i.e., the metacognitive illusion of accuracy) is considered self-serving per se, that is, carries an ego-relevant utility [15,69]. In conclusion, although this section reviewed the evidence that learning asymmetry may be normative in some contexts – and as such may provide justification for its selection in that context – its persistence in contexts where it is unfavorable along with its lack of modulation in many circumstances reinforce the idea that learning asymmetry constitutes a hardcoded learning bias [39,45,54,55]. A complementary perspective on the normativity of this bias could emerge from different modeling perspectives.