Cognitively inspired reinforcement learning architecture and its application to giant-swing motion control
Introduction
Reinforcement learning, a learning framework based on interaction with uncertain environment, has been given much attention since its inception (Sutton and Barto, 1998, Kaelbling et al., 1996, Woergoetter and Porr, 2008). Together with mathematical study of the learning theory, application to various fields of engineering (Abbeel et al., 2007, Branavan et al., 2009) and analysis of actual decision-making and brain activity with the theoretical framework (Lee et al., 2012, Murakoshi and Mizuno, 2005) have been actively performed. The goal in reinforcement learning is to maximize the accumulated reward from the unknown environment, while searching for the better actions by trial and error. The act of maximization (called exploitation) is local in the sense that it is based on the limited knowledge of the environment accumulated so far. Therefore, expanding the local knowledge by search (called exploration) into a more global one is necessary in finding better actions. Though both of exploitation and exploration are with no doubt important, they are usually mutually exclusive by definition, since exploitation is to choose an action that appears to be the best at the time, while exploration is trying another action than the seemingly best one. As far as the number of trials allowed is finite, it is difficult to balance the two kinds of actions. This is a situation well known as “exploration–exploitation dilemma” (Sutton and Barto, 1998, Woergoetter and Porr, 2008). The dilemma is omnipresent for living systems in the real environment. One of the forms that it takes for an animal is “whether to stay at this feeding area or to go for another” (Cohen et al., 2007). How living systems adopt in face of the dilemma or even overcome it is not only scientifically interesting but may be inspiring for engineering. In fact, several architectures for efficient balancing between exploitation and exploration, inspired by the real living systems, have been proposed (Shinohara et al., 2007, Takahashi et al., 2010, Takahashi et al., 2011a, Niizato and Gunji, 2010, Niizato and Gunji, 2011, Kim et al., 2010, Tsuda et al., 2007). They are expected to be more effective when the uncertainty of the environment is greater. The one we propose in this study is another bio-inspired reinforcement learning architecture.
The multi-armed bandit problems embody the most fundamental framework representing the exploration–exploitation dilemma (Sutton and Barto, 1998, Robbins, 1952, Bubeck and Cesa-Bianchi, 2012). In a multi-armed bandit problem, there are K levers with different probabilities of reward (win), attached to a slot machine. The reward probabilities are initially unknown. The agent tries to maximize the total reward, looking for better levers. It is a reinforcement-learning task with K actions and a single state. Niizato and Gunji constructed a category-theoretical model of the aspect of living systems or internal observers where “known” and “unknown” are merged (Niizato and Gunji, 2010), and applied it to two-armed bandit problems (Niizato and Gunji, 2011). Kim et al. (2010) modeled the slime molds searching for better food sources utilizing limited resource in a parallel way.
Most prominently, Shinohara et al. proposed the loosely symmetric (LS) model that describes human causal decision-making with cognitive biases and applied to two-armed bandit problems (Shinohara et al., 2007, Takahashi et al., 2010, Takahashi et al., 2011a). LS model not only shows extremely high correlation with the experimental result of human causal induction, but also performs even better than UCB1-Tuned – one of the best algorithm proposed in (Auer et al., 2002) – in two-armed bandit problems (Oyo and Takahashi, 2013). Also, because of the model's compatibility with simple indices such as conditional probability and expected value, simple operation just as a value function in reinforcement learning system, and easy implementation, further applications have been realized (Takahashi et al., 2011b, Ohmura et al., 2012). In this study, we propose a method to apply LS to more general reinforcement learning tasks with N states.
As stated above, multi-armed bandit problems are classified as reinforcement learning task with only one state. For more general tasks, there are many states, and the delayed reward makes the choice between exploration and exploitation much harder. We apply our method to robotic action acquisition that represents the principal problems in applying reinforcement learning to real world tasks. It is an appropriate task since there are well-established methods to compare to, and it is known that the learning can be efficiently made with some preparation in advance. Therefore, we can evaluate our proposal in respect to the performance and simplicity (if it works well, without additional mechanism to reinforcement learning).
Robotic action acquisition tasks by reinforcement learning have been actively studied (e.g., Peters and Ng, 2009, Morimoto et al., 2010). The obstacles to efficient exploration and learning in the application include huge degrees of freedom, continuous action-state space, and incomplete state observation. To overcome the obstacles, hybrid methods are proposed. The methods apply reinforcement learning hierarchically, and/or in combination with other methods like learning the internal model of environment, functional approximation of state-action space, and imitation learning of movement primitives (Schaal and Atkeson, 2010, Kober and Peters, 2010, Morimoto and Doya, 2001, Doya et al., 2002, Takahashi and Asada, 2003, Morimoto and Atkeson, 2007, Yamaguchi et al., 2009, Hester et al., 2010). Recently, some newly devised methods have been proposed that treat reward in a different way, including inverse reinforcement learning that focuses on the design of reward and exploitation-oriented learning (XoL) that considers reward just as priority (Abbeel et al., 2010, Kuroda et al., 2012). Giant-swing robots, also known as acrobot, have long been studied as a task of motion control for non-linear systems by reinforcement learning (Sutton and Barto, 1998, Hauser and Murray, 1990, Spong, 1995). Sutton showed that swing-up motion for acrobot can be realized by reinforcement learning with only functional approximation of the action-state space, without the explicit model of the system (Sutton, 1996). Boone showed that the acrobot can learn the action control more efficiently by reinforcement learning with explicit learning of the internal model of the system (Boone, 1997). Samejima et al. (2003) succeeded in swinging up and stabilization of pendulum by implementing modular reward into reinforcement learning architecture with multiple internal models. Yoshimoto et al. (2005) realized swinging up and stabilizing acrobot by reinforcement learning of when to switch among multiple controllers.
In contrast, Yabuta and others succeeded reinforcement learning of giant-swing in robotic motion learning in a simpler way (Sakai et al., 2010, Hara et al., 2009, Toyoda et al., 2010, Hara et al., 2011). They applied Q-learning, a model-free reinforcement learning method (Watkins and Dayan, 1992), to giant-swing motion learning, addressing the problems in the application (Sakai et al., 2010, Hara et al., 2009, Hara et al., 2011, Toyoda et al., 2010). Because they neither prepare controllers nor internal models in advance, the trial-and-error interaction with the environment in reinforcement learning is highlighted. Also, because of the uniform discretization of the action and state spaces (without functional approximation), the success of learning depends on how (in)complete the state observation is. Yabuta and others point out that Q values may acutely fluctuate, leading to later or worse convergence, and the acquired actions may form a stagnant loop (Hara et al., 2009, Toyoda et al., 2010). In this study, we show that our bio-inspired, or more specifically, cognitively inspired, proposed system solves these problems.
Section snippets
Method
We explain about the loosely symmetric model that quantitatively represents the symmetric cognitive biases and then propose the way to apply it to Q-learning (Watkins and Dayan, 1992).
Result
Fig. 4 shows the learning curve in simulations. The x-axis is the learning time [/1000 step], and the y-axis is the total acquired rewards in the 1000 steps. The left graph shows the curve of a typical trial and the right the average of 100 trials. At the initial state, the robot hangs still from the bar with zero velocity. This initial state is restored every 1000 steps. At the beginning of a trial (the very first 1000 steps), the robot chooses the action completely randomly. For every 1000
Discussion
The learning environment in this study has two salient features of the real environments surrounding our robots and us: unknown and uncertain. Unknown-ness comes from that we have no model of the system or controllers/motion primitives prepared ahead. The prior knowledge of the designer cannot be reflected. Uncertainty is a result of the state space uniformly coarse-grained without functional approximation. This leads to a greater influence of incompleteness in state observation. In a real
Conclusion
We proposed LS-Q, a new architecture of reinforcement learning that employs a model of human adaptive decision-making to Q-learning. We tested LS-Q on the efficacy applied to motion control for giant-swing robot. For this task of a learning environment with a large uncertainty, we did not prepare internal models or functional approximation of state space. In this task, LS-Q showed a high performance, avoiding and/or escaping from stagnant loops. This suggests that the proposed algorithm can be
Acknowledgments
A part of this study was presented at the 2011 IEEE ICMA (International Conference on Mechatronics and Automation), August 7–10, Beijing, China. This work was carried out under the supports by the Cooperative Research Project Program H25/A12 of the Research Institute of Electrical Communication, Tohoku University, Research Institute for Science and Technology of Tokyo Denki University Grant Numbers Q13K-03 and Q11K-02/Japan, and by Grant-in-Aid for Scientific Research (KAKENHI) 25730150 from
References (57)
Global logic resulting from disequilibration process
Biosystems
(1995)- et al.
Tug-of-war model for the two-bandit problem: nonlocally-correlated parallel exploration via resource conservation
BioSystems
(2010) Quantum mechanics in first, second and third person descriptions
BioSystems
(2003)- et al.
Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning
Robotics and Autonomous Systems
(2001) - et al.
Simulation of rat behavior by a reinforcement learning algorithm in consideration of appearance probabilities of reinforcement signals
BioSystems
(2005) - et al.
Imperfect identity of autonomous living system
BioSystems
(2010) - et al.
A cognitively inspired heuristic for two-armed bandit problems: The loosely symmetric (LS)model
Procedia Computer Science
(2013) - et al.
Guest editorial: special issue on robot learning
Autonomous Robots
(2009) - et al.
Inter-module credit assignment in modular reinforcement learning
Neural Networks
(2003) - et al.
Symmetrizing object and meta levels organizes thinking
BioSystems
(2012)
Incremental acquisition of multiple nonlinear forward models based on differentiation process of schema model
Neural Networks
Reasoning with conditional sentences
Journal of Verbal Learning and Verbal Behavior
Robot control with biological cells
BioSystems
Autonomous helicopter aerobatics through apprenticeship learning
International Journal of Robotics Research
An application of reinforcement learning to aerobatic helicopter flight
Advances in Neural Information Processing Systems
Finite-time analysis of the multiarmed bandit problem
Machine Learning
Prefrontal cortex and decision making in a mixed-strategy game
Nature Neuroscience
Pattern Recognition and Machine Learning
Efficient reinforcement learning: model-based acrobot control
IEEE International Conference on Robotics and Automation
Is the base rate fallacy an instance of asserting the consequent?
Reinforcement learning for mapping instructions to actions
Regret analysis of stochastic and nonstochastic multi-armed bandit problems
Foundations and Trends in Machine Learning
Should I stay or should I go? How the human brain manages the trade-off between exploitation and exploration
Philosophical transactions of the Royal Society of London Series B, Biological sciences
Multiple model-based reinforcement learning
Neural Computation
Acquisition of a gymnast-like robotic giant-swing motion by Q-learning and improvement of the repeatability
Journal of Robotics and Mechatronics
Consideration on robotic giant-swing motion generated by reinforcement learning
Nonlinear controllers for non-integrable systems: the Acrobot example
Generalized model learning for reinforcement learning on a humanoid robot
Cited by (9)
Intelligent SOC-consumption allocation of commercial plug-in hybrid electric vehicles in variable scenario
2021, Applied EnergyCitation Excerpt :Reinforcement learning is one of the critical branches of machine learning algorithms, which is suitable for solving multistep-decision-making problems [38]. At present, several kinds of reinforcement learning algorithms have been proposed and are widely used in many research fields [39–41]. It has been proved that the reinforcement learning algorithm for discretized problems can converge to the optimal solution [42].
Robotic action acquisition with cognitive biases in coarse-grained state space
2016, BioSystemsCitation Excerpt :In addition, it was found to perform at a higher level than UCB1-tuned (Auer et al., 2002), a standard algorithm for bandit problems (Kohno and Takahashi, 2015). For a class of adaptive tasks broader than bandit problems, Uragami and others proposed a method of applying the LS model to the general reinforcement learning algorithm Q-learning, LS-Q (Uragami et al., 2014). The LS-Q algorithm can be applied to general reinforcement learning tasks with an arbitrary number of states, whereas there is only one state in bandit problems.
Application of Automated Guided Vehicles in Smart Automated Warehouse Systems: A Survey
2023, CMES - Computer Modeling in Engineering and SciencesControl of swing-up and giant-swing motions of Acrobot based on periodic input
2022, Nonlinear DynamicsCognitive Satisficing: Bounded rationality in reinforcement learning
2019, Transactions of the Japanese Society for Artificial IntelligenceA cognitive satisficing strategy for bandit problems
2017, International Journal of Parallel, Emergent and Distributed Systems