Elsevier

Biosystems

Volume 116, February 2014, Pages 1-9
Biosystems

Cognitively inspired reinforcement learning architecture and its application to giant-swing motion control

https://doi.org/10.1016/j.biosystems.2013.11.002Get rights and content

Abstract

Many algorithms and methods in artificial intelligence or machine learning were inspired by human cognition. As a mechanism to handle the exploration–exploitation dilemma in reinforcement learning, the loosely symmetric (LS) value function that models causal intuition of humans was proposed (Shinohara et al., 2007). While LS shows the highest correlation with causal induction by humans, it has been reported that it effectively works in multi-armed bandit problems that form the simplest class of tasks representing the dilemma. However, the scope of application of LS was limited to the reinforcement learning problems that have K actions with only one state (K-armed bandit problems). This study proposes LS-Q learning architecture that can deal with general reinforcement learning tasks with multiple states and delayed reward. We tested the learning performance of the new architecture in giant-swing robot motion learning, where uncertainty and unknown-ness of the environment is huge. In the test, the help of ready-made internal models or functional approximation of the state space were not given. The simulations showed that while the ordinary Q-learning agent does not reach giant-swing motion because of stagnant loops (local optima with low rewards), LS-Q escapes such loops and acquires giant-swing. It is confirmed that the smaller number of states is, in other words, the more coarse-grained the division of states and the more incomplete the state observation is, the better LS-Q performs in comparison with Q-learning. We also showed that the high performance of LS-Q depends comparatively little on parameter tuning and learning time. This suggests that the proposed method inspired by human cognition works adaptively in real environments.

Introduction

Reinforcement learning, a learning framework based on interaction with uncertain environment, has been given much attention since its inception (Sutton and Barto, 1998, Kaelbling et al., 1996, Woergoetter and Porr, 2008). Together with mathematical study of the learning theory, application to various fields of engineering (Abbeel et al., 2007, Branavan et al., 2009) and analysis of actual decision-making and brain activity with the theoretical framework (Lee et al., 2012, Murakoshi and Mizuno, 2005) have been actively performed. The goal in reinforcement learning is to maximize the accumulated reward from the unknown environment, while searching for the better actions by trial and error. The act of maximization (called exploitation) is local in the sense that it is based on the limited knowledge of the environment accumulated so far. Therefore, expanding the local knowledge by search (called exploration) into a more global one is necessary in finding better actions. Though both of exploitation and exploration are with no doubt important, they are usually mutually exclusive by definition, since exploitation is to choose an action that appears to be the best at the time, while exploration is trying another action than the seemingly best one. As far as the number of trials allowed is finite, it is difficult to balance the two kinds of actions. This is a situation well known as “exploration–exploitation dilemma” (Sutton and Barto, 1998, Woergoetter and Porr, 2008). The dilemma is omnipresent for living systems in the real environment. One of the forms that it takes for an animal is “whether to stay at this feeding area or to go for another” (Cohen et al., 2007). How living systems adopt in face of the dilemma or even overcome it is not only scientifically interesting but may be inspiring for engineering. In fact, several architectures for efficient balancing between exploitation and exploration, inspired by the real living systems, have been proposed (Shinohara et al., 2007, Takahashi et al., 2010, Takahashi et al., 2011a, Niizato and Gunji, 2010, Niizato and Gunji, 2011, Kim et al., 2010, Tsuda et al., 2007). They are expected to be more effective when the uncertainty of the environment is greater. The one we propose in this study is another bio-inspired reinforcement learning architecture.

The multi-armed bandit problems embody the most fundamental framework representing the exploration–exploitation dilemma (Sutton and Barto, 1998, Robbins, 1952, Bubeck and Cesa-Bianchi, 2012). In a multi-armed bandit problem, there are K levers with different probabilities of reward (win), attached to a slot machine. The reward probabilities are initially unknown. The agent tries to maximize the total reward, looking for better levers. It is a reinforcement-learning task with K actions and a single state. Niizato and Gunji constructed a category-theoretical model of the aspect of living systems or internal observers where “known” and “unknown” are merged (Niizato and Gunji, 2010), and applied it to two-armed bandit problems (Niizato and Gunji, 2011). Kim et al. (2010) modeled the slime molds searching for better food sources utilizing limited resource in a parallel way.

Most prominently, Shinohara et al. proposed the loosely symmetric (LS) model that describes human causal decision-making with cognitive biases and applied to two-armed bandit problems (Shinohara et al., 2007, Takahashi et al., 2010, Takahashi et al., 2011a). LS model not only shows extremely high correlation with the experimental result of human causal induction, but also performs even better than UCB1-Tuned – one of the best algorithm proposed in (Auer et al., 2002) – in two-armed bandit problems (Oyo and Takahashi, 2013). Also, because of the model's compatibility with simple indices such as conditional probability and expected value, simple operation just as a value function in reinforcement learning system, and easy implementation, further applications have been realized (Takahashi et al., 2011b, Ohmura et al., 2012). In this study, we propose a method to apply LS to more general reinforcement learning tasks with N states.

As stated above, multi-armed bandit problems are classified as reinforcement learning task with only one state. For more general tasks, there are many states, and the delayed reward makes the choice between exploration and exploitation much harder. We apply our method to robotic action acquisition that represents the principal problems in applying reinforcement learning to real world tasks. It is an appropriate task since there are well-established methods to compare to, and it is known that the learning can be efficiently made with some preparation in advance. Therefore, we can evaluate our proposal in respect to the performance and simplicity (if it works well, without additional mechanism to reinforcement learning).

Robotic action acquisition tasks by reinforcement learning have been actively studied (e.g., Peters and Ng, 2009, Morimoto et al., 2010). The obstacles to efficient exploration and learning in the application include huge degrees of freedom, continuous action-state space, and incomplete state observation. To overcome the obstacles, hybrid methods are proposed. The methods apply reinforcement learning hierarchically, and/or in combination with other methods like learning the internal model of environment, functional approximation of state-action space, and imitation learning of movement primitives (Schaal and Atkeson, 2010, Kober and Peters, 2010, Morimoto and Doya, 2001, Doya et al., 2002, Takahashi and Asada, 2003, Morimoto and Atkeson, 2007, Yamaguchi et al., 2009, Hester et al., 2010). Recently, some newly devised methods have been proposed that treat reward in a different way, including inverse reinforcement learning that focuses on the design of reward and exploitation-oriented learning (XoL) that considers reward just as priority (Abbeel et al., 2010, Kuroda et al., 2012). Giant-swing robots, also known as acrobot, have long been studied as a task of motion control for non-linear systems by reinforcement learning (Sutton and Barto, 1998, Hauser and Murray, 1990, Spong, 1995). Sutton showed that swing-up motion for acrobot can be realized by reinforcement learning with only functional approximation of the action-state space, without the explicit model of the system (Sutton, 1996). Boone showed that the acrobot can learn the action control more efficiently by reinforcement learning with explicit learning of the internal model of the system (Boone, 1997). Samejima et al. (2003) succeeded in swinging up and stabilization of pendulum by implementing modular reward into reinforcement learning architecture with multiple internal models. Yoshimoto et al. (2005) realized swinging up and stabilizing acrobot by reinforcement learning of when to switch among multiple controllers.

In contrast, Yabuta and others succeeded reinforcement learning of giant-swing in robotic motion learning in a simpler way (Sakai et al., 2010, Hara et al., 2009, Toyoda et al., 2010, Hara et al., 2011). They applied Q-learning, a model-free reinforcement learning method (Watkins and Dayan, 1992), to giant-swing motion learning, addressing the problems in the application (Sakai et al., 2010, Hara et al., 2009, Hara et al., 2011, Toyoda et al., 2010). Because they neither prepare controllers nor internal models in advance, the trial-and-error interaction with the environment in reinforcement learning is highlighted. Also, because of the uniform discretization of the action and state spaces (without functional approximation), the success of learning depends on how (in)complete the state observation is. Yabuta and others point out that Q values may acutely fluctuate, leading to later or worse convergence, and the acquired actions may form a stagnant loop (Hara et al., 2009, Toyoda et al., 2010). In this study, we show that our bio-inspired, or more specifically, cognitively inspired, proposed system solves these problems.

Section snippets

Method

We explain about the loosely symmetric model that quantitatively represents the symmetric cognitive biases and then propose the way to apply it to Q-learning (Watkins and Dayan, 1992).

Result

Fig. 4 shows the learning curve in simulations. The x-axis is the learning time [/1000 step], and the y-axis is the total acquired rewards in the 1000 steps. The left graph shows the curve of a typical trial and the right the average of 100 trials. At the initial state, the robot hangs still from the bar with zero velocity. This initial state is restored every 1000 steps. At the beginning of a trial (the very first 1000 steps), the robot chooses the action completely randomly. For every 1000

Discussion

The learning environment in this study has two salient features of the real environments surrounding our robots and us: unknown and uncertain. Unknown-ness comes from that we have no model of the system or controllers/motion primitives prepared ahead. The prior knowledge of the designer cannot be reflected. Uncertainty is a result of the state space uniformly coarse-grained without functional approximation. This leads to a greater influence of incompleteness in state observation. In a real

Conclusion

We proposed LS-Q, a new architecture of reinforcement learning that employs a model of human adaptive decision-making to Q-learning. We tested LS-Q on the efficacy applied to motion control for giant-swing robot. For this task of a learning environment with a large uncertainty, we did not prepare internal models or functional approximation of state space. In this task, LS-Q showed a high performance, avoiding and/or escaping from stagnant loops. This suggests that the proposed algorithm can be

Acknowledgments

A part of this study was presented at the 2011 IEEE ICMA (International Conference on Mechatronics and Automation), August 7–10, Beijing, China. This work was carried out under the supports by the Cooperative Research Project Program H25/A12 of the Research Institute of Electrical Communication, Tohoku University, Research Institute for Science and Technology of Tokyo Denki University Grant Numbers Q13K-03 and Q11K-02/Japan, and by Grant-in-Aid for Scientific Research (KAKENHI) 25730150 from

References (57)

  • T. Taniguchi et al.

    Incremental acquisition of multiple nonlinear forward models based on differentiation process of schema model

    Neural Networks

    (2008)
  • J.E. Taplin

    Reasoning with conditional sentences

    Journal of Verbal Learning and Verbal Behavior

    (1971)
  • S. Tsuda et al.

    Robot control with biological cells

    BioSystems

    (2007)
  • P. Abbeel et al.

    Autonomous helicopter aerobatics through apprenticeship learning

    International Journal of Robotics Research

    (2010)
  • P. Abbeel et al.

    An application of reinforcement learning to aerobatic helicopter flight

    Advances in Neural Information Processing Systems

    (2007)
  • P. Auer et al.

    Finite-time analysis of the multiarmed bandit problem

    Machine Learning

    (2002)
  • D.J. Barraclough et al.

    Prefrontal cortex and decision making in a mixed-strategy game

    Nature Neuroscience

    (2004)
  • C.M. Bishop

    Pattern Recognition and Machine Learning

    (2006)
  • G. Boone

    Efficient reinforcement learning: model-based acrobot control

    IEEE International Conference on Robotics and Automation

    (1997)
  • M.D.S. Braine et al.

    Is the base rate fallacy an instance of asserting the consequent?

  • S.R.K. Branavan et al.

    Reinforcement learning for mapping instructions to actions

  • S. Bubeck et al.

    Regret analysis of stochastic and nonstochastic multi-armed bandit problems

    Foundations and Trends in Machine Learning

    (2012)
  • J.D. Cohen et al.

    Should I stay or should I go? How the human brain manages the trade-off between exploitation and exploration

    Philosophical transactions of the Royal Society of London Series B, Biological sciences

    (2007)
  • K. Doya et al.

    Multiple model-based reinforcement learning

    Neural Computation

    (2002)
  • M. Hara et al.

    Acquisition of a gymnast-like robotic giant-swing motion by Q-learning and improvement of the repeatability

    Journal of Robotics and Mechatronics

    (2011)
  • M. Hara et al.

    Consideration on robotic giant-swing motion generated by reinforcement learning

  • J. Hauser et al.

    Nonlinear controllers for non-integrable systems: the Acrobot example

  • T. Hester et al.

    Generalized model learning for reinforcement learning on a humanoid robot

  • Cited by (9)

    • Intelligent SOC-consumption allocation of commercial plug-in hybrid electric vehicles in variable scenario

      2021, Applied Energy
      Citation Excerpt :

      Reinforcement learning is one of the critical branches of machine learning algorithms, which is suitable for solving multistep-decision-making problems [38]. At present, several kinds of reinforcement learning algorithms have been proposed and are widely used in many research fields [39–41]. It has been proved that the reinforcement learning algorithm for discretized problems can converge to the optimal solution [42].

    • Robotic action acquisition with cognitive biases in coarse-grained state space

      2016, BioSystems
      Citation Excerpt :

      In addition, it was found to perform at a higher level than UCB1-tuned (Auer et al., 2002), a standard algorithm for bandit problems (Kohno and Takahashi, 2015). For a class of adaptive tasks broader than bandit problems, Uragami and others proposed a method of applying the LS model to the general reinforcement learning algorithm Q-learning, LS-Q (Uragami et al., 2014). The LS-Q algorithm can be applied to general reinforcement learning tasks with an arbitrary number of states, whereas there is only one state in bandit problems.

    • Application of Automated Guided Vehicles in Smart Automated Warehouse Systems: A Survey

      2023, CMES - Computer Modeling in Engineering and Sciences
    • Cognitive Satisficing: Bounded rationality in reinforcement learning

      2019, Transactions of the Japanese Society for Artificial Intelligence
    • A cognitive satisficing strategy for bandit problems

      2017, International Journal of Parallel, Emergent and Distributed Systems
    View all citing articles on Scopus
    View full text