Teaching robots to respond to natural-language commands

自然语言处理

强化学习

海外精选

海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时，内容中提到的“AWS” 是 “Amazon Web Services” 的缩写，在此网站不作为商标展示。

{"value":"If general-purpose household robots ever become a reality, it would be nice to address them in natural language — to say to a robot, for instance, “Take the dirty dishes to the kitchen.”\n\nNatural-language commands, however, introduce a new layer of complexity to the control of robotic systems, since the same sequence of actions can correspond to many different natural-language commands (“Can you clear the dishes from the dining room?”).\n\nIn a ++[paper](https://www.amazon.science/publications/inverse-reinforcement-learning-with-natural-language-goals)++ that my colleagues and I presented last week at the annual meeting of the Association for the Advancement of Artificial Intelligence (++[AAAI](https://www.amazon.science/conferences-and-events/aaai-2021)++), we apply some of what we’ve learned working on natural-language understanding to the problem of natural-language robotic control.\n\nIn particular, we consider the case of inverse reinforcement learning (IRL), in which an AI agent learns to perform a specified task by observing human demonstrations. We augment the standard IRL framework, though, by specifying the agent’s goals in natural language, not, explicitly, as unique states.\n\n![image.png](https://dev-media.amazoncloud.cn/b989bbbd5fe14f2ab9c403cf0d852bd2_image.png)\n\nA diagram of the researchers’ training methodology, which alternates between updating an autonomous agent’s policy — a set of actions (a) to take in various states (s) in order to achieve a goal (G) — and training a discriminator to recognize the reward function implicit in experts’ examples. The discriminator learns from both positive and negative examples. Some negative examples (sampled trajectories) are relabeled (relabeled trajectories) and used to augment the experts' examples, both for updating the policy and for training the discriminator.\n\nCREDIT: GLYNIS CONDON\n\nIn experiments involving a benchmark data set consisting of high-quality 3-D simulations of an indoor environment, we compared our method to four leading approaches to IRL. \n\nIn cases in which the agent is tested in an environment that it saw during training, our method improves its success rate in achieving goals specified in natural language by 14%, relative to the best-performing baseline. In novel test environments — environments unseen during training — our method improves the agent’s success rate by 17%.\n\nIn the paper, we also present a method whereby a trained AI agent deployed to an unfamiliar environment can generate its own training examples tailored to that environment. This additional self-supervised learning improves the agent’s success rate by an additional 38%.\n\n\n#### **Inverse reinforcement learning**\n\n\n++[Reinforcement learning](https://www.amazon.science/tag/reinforcement-learning)++ is a paradigm in which an agent learns through trial and error. More specifically, it has a reward function — a measure of how successful it is at achieving some goal — and it learns a set of behaviors that maximize its reward.\n\nIn inverse reinforcement learning, by contrast, the agent is presented with a set of demonstrations — the examples of a human expert or other agent — and it must learn the reward function implicitly maximized by the experts.\n\nDemonstrations are represented as trajectories, which consist of sequences of alternating states (of the environment and the agent’s place in it) and actions. With IRL, as with standard reinforcement learning, the agent’s ultimate aim is to learn a policy, which dictates which actions to take in which states. With IRL, however, the agent must learn the reward function and the policy simultaneously.\n\nA common approach to IRL is to use a generative adversarial network, or GAN. The training data for the agent is a set of true trajectories, modeled by experts, which accomplish the goal to be learned.\n\nBut the training setup also includes an adversarial generator, which creates false trajectories, and the IRL discriminator must learn to distinguish the two. That is, it must learn a reward function that assigns a high value to true trajectories and a low value to false ones. Simultaneously, the adversarial generator tries to learn a policy that generates high-reward trajectories.\n\nWe vary this setup by combining each trajectory with an additional input: a natural-language specification of the goal. A single trajectory may have multiple natural-language goals, corresponding to multiple states and actions in the sequence: for example, “go down the hall”, “turn left”, “find the first doorway on your right”, and so on.\n\nIn this setting, the negative examples generated by the adversarial generator are trajectories with misaligned natural-language goals: the trajectory maps out a right turn, for instance, but the natural-language goal is “turn left”.\n\nWe alternate between using training examples to teach the agent the reward function and to update the agent’s policy. The reward function is trained on both trajectories and natural-language goals (NL goals), and its training data includes negative examples from the adversarial generator. For policy updates, the agent receives only the NL goals — and only from positive examples — and must predict the associated trajectories.\n\nIn our experiments, this basic model offered little improvement over existing IRL models, requiring several additional features to improve its performance.\n\n\n#### **Data augmentation**\n\n\nFirst, using our expert-supplied trajectories, we trained a variational goal generator to predict NL goals on the basis of trajectories. That model includes a variational autoencoder, a neural network that produces a highly compressed vector representation of each NL goal. The compressed representation captures semantic information about the NL goal, but it loses information about the goal’s phrasing. Re-expanding such a representation produces a new NL goal that is phrased differently from the original but preserves the semantic content.\n\nWe use these trajectories with rephrased NL goals as new positive training examples. This augments our supply of expert training examples, which tend to be scarce, increasing robustness through lexical variance.\n\nWhen a negative example from the adversarial generator — whose NL goal is inaccurate — passes through the label prediction model, the result is a reconstructed trajectory with a correct NL goal. These relabeled trajectories are added to our supply of positive examples as well.\n\nWe use our added positive examples to both train the reward function and update the agent’s policy. Not only does this improve the accuracy of the reward function, but it also increases the agent’s ability to generalize to new settings, since it has more varied encounters with the environment to learn from than it would otherwise.\n\nFinally, we explore an additional method for bootstrapping an agent that is asked to perform tasks in an unfamiliar environment. First, the agent learns a new, goal-agnostic policy from existing training data. This policy encodes general principles, such as not trying to move through closed doors. \n\nThen we use that general policy to generate sample trajectories in the new environment; these pass through the variational goal generator, which assigns them NL goals. We treat these newly labeled trajectories as expert examples in the new setting, and we use them to update the reward function. \n\nThis added layer of training is what increased our agents’ success rates by 36% when they were deployed to new environments. We think this kind of adaptability will be crucial to household robots of the future, which will need to adjust to new environments — when a family moves or goes on vacation, for instance — without being retrained from scratch.\n\nABOUT THE AUTHOR\n\n#### **[Li Zhou](https://www.amazon.science/author/li-zhou)**\n\nLi Zhou is an applied scientist in the Alexa AI group.","render":"If general-purpose household robots ever become a reality, it would be nice to address them in natural language — to say to a robot, for instance, “Take the dirty dishes to the kitchen.”\nNatural-language commands, however, introduce a new layer of complexity to the control of robotic systems, since the same sequence of actions can correspond to many different natural-language commands (“Can you clear the dishes from the dining room?”).\nIn a <ins><a href=\"https://www.amazon.science/publications/inverse-reinforcement-learning-with-natural-language-goals\" target=\"_blank\">paper</a></ins> that my colleagues and I presented last week at the annual meeting of the Association for the Advancement of Artificial Intelligence (<ins><a href=\"https://www.amazon.science/conferences-and-events/aaai-2021\" target=\"_blank\">AAAI</a></ins>), we apply some of what we’ve learned working on natural-language understanding to the problem of natural-language robotic control.\nIn particular, we consider the case of inverse reinforcement learning (IRL), in which an AI agent learns to perform a specified task by observing human demonstrations. We augment the standard IRL framework, though, by specifying the agent’s goals in natural language, not, explicitly, as unique states.\n<img src=\"https://dev-media.amazoncloud.cn/b989bbbd5fe14f2ab9c403cf0d852bd2_image.png\" alt=\"image.png\" />\nA diagram of the researchers’ training methodology, which alternates between updating an autonomous agent’s policy — a set of actions (a) to take in various states (s) in order to achieve a goal (G) — and training a discriminator to recognize the reward function implicit in experts’ examples. The discriminator learns from both positive and negative examples. Some negative examples (sampled trajectories) are relabeled (relabeled trajectories) and used to augment the experts’ examples, both for updating the policy and for training the discriminator.\nCREDIT: GLYNIS CONDON\nIn experiments involving a benchmark data set consisting of high-quality 3-D simulations of an indoor environment, we compared our method to four leading approaches to IRL.\nIn cases in which the agent is tested in an environment that it saw during training, our method improves its success rate in achieving goals specified in natural language by 14%, relative to the best-performing baseline. In novel test environments — environments unseen during training — our method improves the agent’s success rate by 17%.\nIn the paper, we also present a method whereby a trained AI agent deployed to an unfamiliar environment can generate its own training examples tailored to that environment. This additional self-supervised learning improves the agent’s success rate by an additional 38%.\n<h4><a id=\"Inverse_reinforcement_learning_21\"></a>Inverse reinforcement learning</h4>\n<ins><a href=\"https://www.amazon.science/tag/reinforcement-learning\" target=\"_blank\">Reinforcement learning</a></ins> is a paradigm in which an agent learns through trial and error. More specifically, it has a reward function — a measure of how successful it is at achieving some goal — and it learns a set of behaviors that maximize its reward.\nIn inverse reinforcement learning, by contrast, the agent is presented with a set of demonstrations — the examples of a human expert or other agent — and it must learn the reward function implicitly maximized by the experts.\nDemonstrations are represented as trajectories, which consist of sequences of alternating states (of the environment and the agent’s place in it) and actions. With IRL, as with standard reinforcement learning, the agent’s ultimate aim is to learn a policy, which dictates which actions to take in which states. With IRL, however, the agent must learn the reward function and the policy simultaneously.\nA common approach to IRL is to use a generative adversarial network, or GAN. The training data for the agent is a set of true trajectories, modeled by experts, which accomplish the goal to be learned.\nBut the training setup also includes an adversarial generator, which creates false trajectories, and the IRL discriminator must learn to distinguish the two. That is, it must learn a reward function that assigns a high value to true trajectories and a low value to false ones. Simultaneously, the adversarial generator tries to learn a policy that generates high-reward trajectories.\nWe vary this setup by combining each trajectory with an additional input: a natural-language specification of the goal. A single trajectory may have multiple natural-language goals, corresponding to multiple states and actions in the sequence: for example, “go down the hall”, “turn left”, “find the first doorway on your right”, and so on.\nIn this setting, the negative examples generated by the adversarial generator are trajectories with misaligned natural-language goals: the trajectory maps out a right turn, for instance, but the natural-language goal is “turn left”.\nWe alternate between using training examples to teach the agent the reward function and to update the agent’s policy. The reward function is trained on both trajectories and natural-language goals (NL goals), and its training data includes negative examples from the adversarial generator. For policy updates, the agent receives only the NL goals — and only from positive examples — and must predict the associated trajectories.\nIn our experiments, this basic model offered little improvement over existing IRL models, requiring several additional features to improve its performance.\n<h4><a id=\"Data_augmentation_43\"></a>Data augmentation</h4>\nFirst, using our expert-supplied trajectories, we trained a variational goal generator to predict NL goals on the basis of trajectories. That model includes a variational autoencoder, a neural network that produces a highly compressed vector representation of each NL goal. The compressed representation captures semantic information about the NL goal, but it loses information about the goal’s phrasing. Re-expanding such a representation produces a new NL goal that is phrased differently from the original but preserves the semantic content.\nWe use these trajectories with rephrased NL goals as new positive training examples. This augments our supply of expert training examples, which tend to be scarce, increasing robustness through lexical variance.\nWhen a negative example from the adversarial generator — whose NL goal is inaccurate — passes through the label prediction model, the result is a reconstructed trajectory with a correct NL goal. These relabeled trajectories are added to our supply of positive examples as well.\nWe use our added positive examples to both train the reward function and update the agent’s policy. Not only does this improve the accuracy of the reward function, but it also increases the agent’s ability to generalize to new settings, since it has more varied encounters with the environment to learn from than it would otherwise.\nFinally, we explore an additional method for bootstrapping an agent that is asked to perform tasks in an unfamiliar environment. First, the agent learns a new, goal-agnostic policy from existing training data. This policy encodes general principles, such as not trying to move through closed doors.\nThen we use that general policy to generate sample trajectories in the new environment; these pass through the variational goal generator, which assigns them NL goals. We treat these newly labeled trajectories as expert examples in the new setting, and we use them to update the reward function.\nThis added layer of training is what increased our agents’ success rates by 36% when they were deployed to new environments. We think this kind of adaptability will be crucial to household robots of the future, which will need to adjust to new environments — when a family moves or goes on vacation, for instance — without being retrained from scratch.\nABOUT THE AUTHOR\n<h4><a id=\"Li_Zhouhttpswwwamazonscienceauthorlizhou_62\"></a><a href=\"https://www.amazon.science/author/li-zhou\" target=\"_blank\">Li Zhou</a></h4>\nLi Zhou is an applied scientist in the Alexa AI group.\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家