ThinkBot: Embodied Instruction Following with Thought Chain Reasoning

¹Tsinghua Shenzhen International Graduate School, Tsinghua University, ²Carnegie Mellon University ³Department of Automation, Tsinghua University

tl;dr

We propose a Thinkbot agent that reasons the thought chain in sparse human instruction for coherence mining to successfully complete complex EIF goals.

We present an instruction completer based on large language models to generate the missing actions with interacted objects, and propose an object localizer to predict the position of objects for interaction.

We conduct extensive experiments of diverse EIF tasks on ALFRED benchmark, and the results demonstrate that our method achieves higher success rate and path-length-weighted success rate than the state-of-the-art methods on unseen environments.

Abstract

Embodied Instruction Following (EIF) requires agents to complete human instruction by interacting objects in complicated surrounding environments. Conventional methods directly consider the sparse human instruction to generate action plans for agents, which usually fail to achieve human goals because of the instruction incoherence in action descriptions. On the contrary, we propose ThinkBot that reasons the thought chain in human instruction to recover the missing action descriptions, so that the agent can successfully complete human goals by following the coherent instruction. Specifically, we first design an instruction completer based on large language models to recover the missing actions with interacted objects between consecutive human instruction, where the perceived surrounding environments and the completed sub-goals are considered for instruction completion. Based on the partially observed scene semantic maps, we present an object localizer to infer the position of interacted objects for agents to achieve complex human goals. Extensive experiments in the simulated environment show that our ThinkBot outperforms the state-of-the-art EIF methods by a sizable margin in both success rate and execution efficiency.

Pipeline

The overall pipeline of ThinkBot, which consists of an instruction completer and an object localizer. The instruction completer generates the coherent instruction with interacted objects based on sparse human instruction and the current visual perception results, and the object localizer predicts the position of the interacted object for manipulation and navigation.

Results

For the simulation of EIF tasks, we utilize the well-recognized ALFRED benchmark \cite{shridhar2020alfred}. The benchmark is divided into five splits including train, test seen, test unseen, valid seen and valid unseen. The ALFRED benchmark poses significant challenges for EIF agents, as it requires them to ground incoherent natural instruction of different granularity into various household tasks that involve long-horizon reasoning plans.

Comparison with the state-of-the-art methods in SR, GC, PLWSR, PLWGC on the test seen and test unseen splits.

Visualization

Visualization of the agent action sequence acquired by Prompter+ (top) and our ThinkBot (bottom), where our method can recover the missing actions with interacted instances `Open Fridge' and `Open Cabinet' to successfully achieve the human goal.

We also provide a comprehensive trial selected from the valid unseen split.