For the simulation of EIF tasks, we utilize the well-recognized ALFRED benchmark \cite{shridhar2020alfred}. The benchmark is divided into five splits including train, test seen, test unseen, valid seen and valid unseen. The ALFRED benchmark poses significant challenges for EIF agents, as it requires them to ground incoherent natural instruction of different granularity into various household tasks that involve long-horizon reasoning plans.
Comparison with the state-of-the-art methods in SR, GC, PLWSR, PLWGC on the test seen and test unseen splits.