Synatra: Turning Indirect Knowledge into Direct Demonstrations for Digital Agents at Scale

1Carnegie Mellon University, 2Amazon AWS AI
Introducing Synatra
  • A data synthesis approach relying on indirect knowledge
  • 100k next action demonstrations in the form of web trajectories
  • Synatra-CodeLlama-7B, a dedicated web navigation agent

Abstract

LLMs can now act as autonomous agents that interact with digital environments and complete specific objectives (e.g., arranging an online meeting). However, accuracy is still far from satisfactory, partly due to a lack of large-scale, direct demonstrations for digital tasks. Obtaining supervised data from humans is costly, and automatic data collection through exploration or reinforcement learning relies on complex environmental and content setup, resulting in datasets that lack comprehensive coverage of various scenarios. On the other hand, there is abundant knowledge that may indirectly assist task completion, such as online tutorials that were created for human consumption. In this work, we present Synatra, an approach that effectively transforms this indirect knowledge into direct supervision at scale. We define different types of indirect knowledge, and carefully study the available sources to obtain it, methods to encode the structure of direct demonstrations, and finally methods to transform indirect knowledge into direct demonstrations. We use 100k such synthetically-created demonstrations to finetune a 7B CodeLlama, and demonstrate that the resulting agent surpasses all comparably sized models on three web-based task benchmarks Mind2Web, MiniWoB++ and WebArena, as well as surpassing GPT-3.5 on WebArena and Mind2Web. In addition, while synthetic demonstrations prove to be only 3% the cost of human demonstrations (at $0.031 each), we show that the synthetic demonstrations can be more effective than an identical number of human demonstrations collected from limited domains.

Method

we propose Synatra, a data generation approach that synthesizes high-quality trajectories for complex digital tasks at scale. This approach is based on the intuition that there is a rich trove of existing knowledge that encodes indirect supervision about how to perform digital tasks. The key idea of Synatra is that, given this indirect knowledge, we can leverage an LLM to re-purpose it into more usable form that directly demonstrates the exact action to take given a concrete observation. We propose an iterative approach that first uses an LLM to rewrite an article into a hypothetical trajectory, then leverages a generative model to synthesize the intermediate observation between two consecutive actions. We also generate from random observation by taking a sampled web page as an intermediate observation at time step t, and synthesize the specific task i, the previous actions a1 , ..., at−1 and the subsequent action at consisting of an action, a corresponding target element in the observation, and a natural language summary of the action.

Data Statisics

We generated 99,924 action steps of web trajectories data using a combination of online tutorials and random observations. We generated web trajectories covering websites from 21 domains and uses an action space with 12 different actions.
Each entry consists of: an observation in the form of accessibility tree, objective, history of previous actions, and the corresponding next action. Next action has four components: sub task for planning, CoT reasoning, the action itself, and a brief summary of the current step for future history tracking.


Format of each training sample.


Top intents in our trajectories on shopping websites.

Data Explorer

Task

Synatra-CodeLlama-7B Performance

BibTeX

@inproceedings{Ou2024SynatraTI,
      title={Synatra: Turning Indirect Knowledge into Direct Demonstrations for Digital Agents at Scale},
      author={Tianyue Ou and Frank F. Xu and Aman Madaan and Jiarui Liu and Robert Lo and Abishek Sridhar and Sudipta Sengupta and Dan Roth and Graham Neubig and Shuyan Zhou},
      year={2024},
      url={https://api.semanticscholar.org/CorpusID:272831875}
    }