UMR CNRS 7253

Antoine Bordes
Antoine Bordes
Antoine Bordes
Antoine Bordes
Antoine Bordes

Site Tools


en:world_data

Introduction

We propose here the data employed in the paper: Towards Understanding Situated Natural Language by A. Bordes, N. Usunier, R. Collobert and J. Weston, appearing in the proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS), Volume 9 of JMLR: W&CP 9, (2010).

This data is designed for learning an algorithm for the task of concept labeling. In this task, each word in a given sentence has to be tagged with the unique physical entity (e.g. person, object or location) or abstract concept it refers to. Please see the paper for more details.

Dataset

A concept labeling example consists in a triple {sentence, label, universe}. In our dataset, the training set contains 50,000 such triples and the testing one 20,000.

The files “world_data_{train, test}_{sentences, labels, universe}.dat” contain the dataset. They have to be read line after line: each line of “world_data_{train, test}_sentences.dat” provides a sentence, the corresponding lines of “world_data_{train, test}_labels.dat” and “world_data_{train, test}_universes.dat” then respectively provide the corresponding label sequence and universe state. (Note: label sequences are provided with strong labeling.)

Download the dataset: world_data.tar.gz

Universe

The universe of the simulation contains 58 concepts:

  • 15 actions (to eat/drink, to move, to sit,…)
  • 10 actors (humans and pets)
  • 6 locations (bedroom, kitchen,…)
  • 27 objects (sets, food, toys,…)

Each concept is named using an unique identifier whose list is given in the file “world_data_concepts.dat”.

Our universe considers two kinds of relations between concepts: “location” and “containedby”. Each lines of the files “world_data_{train, test}_universes.dat” displays their current states with the following template: <concept1>:relation_name: <concept2>. So, for example, “fridge:located:kitchen” means that the concept <fridge> is located inside the concept <kitchen> and “milk:iscontainedby:father” means that the concept <father> owns the concept <milk>.

Simulation (not provided)

The data has been generated using a simulation which mimics activity within a house interior. The simulation algorithm generates actions along with natural languages sentences describing them, and keep the world state up-to-date.

For example a simulation step could produce the results:

  1. Pick the event <move>(<mother>, <hall>).
  2. Generate the training sample (“she goes from the bedroom to the hall”, {<hall>, <mother>, <bedroom>, <move>}, universe).
  3. Update the universe with location(<mother>) = <hall>.

We define the set of describing words for each concept to contain at least two terms: an ambiguous one (using a pronoun) and a unique one. There is a dictionary size of 75 words used for generating sentences. 55% of the generated sentences contain lexical ambiguities that needs the use of world knowledge to be resolved. All ambiguities have been designed so that they can be ultimately be resolved using all the available world knowledge.

The files “word_data_{train, test}_stats.txt” display the number of occurrences of each concept in the corresponding files.

Credits

This data has been created by Antoine Bordes, Nicolas Usunier and Jason Weston. If you use this data for your research or a publication, please cite the AISTATS'10 paper:

@conference{bordes-aistats10,
  author =    {Bordes, Antoine and Usunier, Nicolas and 
               Collobert, Ronan and Weston, Jason},
  title =     {Towards Understanding Situated Natural Language},
  booktitle = {Proceedings of the 13th International Conference 
               on Artificial Intelligence and Statistics (AISTATS)},
  year =      {2010},
  series =    {W&CP},
  volume =    {9},
  publisher = {JMLR},
}

User Tools