en:aaai11_erratum

This erratum has been written by **Antoine Bordes** (CNRS - Heudiasyc/UTC) with a great help from **Xavier Glorot** (Université de Montréal), who wrote most of the code and conducted most of the new experiments.

The method of **Structured Embeddings** (**SE**) for encoding knowledge base (KB) data (formalized as a multi-relational graph) has been introduced by Antoine Bordes, Jason Weston, Ronan Collobert and Yoshua Bengio in the paper “Learning Structured Embeddings of Knowledge Bases” published in the proceedings of AAAI'11 (see the (pdf)).

However, after publication, two bugs in the experiments have been detected (by Kevin Murphy's lab at Google and by Danqi Chen and Richard Socher at Stanford): there were duplicates between the training and test sets in the WordNet data and some examples were excluded from the Freebase test set by the evaluation script. Hence, this erratum page has been built to clarify the experiments of the original paper (basically by re-doing them properly). **The conclusions of these cleaned (and more detailed) experiments remain identical to those of the paper and still show a clear efficiency of SE.**

The data sets have been re-generated and they are available for download below. Similarly, the code for **SE** has been released as a part of the SME package of Xavier Glorot, which is available from Github: (code). Hence, all experiments can be replicated.

This section details the experimental setting, following that of the paper: “Learning Structured Embeddings of Knowledge Bases”.

The data of interest is made of a collection of triplets {(*lhs*, *rel*, *rhs*)}, where *lhs* and *rhs* are termed entities (resp. left-hand and right-hand) and *rel* is the relation type. The experiments consider two datasets:

**WordNet**is designed to produce intuitively usable dictionary and thesaurus, and supports automatic text analysis. It encompasses comprehensive knowledge within its graph structure, whose entities correspond to senses, and relation types define lexical relations between them. We filtered out the entities appearing in less that 15 triplets. We obtain a graph with 40,943 synsets and 18 relations types, with 151,442 expressed relations.

**Freebase**is a huge KB, formalized as a multi-relational graph. For simplicity of the experiments we did not consider the whole of the KB (several millions of entities). Instead, we restricted ourselves to entities of the Freebase type <deceased_people>, and considered the sub-graph defined by all relations involving at least one entity of this type. We only kept triplets regarding entities involved in at least 3 relations and relation types appearing at least 5,000 times. We obtain a graph with 81,065 synsets and 13 relations types, with 360,517 expressed relations.

Both data sets were split into 3 sets for training, validation and testing, by trying to enforce that each entity of the validation and test sets appears only once there (but also appears in the training set of course). The table below details the statistics of both data sets and provides weblinks to download them. For the “Entities” column, “left” (resp. “right”) indicates the number of entities appearing as *lhs* (resp. *rhs*). Hence, almost all WordNet entities can appear on both sides, whereas Freebase entities are much more separated.

Each evaluated model is designed to assign a score to any triplet (*lhs*, *rel*, *rhs*): the lowest the score, the more likely the triplet is. We assess the quality of the models using the following ranking task.

For each test triplet, the left entity (*lhs*) is removed and replaced by each of the entities of the dictionary in turn. Scores of those corrupted triplets are computed by the model and sorted by ascending order and the rank of the correct synset is stored. This whole procedure is also repeated when removing the right-hand argument (*rhs*) instead.

We then measure the **median and mean of these predicted ranks** and the mean **top-10** accuracy (the fraction of triplets for which the rank was below 10). To produce the final result, we propose two ways of averaging:

**micro-averaging**considers all test triplets together, and computes the mean and median on the 5,000 test examples. This gives more weight to frequent relation types.**macro-averaging**first computes the metrics for each relation type independently and then averages those results. This gives equal weight to all relation types and hence provides more influence to rare types.

The combination of these two aggregation methods allows for a finer interpretation.

**Important:** there is difference when evaluating on **Freebase**. In this case, we are only predicting and ranking right-hand sides (*rhs*), and the ranking is only among the 16,094 entities appearing as *rhs* in the training set.

Here are the methods we evaluated:

**Structured Embeddings**(**SE**) uses the model defined in the original AAAI paper trained on the training set and whose hyperparameters have been chosen on the validation set.**Unstructured Embeddings**(**Emb**) is an unstructured version of**SE**. In this case, the score of a triplet (*lhs*,*rel*,*rhs*) is simply determined by the dot-product between embeddings learnt for*lhs*and*rhs*and is independent of*rel*.**Counts**estimates the score of a triplet (*lhs*,*rel*,*rhs*) by summing the frequencies of the two bigrams (*lhs*,*rel*) and (*rel*,*rhs*) in the training set (without smoothing).**Random**predicts the score of a triplet at random.

We did not re-run the experiments with KDE (Kernel Density Estimation), since they are very costly and did not bring much improvement. For us, the core of the method lies in **SE** alone.

This section presents the results. In short, **SE is much more efficient than its counterparts**. Best results are in **bold**.

Methods | Micro-averaging | Macro-averaging | ||
---|---|---|---|---|

Rank (median / mean) | top-10 (%) | Rank (median / mean) | top-10 (%) | |

SE | 3 / 1011 | 68.46 | 30 / 1369 | 61.80 |

Emb | 26 / 317 | 35.05 | 106 / 483 | 27.25 |

Counts | 10613 / 16302 | 5.06 | 10168 / 12168 | 15.26 |

Random | 20820 / 20540 | 0.04 | 21554 / 20770 | 0.03 |

**SE** outperforms by a wide margin all methods, unless on mean rank for which **Emb** is beneficial. This is due to the fact that for word senses (as in WordNet) the lexical field is rather limited. Hence, predicting the same list of *rhs* given a *lhs* (or list of lhs given rhs), independent of *rel*, is already giving fair results according to this metric. Still, **Emb** does not really model the data and is greatly outperformed on other metrics. The very low median rank and the high top-10 is indicating that **SE** is capable of very good performances.

Methods | Micro-averaging | Macro-averaging | ||
---|---|---|---|---|

Rank (median / mean) | top-10 (%) | Rank (median / mean) | top-10 (%) | |

SE | 178 / 1544 | 15.10 | 202 / 1329 | 18.76 |

Emb | 265 / 1485 | 11.88 | 549 / 1664 | 12.74 |

Counts | 1097 / 2523 | 1.58 | 1404 / 1894 | 11.53 |

Random | 8060 / 8010 | 0.06 | 7710 / 7710 | 0.03 |

SE also performs best except for the micro-averaged mean rank, for which **Emb** is slightly better, for similar reasons as above. On overall, this data set is much harder than WordNet. It is hard for a model to generalize on this sub-graph, mostly because the connectivity is too low to allow for information propagation.

For more details and information, please contact:

Antoine Bordes: Heudiasyc, UMR CNRS 7253, Université de Technologie de Compiègne, France.

For more recent work on the topic, see also the page of the **“Semantic Matching Energy”** project: (sme-project).