Antoine Bordes
Antoine Bordes
Antoine Bordes
Antoine Bordes
Antoine Bordes

Site Tools


Erratum on "Learning Structured Embeddings of Knowledge Bases"

This erratum has been written by Antoine Bordes (CNRS - Heudiasyc/UTC) with a great help from Xavier Glorot (Université de Montréal), who wrote most of the code and conducted most of the new experiments.


The method of Structured Embeddings (SE) for encoding knowledge base (KB) data (formalized as a multi-relational graph) has been introduced by Antoine Bordes, Jason Weston, Ronan Collobert and Yoshua Bengio in the paper “Learning Structured Embeddings of Knowledge Bases” published in the proceedings of AAAI'11 (see the (pdf)).

However, after publication, two bugs in the experiments have been detected (by Kevin Murphy's lab at Google and by Danqi Chen and Richard Socher at Stanford): there were duplicates between the training and test sets in the WordNet data and some examples were excluded from the Freebase test set by the evaluation script. Hence, this erratum page has been built to clarify the experiments of the original paper (basically by re-doing them properly). The conclusions of these cleaned (and more detailed) experiments remain identical to those of the paper and still show a clear efficiency of SE.

The data sets have been re-generated and they are available for download below. Similarly, the code for SE has been released as a part of the SME package of Xavier Glorot, which is available from Github: (code). Hence, all experiments can be replicated.

Experimental setup

This section details the experimental setting, following that of the paper: “Learning Structured Embeddings of Knowledge Bases”.


The data of interest is made of a collection of triplets {(lhs, rel, rhs)}, where lhs and rhs are termed entities (resp. left-hand and right-hand) and rel is the relation type. The experiments consider two datasets:

  • WordNet is designed to produce intuitively usable dictionary and thesaurus, and supports automatic text analysis. It encompasses comprehensive knowledge within its graph structure, whose entities correspond to senses, and relation types define lexical relations between them. We filtered out the entities appearing in less that 15 triplets. We obtain a graph with 40,943 synsets and 18 relations types, with 151,442 expressed relations.
  • Freebase is a huge KB, formalized as a multi-relational graph. For simplicity of the experiments we did not consider the whole of the KB (several millions of entities). Instead, we restricted ourselves to entities of the Freebase type <deceased_people>, and considered the sub-graph defined by all relations involving at least one entity of this type. We only kept triplets regarding entities involved in at least 3 relations and relation types appearing at least 5,000 times. We obtain a graph with 81,065 synsets and 13 relations types, with 360,517 expressed relations.

Both data sets were split into 3 sets for training, validation and testing, by trying to enforce that each entity of the validation and test sets appears only once there (but also appears in the training set of course). The table below details the statistics of both data sets and provides weblinks to download them. For the “Entities” column, “left” (resp. “right”) indicates the number of entities appearing as lhs (resp. rhs). Hence, almost all WordNet entities can appear on both sides, whereas Freebase entities are much more separated.

Dataset Triples (Train / Valid. / Test) Entities (left / right) Relation types Download
WordNet 151,442 (141,442 / 5k / 5k) 40,943 (40,504 / 40,551) 18 (wn-data)
Freebase 360,517 (350,517 / 5k / 5k) 81,065 (72,756 / 16,094) 13 (fb-data)

Evaluation Protocol

Each evaluated model is designed to assign a score to any triplet (lhs, rel, rhs): the lowest the score, the more likely the triplet is. We assess the quality of the models using the following ranking task.

For each test triplet, the left entity (lhs) is removed and replaced by each of the entities of the dictionary in turn. Scores of those corrupted triplets are computed by the model and sorted by ascending order and the rank of the correct synset is stored. This whole procedure is also repeated when removing the right-hand argument (rhs) instead.

We then measure the median and mean of these predicted ranks and the mean top-10 accuracy (the fraction of triplets for which the rank was below 10). To produce the final result, we propose two ways of averaging:

  • micro-averaging considers all test triplets together, and computes the mean and median on the 5,000 test examples. This gives more weight to frequent relation types.
  • macro-averaging first computes the metrics for each relation type independently and then averages those results. This gives equal weight to all relation types and hence provides more influence to rare types.

The combination of these two aggregation methods allows for a finer interpretation.

Important: there is difference when evaluating on Freebase. In this case, we are only predicting and ranking right-hand sides (rhs), and the ranking is only among the 16,094 entities appearing as rhs in the training set.


Here are the methods we evaluated:

  • Structured Embeddings (SE) uses the model defined in the original AAAI paper trained on the training set and whose hyperparameters have been chosen on the validation set.
  • Unstructured Embeddings (Emb) is an unstructured version of SE. In this case, the score of a triplet (lhs, rel, rhs) is simply determined by the dot-product between embeddings learnt for lhs and rhs and is independent of rel.
  • Counts estimates the score of a triplet (lhs, rel, rhs) by summing the frequencies of the two bigrams (lhs, rel) and (rel, rhs) in the training set (without smoothing).
  • Random predicts the score of a triplet at random.

We did not re-run the experiments with KDE (Kernel Density Estimation), since they are very costly and did not bring much improvement. For us, the core of the method lies in SE alone.

Experimental Results

This section presents the results. In short, SE is much more efficient than its counterparts. Best results are in bold.


Methods Micro-averaging Macro-averaging
Rank (median / mean) top-10 (%) Rank (median / mean) top-10 (%)
SE 3 / 1011 68.46 30 / 1369 61.80
Emb 26 / 317 35.05 106 / 483 27.25
Counts 10613 / 16302 5.06 10168 / 12168 15.26
Random 20820 / 20540 0.04 21554 / 20770 0.03

SE outperforms by a wide margin all methods, unless on mean rank for which Emb is beneficial. This is due to the fact that for word senses (as in WordNet) the lexical field is rather limited. Hence, predicting the same list of rhs given a lhs (or list of lhs given rhs), independent of rel, is already giving fair results according to this metric. Still, Emb does not really model the data and is greatly outperformed on other metrics. The very low median rank and the high top-10 is indicating that SE is capable of very good performances.


Methods Micro-averaging Macro-averaging
Rank (median / mean) top-10 (%) Rank (median / mean) top-10 (%)
SE 178 / 1544 15.10 202 / 1329 18.76
Emb 265 / 1485 11.88 549 / 1664 12.74
Counts 1097 / 2523 1.58 1404 / 1894 11.53
Random 8060 / 8010 0.06 7710 / 7710 0.03

SE also performs best except for the micro-averaged mean rank, for which Emb is slightly better, for similar reasons as above. On overall, this data set is much harder than WordNet. It is hard for a model to generalize on this sub-graph, mostly because the connectivity is too low to allow for information propagation.


For more details and information, please contact:
Antoine Bordes: Heudiasyc, UMR CNRS 7253, Université de Technologie de Compiègne, France.

For more recent work on the topic, see also the page of the “Semantic Matching Energy” project: (sme-project).

User Tools