UMR CNRS 7253

Antoine Bordes
Antoine Bordes
Antoine Bordes
Antoine Bordes
Antoine Bordes

Site Tools


en:aaai11_erratum

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
en:aaai11_erratum [2013/02/20 14:18]
bordesan
en:aaai11_erratum [2013/02/20 14:29] (current)
bordesan
Line 6: Line 6:
 ===== Introduction ===== ===== Introduction =====
  
-The method of **Structured Embeddings** (SE) for encoding knowledge base (KB) data  (formalized as a multi-relational graph) has been introduced by Antoine Bordes, Jason Weston, Ronan Collobert and Yoshua Bengio in the paper "​Learning Structured Embeddings of Knowledge Bases" published in the proceedings of AAAI'​11 (see the {{en:​bordes11aaai.pdf|(pdf)}}).+The method of **Structured Embeddings** (**SE**) for encoding knowledge base (KB) data  (formalized as a multi-relational graph) has been introduced by Antoine Bordes, Jason Weston, Ronan Collobert and Yoshua Bengio in the paper "​Learning Structured Embeddings of Knowledge Bases" published in the proceedings of AAAI'​11 (see the {{en:​bordes11aaai.pdf|(pdf)}}).
  
 However, after publication,​ two bugs in the experiments have been detected (by Kevin Murphy'​s lab at Google and by Danqi Chen and Richard Socher at Stanford): there were duplicates between the training and test sets in the WordNet data and some examples were excluded from the Freebase test set by the evaluation script. Hence, this erratum page has been built to clarify the experiments of the original paper (basically by re-doing them properly). **The conclusions of these cleaned (and more detailed) experiments remain identical to those of the paper and still show a clear efficiency of SE.** However, after publication,​ two bugs in the experiments have been detected (by Kevin Murphy'​s lab at Google and by Danqi Chen and Richard Socher at Stanford): there were duplicates between the training and test sets in the WordNet data and some examples were excluded from the Freebase test set by the evaluation script. Hence, this erratum page has been built to clarify the experiments of the original paper (basically by re-doing them properly). **The conclusions of these cleaned (and more detailed) experiments remain identical to those of the paper and still show a clear efficiency of SE.**
  
-The data sets have been re-generated and they are available for download below. Similarly, the code for SE has been released as a part of the SME package of Xavier Glorot, which is available from Github: [[https://​github.com/​glorotxa/​SME|(code)]]. Hence, all experiments can be replicated.+The data sets have been re-generated and they are available for download below. Similarly, the code for **SE** has been released as a part of the SME package of Xavier Glorot, which is available from Github: [[https://​github.com/​glorotxa/​SME|(code)]]. Hence, all experiments can be replicated.
  
  
Line 23: Line 23:
 ==== Datasets ==== ==== Datasets ====
  
-The data of interest is made of a collection of triplets {(lhs, rel, rhs)}, where lhs and rhs are termed entities (resp. left-hand and right-hand) and rel is the relation type. The experiments consider two datasets:+The data of interest is made of a collection of triplets {(//lhs////rel////rhs//)}, where //lhs// and //rhs// are termed entities (resp. left-hand and right-hand) and //rel// is the relation type. The experiments consider two datasets:
  
-  * **WordNet** is designed to produce intuitively usable dictionary and thesaurus, and supports automatic text analysis. It encompasses comprehensive knowledge within its graph structure, whose entities correspond to senses, and relation types define lexical relations between them. We filtered out the synsets ​appearing in less that 15 triplets. We obtain a graph with 40,943 synsets and 18 relations types, with 151,442 expressed relations.+  * **WordNet** is designed to produce intuitively usable dictionary and thesaurus, and supports automatic text analysis. It encompasses comprehensive knowledge within its graph structure, whose entities correspond to senses, and relation types define lexical relations between them. We filtered out the entities ​appearing in less that 15 triplets. We obtain a graph with 40,943 synsets and 18 relations types, with 151,442 expressed relations.
  
-  * **Freebase** is a KB, formalized as a multi-relational graph. For simplicity of the experiments we did not consider the whole of the KB (several millions of entities). Instead, we restricted ourselves to entities of the Freebase type <​deceased_people>,​ and considered the sub-graph defined by all relations involving at least one entity of this type. We only kept triplets regarding entities involved in at least 3 relations and relation types appearing at least 5,000 times. We obtain a graph with 81,065 synsets and 13 relations types, with 360,517 expressed relations.+  * **Freebase** is a huge KB, formalized as a multi-relational graph. For simplicity of the experiments we did not consider the whole of the KB (several millions of entities). Instead, we restricted ourselves to entities of the Freebase type <​deceased_people>,​ and considered the sub-graph defined by all relations involving at least one entity of this type. We only kept triplets regarding entities involved in at least 3 relations and relation types appearing at least 5,000 times. We obtain a graph with 81,065 synsets and 13 relations types, with 360,517 expressed relations.
  
-Both data sets were split into 3 sets for training, validation and test, by trying to enforce that each entity of the validation and test sets appear ​only once there (but also appear ​in the training set of course). The table below details the statistics of both data sets and provides weblinks to download them. For the "​Entities"​ column, "​left"​ (resp. "​right"​) indicates the number of entities appearing as left-hand ​(resp. ​right-handside of a relation. Hence, ​one can see that almost all WordNet entities can appear on both sides, whereas Freebase entities are much more separated.+Both data sets were split into 3 sets for training, validation and testing, by trying to enforce that each entity of the validation and test sets appears ​only once there (but also appears ​in the training set of course). The table below details the statistics of both data sets and provides weblinks to download them. For the "​Entities"​ column, "​left"​ (resp. "​right"​) indicates the number of entities appearing as //​lhs// ​(resp. ​//rhs//). Hence, almost all WordNet entities can appear on both sides, whereas Freebase entities are much more separated.
  
 | Dataset ​ ^   ​Triples (Train / Valid. / Test)   ​^ ​  ​Entities (left / right) ​   ^  Relation types  ^  Download ​ ^ | Dataset ​ ^   ​Triples (Train / Valid. / Test)   ​^ ​  ​Entities (left / right) ​   ^  Relation types  ^  Download ​ ^
Line 38: Line 38:
 ==== Evaluation Protocol ==== ==== Evaluation Protocol ====
  
-Each evaluated model is designed to assign a score to any triplet (lhs, rel, rhs): the lowest the score, the more likely the triplet is. We assess the quality of the models using the following ranking task. +Each evaluated model is designed to assign a score to any triplet (//lhs////rel////rhs//): the lowest the score, the more likely the triplet is. We assess the quality of the models using the following ranking task. 
  
-For each test triplet, the left entity (lhs) is removed and replaced by each of the entities of the dictionary in turn. Scores of those corrupted triplets are computed by the model and sorted by ascending order and the rank of the correct synset is stored. This whole procedure is also repeated when removing the right-hand argument (rhs) instead.+For each test triplet, the left entity (//lhs//) is removed and replaced by each of the entities of the dictionary in turn. Scores of those corrupted triplets are computed by the model and sorted by ascending order and the rank of the correct synset is stored. This whole procedure is also repeated when removing the right-hand argument (//rhs//) instead.
  
 We then measure the **median and mean of these predicted ranks** and the mean **top-10** accuracy (the fraction of triplets for which the rank was below 10). To produce the final result, we propose two ways of averaging: We then measure the **median and mean of these predicted ranks** and the mean **top-10** accuracy (the fraction of triplets for which the rank was below 10). To produce the final result, we propose two ways of averaging:
-  * **micro-averaging** considers all test triplets together, and computes the mean and median on the 5,000 examples. This gives more weight to frequent relation types.+  * **micro-averaging** considers all test triplets together, and computes the mean and median on the 5,000 test examples. This gives more weight to frequent relation types.
   * **macro-averaging** first computes the metrics for each relation type independently and then averages those results. This gives equal weight to all relation types and hence provides more influence to rare types.   * **macro-averaging** first computes the metrics for each relation type independently and then averages those results. This gives equal weight to all relation types and hence provides more influence to rare types.
  
-The combination of these two aggregation methods allows a finer interpretation.+The combination of these two aggregation methods allows ​for a finer interpretation.
  
-There is slight ​difference when evaluating on **Freebase**. In this case, we are only predicting and ranking right-hand sides (rhs), and the ranking is only among the 16,094 entities appearing as rhs in the training set.+**Important:​** there is difference when evaluating on **Freebase**. In this case, we are only predicting and ranking right-hand sides (//rhs//), and the ranking is only among the 16,094 entities appearing as //rhs// in the training set.
  
  
Line 54: Line 54:
 Here are the methods we evaluated: Here are the methods we evaluated:
  
-  * **Structured Embeddings** (SE) uses the model defined in the original AAAI paper trained on the training set and whose hyperparameters have been chosen on the validation set. +  * **Structured Embeddings** (**SE**) uses the model defined in the original AAAI paper trained on the training set and whose hyperparameters have been chosen on the validation set. 
-  * **Unstructured Embeddings** (Emb) is an unstructured version of SE. In this case, the score of a triplet (lhs, rel, rhs) is simply determined by the dot-product learnt for lhs and rhs. +  * **Unstructured Embeddings** (**Emb**) is an unstructured version of **SE**. In this case, the score of a triplet (//lhs////rel////rhs//) is simply determined by the dot-product ​between embeddings ​learnt for //lhs// and //rhs// and is independent of //rel//
-  * **Counts** estimates the score of a triplet (lhs, rel, rhs) by summing the frequencies of the two bigrams (lhs, rel) and (rel, rhs) in the training set (without smoothing). ​+  * **Counts** estimates the score of a triplet (//lhs////rel////rhs//) by summing the frequencies of the two bigrams (//lhs////rel//) and (//rel////rhs//) in the training set (without smoothing). ​
   * **Random** predicts the score of a triplet at random.   * **Random** predicts the score of a triplet at random.
  
-We did not re-run the experiments with KDE, since they are very costly and did not bring much improvement. For us, the core of the method lies in SE alone.+We did not re-run the experiments with KDE (Kernel Density Estimation), since they are very costly and did not bring much improvement. For us, the core of the method lies in **SE** alone.
  
 ===== Experimental Results ===== ===== Experimental Results =====
Line 75: Line 75:
  
  
-SE outperforms by a wide margin all methods, unless on mean rank for which Embis beneficial. This is due to the fact that for word senses (as in WordNet) the lexical field is rather limited. Hence, predicting the same list of rhs given a lhs (or list of lhs given rhs), independent of rel, is already giving fair results according to this metric. Still, Embdoes not really model the data and is greatly outperformed on other metrics. The very low median rank and the high top-10 is indicating that SE is capable of very good performances.+**SE** outperforms by a wide margin all methods, unless on mean rank for which **Emb** is beneficial. This is due to the fact that for word senses (as in WordNet) the lexical field is rather limited. Hence, predicting the same list of //rhs// given a //lhs// (or list of lhs given rhs), independent of //rel//, is already giving fair results according to this metric. Still, ​**Emb** does not really model the data and is greatly outperformed on other metrics. The very low median rank and the high top-10 is indicating that **SE** is capable of very good performances.
  
  
Line 89: Line 89:
 ^ Random ​      ​| ​    8060 / 8010     ​| ​    ​0.06 ​   |     7710 / 7710      |     ​0.03 ​   ^ ^ Random ​      ​| ​    8060 / 8010     ​| ​    ​0.06 ​   |     7710 / 7710      |     ​0.03 ​   ^
  
-SE also outperforms ​except for the micro-averaged mean rank, for which Embis also slightly better, for similar reasons as above. On overall, this data set is much harder than WordNet. It is hard for a model to generalize ​of this sub-graph, mostly because the connectivity is too low to allow for information propagation.+SE also performs best except for the micro-averaged mean rank, for which **Emb** is slightly better, for similar reasons as above. On overall, this data set is much harder than WordNet. It is hard for a model to generalize ​on this sub-graph, mostly because the connectivity is too low to allow for information propagation.
  
  
Line 97: Line 97:
  
 For more recent work on the topic, see also the page of the **"​Semantic Matching Energy"​** project: [[https://​www.hds.utc.fr/​everest/​doku.phphttp://​example.com|(sme-project)]]. For more recent work on the topic, see also the page of the **"​Semantic Matching Energy"​** project: [[https://​www.hds.utc.fr/​everest/​doku.phphttp://​example.com|(sme-project)]].
 +

User Tools