Train User-Defined Knowledage Graphs

Users can use DGL-KE to train embeddings on their own knowledge graphs. In this case, users need to use --data_path to specify the path to the knowledge graph dataset, --data_files to specify the triplets of a knowledge graph as well as node/relation ID mapping, --format to specify the input format of the knowledge graph.

The input format of users’ knowledge graphs

Users need to store all the data associated with a knowledge graph in the same directory. DGL-KE supports two knowledge graph input formats:

  • Raw user-defined knowledge graphs: user only needs to provide triplets, both entities and relations in the triplets can be arbitrary strings. The dataloader will automatically generate the id mappings for entities and relations in the triplets. An example of triplets:
      train.tsv  
    Beijing is_capital_of China
    London is_capital_of UK
    UK located_at Europe
       
  • User-defined knowledge graphs: user need to provide the id mapping for entities and relations as well as the triplets of the knowledge graph. The triplets should only contains entities ids and relation ids. Here we assume the both the entities ids and relation ids start from 0 and should be contineous. An example of mapping and triplets files:
    entities.dict relation.dict train.tsv
    Beijing 0 is_capital_of 0 0 0 2
    London 1 located_at 1 1 0 3
    China 2   3 1 4
    UK 3    
    Europe 4    

Using raw user-defined knowledge graph format

Users need to store all the data associated with a knowledge graph in the same directory. DGL-KE supports two knowledge graph input formats:

raw_udd_[h|r|t]: In this format, users only need to provide triplets and the dataloader generates the id mappings for entities and relations in the triplets. The dataloader outputs two files: entities.tsv for entity id mapping and relations.tsv for relation id mapping while loading data. The order of head, relation and tail entities are described in [h|r|t], for example, raw_udd_trh means the triplets are stored in the order of tail, relation and head. The directory contains three files:

  • train stores the triplets in the training set. The format of a triplet, e.g., [src_name, rel_name, dst_name], should follow the order specified in [h|r|t]
  • valid stores the triplets in the validation set. The format of a triplet, e.g., [src_name, rel_name, dst_name], should follow the order specified in [h|r|t]. This is optional.
  • test stores the triplets in the test set. The format of a triplet, e.g., [src_name, rel_name, dst_name], should follow the order specified in [h|r|t]. This is optional.

Using user-defined knowledge graph format

udd_[h|r|t]: In this format, user should provide the id mapping for entities and relations. The order of head, relation and tail entities are described in [h|r|t], for example, raw_udd_trh means the triplets are stored in the order of tail, relation and head. The directory should contains five files:

  • entities stores the mapping between entity name and entity Id
  • relations stores the mapping between relation name relation Id
  • train stores the triplets in the training set. The format of a triplet, e.g., [src_id, rel_id, dst_id], should follow the order specified in [h|r|t]
  • valid stores the triplets in the validation set. The format of a triplet, e.g., [src_id, rel_id, dst_id], should follow the order specified in [h|r|t]
  • test stores the triplets in the test set. The format of a triplet, e.g., [src_id, rel_id, dst_id], should follow the order specified in [h|r|t]