Distributed Training on Large Data ------------------------------------ ``dglke_dist_train`` trains knowledge graph embeddings on a cluster of machines. DGL-KE adopts the *parameter-server* architecture for distributed training. .. image:: ../images/dist_train.png :width: 400 In this architecture, the entity embeddings and relation embeddings are stored in DGL KVStore. The trainer processes pull the latest model from KVStore and push the calculated gradient to the KVStore to update the model. All the processes trains the KG embeddings with asynchronous SGD. Arguments ^^^^^^^^^ The command line provides the following arguments: - ``--model_name {TransE, TransE_l1, TransE_l2, TransR, RESCAL, DistMult, ComplEx, RotatE}`` The models provided by DGL-KE. - ``--data_path DATA_PATH`` The path of the directory where DGL-KE loads knowledge graph data. - ``--dataset DATA_SET`` The name of the knowledge graph stored under data_path. The knowledge graph should be generated by Partition script. - ``--format FORMAT`` The format of the dataset. For builtin knowledge graphs,the foramt should be *built_in*. For users own knowledge graphs,it needs to be *raw_udd_{htr}* or *udd_{htr}*. - ``--save_path SAVE_PATH`` The path of the directory where models and logs are saved. - ``--no_save_emb`` Disable saving the embeddings under save_path. - ``--max_step MAX_STEP`` The maximal number of steps to train the model in a single process. A step trains the model with a batch of data. In the case of multiprocessing training, the total number of training steps is ``MAX_STEP`` * ``NUM_PROC``. - ``--batch_size BATCH_SIZE`` The batch size for training. - ``--batch_size_eval BATCH_SIZE_EVAL`` The batch size used for validation and test. - ``--neg_sample_size NEG_SAMPLE_SIZE`` The number of negative samples we use for each positive sample in the training. - ``--neg_deg_sample`` Construct negative samples proportional to vertex degree in the training. When this option is turned on, the number of negative samples per positive edge will be doubled. Half of the negative samples are generated uniformly whilethe other half are generated proportional to vertex degree. - ``--neg_deg_sample_eval`` Construct negative samples proportional to vertex degree in the evaluation. - ``--neg_sample_size_eval NEG_SAMPLE_SIZE_EVAL`` The number of negative samples we use to evaluate a positive sample. - ``--eval_percent EVAL_PERCENT`` Randomly sample some percentage of edges for evaluation. - ``--no_eval_filter`` Disable filter positive edges from randomly constructed negative edges for evaluation. - ``-log LOG_INTERVAL`` Print runtime of different components every *x* steps. - ``--test`` Evaluate the model on the test set after the model is trained. - ``--num_proc NUM_PROC`` The number of processes to train the model in parallel. - ``--num_thread NUM_THREAD`` The number of CPU threads to train the model in each process. This argument is used for multi-processing training. - ``--force_sync_interval FORCE_SYNC_INTERVAL`` We force a synchronization between processes every *x* steps formultiprocessing training. This potentially stablizes the training processto get a better performance. For multiprocessing training, it is set to 1000 by default. - ``--hidden_dim HIDDEN_DIM`` The embedding size of relation and entity. - ``--lr LR`` The learning rate. DGL-KE uses Adagrad to optimize the model parameters. - ``-g GAMMA`` or ``--gamma GAMMA`` The margin value in the score function. It is used by *TransX* and *RotatE*. - ``-de`` or ``--double_ent`` Double entitiy dim for complex number It is used by *RotatE*. - ``-dr`` or ``--double_rel`` Double relation dim for complex number. - ``-adv`` or ``--neg_adversarial_sampling`` Indicate whether to use negative adversarial sampling.It will weight negative samples with higher scores more. - ``-a ADVERSARIAL_TEMPERATURE`` or ``--adversarial_temperature ADVERSARIAL_TEMPERATURE`` The temperature used for negative adversarial sampling. - ``-rc REGULARIZATION_COEF`` or ``--regularization_coef REGULARIZATION_COEF`` The coefficient for regularization. - ``-rn REGULARIZATION_NORM`` or ``--regularization_norm REGULARIZATION_NORM`` norm used in regularization. - ``--path PATH`` Path of distributed workspace. - ``--ssh_key SSH_KEY`` ssh private key. - ``--ip_config IP_CONFIG`` Path of IP configuration file. - ``--num_client_proc NUM_CLIENT_PROC`` Number of worker processes on each machine. The Steps for Distributed Training ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Distributed training on DGL-KE usually involves three steps: 1. Partition a knowledge graph. 2. Copy partitioned data to remote machines. 3. Invoke the distributed training job by ``dglke_dist_train``. Here we demonstrate how to training KG embedding on ``FB15k`` dataset using 4 machines. Note that, the ``FB15k`` is just a small dataset as our toy demo. An interested user can try it on ``Freebase``, which contains *86M* nodes and *338M* edges. **Step 1: Prepare your machines** Assume that we have four machines with the following IP addresses:: machine_0: 172.31.24.245 machine_1: 172.31.24.246 machine_2: 172.31.24.247 machine_3: 172.32.24.248 Make sure that *machine_0* has the permission to *ssh* to all the other machines. **Step 2: Prepare your data** Create a new directory called ``my_task`` on machine_0:: mkdir my_task We use built-in ``FB15k`` as demo and paritition it into ``4`` parts:: dglke_partition --dataset FB15k -k 4 --data_path ~/my_task Note that, in this demo, we have 4 machines so we set ``-k`` to 4. After this step, we can see 4 new directories called ``partition_0``, ``partition_1``, ``partition_2``, and ``partition_3`` in your ``FB15k`` dataset folder. Create a new file called ``ip_config.txt`` in ``my_task``, and write the following contents into it:: 172.31.24.245 30050 8 172.31.24.246 30050 8 172.31.24.247 30050 8 172.32.24.248 30050 8 Each line in ``ip_config.txt`` is the KVStore configuration on each machine. For example, ``172.31.24.245 30050 8`` represents that, on ``machine_0``, the IP is ``172.31.24.245``, the base port is ``30050``, and we start ``8`` servers on this machine. Note that, you can change the number of servers on each machine based on your machine capabilities. In our environment, the instance has ``48`` cores, and we set ``8`` cores to KVStore and ``40`` cores for worker processes. After that, we can copy the ``my_task`` directory to all the remote machines:: scp -r ~/my_task 172.31.24.246:~ scp -r ~/my_task 172.31.24.247:~ scp -r ~/my_task 172.31.24.248:~ **Step 3: Launch distributed jobs** Run the following command on ``machine_0`` to start a distributed task:: dglke_dist_train --path ~/my_task --ip_config ~/my_task/ip_config.txt \ --num_client_proc 16 --model_name TransE_l2 --dataset FB15k --data_path ~/my_task --hidden_dim 400 \ --gamma 19.9 --lr 0.25 --batch_size 1000 --neg_sample_size 200 --max_step 500 --log_interval 100 \ --batch_size_eval 16 --test -adv --regularization_coef 1.00E-09 --num_thread 1 Most of the options we have already seen in previous sections. Here are some new options we need to know. ``--path`` indicates the absolute path of our workspace. All the logs and trained embedding will be stored in this path. ``--ip_config`` is the absolute path of ``ip_config.txt``. ``--num_client_proc`` has the same behaviors to ``--num_proc`` in single-machine training. All the other options are the same as single-machine training. For some EC2 users, you can also set ``--ssh_key`` for right *ssh* permission. If you don't set ``--no_save_embed`` option. The trained KG embeddings will be stored in ``machine_0/my_task/ckpts`` by default.