BioKEEN¶
BioKEEN (Biological KnowlEdge EmbeddiNgs) is a package for training and evaluating biological knowledge graph embeddings built on PyKEEN. Within BioKEEN several biomedical databases are directly accessible for training and evaluating biological knowledge graph embeddings (see Biological Databases).
Because we use PyKEEN as the core underlying framework, currently, implementations of 10 knowledge graph emebddings models are avaialble for BioKEEN. Furthermore, it can be run in training mode in which users provide their own set of hyper-parameter values, or in hyper-parameter optimization mode to find suitable hyper-parameter values from set of user defined values. BioKEEN can also be run without having experience in programing by using its interactive command line interface that can be started with the command “biokeen” from a terminal.
Installation is as easy as getting the code from PyPI with
python3 -m pip install biokeen
.
Citation¶
If you use BioKEEN in your work, please cite [1]:
[1] | Ali, M., et al. (2018). BioKEEN: A library for learning and evaluating biological knowledge graph embeddings. |
Installation¶
There are several ways to download and install BioKEEN.
Warning
BioKEEN requires Python 3.6+
Train and Evaluate¶
Tutorial¶
Step 1: Start CLI¶
biokeen start
Step 2: Select data source¶

Step 3: Select database¶

Step 4: Specify execution mode¶

Step 5: Select KGE model¶

Step 6: Specify model dependent hyper-parameters¶
Step 7: Specify the batch-size¶

Step 8: Specify the number of training epochs¶

Step 9: Specify whether to evaluate the model¶

Step 10: Specify whether to evaluate the model¶

Step 11: Provide a random seed¶

Step 12: Specify preferred device¶

Step 13: Specify the path to the output directory¶

Reference¶
Perform Inference¶
Starting the Prediction Pipeline¶
biokeen predict -m /path/to/model/directory -d /path/to/data/directory
where the value for the argument -m is the directory containing the model, in more detail following files must be contained in the directory:
- configuration.json
- entities_to_embeddings.json
- relations_to_embeddings.json
- trained_model.pkl
These files are created automatically created when an experiment is configured through the CLI.
The value for the argument -d is the directory containing the data for which inference should be applied, and it needs to contain following files:
- entities.tsv
- relations.tsv
where entities.tsv contains all entities of interest, and relations.tsv all relations. PyKEEN will create all possible combinations of triples, and computes the predictions for them, and saves them in data directory in predictions.tsv.
Optionally, a set of triples can be provided that should be exluded from the prediction, e.g. all the triples contained in the training set:
pykeen-predict -m /path/to/model/directory -d /path/to/data/directory -t /path/to/triples.tsv
Hence, it is easily possible to compute plausibility scores forr all triples that are not contained in the training set.
CLI Manual¶
Summarize all Experiments¶
Here, we describe how to summarize all experiments into a single csv-file. To get the summary, please provide the path to parent directory containing all the experiments as sub-directories, and the path to the output file:
biokeen summarize -d /path/to/experiments/directory -o /path/to/output/file.csv
Train and Evaluate¶
Here, we explain how to define and run experiments programmatically. This should be done using PyKEEN.
Configure your experiment¶
To programmatically train (and evaluate) a KGE model, a python dictionary must be created specifying the experiment:
config = dict(
training_set_path = 'data/corpora/fb15k/compath.tsv',
test_set_ratio = 0.1,
execution_mode = 'Training_mode',
kg_embedding_model_name = 'TransE',
embedding_dim = 50,
normalization_of_entities = 2, # corresponds to L2
scoring_function = 1, # corresponds to L1
margin_loss = 1,
learning_rate = 0.01,
batch_size = 32,
num_epochs = 1000,
filter_negative_triples = True,
random_seed = 2,
preferred_device = 'cpu',
)
Run your experiment¶
results = pykeen.run(
config=config,
output_directory=output_directory,
)
Access your results¶
Show all keys contained in results
:
print('Keys:', *sorted(results.results.keys()), sep='\n ')
Access trained KGE model¶
results.results['trained_model']
Access the losses¶
results.results['losses']
Access evaluation results¶
results.results['eval_summary']
Apply a Hyper-Parameter Optimization¶
Here, we describe how to define an experiment that should perform a hyper-parameter optimization mode.
Configure your experiment¶
To run experiments programmatically, the core software library PyKEEN should be used. To run PyKEEN in hyper-parameter optimization (HPO) mode, please set execution_mode to HPO_mode.In HPO mode several values can be provided for the hyper-parameters from which different settings will be tested based on the hyper-parameter optimization algorithm. The possible values for a single hyper-parameter need to be provided as a list. The maximum_number_of_hpo_iters defines how many HPO iterations should be performed.
config = dict(
training_set_path = 'data/corpora/compath.tsv',
test_set_ratio = 0.1,
execution_mode = 'HPO_mode',
kg_embedding_model_name = 'TransE',
embedding_dim = [50,100,150]
normalization_of_entities = 2, # corresponds to L2
scoring_function = [1,2], # corresponds to L1
margin_loss = [1,1.5,2],
learning_rate = [0.1,0.01],
batch_size = [32,128],
num_epochs = 1000,
maximum_number_of_hpo_iters = 5,
filter_negative_triples = True,
random_seed = 2,
preferred_device = 'cpu',
)
Run your experiment¶
The experiment will be started with the run function, and in the output directory the exported results will be saved.
results = pykeen.run(
config=config,
output_directory=output_directory,
)
Access your results¶
Show all keys contained in results
:
print('Keys:', *sorted(results.results.keys()), sep='\n ')
Access trained KGE model¶
results.results['trained_model']
Access the losses¶
results.results['losses']
Access evaluation results¶
results.results['eval_summary']
Handling BEL¶
-
biokeen.convert.
to_pykeen_path
(df, path)[source]¶ Write the relationships in the BEL graph to a KEEN TSV file.
If you have a BEL graph, first do:
>>> from biokeen.convert import to_pykeen_df, to_pykeen_path >>> graph = ... # Something from PyBEL >>> df = to_pykeen_df(graph) >>> to_pykeen_path(df, 'graph.keen.tsv')
Return type: bool