BioKEEN

BioKEEN (Biological KnowlEdge EmbeddiNgs) is a package for training and evaluating biological knowledge graph embeddings built on PyKEEN. Within BioKEEN several biomedical databases are directly accessible for training and evaluating biological knowledge graph embeddings (see Biological Databases).

Because we use PyKEEN as the core underlying framework, currently, implementations of 10 knowledge graph emebddings models are avaialble for BioKEEN. Furthermore, it can be run in training mode in which users provide their own set of hyper-parameter values, or in hyper-parameter optimization mode to find suitable hyper-parameter values from set of user defined values. BioKEEN can also be run without having experience in programing by using its interactive command line interface that can be started with the command “biokeen” from a terminal.

Installation is as easy as getting the code from PyPI with python3 -m pip install biokeen.

Citation

If you use BioKEEN in your work, please cite [1]:

[1]Ali, M., et al. (2018). BioKEEN: A library for learning and evaluating biological knowledge graph embeddings.

Installation

There are several ways to download and install BioKEEN.

Warning

BioKEEN requires Python 3.6+

Easiest

Download the latest stable code from PyPI with:

$ pip install biokeen

Get the Latest

Download the most recent code from GitHub with:

$ pip install git+https://github.com/SmartDataAnalytics/BioKEEN.git

Train and Evaluate

Tutorial

Step 1: Start CLI
biokeen start
Step 2: Select data source
_images/data_source.png
Step 3: Select database
_images/select_database.png
Step 4: Specify execution mode
_images/execution_mode.png
Step 5: Select KGE model
_images/select_model.png
Step 6: Specify model dependent hyper-parameters
Step 7: Specify the batch-size
_images/batch_size.png
Step 8: Specify the number of training epochs
_images/epochs.png
Step 9: Specify whether to evaluate the model
_images/epochs.png
Step 10: Specify whether to evaluate the model
_images/epochs.png
Step 11: Provide a random seed
_images/random_seed.png
Step 12: Specify preferred device
_images/preferred_device.png
Step 13: Specify the path to the output directory
_images/output_directory.png

Reference

Perform Inference

Starting the Prediction Pipeline

biokeen predict -m /path/to/model/directory -d /path/to/data/directory

where the value for the argument -m is the directory containing the model, in more detail following files must be contained in the directory:

  • configuration.json
  • entities_to_embeddings.json
  • relations_to_embeddings.json
  • trained_model.pkl

These files are created automatically created when an experiment is configured through the CLI.

The value for the argument -d is the directory containing the data for which inference should be applied, and it needs to contain following files:

  • entities.tsv
  • relations.tsv

where entities.tsv contains all entities of interest, and relations.tsv all relations. PyKEEN will create all possible combinations of triples, and computes the predictions for them, and saves them in data directory in predictions.tsv.

Optionally, a set of triples can be provided that should be exluded from the prediction, e.g. all the triples contained in the training set:

pykeen-predict -m /path/to/model/directory -d /path/to/data/directory -t /path/to/triples.tsv

Hence, it is easily possible to compute plausibility scores forr all triples that are not contained in the training set.

CLI Manual

Summarize all Experiments

Here, we describe how to summarize all experiments into a single csv-file. To get the summary, please provide the path to parent directory containing all the experiments as sub-directories, and the path to the output file:

biokeen summarize -d /path/to/experiments/directory -o /path/to/output/file.csv

Train and Evaluate

Here, we explain how to define and run experiments programmatically. This should be done using PyKEEN.

Configure your experiment

To programmatically train (and evaluate) a KGE model, a python dictionary must be created specifying the experiment:

config = dict(
    training_set_path           = 'data/corpora/fb15k/compath.tsv',
    test_set_ratio              = 0.1,
    execution_mode              = 'Training_mode',
    kg_embedding_model_name     = 'TransE',
    embedding_dim               = 50,
    normalization_of_entities   = 2,  # corresponds to L2
    scoring_function            = 1,  # corresponds to L1
    margin_loss                 = 1,
    learning_rate               = 0.01,
    batch_size                  = 32,
    num_epochs                  = 1000,
    filter_negative_triples     = True,
    random_seed                 = 2,
    preferred_device            = 'cpu',
)

Run your experiment

results = pykeen.run(
    config=config,
    output_directory=output_directory,
)

Access your results

Show all keys contained in results:

print('Keys:', *sorted(results.results.keys()), sep='\n  ')

Access trained KGE model

results.results['trained_model']

Access the losses

results.results['losses']

Access evaluation results

results.results['eval_summary']

Apply a Hyper-Parameter Optimization

Here, we describe how to define an experiment that should perform a hyper-parameter optimization mode.

Configure your experiment

To run experiments programmatically, the core software library PyKEEN should be used. To run PyKEEN in hyper-parameter optimization (HPO) mode, please set execution_mode to HPO_mode.In HPO mode several values can be provided for the hyper-parameters from which different settings will be tested based on the hyper-parameter optimization algorithm. The possible values for a single hyper-parameter need to be provided as a list. The maximum_number_of_hpo_iters defines how many HPO iterations should be performed.

config = dict(
    training_set_path           = 'data/corpora/compath.tsv',
    test_set_ratio              = 0.1,
    execution_mode              = 'HPO_mode',
    kg_embedding_model_name     = 'TransE',
    embedding_dim               = [50,100,150]
    normalization_of_entities   =  2,  # corresponds to L2
    scoring_function            = [1,2],  # corresponds to L1
    margin_loss                 = [1,1.5,2],
    learning_rate               = [0.1,0.01],
    batch_size                  = [32,128],
    num_epochs                  = 1000,
    maximum_number_of_hpo_iters = 5,
    filter_negative_triples     = True,
    random_seed                 = 2,
    preferred_device            = 'cpu',
)

Run your experiment

The experiment will be started with the run function, and in the output directory the exported results will be saved.

results = pykeen.run(
    config=config,
    output_directory=output_directory,
)

Access your results

Show all keys contained in results:

print('Keys:', *sorted(results.results.keys()), sep='\n  ')

Access trained KGE model

results.results['trained_model']

Access the losses

results.results['losses']

Access evaluation results

results.results['eval_summary']

Handling BEL

biokeen.convert.to_pykeen_path(df, path)[source]

Write the relationships in the BEL graph to a KEEN TSV file.

If you have a BEL graph, first do:

>>> from biokeen.convert import to_pykeen_df, to_pykeen_path
>>> graph = ...  # Something from PyBEL
>>> df = to_pykeen_df(graph)
>>> to_pykeen_path(df, 'graph.keen.tsv')
Return type:bool
biokeen.convert.to_pykeen_df(graph, use_tqdm=True)[source]

Get a DataFrame representing the triples.

Return type:DataFrame

Biological Databases

The following biological databases can be used for training and evaluating knowledge graph embeddings. This is done by using the Bio2BEL universe.

Source DOI
ADEPTUS adeptus_zenodo
ComPath compath_zenodo
DrugBank Zenodo DOI
ExPASy ExPASy Zenodo DOI
HIPPIE HIPPIE Zenodo DOI
HSDN hsdn_zenodo
KEGG KEGG Zenodo DOI
miRTarBase Zenodo DOI
MSigDB MSigDB Zenodo DOI
Reactome Reactome Zenodo DOI
SIDER SIDER Zenodo DOI
InterPro InterPro Zenodo DOI
WikiPathways WikiPathways Zenodo DOI

Indices and tables