Embedding and downstream tasks tutorial¶

This tutorial illustrates an example of a co-occurrence graph and guides the user through the graph representation learning and all it’s downstream tasks including node similarity queries, node classification, edge prediction and embedding pipeline building. The source notebook can be found here.

import pandas as pd
import numpy as np

from sklearn import model_selection
from sklearn import mixture
from sklearn.svm import LinearSVC

from bluegraph.core import PandasPGFrame
from bluegraph.preprocess.generators import CooccurrenceGenerator
from bluegraph.preprocess.encoders import ScikitLearnPGEncoder

from bluegraph.core.embed.embedders import GraphElementEmbedder
from bluegraph.backends.stellargraph import StellarGraphNodeEmbedder

from bluegraph.downstream import EmbeddingPipeline, transform_to_2d, plot_2d, get_classification_scores
from bluegraph.downstream.similarity import (FaissSimilarityIndex,
                                             SimilarityProcessor,
                                             NodeSimilarityProcessor)
from bluegraph.downstream.node_classification import NodeClassifier
from bluegraph.downstream.link_prediction import (generate_negative_edges,
                                                  EdgePredictor)

Data preparation¶

Fist, we read the source dataset with mentions of entities in different paragraphs

mentions = pd.read_csv("../data/labeled_entity_occurrence.csv")

# Extract unique paper/seciton/paragraph identifiers
mentions = mentions.rename(columns={"occurrence": "paragraph"})
number_of_paragraphs = len(mentions["paragraph"].unique())

mentions

	entity	paragraph
0	lithostathine-1-alpha	1
1	pulmonary	1
2	host	1
3	lithostathine-1-alpha	2
4	surfactant protein d measurement	2
...	...	...
2281346	covid-19	227822
2281347	covid-19	227822
2281348	viral infection	227823
2281349	lipid	227823
2281350	inflammation	227823

2281351 rows × 2 columns

We will also load a dataset that contains definitions of entities and their types

entity_data = pd.read_csv("../data/entity_types_defs.csv")

entity_data

	entity	entity_type	definition
0	(e3-independent) e2 ubiquitin-conjugating enzyme	PROTEIN	(E3-independent) E2 ubiquitin-conjugating enzy...
1	(h115d)vhl35 peptide	CHEMICAL	A peptide vaccine derived from the von Hippel-...
2	1,1-dimethylhydrazine	DRUG	A clear, colorless, flammable, hygroscopic liq...
3	1,2-dimethylhydrazine	CHEMICAL	A compound used experimentally to induce tumor...
4	1,25-dihydroxyvitamin d(3) 24-hydroxylase, mit...	PROTEIN	1,25-dihydroxyvitamin D(3) 24-hydroxylase, mit...
...	...	...	...
28127	zygomycosis	DISEASE	Any infection due to a fungus of the Zygomycot...
28128	zygomycota	ORGANISM	A phylum of fungi that are characterized by ve...
28129	zygosity	ORGANISM	The genetic condition of a zygote, especially ...
28130	zygote	CELL_COMPARTMENT	The cell formed by the union of two gametes, e...
28131	zyxin	ORGANISM	Zyxin (572 aa, ~61 kDa) is encoded by the huma...

28132 rows × 3 columns

Generation of a co-occurrence graph¶

We first create a graph whose nodes are entities

graph = PandasPGFrame()
entity_nodes = mentions["entity"].unique()
graph.add_nodes(entity_nodes)
graph.add_node_types({n: "Entity" for n in entity_nodes})

entity_props = entity_data.rename(columns={"entity": "@id"}).set_index("@id")
graph.add_node_properties(entity_props["entity_type"], prop_type="category")
graph.add_node_properties(entity_props["definition"], prop_type="text")

paragraph_prop = pd.DataFrame({"paragraphs": mentions.groupby("entity").aggregate(set)["paragraph"]})
graph.add_node_properties(paragraph_prop, prop_type="category")

graph.nodes(raw_frame=True)

	@type	entity_type	definition	paragraphs
@id
lithostathine-1-alpha	Entity	PROTEIN	Lithostathine-1-alpha (166 aa, ~19 kDa) is enc...	{1, 2, 3, 195589, 104454, 104455, 104456, 5120...
pulmonary	Entity	ORGAN	Relating to the lungs as the intended site of ...	{1, 196612, 196613, 196614, 196621, 196623, 16...
host	Entity	ORGANISM	An organism that nourishes and supports anothe...	{1, 114689, 3, 221193, 180243, 180247, 28, 180...
surfactant protein d measurement	Entity	PROTEIN	The determination of the amount of surfactant ...	{145537, 2, 3, 4, 5, 6, 51202, 103939, 103940,...
communication response	Entity	PATHWAY	A statement (either spoken or written) that is...	{46592, 64000, 2, 28162, 166912, 226304, 88585...
...	...	...	...	...
drug binding site	Entity	PATHWAY	The reactive parts of a macromolecule that dir...	{225082, 225079}
carbaril	Entity	CHEMICAL	A synthetic carbamate acetylcholinesterase inh...	{225408, 225409, 225415, 225419, 225397}
ny-eso-1 positive tumor cells present	Entity	CELL_TYPE	An indication that Cancer/Testis Antigen 1 exp...	{225544, 226996}
mustelidae	Entity	ORGANISM	Taxonomic family which includes the Ferret.	{225901, 225903}
friulian language	Entity	ORGANISM	An Indo-European Romance language spoken in th...	{225901, 225903}

17989 rows × 4 columns

For each node we will add the frequency property that counts the total number of paragraphs where the entity was mentioned.

frequencies = graph._nodes["paragraphs"].apply(len)
frequencies.name = "frequency"
graph.add_node_properties(frequencies)

graph.nodes(raw_frame=True)

	@type	entity_type	definition	paragraphs	frequency
@id
lithostathine-1-alpha	Entity	PROTEIN	Lithostathine-1-alpha (166 aa, ~19 kDa) is enc...	{1, 2, 3, 195589, 104454, 104455, 104456, 5120...	80
pulmonary	Entity	ORGAN	Relating to the lungs as the intended site of ...	{1, 196612, 196613, 196614, 196621, 196623, 16...	8295
host	Entity	ORGANISM	An organism that nourishes and supports anothe...	{1, 114689, 3, 221193, 180243, 180247, 28, 180...	2660
surfactant protein d measurement	Entity	PROTEIN	The determination of the amount of surfactant ...	{145537, 2, 3, 4, 5, 6, 51202, 103939, 103940,...	268
communication response	Entity	PATHWAY	A statement (either spoken or written) that is...	{46592, 64000, 2, 28162, 166912, 226304, 88585...	160
...	...	...	...	...	...
drug binding site	Entity	PATHWAY	The reactive parts of a macromolecule that dir...	{225082, 225079}	2
carbaril	Entity	CHEMICAL	A synthetic carbamate acetylcholinesterase inh...	{225408, 225409, 225415, 225419, 225397}	5
ny-eso-1 positive tumor cells present	Entity	CELL_TYPE	An indication that Cancer/Testis Antigen 1 exp...	{225544, 226996}	2
mustelidae	Entity	ORGANISM	Taxonomic family which includes the Ferret.	{225901, 225903}	2
friulian language	Entity	ORGANISM	An Indo-European Romance language spoken in th...	{225901, 225903}	2

17989 rows × 5 columns

Now, for constructing co-occurrence network we will select only 1000 most frequent entities.

nodes_to_include = graph._nodes.nlargest(1000, "frequency").index

The CooccurrenceGenerator class allows us to generate co-occurrence edges from overlaps in node property values or edge (or edge properties). In this case we consider the paragraph node property and construct co-occurrence edges from overlapping sets of paragraphs. In addition, we will compute some co-occurrence statistics: total co-occurrence frequency and normalized pointwise mutual information (NPMI).

%%time
generator = CooccurrenceGenerator(graph.subgraph(nodes=nodes_to_include))
paragraph_cooccurrence_edges = generator.generate_from_nodes(
    "paragraphs", total_factor_instances=number_of_paragraphs,
    compute_statistics=["frequency", "npmi"],
    parallelize=True, cores=8)

CPU times: user 13.9 s, sys: 3.65 s, total: 17.6 s
Wall time: 1min 44s

cutoff = paragraph_cooccurrence_edges["npmi"].mean()

paragraph_cooccurrence_edges = paragraph_cooccurrence_edges[paragraph_cooccurrence_edges["npmi"] > cutoff]

We add generated edges to the original graph

graph._edges = paragraph_cooccurrence_edges
graph.edge_prop_as_numeric("frequency")
graph.edge_prop_as_numeric("npmi")

graph.edges(raw_frame=True)

		common_factors	frequency	npmi
@source_id	@target_id
surfactant protein d measurement	microorganism	{2, 3, 7810, 17, 19, 21, 100502, 26, 41, 7850,...	19	0.235263
	lung	{2, 103939, 51202, 5, 4, 103940, 15, 145438, 3...	93	0.221395
	alveolar	{223872, 2, 51202, 100502, 7831, 149657, 19522...	25	0.336175
	epithelial cell	{2, 4, 5, 222298, 7825, 7732, 7733, 169174, 7738}	9	0.175923
	molecule	{2, 7750, 49991, 134504, 206448, 49, 52, 20645...	10	0.113611
...	...	...	...	...
sars-cov-2	cardiac valve injury	{196614, 207366, 186391, 190497, 196641, 18947...	123	0.213579
	chloroquine	{168961, 202755, 203276, 202765, 217102, 19868...	195	0.290027
	severe acute respiratory syndrome	{215556, 182277, 221190, 221191, 200710, 22119...	211	0.241288
	caax prenyl protease 2	{226304, 208386, 215559, 209415, 208397, 21556...	150	0.343314
	transmembrane protease serine 2	{192518, 200748, 200756, 204855, 188475, 19873...	380	0.420739

161332 rows × 3 columns

Recall that we have generated edges only for the 1000 most frequent entities, the rest of the entities will be isolated (having no incident edges). Let us remove all the isolated nodes.

graph.remove_node_properties("paragraphs")
graph.remove_edge_properties("common_factors")

graph.remove_isolated_nodes()

graph.number_of_nodes()

Next, we save the generated co-occurrence graph.

graph.export_json("../data/cooccurrence_graph.json")

graph = PandasPGFrame.load_json("../data/cooccurrence_graph.json")

Node feature extraction¶

We extract node features from entity definitions using the tfidf model.

encoder = ScikitLearnPGEncoder(
    node_properties=["definition"],
    text_encoding_max_dimension=512)

%%time
transformed_graph = encoder.fit_transform(graph)

CPU times: user 959 ms, sys: 26.4 ms, total: 986 ms
Wall time: 1.02 s

We can have a glance at the vocabulary that the encoder constructed for the ‘definition’ property

vocabulary = encoder._node_encoders["definition"].model.vocabulary_
list(vocabulary.keys())[:10]

['relating',
 'lungs',
 'site',
 'administration',
 'product',
 'usually',
 'action',
 'lower',
 'respiratory',
 'tract']

We will add additional properties to our transformed graph corresponding to the entity type labels. We will also add NPMI as an edge property to this transformed graph.

transformed_graph.add_node_properties(
    graph.get_node_property_values("entity_type"))
transformed_graph.add_edge_properties(
    graph.get_edge_property_values("npmi"), prop_type="numeric")

transformed_graph.nodes(raw_frame=True)

	features	@type	entity_type
@id
pulmonary	[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...	Entity	ORGAN
host	[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...	Entity	ORGANISM
surfactant protein d measurement	[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...	Entity	PROTEIN
microorganism	[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...	Entity	ORGANISM
lung	[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...	Entity	ORGAN
...	...	...	...
candida parapsilosis	[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...	Entity	ORGANISM
ciliated bronchial epithelial cell	[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...	Entity	CELL_TYPE
cystic fibrosis pulmonary exacerbation	[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...	Entity	DISEASE
caax prenyl protease 2	[0.0, 0.0, 0.3198444339599345, 0.0, 0.0, 0.0, ...	Entity	PROTEIN
transmembrane protease serine 2	[0.0, 0.0, 0.2853086240289885, 0.0, 0.0, 0.0, ...	Entity	PROTEIN

1000 rows × 3 columns

Node embedding and downstream tasks¶

Node embedding using StellarGraph¶

Using StellarGraphNodeEmbedder we construct three different embeddings of our transformed graph corresponding to different embedding techniques.

node2vec_embedder = StellarGraphNodeEmbedder(
    "node2vec", edge_weight="npmi", embedding_dimension=64, length=10, number_of_walks=20)
node2vec_embedding = node2vec_embedder.fit_model(transformed_graph)

attri2vec_embedder = StellarGraphNodeEmbedder(
    "attri2vec", feature_vector_prop="features",
    length=5, number_of_walks=10,
    epochs=10, embedding_dimension=128, edge_weight="npmi")
attri2vec_embedding = attri2vec_embedder.fit_model(transformed_graph)

link_classification: using 'ip' method to combine node embeddings into edge embeddings

gcn_dgi_embedder = StellarGraphNodeEmbedder(
    "gcn_dgi", feature_vector_prop="features", epochs=250, embedding_dimension=512)
gcn_dgi_embedding = gcn_dgi_embedder.fit_model(transformed_graph)

Using GCN (local pooling) filters...

The fit_model method produces a dataframe of the following shape

node2vec_embedding

	embedding
pulmonary	[0.13196799159049988, -0.23611457645893097, 0....
host	[-0.6323956847190857, 0.36397579312324524, -0....
surfactant protein d measurement	[-0.5495556592941284, 0.14938104152679443, 0.0...
microorganism	[-0.4700668454170227, 0.5236756801605225, 0.14...
lung	[-0.2819957435131073, 0.08759381622076035, 0.0...
...	...
candida parapsilosis	[-0.18134233355522156, 0.14365115761756897, 0....
ciliated bronchial epithelial cell	[-0.6209977865219116, 0.2375614047050476, 0.00...
cystic fibrosis pulmonary exacerbation	[-0.1944447010755539, 0.06318975239992142, 0.1...
caax prenyl protease 2	[-0.2207261174917221, -0.071625716984272, 0.11...
transmembrane protease serine 2	[-0.40691250562667847, 0.07031852006912231, 0....

1000 rows × 1 columns

Let us add the embedding vectors obtained using different models as node properties of our graph.

transformed_graph.add_node_properties(
    node2vec_embedding.rename(columns={"embedding": "node2vec"}))

transformed_graph.add_node_properties(
    attri2vec_embedding.rename(columns={"embedding": "attri2vec"}))

transformed_graph.add_node_properties(
    gcn_dgi_embedding.rename(columns={"embedding": "gcn_dgi"}))

transformed_graph.nodes(raw_frame=True)

	features	@type	entity_type	node2vec	attri2vec	gcn_dgi
@id
pulmonary	[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...	Entity	ORGAN	[0.13196799159049988, -0.23611457645893097, 0....	[0.034921467304229736, 0.016040265560150146, 0...	[0.01300269179046154, 0.0, 0.03357855603098869...
host	[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...	Entity	ORGANISM	[-0.6323956847190857, 0.36397579312324524, -0....	[0.07983770966529846, 0.02787071466445923, 0.0...	[0.0, 0.0, 0.028662730008363724, 0.00578320631...
surfactant protein d measurement	[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...	Entity	PROTEIN	[-0.5495556592941284, 0.14938104152679443, 0.0...	[0.026128143072128296, 0.030555397272109985, 0...	[0.0, 0.0, 0.02776358649134636, 0.005184333305...
microorganism	[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...	Entity	ORGANISM	[-0.4700668454170227, 0.5236756801605225, 0.14...	[0.2282787561416626, 0.05689656734466553, 0.07...	[0.0, 0.0, 0.04060275852680206, 0.0, 0.0, 0.05...
lung	[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...	Entity	ORGAN	[-0.2819957435131073, 0.08759381622076035, 0.0...	[0.01818174123764038, 0.014254063367843628, 0....	[0.0, 0.0, 0.03078138828277588, 0.008552972227...
...	...	...	...	...	...	...
candida parapsilosis	[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...	Entity	ORGANISM	[-0.18134233355522156, 0.14365115761756897, 0....	[0.373728483915329, 0.05336388945579529, 0.090...	[0.0, 0.0, 0.02676139771938324, 0.0, 0.0, 0.03...
ciliated bronchial epithelial cell	[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...	Entity	CELL_TYPE	[-0.6209977865219116, 0.2375614047050476, 0.00...	[0.03760749101638794, 0.00703778862953186, 0.0...	[0.0, 0.0, 0.032069120556116104, 0.00537745608...
cystic fibrosis pulmonary exacerbation	[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...	Entity	DISEASE	[-0.1944447010755539, 0.06318975239992142, 0.1...	[0.10799965262413025, 0.07695361971855164, 0.0...	[0.0, 0.0, 0.031117763370275497, 0.0, 0.0, 0.0...
caax prenyl protease 2	[0.0, 0.0, 0.3198444339599345, 0.0, 0.0, 0.0, ...	Entity	PROTEIN	[-0.2207261174917221, -0.071625716984272, 0.11...	[0.006837755441665649, 0.01296880841255188, 0....	[0.010648305527865887, 0.0, 0.0312722884118557...
transmembrane protease serine 2	[0.0, 0.0, 0.2853086240289885, 0.0, 0.0, 0.0, ...	Entity	PROTEIN	[-0.40691250562667847, 0.07031852006912231, 0....	[0.00615808367729187, 0.02638322114944458, 0.0...	[0.0, 0.0, 0.03197368606925011, 0.010241100564...

1000 rows × 6 columns

Plotting the embeddings¶

Having produced the embedding vectors, we can project them into a 2D space using dimensionality reduction techniques such as TSNE (t-distributed Stochastic Neighbor Embedding).

node2vec_2d = transform_to_2d(transformed_graph._nodes["node2vec"].tolist())

attri2vec_2d = transform_to_2d(transformed_graph._nodes["attri2vec"].tolist())

gcn_dgi_2d = transform_to_2d(transformed_graph._nodes["gcn_dgi"].tolist())

We can now plot these 2D vectors using the plot_2d util provided by bluegraph.

plot_2d(transformed_graph, vectors=node2vec_2d, label_prop="entity_type")

plot_2d(transformed_graph, vectors=attri2vec_2d, label_prop="entity_type")

plot_2d(transformed_graph, vectors=gcn_dgi_2d, label_prop="entity_type")

Node similarity¶

We would like to be able to search for similar nodes using the computed vector embeddings. For this we can use the NodeSimilarityProcessor interfaces provided as a part of bluegraph.

We construct similarity processors for different embeddings and query top 10 most similar nodes to the terms glucose and covid-19.

node2vec_l2 = NodeSimilarityProcessor(transformed_graph, "node2vec", similarity="euclidean")
node2vec_cosine = NodeSimilarityProcessor(
    transformed_graph, "node2vec", similarity="cosine")

node2vec_l2.get_neighbors(["glucose", "covid-19"], k=10)

{'glucose': {0.0: 'glucose',
016042586: 'diabetic nephropathy',
020855632: 'nonalcoholic fatty liver disease',
020919867: 'hyperglycemia',
027952814: 'metabolic syndrome',
04255097: 'visceral',
049424335: 'obesity',
05932623: 'citrate',
061201043: 'tissue factor',
06682069: 'liver and intrahepatic bile duct disorder'},
 'covid-19': {0.0: 'covid-19',
023866901: 'fatal',
049039844: 'procalcitonin measurement',
05976087: 'acute respiratory distress syndrome',
08363058: 'neuromuscular',
08448325: 'sterile',
084664375: 'hydroxychloroquine',
103314176: 'tidal volume',
10976424: 'caspase-5',
11111233: 'status epilepticus'}}

node2vec_cosine.get_neighbors(["glucose", "covid-19"], k=10)

{'glucose': {0.99999994: 'glucose',
99718344: 'diabetic nephropathy',
9968226: 'hyperglycemia',
9958539: 'nonalcoholic fatty liver disease',
9947761: 'metabolic syndrome',
99151814: 'visceral',
991088: 'respiration',
9901221: 'obesity',
9887427: 'liver and intrahepatic bile duct disorder',
9885775: 'citrate'},
 'covid-19': {1.0: 'covid-19',
99730766: 'fatal',
9942852: 'procalcitonin measurement',
9897085: 'acute respiratory distress syndrome',
98890024: 'chronic obstructive pulmonary disease',
9888062: 'sterile',
98763454: 'neuromuscular',
98537326: 'hydroxychloroquine',
98534656: 'lopinavir/ritonavir',
98470575: 'pulmonary'}}

attri2vec_l2 = NodeSimilarityProcessor(transformed_graph, "attri2vec")
attri2vec_cosine = NodeSimilarityProcessor(
    transformed_graph, "attri2vec", similarity="cosine")

attri2vec_l2.get_neighbors(["glucose", "covid-19"], k=10)

{'glucose': {0.0: 'glucose',
0071316347: 'digestion',
00823471: 'hepatocellular',
0091231465: 'adipose tissue',
010375342: 'axon',
010453261: 'hemoglobin',
010671802: 'bile',
0106950635: 'vitamin',
011250288: 'tissue',
011955512: 'small intestine'},
 'covid-19': {0.0: 'covid-19',
00061282323: 'chronic obstructive pulmonary disease',
0009526084: 'vasculitis',
0009802075: 'pulmonary edema',
0010977304: 'liver failure',
0011182561: 'inflammatory disorder',
0011229385: 'parenteral',
0012357396: 'osteoporosis',
001249002: 'h1n1 influenza',
0012659363: 'morphine'}}

attri2vec_cosine.get_neighbors(["glucose", "covid-19"], k=10)

{'glucose': {1.0: 'glucose',
9778094: 'digestion',
97610795: 'degradation',
97395945: 'creatine',
9727266: 'hepatocellular',
9708393: 'adipose tissue',
9704221: 'vitamin',
9702778: 'astrocyte',
9700098: 'hematopoietic stem cell',
9698795: 'lymph node'},
 'covid-19': {1.0: 'covid-19',
97816277: 'severe acute respiratory syndrome',
9777578: 'middle east respiratory syndrome',
9767103: 'respiratory failure',
97613215: 'childhood-onset systemic lupus erythematosus',
97379327: 'h1n1 influenza',
9727: 'dengue fever',
9719033: 'chronic obstructive pulmonary disease',
97159684: 'arthritis',
9704671: 'delirium'}}

gcn_l2 = NodeSimilarityProcessor(transformed_graph, "gcn_dgi")
gcn_cosine = NodeSimilarityProcessor(
    transformed_graph, "gcn_dgi", similarity="cosine")

gcn_l2.get_neighbors(["glucose", "covid-19"], k=10)

{'glucose': {0.0: 'glucose',
0030039286: 'glucose tolerance test',
0034940867: 'triglycerides',
003617311: 'insulin',
0036187829: 'high density lipoprotein',
004899253: 'cholesterol',
0056207227: 'organic phosphate',
0057664528: 'uric acid',
0058270395: 'fetus',
006129055: 'diabetic nephropathy'},
 'covid-19': {0.0: 'covid-19',
0009082245: 'coronavirus',
002618216: 'fatal',
0026699416: 'acute respiratory distress syndrome',
0042233844: 'sars-cov-2',
004636312: 'severe acute respiratory syndrome',
004916654: 'middle east respiratory syndrome',
005095474: 'myocarditis',
0056914845: 'angiotensin ii receptor antagonist',
0057702293: 'cardiac valve injury'}}

gcn_cosine.get_neighbors(["glucose", "covid-19"], k=10)

{'glucose': {1.0000001: 'glucose',
98359084: 'triglycerides',
9822164: 'cholesterol',
981979: 'insulin',
98167336: 'glucose tolerance test',
979028: 'high density lipoprotein',
9727696: 'low density lipoprotein',
9723866: 'plasma',
97019887: 'skeletal muscle tissue',
9700538: 'atherosclerosis'},
 'covid-19': {0.99999994: 'covid-19',
99609506: 'coronavirus',
9897146: 'fatal',
98897403: 'acute respiratory distress syndrome',
98260605: 'sars-cov-2',
980789: 'severe acute respiratory syndrome',
9791904: 'middle east respiratory syndrome',
97802055: 'myocarditis',
97669864: 'angiotensin ii receptor antagonist',
9753277: 'sars coronavirus'}}

Node clustering¶

We can cluster nodes according to their node embeddings. Often such clustering helps to reveal the community structure encoded in the underlying networks.

In this example we will use the BayesianGaussianMixture model provided by the scikit-learn to cluster the nodes according to different embeddings into 5 clusters.

N = 5

X = transformed_graph.get_node_property_values("node2vec").to_list()
gmm = mixture.BayesianGaussianMixture(n_components=N, covariance_type='full').fit(X)
node2vec_clusters = gmm.predict(X)

X = transformed_graph.get_node_property_values("attri2vec").to_list()
gmm = mixture.BayesianGaussianMixture(n_components=5, covariance_type='full').fit(X)
attri2vec_clusters = gmm.predict(X)

X = transformed_graph.get_node_property_values("gcn_dgi").to_list()
gmm = mixture.BayesianGaussianMixture(n_components=5, covariance_type='full').fit(X)
gcn_dgi_clusters = gmm.predict(X)

Below we inspect the most frequent cluster members.

def show_top_members(clusters, N):
    for i in range(N):
        df = transformed_graph._nodes.iloc[np.where(clusters == i)]
        df.loc[:, "frequency"] = df.index.map(lambda x: graph._nodes.loc[x, "frequency"])
        print(f"#{i}: ", ", ".join(df.nlargest(10, columns=["frequency"]).index))

show_top_members(node2vec_clusters, N)

#0:  blood, heart, pulmonary, death, renal, hypertension, cardiovascular system, septicemia, oral cavity, fever
#1:  lung, survival, cancer, organ, plasma, angiotensin-converting enzyme 2, vascular, insulin, neutrophil, antibody
#2:  bacteria, antibiotic, pneumonia, escherichia coli, staphylococcus aureus, pathogen, klebsiella pneumoniae, microorganism, mucoid pseudomonas aeruginosa, organism
#3:  human, mouse, inflammation, animal, cytokine, interleukin-6, neoplasm, dna, tissue, proliferation
#4:  covid-19, infectious disorder, diabetes mellitus, sars-cov-2, liver, virus, brain, glucose, kidney, serum

/Users/oshurko/opt/anaconda3/envs/bg/lib/python3.7/site-packages/pandas/core/indexing.py:1667: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = value

show_top_members(attri2vec_clusters, N)

#0:  antibiotic, escherichia coli, staphylococcus aureus, klebsiella pneumoniae, mucoid pseudomonas aeruginosa, vancomycin, pseudomonas aeruginosa, ciprofloxacin, community-acquired pneumonia, staphylococcus
#1:  human, renal, survival, brain, hypertension, obesity, respiratory system, oral cavity, injury, oxygen
#2:  death, person, proliferation, molecule, lower, failure, intestinal, transfer, organism, sterile
#3:  dog, cat, water, depression, horse, anxiety, nasal, subarachnoid hemorrhage, proximal, brother
#4:  covid-19, blood, infectious disorder, heart, diabetes mellitus, lung, sars-cov-2, mouse, pulmonary, bacteria

show_top_members(gcn_dgi_clusters, N)

#0:  lung, sars-cov-2, liver, survival, virus, brain, glucose, kidney, cancer, serum
#1:  covid-19, blood, heart, diabetes mellitus, pulmonary, death, renal, hypertension, cardiovascular system, dog
#2:  bacteria, antibiotic, escherichia coli, staphylococcus aureus, pathogen, klebsiella pneumoniae, microorganism, mucoid pseudomonas aeruginosa, organism, sputum
#3:  infectious disorder, respiratory system, oral cavity, pneumonia, skin, fever, cystic fibrosis, urine, human immunodeficiency virus, influenza
#4:  human, mouse, inflammation, animal, cytokine, plasma, interleukin-6, insulin, neoplasm, dna

We can also use the previously plot_2d util and color our 2D nore representation according to the clusters they belong to.

plot_2d(transformed_graph, vectors=node2vec_2d, labels=node2vec_clusters)

plot_2d(transformed_graph, vectors=attri2vec_2d, labels=attri2vec_clusters)

plot_2d(transformed_graph, vectors=gcn_dgi_2d, labels=gcn_dgi_clusters)

Node classification¶

Another downstream task that we would like to perform is node classification. We would like to automatically assign entity types according to their node embeddings. For this we will build predictive models for entity type prediction based on:

Only node features
Node2vec embeddings (only structure)
Attri2vec embeddings (structure and node features)
GCN Deep Graph Infomax embeddings (structure and node features)

First of all, we split the graph nodes into the train and the test sets.

train_nodes, test_nodes = model_selection.train_test_split(
    transformed_graph.nodes(), train_size=0.8)

Now we use the NodeClassifier interface to create our classification models. As the base model we will use the linear SVM classifier (LinearSVC) provided by scikit-learn.

features_classifier = NodeClassifier(LinearSVC(), feature_vector_prop="features")
features_classifier.fit(transformed_graph, train_elements=train_nodes, label_prop="entity_type")
features_pred = features_classifier.predict(transformed_graph, predict_elements=test_nodes)

node2vec_classifier = NodeClassifier(LinearSVC(), feature_vector_prop="node2vec")
node2vec_classifier.fit(transformed_graph, train_elements=train_nodes, label_prop="entity_type")
node2vec_pred = node2vec_classifier.predict(transformed_graph, predict_elements=test_nodes)

attri2vec_classifier = NodeClassifier(LinearSVC(), feature_vector_prop="attri2vec")
attri2vec_classifier.fit(transformed_graph, train_elements=train_nodes, label_prop="entity_type")
attri2vec_pred = attri2vec_classifier.predict(transformed_graph, predict_elements=test_nodes)

/Users/oshurko/opt/anaconda3/envs/bg/lib/python3.7/site-packages/sklearn/svm/_base.py:986: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)

gcn_dgi_classifier = NodeClassifier(LinearSVC(), feature_vector_prop="gcn_dgi")
gcn_dgi_classifier.fit(transformed_graph, train_elements=train_nodes, label_prop="entity_type")
gcn_dgi_pred = gcn_dgi_classifier.predict(transformed_graph, predict_elements=test_nodes)

Let us have a look at the scores of different node classification models we have produced.

true_labels = transformed_graph._nodes.loc[test_nodes, "entity_type"]

get_classification_scores(true_labels, features_pred, multiclass=True)

{'accuracy': 0.59,
 'precision': 0.59,
 'recall': 0.59,
 'f1_score': 0.59,
 'roc_auc_score': 0.7847725250984877}

get_classification_scores(true_labels, node2vec_pred, multiclass=True)

{'accuracy': 0.36,
 'precision': 0.36,
 'recall': 0.36,
 'f1_score': 0.36,
 'roc_auc_score': 0.6786980556614562}

get_classification_scores(true_labels, attri2vec_pred, multiclass=True)

{'accuracy': 0.46,
 'precision': 0.46,
 'recall': 0.46,
 'f1_score': 0.46,
 'roc_auc_score': 0.7230397763375269}

get_classification_scores(true_labels, gcn_dgi_pred, multiclass=True)

{'accuracy': 0.33,
 'precision': 0.33,
 'recall': 0.33,
 'f1_score': 0.33,
 'roc_auc_score': 0.6585176007116533}

Link prediction¶

Finally, we would like to use the produced node embeddings to predict the existance of edges. This downstream task is formulated as follows: given a pair of nodes and their embedding vectors, is there an edge between these nodes?

As the first step of the edges prediciton task we will generate false edges for training (node pairs that don’t have edges between them).

false_edges = generate_negative_edges(transformed_graph)

We will now split both true and false edges into training and test sets.

true_train_edges, true_test_edges = model_selection.train_test_split(
    transformed_graph.edges(), train_size=0.8)

false_train_edges, false_test_edges = model_selection.train_test_split(
    false_edges, train_size=0.8)

And, finally, we will use the EdgePredictor interface to build our model (using LinearSVC as before and the Hadamard product as the binary operator between the embedding vectors for the source and the target nodes.

model = EdgePredictor(LinearSVC(), feature_vector_prop="node2vec",
                      operator="hadamard", directed=False)
model.fit(transformed_graph, true_train_edges, negative_samples=false_train_edges)

true_labels = np.hstack([
    np.ones(len(true_test_edges)),
    np.zeros(len(false_test_edges))])

y_pred = model.predict(transformed_graph, true_test_edges + false_test_edges)

Let us have a look at the obtained scores.

get_classification_scores(true_labels, y_pred)

{'accuracy': 0.7333526166814736,
 'precision': 0.7333526166814736,
 'recall': 0.7333526166814736,
 'f1_score': 0.7333526166814736,
 'roc_auc_score': 0.6407728790685658}

Creating and saving embedding pipelines¶

bluegraph allows to create emebedding pipelines (using the EmbeddingPipeline class) that represent a useful wrapper around a sequence of steps necessary to produce embeddings and compute point similarities. In the example below we create a pipeline for producing attri2vec node embeddings and computing their cosine similarity.

We first create an encoder object that will be used in our pipeline as a preprocessing step.

definition_encoder = ScikitLearnPGEncoder(
    node_properties=["definition"], text_encoding_max_dimension=512)

We then create an embedder object.

D = 128
params = {
    "length": 5,
    "number_of_walks": 10,
    "epochs": 5,
    "embedding_dimension": D
}
attri2vec_embedder = StellarGraphNodeEmbedder(
    "attri2vec", feature_vector_prop="features", edge_weight="npmi", **params)

And finally we create a pipeline object. Note that in the code below we use the SimilarityProcessor interface and not NodeSimilarityProcessor, as we have done it previously. We use this lower abstraction level interface, because the EmbeddingPipeline is designed to work with any embedding models (not only node embedding models).

attri2vec_pipeline = EmbeddingPipeline(
    preprocessor=definition_encoder,
    embedder=attri2vec_embedder,
    similarity_processor=SimilarityProcessor(
        FaissSimilarityIndex(
            similarity="cosine", dimension=D)))

We run the fitting process, which given the input data: 1. fits the encoder 2. transforms the data 3. fits the embedder 4. produces the embedding table 5. fits the similarity processor index

attri2vec_pipeline.run_fitting(graph)

link_classification: using 'ip' method to combine node embeddings into edge embeddings

How we can save our pipeline to the file system.

attri2vec_pipeline.save(
    "../data/attri2vec_test_model",
    compress=True)

And we can load the pipeline back into memory:

pipeline = EmbeddingPipeline.load(
    "../data/attri2vec_test_model.zip",
    embedder_interface=GraphElementEmbedder,
    embedder_ext="zip")

We can use retrieve_embeddings and get_similar_points methods of the pipeline object to respectively get embedding vectors and top most similar nodes for the input nodes.

pipeline.retrieve_embeddings(["covid-19", "glucose"])

[[0.07280001044273376,
08163794130086899,
08893375843763351,
09304069727659225,
11964225769042969,
08136298507452011,
0790518969297409,
08503866195678711,
08987397700548172,
13234665989875793,
06845631450414658,
09433518350124359,
057276081293821335,
08183374255895615,
0636567771434784,
10424472391605377,
06787201017141342,
08923638612031937,
07220311462879181,
07509997487068176,
09238457679748535,
06531045585870743,
0759056881070137,
14457547664642334,
08505883812904358,
06661373376846313,
07629712671041489,
07443031668663025,
07806529849767685,
08416897058486938,
12059333175420761,
0758424922823906,
10647209733724594,
07496806234121323,
09789688140153885,
10009769350290298,
09310337901115417,
08175752311944962,
08274300396442413,
07131325453519821,
12208940088748932,
06224219128489494,
09508002549409866,
14279678463935852,
057057347148656845,
0588308647274971,
08901730924844742,
08926397562026978,
0662379041314125,
09682483226060867,
07646792382001877,
07486658543348312,
070854052901268,
054801177233457565,
07894912362098694,
060327619314193726,
10469762980937958,
07393162697553635,
09346463531255722,
09142538905143738,
08995286375284195,
057934362441301346,
09345584362745285,
09328961372375488,
07854010164737701,
07263723015785217,
12583819031715393,
06582190096378326,
07038778066635132,
06997384876012802,
07740046083927155,
0648268535733223,
0915069580078125,
1107659563422203,
10443656146526337,
06657622754573822,
09377510845661163,
06837121397256851,
09725506603717804,
060706377029418945,
1157352551817894,
0791042298078537,
08426657319068909,
06966130435466766,
07881376147270203,
06591648608446121,
12842406332492828,
09824175387620926,
07571471482515335,
0666264072060585,
13996072113513947,
10810025036334991,
08261056989431381,
062233999371528625,
0959680825471878,
0712309181690216,
09311872720718384,
08855060487985611,
10211314260959625,
0744297131896019,
13628296554088593,
07632824778556824,
09952477365732193,
09145186096429825,
05990583822131157,
08039164543151855,
09073426574468613,
0997760146856308,
07251497358083725,
06577309966087341,
13079826533794403,
08491260558366776,
06395302712917328,
04059096425771713,
13386057317256927,
07978139072656631,
11739350110292435,
05938231945037842,
09113242477178574,
04842013493180275,
05951233580708504,
0531817302107811,
07620435208082199,
0648634135723114,
07864787429571152,
16829492151737213,
08553200215101242,
10460848361253738],
 [0.10236917436122894,
09674006700515747,
07649692893028259,
0845288410782814,
0760805606842041,
09261447936296463,
09488159418106079,
12473700195550919,
0718981921672821,
1021432876586914,
09268027544021606,
09814798831939697,
09521770477294922,
10098892450332642,
09244446456432343,
0635334774851799,
09584149718284607,
08556737005710602,
0852125957608223,
07645734399557114,
08095100522041321,
09593727439641953,
08347492665052414,
08885250240564346,
08701310306787491,
09694880247116089,
11121281236410141,
08294625580310822,
08726843446493149,
0701715424656868,
09523919224739075,
07785829901695251,
09603790938854218,
0824458971619606,
08737047761678696,
08853974938392639,
06570149958133698,
10123683512210846,
07348940521478653,
06943066418170929,
1299903839826584,
08817175030708313,
06109187752008438,
08437755703926086,
08351798355579376,
08457473665475845,
07322832942008972,
09192510694265366,
08886606246232986,
07747369259595871,
07242843508720398,
09057212620973587,
10816606134176254,
09043016284704208,
09076884388923645,
09677130728960037,
08017739653587341,
10074104368686676,
07700169831514359,
07268036901950836,
07325926423072815,
07274069637060165,
06991708278656006,
0845450609922409,
06915223598480225,
0702526643872261,
09593337029218674,
09438585489988327,
08171636611223221,
07945361733436584,
0642147958278656,
08085450530052185,
0607246495783329,
08492715656757355,
07719805836677551,
10578399896621704,
10591499507427216,
09201952069997787,
0818672627210617,
08240731060504913,
06790471076965332,
07807260751724243,
0730040892958641,
1071859821677208,
11890396475791931,
056871384382247925,
09596915543079376,
07900075614452362,
09519974142313004,
10644269734621048,
08464374393224716,
10578206926584244,
10132604092359543,
07531124353408813,
09358139336109161,
07341431826353073,
09914236515760422,
07994917780160904,
06680438667535782,
07904554903507233,
09318091720342636,
08036279678344727,
07590607553720474,
07815994322299957,
10222751647233963,
11459968239068985,
0987963154911995,
08063937723636627,
10191671550273895,
11327352374792099,
08440998196601868,
09114128351211548,
0879993736743927,
0869138091802597,
1110539585351944,
08841552585363388,
08597182482481003,
09037397056818008,
07773328572511673,
09250291436910629,
09562606364488602,
07948072999715805,
08507171273231506,
08046958595514297,
08189624547958374,
07476285845041275,
10559207946062088,
10403718799352646]]

a = pipeline.retrieve_embeddings(["covid-19", "glucose"])

pipeline.get_neighbors(existing_points=["covid-19", "glucose"], k=5)

([array([1.0000001 , 0.98876834, 0.9861363 , 0.9855296 , 0.98494315],
        dtype=float32),
  array([1.0000001 , 0.98885393, 0.98832536, 0.9882704 , 0.9882704 ],
        dtype=float32)],
 [Index(['covid-19', 'middle east respiratory syndrome',
         'severe acute respiratory syndrome',
         'childhood-onset systemic lupus erythematosus', 'h1n1 influenza'],
        dtype='object', name='@id'),
  Index(['glucose', 'fatigue', 'anorexia', 'congenital abnormality', 'proximal'], dtype='object', name='@id')])