Literature exploration: in-memory analytics tutorial

In this example we illustrate how network analytics can be used for literature exploration. The source notebook can be found here.

The input dataset contains occurrences of different terms in paragraphs of scientific articles previously extracted by means of a Named Entity Recognition (NER) model. This dataset is transformed into three co-occurrence networks: representing paper- and paragraph-level co-occurrence relation between terms. The term relations in the above-mentioned networks are quantified using mutual-information-based scores (pointwise mutual information and its normalized version).

The networks are further analysed using classical tools from complex networks: we find various centrality measures characterizing the importance of extracted terms, we detect term communities representing denesely connected clusters of terms and finally we illustrate how the algorithms for finding shortest paths and minimum spanning trees can be used to perform guided search in networks.

import networkx as nx
import pandas as pd
import numpy as np
from bluegraph.core import (PandasPGFrame,
                            pretty_print_paths,
                            pretty_print_tripaths,
                            graph_elements_from_paths)
from bluegraph.preprocess.generators import CooccurrenceGenerator

from bluegraph.backends.graph_tool import (GTMetricProcessor,
                                           GTPathFinder,
                                           GTGraphProcessor,
                                           GTCommunityDetector)
from bluegraph.backends.graph_tool import graph_tool_to_pgframe

from bluegraph.backends.networkx import NXCommunityDetector, NXPathFinder

from bluegraph.backends.stellargraph import StellarGraphNodeEmbedder

Entity-occurrence property graph

In this section we will create a property graph whose nodes are papers and extracted named entities, and whose edges connect entities to the papers they occur in.

The input data is given by occurrences of different entities in specific paragraphs of scientific articles.

mentions = pd.read_csv("../data/literature_NER_example.csv")
mentions.sample(5)
entity occurrence
1510 viral 214924:The Protective Role Of Angiotensin-Conv...
720 dipeptidyl peptidase 4 184360:Cardiovascular Effects Of Sdpp4 Upregul...
1019 insulin 214924:The Interplay Between Covid-19 And Ampk...
556 diabetes mellitus 184360:Mechanisms Of Sars-Cov-2 Entry Into Hos...
540 diabetes mellitus 214924:Introduction:7

Every paragraph is identified using the format <paper_id>:<section_id>:<paragraph_id>. From this data we will extract occurrences in distinct papers/paragraphs as follows:

# Extract unique paper/seciton/paragraph identifiers
mentions["paper"] = mentions["occurrence"].apply(
    lambda x: x.split(":")[0])

mentions = mentions.rename(columns={"occurrence": "paragraph"})
mentions.sample(5)
entity paragraph paper
154 blood 214728:Cap Community-Acquired Pneumonia Covid-... 214728
833 glycosylated hemoglobin measurement 184360:Gliptins ::: Therapeutic Potential Of T... 184360
1506 viral 214924:The Interplay Between Covid-19 And Ampk... 214924
936 hypertension 211125:Introduction:5 211125
828 glyburide 160564:Data Extraction And Study Quality ::: M... 160564

We, first, create an empty property graph object.

graph = PandasPGFrame()

Then we add nodes for unique entities and papers

entity_nodes = mentions["entity"].unique()
graph.add_nodes(entity_nodes)
graph.add_node_types({n: "Entity" for n in entity_nodes})

paper_nodes = mentions["paper"].unique()
graph.add_nodes(paper_nodes)
graph.add_node_types({n: "Paper" for n in paper_nodes})
graph.nodes(raw_frame=True)
@type
@id
ace inhibitor Entity
acetaminophen Entity
acute lung injury Entity
acute respiratory distress syndrome Entity
adenosine Entity
... ...
78884 Paper
35198 Paper
139943 Paper
172581 Paper
102473 Paper

177 rows × 1 columns

We now add edges from entities to the papers they occur in storing paragraphs as edge properties.

occurrence_edges = mentions.groupby(by=["entity", "paper"]).aggregate(set)
occurrence_edges
paragraph
entity paper
ace inhibitor 184360 {184360:Conclusion:62, 184360:Combined Therape...
197804 {197804:Caption:71, 197804:Caption:72, 197804:...
acetaminophen 179426 {179426:Blood Glucose Monitoring ::: Special A...
197804 {197804:Discussion:51, 197804:Discussion:52, 1...
acute lung injury 179426 {179426:Role Of Ace/Arbs ::: Special Aspects O...
... ... ...
virus 184360 {184360:Gliptins ::: Therapeutic Potential Of ...
197804 {197804:Discussion:44, 197804:Introduction:2}
211125 {211125:Discussion:25}
211373 {211373:Introduction:5, 211373:Introduction:6,...
214924 {214924:Angiotensin-Converting Enzyme 2 Expres...

551 rows × 1 columns

graph.add_edges(occurrence_edges.index)
graph.add_edge_types({e: "OccursIn" for e in occurrence_edges.index})
occurrence_edges.index = occurrence_edges.index.rename(["@source_id", "@target_id"])
graph.add_edge_properties(occurrence_edges["paragraph"])
graph.edges(raw_frame=True)
@type paragraph
@source_id @target_id
ace inhibitor 184360 OccursIn {184360:Conclusion:62, 184360:Combined Therape...
197804 OccursIn {197804:Caption:71, 197804:Caption:72, 197804:...
acetaminophen 179426 OccursIn {179426:Blood Glucose Monitoring ::: Special A...
197804 OccursIn {197804:Discussion:51, 197804:Discussion:52, 1...
acute lung injury 179426 OccursIn {179426:Role Of Ace/Arbs ::: Special Aspects O...
... ... ... ...
virus 184360 OccursIn {184360:Gliptins ::: Therapeutic Potential Of ...
197804 OccursIn {197804:Discussion:44, 197804:Introduction:2}
211125 OccursIn {211125:Discussion:25}
211373 OccursIn {211373:Introduction:5, 211373:Introduction:6,...
214924 OccursIn {214924:Angiotensin-Converting Enzyme 2 Expres...

551 rows × 2 columns

Entity co-occurrence graphs

We will generate co-occurrence graphs for different occurrence factors (paper/paragraph), i.e. an edge between a pair of entities is added if they co-occur in the same paper or paragraph.

NB: Read more about statistics computed during the co-occurrence analysis (positive pointwise mutual information (PPMI) and normalized pointwise mutual information (NPMI)) here.

Paper-based co-occurrence

We first generate co-occurrence network from edges of type OccursIn linking entities and papers.

gen = CooccurrenceGenerator(graph)
paper_cooccurrence_edges = gen.generate_from_edges(
     "OccursIn", compute_statistics=["frequency", "ppmi", "npmi"])
Examining 12246 pairs of terms for co-occurrence...
paper_cooccurrence_edges["@type"] = "CoOccursWith"
paper_cooccurrence_edges
common_factors frequency ppmi npmi @type
@source_id @target_id
ace inhibitor acetaminophen {197804} 1 2.321928 0.537244 CoOccursWith
acute lung injury {197804, 184360} 2 2.321928 0.698970 CoOccursWith
acute respiratory distress syndrome {197804, 184360} 2 1.736966 0.522879 CoOccursWith
adenosine {184360} 1 2.321928 0.537244 CoOccursWith
adipose tissue {197804, 184360} 2 2.736966 0.823909 CoOccursWith
... ... ... ... ... ... ...
viral viral infection {184360, 211125, 214924, 211373} 4 2.000000 0.861353 CoOccursWith
virus {214924, 184360, 211373, 211125, 179426} 5 1.514573 0.757287 CoOccursWith
viral entry viral infection {214924} 1 1.321928 0.305865 CoOccursWith
virus {179426, 214924} 2 1.514573 0.455932 CoOccursWith
viral infection virus {184360, 211125, 214924, 211373} 4 1.514573 0.652291 CoOccursWith

9748 rows × 5 columns

From the generated edges we remove the ones with zero NPMI scores.

paper_cooccurrence_edges = paper_cooccurrence_edges[paper_cooccurrence_edges["npmi"] != 0]
entity_nodes = graph.nodes_of_type("Entity").copy()
paper_frequency = mentions.groupby("entity").aggregate(set)["paper"].apply(len)
paper_frequency.name = "paper_frequency"

entity_nodes["paper_frequency"] = paper_frequency

We create a new property graph object from generated edges and entity nodes as follows:

paper_network = PandasPGFrame.from_frames(
    nodes=entity_nodes,
    edges=paper_cooccurrence_edges,
    node_prop_types={
        "paper_frequency": "numeric",
        "paragraph_frequency": "numeric"
    },
    edge_prop_types={
        "frequency": "numeric",
        "ppmi": "numeric",
        "npmi": "numeric"
    })
paper_network.edges(raw_frame=True).sample(5)
common_factors frequency ppmi npmi @type
@source_id @target_id
cough pneumonia {211125, 197804, 214924, 184360} 4 1.321928 0.569323 CoOccursWith
cytokine oxygen {211125} 1 1.736966 0.401896 CoOccursWith
chronic disease vildagliptin {179426} 1 2.321928 0.537244 CoOccursWith
insulin receptor binding {197804} 1 0.514573 0.119061 CoOccursWith
hyperglycemia tumor necrosis factor {211125, 214924, 184360} 3 1.321928 0.482990 CoOccursWith
paper_network.nodes(raw_frame=True).sample(5)
@type paper_frequency
@id
myalgia Entity 2
prognosis Entity 2
apoptosis Entity 2
thrombophilia Entity 3
septicemia Entity 2

Paragraph-based co-occurrence

We perform similar operation for paragraph-level co-occurrence. In order to use another co-occurrence factor, we will define the following ‘factor_aggregator’ function (aggregate_paragraph) that takes a collection of sets of paragraphs and merges them into the same set. This aggregator will be used to collect sets of common paragraphs of OccursIn edges pointing from a pair of entities to the same paper.

def aggregate_paragraphs(data):
    return set(sum(data["paragraph"].apply(list), []))
%%time
paragraph_cooccurrence_edges = gen.generate_from_edges(
     "OccursIn",
    factor_aggregator=aggregate_paragraphs,
    compute_statistics=["frequency", "ppmi", "npmi"],
    parallelize=True, cores=8)
Computing total factor instances...
Examining 12246 pairs of terms for co-occurrence...
CPU times: user 78.8 ms, sys: 62.1 ms, total: 141 ms
Wall time: 5.39 s
paragraph_cooccurrence_edges["@type"] = "CoOccursWith"
paragraph_cooccurrence_edges.sample(5)
common_factors frequency ppmi npmi @type
@source_id @target_id
dpp4i glycosylated hemoglobin measurement {184360:Gliptins ::: Therapeutic Potential Of ... 1 2.736966 0.336680 CoOccursWith
h1n1 virus {214924:The Interplay Between Covid-19 And Amp... 1 2.669851 0.328424 CoOccursWith
interleukin-6 middle east respiratory syndrome coronavirus {214924:Diabetes Mellitus And Covid-19: In The... 2 0.415037 0.058216 CoOccursWith
insulin resistance t-lymphocyte {184360:Cardiovascular Effects Of Sdpp4 Upregu... 1 3.000000 0.369036 CoOccursWith
dna replication middle east respiratory syndrome coronavirus {214924:The Interplay Between Covid-19 And Amp... 2 2.152003 0.301854 CoOccursWith

From the generated edges we remove the ones with zero NPMI scores.

paragraph_cooccurrence_edges = paragraph_cooccurrence_edges[paragraph_cooccurrence_edges["npmi"] != 0]
paragraph_network = PandasPGFrame.from_frames(
    nodes=entity_nodes,
    edges=paragraph_cooccurrence_edges,
    node_prop_types={
        "paper_frequency": "numeric",
    },
    edge_prop_types={
        "frequency": "numeric",
        "ppmi": "numeric",
        "npmi": "numeric"
    })

Faster paragraph-based co-occurrence

Alternatively, to generate paragraph-level co-occurrence network, we can assign sets of paragraphs where entities occur as properties of their respective nodes (as follows).

paragraph_prop = pd.DataFrame({"paragraphs": mentions.groupby("entity").aggregate(set)["paragraph"]})
graph.add_node_properties(paragraph_prop, prop_type="category")
graph.nodes(raw_frame=True).sample(5)
@type paragraphs
@id
islet of langerhans Entity {179426:Effect Of Sars Cov-2 On Blood Glucose ...
multi-organ dysfunction Entity {179426:Morbidity And Mortality In Diabetic Co...
angiotensin ii receptor antagonist Entity {214924:Angiotensin-Converting Enzyme 2 Expres...
m protein Entity {184360:Cardiovascular Effects Of Sdpp4 Upregu...
insulin infusion Entity {179426:Glycemic Control ::: Special Aspects O...

And then use the generate_from_nodes method of CooccurrenceGenerator in order to generate co-occurrence edges for nodes whose paragraphs property has a non-empty intersection.

%%time
generator = CooccurrenceGenerator(graph)
paragraph_cooccurrence_edges = generator.generate_from_nodes(
    "paragraphs", total_factor_instances=len(mentions.paragraph.unique()),
    compute_statistics=["frequency", "npmi"],
    parallelize=True, cores=8)
Examining 15576 pairs of terms for co-occurrence...
CPU times: user 101 ms, sys: 69.7 ms, total: 170 ms
Wall time: 1.37 s

Additional co-occurrence measures: NPMI-based distance

For both paper- and paragraph-based networks we will compute a mutual-information-based distance as follows: \(D = \frac{1}{NPMI}\).

import math

def compute_distance(x):
    return 1 / x if x > 0 else math.inf
npmi_distance = paper_network.edges(raw_frame=True)["npmi"].apply(compute_distance)
npmi_distance.name = "distance_npmi"
paper_network.add_edge_properties(npmi_distance, "numeric")
paper_network.edges(raw_frame=True).sample(5)
common_factors frequency ppmi npmi @type distance_npmi
@source_id @target_id
inflammation tumor necrosis factor {211125, 214924, 184360} 3 1.736966 0.634632 CoOccursWith 1.575717
fever islet of langerhans {211125} 1 1.736966 0.401896 CoOccursWith 2.488206
acute respiratory distress syndrome immune response process {214924, 184360} 2 1.152003 0.346787 CoOccursWith 2.883610
angiotensin ii receptor antagonist influenza {179426, 214924, 184360} 3 2.736966 1.000000 CoOccursWith 1.000000
lower respiratory tract infection lymphopenia {211125, 214924} 2 2.152003 0.647817 CoOccursWith 1.543645
npmi_distance = paragraph_network.edges(raw_frame=True)["npmi"].apply(compute_distance)
npmi_distance.name = "distance_npmi"
paragraph_network.add_edge_properties(npmi_distance, "numeric")
paper_network.edges(raw_frame=True).sample(5)
common_factors frequency ppmi npmi @type distance_npmi
@source_id @target_id
influenza viral entry {179426, 214924} 2 2.736966 0.823909 CoOccursWith 1.213727
blood vessel viral entry {214924} 1 2.321928 0.537244 CoOccursWith 1.861353
hyperglycemia pneumonia {214924, 184360, 214728, 211125, 179426} 5 0.643856 0.321928 CoOccursWith 3.106284
death sitagliptin {179426, 184360} 2 0.567041 0.170696 CoOccursWith 5.858360
adipose tissue fatigue {214924} 1 1.736966 0.401896 CoOccursWith 2.488206

Nearest neighours by co-occurrence scores

To illustrate the importance of computing mutual-information-based scores over raw frequencies consider the following example, where we would like to estimate top closest (most related) neighbors to a specific term.

To do so, we will use the paragraph-based network and the raw co-occurrence frequency as the weight of our co-occurrence relation. The top_neighbors method of the PathFinder interface provided by the BlueGraph allows us to search for top neighbors with the highest edge weight. In this example, we use graph_tool-based GTPathFinder interface.

paragraph_path_finder = GTPathFinder(paragraph_network, directed=False)

Observe in the following cell that the path finder interface generated a backend-specific graph object.

paragraph_path_finder.graph
<Graph object, undirected, with 157 vertices and 2479 edges at 0x7fc6ae5b76d8>
paragraph_path_finder.top_neighbors("glucose", 10, weight="frequency")
{'diabetes mellitus': 29.0,
 'blood': 18.0,
 'insulin': 11.0,
 'death': 9.0,
 'hyperglycemia': 8.0,
 'coronavirus': 8.0,
 'infectious disorder': 6.0,
 'inflammation': 6.0,
 'sars coronavirus': 6.0,
 'interleukin-6': 5.0}
paragraph_path_finder.top_neighbors("lung", 10, weight="frequency")
{'covid-19': 24.0,
 'angiotensin-converting enzyme 2': 16.0,
 'sars-cov-2': 13.0,
 'acute lung injury': 12.0,
 'pulmonary': 10.0,
 'sars coronavirus': 10.0,
 'viral': 9.0,
 'human': 9.0,
 'mouse': 7.0,
 'inflammation': 6.0}

We observe that ‘glucose’ and ‘lung’ share a lot of the closest neighbors by raw frequency. If we look into the list of top 10 entities by paragraph frequency in the entire corpus and we notice that ‘glucose’ and ‘blood’ co-occur the most with the terms that are simply the most frequent in our corpus, such as ‘covid-19’ and ‘diabetes mellitus’.

(Closest inspection of the distribution of weighted node degrees suggests that the network contains hubs, nodes with significantly high-degree connectivity to other nodes.)

paragraph_network._nodes
@type paper_frequency
@id
ace inhibitor Entity 2
acetaminophen Entity 2
acute lung injury Entity 4
acute respiratory distress syndrome Entity 6
adenosine Entity 2
... ... ...
vildagliptin Entity 2
viral Entity 5
viral entry Entity 2
viral infection Entity 4
virus Entity 7

157 rows × 2 columns

paragraph_network.nodes(raw_frame=True).nlargest(10, columns=["paper_frequency"])
@type paper_frequency
@id
covid-19 Entity 20
diabetes mellitus Entity 19
coronavirus Entity 16
glucose Entity 13
death Entity 9
glyburide Entity 9
infectious disorder Entity 9
blood Entity 8
hyperglycemia Entity 8
pneumonia Entity 8

To account for the presence of such hubs, we use the mutual-information-based scores presented above. They ‘balance’ the influence of the highly connected hub nodes such as ‘covid-19’ and ‘diabetes mellitus’ in our example.

paragraph_path_finder.top_neighbors("glucose", 10, weight="npmi")
{'blood': 0.5133209650995287,
 'glucose metabolism disorder': 0.43558951200762297,
 'insulin': 0.41957609533629175,
 'thrombophilia': 0.4079646453270325,
 'insulin infusion': 0.3744908698338857,
 'leukopenia': 0.3744908698338857,
 'millimole per liter': 0.3744908698338857,
 'troponin t, cardiac muscle': 0.3744908698338857,
 'bals r': 0.3744908698338857,
 'hyperglycemia': 0.3255525953220345}
paragraph_path_finder.top_neighbors("lung", 10, weight="npmi")
{'acute lung injury': 0.731006557092012,
 'pulmonary': 0.6362945400636919,
 'angiotensin-converting enzyme 2': 0.4757184436640079,
 'receptor binding': 0.46595542454855043,
 'viral infection': 0.4465392749200213,
 'animal': 0.4465392749200213,
 'mouse': 0.4362945258726772,
 'angiotensin-1': 0.4259996541516483,
 'viral': 0.4233482775367079,
 'human': 0.4098154380746763}

Graph metrics and centrality measures

BlueGraph provides the MetricProcessor interface for computing various graph statistics. As in the previous example, we will use graph_tool-based GTMetricProcessor interface.

paper_metrics = GTMetricProcessor(paper_network, directed=False)
paragraph_metrics = GTMetricProcessor(paragraph_network, directed=False)

Graph density

Density of a graph is quantified by the proportion of all possible edges (\(n(n-1) / 2\) for the undirected graph with \(n\) nodes) that are realized.

print("Density of the paper-based network: ", paper_metrics.density())
print("Density of the paragraph-based network: ", paragraph_metrics.density())
Density of the paper-based network:  0.7769884043769394
Density of the paragraph-based network:  0.20243344765637758

The results above show that in the paper, section and paragraph network repsectively 80%, 42% and 22% of all possible term pairs co-occur at least once.

Node centrality (importance) measures

In this example we will compute the Degree and PageRank centralities only for the raw frequency, and the Betweenness centrality for the mutual-information-based scores. We will use methods provided by the MetricProcessor interface in the write mode, i.e. computed metrics will be written as node properties of the underlying graph object.

Degree centrality is given by the sum of weights of all incident edges of the given node and characterizes the importance of the node in the network in terms of its connectivity to other nodes (high degree = high connectivity).

paragraph_metrics.degree_centrality("frequency", write=True, write_property="degree")

PageRank centrality is another measure that estimated the importance of the given node in the network. Roughly speaking it can be interpreted as the probablity that having landed on a random node in the network we will jump to the given node (here the edge weights are taken into account”).

https://en.wikipedia.org/wiki/PageRank

paragraph_metrics.pagerank_centrality("frequency", write=True, write_property="pagerank")

We then compute the betweenness centrality based on the NPMI distances.

Betweenness centrality is a node importance measure that estimates how often a shortest path between a pair of nodes will pass through the given node.

paragraph_metrics.betweenness_centrality("distance_npmi", write=True, write_property="betweenness")

We can inspect the underlying graph object and observe the newly added properties:

paragraph_metrics.graph.vp.keys()
['@id', '@type', 'paper_frequency', 'degree', 'pagerank', 'betweenness']

Now, we will export this backend-specific graph object into a PGFrame.

new_paragraph_network = paragraph_metrics.get_pgframe()
new_paragraph_network.nodes(raw_frame=True).sample(5)
@type paper_frequency degree pagerank betweenness
@id
iv Entity 2.0 4.0 0.001286 0.000000
chemokine Entity 3.0 43.0 0.004657 0.026730
death Entity 9.0 194.0 0.015881 0.004632
lymphopenia Entity 3.0 55.0 0.005415 0.038916
bradykinin Entity 2.0 18.0 0.002771 0.004274
print("Top 10 nodes by degree")
for n in new_paragraph_network.nodes(raw_frame=True).nlargest(10, columns=["degree"]).index:
    print("\t", n)
Top 10 nodes by degree
     covid-19
     diabetes mellitus
     sars-cov-2
     angiotensin-converting enzyme 2
     lung
     coronavirus
     dipeptidyl peptidase 4
     glucose
     sars coronavirus
     interleukin-6
print("Top 10 nodes by PageRank")
for n in new_paragraph_network.nodes(raw_frame=True).nlargest(10, columns=["pagerank"]).index:
    print("\t", n)
Top 10 nodes by PageRank
     covid-19
     diabetes mellitus
     sars-cov-2
     angiotensin-converting enzyme 2
     lung
     dipeptidyl peptidase 4
     glucose
     coronavirus
     sars coronavirus
     interleukin-6
print("Top 10 nodes by betweenness")
for n in new_paragraph_network.nodes(raw_frame=True).nlargest(10, columns=["betweenness"]).index:
    print("\t", n)
Top 10 nodes by betweenness
     lymphopenia
     pulmonary
     glucose metabolism disorder
     t-lymphocyte
     cough
     chemokine
     d-dimer measurement
     kidney
     interleukin-19
     ibuprofen

Compute multiple metrics in one go

Alternatively, we can compute all the metrics in one go. To do so, we need to specify edge attributes used for computing different metrics (if an empty list is specified as a weight list for a metric, computation of this metric is not performed).

We select the paragraph-based network and re-compute all some of the previously illustrated metrics as follows:

result_metrics = paragraph_metrics.compute_all_node_metrics(
    degree_weights=["frequency"],
    pagerank_weights=["frequency"],
    betweenness_weights=["distance_npmi"])
result_metrics
{'degree': {'frequency': {'ace inhibitor': 39.0,
   'acetaminophen': 5.0,
   'acute lung injury': 128.0,
   'acute respiratory distress syndrome': 139.0,
   'adenosine': 22.0,
   'adipose tissue': 27.0,
   'angioedema': 20.0,
   'angiotensin ii receptor antagonist': 85.0,
   'angiotensin-1': 70.0,
   'angiotensin-2': 18.0,
   'angiotensin-converting enzyme': 13.0,
   'angiotensin-converting enzyme 2': 329.0,
   'animal': 47.0,
   'apoptosis': 18.0,
   'bals r': 6.0,
   'basal': 34.0,
   'blood': 115.0,
   'blood vessel': 28.0,
   'bradykinin': 18.0,
   'c-c motif chemokine 1': 47.0,
   'c-reactive protein': 31.0,
   'cardiac failure': 17.0,
   'cardiovascular disorder': 107.0,
   'cardiovascular system': 115.0,
   'cd44 antigen': 46.0,
   'cellular secretion': 47.0,
   'cerebrovascular': 18.0,
   'chemokine': 43.0,
   'chest pain': 21.0,
   'chloroquine': 7.0,
   'chronic disease': 12.0,
   'chronic kidney disease': 18.0,
   'comorbidity': 19.0,
   'confounding factors': 26.0,
   'coronaviridae': 54.0,
   'coronavirus': 231.0,
   'cough': 51.0,
   'covid-19': 701.0,
   'cytokine': 141.0,
   'd-dimer measurement': 87.0,
   'death': 194.0,
   'degradation': 24.0,
   'diabetes mellitus': 489.0,
   'diabetic ketoacidosis': 46.0,
   'diarrhea, ctcae': 35.0,
   'dipeptidyl peptidase 4': 217.0,
   'dna replication': 67.0,
   'dpp4i': 92.0,
   'dyspnea': 23.0,
   'extracellular matrix': 14.0,
   'fatigue': 23.0,
   'fever': 40.0,
   'glucose': 215.0,
   'glucose metabolism disorder': 56.0,
   'glyburide': 151.0,
   'glycosylated hemoglobin measurement': 27.0,
   'growth factor': 22.0,
   'h1n1': 16.0,
   'hcp': 25.0,
   'headache': 24.0,
   'heart': 107.0,
   'heart failure': 17.0,
   'high sensitivity c-reactive protein measurement': 52.0,
   'hiv entry inhibitor': 59.0,
   'hmg-coa reductase inhibitor': 18.0,
   'host cell': 57.0,
   'human': 169.0,
   'human dpp4': 22.0,
   'human immunodeficiency virus': 39.0,
   'human kidney organoids': 17.0,
   'humoral immunity': 19.0,
   'hyperglycemia': 110.0,
   'hypertension': 159.0,
   'hypoxia': 15.0,
   'ibuprofen': 23.0,
   'immune cell': 27.0,
   'immune response process': 34.0,
   'infectious disorder': 194.0,
   'inflammation': 154.0,
   'influenza': 27.0,
   'insulin': 100.0,
   'insulin infusion': 13.0,
   'insulin resistance': 65.0,
   'interleukin': 18.0,
   'interleukin 1 beta measurement': 78.0,
   'interleukin-19': 56.0,
   'interleukin-6': 209.0,
   'interleukin-8': 35.0,
   'islet of langerhans': 18.0,
   'iv': 4.0,
   'janus bifrons': 4.0,
   'kidney': 89.0,
   'leucopenia': 18.0,
   'leukopenia': 26.0,
   'liver': 42.0,
   'lower respiratory tract infection': 43.0,
   'lung': 259.0,
   'lymphocyte': 50.0,
   'lymphopenia': 55.0,
   'm protein': 11.0,
   'macrophage': 40.0,
   'mellitus': 2.0,
   'metformin': 39.0,
   'middle east respiratory syndrome': 67.0,
   'middle east respiratory syndrome coronavirus': 171.0,
   'millimole per liter': 16.0,
   'molecule': 20.0,
   'mouse': 140.0,
   'multi-organ dysfunction': 23.0,
   'muscle': 13.0,
   'myalgia': 9.0,
   'myocardium': 17.0,
   'neoplasm': 26.0,
   'nephropathy': 12.0,
   'neutrophil': 66.0,
   'obesity': 92.0,
   'oral cavity': 57.0,
   'organ': 24.0,
   'oxygen': 13.0,
   'person': 48.0,
   'plasma': 57.0,
   'plasmid': 21.0,
   'pneumonia': 116.0,
   'prognosis': 4.0,
   'proliferation': 75.0,
   'pulmonary': 137.0,
   'rbd': 20.0,
   'receptor binding': 12.0,
   'renal': 35.0,
   'respiratory failure': 23.0,
   'respiratory system': 39.0,
   'sars coronavirus': 213.0,
   'sars-cov-2': 406.0,
   'saxagliptin': 48.0,
   'septicemia': 6.0,
   'serum': 29.0,
   'serum ferritin': 59.0,
   'severe acute respiratory syndrome': 30.0,
   'shortness of breath visual analogue scale': 18.0,
   'sitagliptin': 103.0,
   'sulfonylurea antidiabetic agent': 27.0,
   'survival': 34.0,
   't-lymphocyte': 56.0,
   'therapeutic corticosteroid': 28.0,
   'thrombophilia': 39.0,
   'tissue': 32.0,
   'transmembrane protein': 22.0,
   'troponin t, cardiac muscle': 26.0,
   'tumor necrosis factor': 56.0,
   'tzd': 26.0,
   'vaccine': 33.0,
   'vascular': 110.0,
   'vildagliptin': 55.0,
   'viral': 203.0,
   'viral entry': 58.0,
   'viral infection': 54.0,
   'virus': 199.0}},
 'pagerank': {'frequency': {'ace inhibitor': 0.004042530805234559,
   'acetaminophen': 0.001438112847217684,
   'acute lung injury': 0.010586985714212894,
   'acute respiratory distress syndrome': 0.011867704712126654,
   'adenosine': 0.0026917760651910213,
   'adipose tissue': 0.003050180074112491,
   'angioedema': 0.002756829710592291,
   'angiotensin ii receptor antagonist': 0.00746809358002035,
   'angiotensin-1': 0.006245821418420198,
   'angiotensin-2': 0.002470853065990793,
   'angiotensin-converting enzyme': 0.0021517440285987646,
   'angiotensin-converting enzyme 2': 0.026787609075385067,
   'animal': 0.004592419992089224,
   'apoptosis': 0.0023524301833432923,
   'bals r': 0.001397199827976852,
   'basal': 0.0034764575410275544,
   'blood': 0.010853621341137631,
   'blood vessel': 0.0031260667851256397,
   'bradykinin': 0.0027708076837687666,
   'c-c motif chemokine 1': 0.004515416323614919,
   'c-reactive protein': 0.0034587853616818063,
   'cardiac failure': 0.002350378641575196,
   'cardiovascular disorder': 0.00919888846066653,
   'cardiovascular system': 0.009866588757093796,
   'cd44 antigen': 0.004534812732391918,
   'cellular secretion': 0.004739588233023713,
   'cerebrovascular': 0.002327955010459226,
   'chemokine': 0.004657039692111883,
   'chest pain': 0.00285352889029586,
   'chloroquine': 0.0017516584191543307,
   'chronic disease': 0.001828074100556019,
   'chronic kidney disease': 0.0024017709671377346,
   'comorbidity': 0.0022953068575465477,
   'confounding factors': 0.0030216706672650372,
   'coronaviridae': 0.0051371282608085565,
   'coronavirus': 0.018866668159786992,
   'cough': 0.005549716153725856,
   'covid-19': 0.05433610118159648,
   'cytokine': 0.011999810473555757,
   'd-dimer measurement': 0.007904179723956701,
   'death': 0.01588115762541802,
   'degradation': 0.003379550619191846,
   'diabetes mellitus': 0.03872932193775662,
   'diabetic ketoacidosis': 0.004489379965610478,
   'diarrhea, ctcae': 0.00437488923081928,
   'dipeptidyl peptidase 4': 0.019292255378860798,
   'dna replication': 0.006114767662303313,
   'dpp4i': 0.008544463701272936,
   'dyspnea': 0.0030876590553554186,
   'extracellular matrix': 0.0023364487459284124,
   'fatigue': 0.0030876590553554186,
   'fever': 0.004607872966016623,
   'glucose': 0.01891629565971577,
   'glucose metabolism disorder': 0.005398096884190623,
   'glyburide': 0.013288960222700855,
   'glycosylated hemoglobin measurement': 0.00316352693129643,
   'growth factor': 0.0026633183979991887,
   'h1n1': 0.002264938539929582,
   'hcp': 0.0027364236231450087,
   'headache': 0.0033578693582865097,
   'heart': 0.009421332811370097,
   'heart failure': 0.0023503786415751955,
   'high sensitivity c-reactive protein measurement': 0.005099882641154376,
   'hiv entry inhibitor': 0.005365135800387551,
   'hmg-coa reductase inhibitor': 0.002289513092451817,
   'host cell': 0.005223029773412262,
   'human': 0.014092801528073864,
   'human dpp4': 0.002556211710519165,
   'human immunodeficiency virus': 0.0037946404784082693,
   'human kidney organoids': 0.0022050307810595787,
   'humoral immunity': 0.002447806997101164,
   'hyperglycemia': 0.009797828654832542,
   'hypertension': 0.013430293464003852,
   'hypoxia': 0.0021438186717728353,
   'ibuprofen': 0.0033415396684936464,
   'immune cell': 0.0034992608747177354,
   'immune response process': 0.0035897227204900926,
   'infectious disorder': 0.01591287393608746,
   'inflammation': 0.013196747867092438,
   'influenza': 0.003084152937829748,
   'insulin': 0.00949502983710845,
   'insulin infusion': 0.001944578655114113,
   'insulin resistance': 0.005983411850864118,
   'interleukin': 0.0024108668083832564,
   'interleukin 1 beta measurement': 0.0070970118624575345,
   'interleukin-19': 0.005432616905573494,
   'interleukin-6': 0.017315393375679777,
   'interleukin-8': 0.00383178730610752,
   'islet of langerhans': 0.002251140818756654,
   'iv': 0.0012861290938934351,
   'janus bifrons': 0.0012449367295877894,
   'kidney': 0.008050845332539039,
   'leucopenia': 0.0027088842460303865,
   'leukopenia': 0.002932949994588701,
   'liver': 0.004381443381638265,
   'lower respiratory tract infection': 0.004531576500548021,
   'lung': 0.020767410879585207,
   'lymphocyte': 0.00487033347133493,
   'lymphopenia': 0.005415097958458006,
   'm protein': 0.0018646822018238388,
   'macrophage': 0.004076946700022219,
   'mellitus': 0.0010900555777919516,
   'metformin': 0.004052023030357942,
   'middle east respiratory syndrome': 0.006036974208200138,
   'middle east respiratory syndrome coronavirus': 0.013925679997029811,
   'millimole per liter': 0.0021373699233968708,
   'molecule': 0.002838505562387744,
   'mouse': 0.011677381179374615,
   'multi-organ dysfunction': 0.0029372153143388878,
   'muscle': 0.002122672174800488,
   'myalgia': 0.001977521539326412,
   'myocardium': 0.0022345724148010787,
   'neoplasm': 0.003037911877765503,
   'nephropathy': 0.0019676543688425746,
   'neutrophil': 0.006043543871225477,
   'obesity': 0.008267711638693713,
   'oral cavity': 0.0055659857338091045,
   'organ': 0.002973741879099519,
   'oxygen': 0.0020351103352963485,
   'person': 0.004525680283686401,
   'plasma': 0.0053197061181747204,
   'plasmid': 0.0027226509338995376,
   'pneumonia': 0.010512095493427492,
   'prognosis': 0.0012386054122404233,
   'proliferation': 0.006890696441541255,
   'pulmonary': 0.01163903389278049,
   'rbd': 0.0025247523434533924,
   'receptor binding': 0.0018464368550616746,
   'renal': 0.0038736869773938884,
   'respiratory failure': 0.002986319777792431,
   'respiratory system': 0.004329495522768546,
   'sars coronavirus': 0.01735548019022084,
   'sars-cov-2': 0.033049550473799,
   'saxagliptin': 0.004721702323809725,
   'septicemia': 0.0014192915757123858,
   'serum': 0.0031395422204015004,
   'serum ferritin': 0.005597172953134468,
   'severe acute respiratory syndrome': 0.003097702126449046,
   'shortness of breath visual analogue scale': 0.0026129361572218095,
   'sitagliptin': 0.009855673950393119,
   'sulfonylurea antidiabetic agent': 0.003096324833939332,
   'survival': 0.0035787752943805084,
   't-lymphocyte': 0.005867754495528255,
   'therapeutic corticosteroid': 0.0031845265341031753,
   'thrombophilia': 0.004018275751713626,
   'tissue': 0.0035433048999557516,
   'transmembrane protein': 0.0027410790983396745,
   'troponin t, cardiac muscle': 0.0029329499945887007,
   'tumor necrosis factor': 0.005503860224580636,
   'tzd': 0.0030216706672650372,
   'vaccine': 0.003480062769988538,
   'vascular': 0.009565371755902547,
   'vildagliptin': 0.005452913076992331,
   'viral': 0.016857758695857906,
   'viral entry': 0.005437985316728421,
   'viral infection': 0.005080823684964583,
   'virus': 0.01628447621361242}},
 'betweenness': {'distance_npmi': {'ace inhibitor': 0.00020678246484698098,
   'acetaminophen': 0.0,
   'acute lung injury': 0.015260545905707195,
   'acute respiratory distress syndrome': 0.014640198511166254,
   'adenosine': 0.009376895505927763,
   'adipose tissue': 0.006782464846980976,
   'angioedema': 0.006685966363385718,
   'angiotensin ii receptor antagonist': 0.009015715467328371,
   'angiotensin-1': 0.003143093465674111,
   'angiotensin-2': 0.014006065618968845,
   'angiotensin-converting enzyme': 0.0045905707196029774,
   'angiotensin-converting enzyme 2': 0.0030603804797353184,
   'animal': 0.006617038875103391,
   'apoptosis': 0.0027564102564102567,
   'bals r': 0.0,
   'basal': 0.0011993382961124897,
   'blood': 0.01033912324234905,
   'blood vessel': 0.005872622001654259,
   'bradykinin': 0.004273504273504273,
   'c-c motif chemokine 1': 0.0071960297766749375,
   'c-reactive protein': 0.0070168183071408876,
   'cardiac failure': 0.0015536255858836503,
   'cardiovascular disorder': 0.014061207609594707,
   'cardiovascular system': 0.004962779156327543,
   'cd44 antigen': 0.013234077750206782,
   'cellular secretion': 0.01368899917287014,
   'cerebrovascular': 0.0012682657843948167,
   'chemokine': 0.026730079955886405,
   'chest pain': 0.0059449958643507045,
   'chloroquine': 0.0019713261648745518,
   'chronic disease': 0.0009098428453267163,
   'chronic kidney disease': 0.011248966087675765,
   'comorbidity': 8.271298593879239e-05,
   'confounding factors': 0.007154673283705543,
   'coronaviridae': 0.010173697270471464,
   'coronavirus': 0.006947890818858561,
   'cough': 0.02803970223325062,
   'covid-19': 0.0,
   'cytokine': 0.004425144747725393,
   'd-dimer measurement': 0.023738626964433417,
   'death': 0.004631927212572374,
   'degradation': 0.01578439481665288,
   'diabetes mellitus': 0.01282051282051282,
   'diabetic ketoacidosis': 0.003970223325062035,
   'diarrhea, ctcae': 0.005707196029776675,
   'dipeptidyl peptidase 4': 0.01108354011579818,
   'dna replication': 0.010835401157981803,
   'dpp4i': 0.003515301902398677,
   'dyspnea': 0.002257375241246211,
   'extracellular matrix': 0.007740556934105321,
   'fatigue': 0.002257375241246211,
   'fever': 0.004962779156327543,
   'glucose': 0.007444168734491315,
   'glucose metabolism disorder': 0.03432588916459884,
   'glyburide': 0.0032258064516129032,
   'glycosylated hemoglobin measurement': 0.01621174524400331,
   'growth factor': 0.004921422663358147,
   'h1n1': 0.007816377171215881,
   'hcp': 0.0008271298593879239,
   'headache': 0.0162979046043562,
   'heart': 0.016604631927212572,
   'heart failure': 0.0015536255858836503,
   'high sensitivity c-reactive protein measurement': 0.013289219740832645,
   'hiv entry inhibitor': 0.010794044665012407,
   'hmg-coa reductase inhibitor': 0.002267714364488558,
   'host cell': 0.002522746071133168,
   'human': 0.004549214226633581,
   'human dpp4': 0.004466501240694789,
   'human immunodeficiency virus': 0.004880066170388751,
   'human kidney organoids': 0.0007857733664185277,
   'humoral immunity': 0.00722360077198787,
   'hyperglycemia': 0.01588089330024814,
   'hypertension': 0.004549214226633581,
   'hypoxia': 0.0030603804797353184,
   'ibuprofen': 0.020168183071408875,
   'immune cell': 0.01000827129859388,
   'immune response process': 0.0013234077750206782,
   'infectious disorder': 0.0012406947890818859,
   'inflammation': 0.0037220843672456576,
   'influenza': 0.006699751861042184,
   'insulin': 0.007113316790736146,
   'insulin infusion': 0.004328646264130135,
   'insulin resistance': 0.014185277088502896,
   'interleukin': 0.002522746071133168,
   'interleukin 1 beta measurement': 0.007899090157154674,
   'interleukin-19': 0.022125723738626965,
   'interleukin-6': 0.013316790736145575,
   'interleukin-8': 0.005872622001654259,
   'islet of langerhans': 0.004797353184449959,
   'iv': 0.0,
   'janus bifrons': 0.0,
   'kidney': 0.022539288668320927,
   'leucopenia': 0.01262062310449407,
   'leukopenia': 0.0029914529914529912,
   'liver': 0.015232974910394265,
   'lower respiratory tract infection': 0.014846980976013235,
   'lung': 0.004466501240694789,
   'lymphocyte': 0.007954232147780536,
   'lymphopenia': 0.038916459884201816,
   'm protein': 0.0008684863523573201,
   'macrophage': 0.008395368072787427,
   'mellitus': 0.0,
   'metformin': 0.004425144747725393,
   'middle east respiratory syndrome': 0.005500413564929694,
   'middle east respiratory syndrome coronavirus': 0.005045492142266336,
   'millimole per liter': 0.0016266887234629168,
   'molecule': 0.01851392335263303,
   'mouse': 0.008064516129032258,
   'multi-organ dysfunction': 0.006792803970223326,
   'muscle': 0.002564102564102564,
   'myalgia': 0.0007926661152467603,
   'myocardium': 0.0058519437551695615,
   'neoplasm': 0.008353322304935207,
   'nephropathy': 0.004962779156327543,
   'neutrophil': 0.008836503997794318,
   'obesity': 0.008023159636062862,
   'oral cavity': 0.010752688172043012,
   'organ': 0.013454645712710229,
   'oxygen': 0.007154673283705542,
   'person': 0.006038047973531845,
   'plasma': 0.009966914805624482,
   'plasmid': 0.016191066997518613,
   'pneumonia': 0.012572373862696443,
   'prognosis': 0.0,
   'proliferation': 0.00380479735318445,
   'pulmonary': 0.0347808105872622,
   'rbd': 0.00380479735318445,
   'receptor binding': 0.00260545905707196,
   'renal': 0.012985938792390406,
   'respiratory failure': 0.006406810035842295,
   'respiratory system': 0.014846980976013235,
   'sars coronavirus': 0.0037220843672456576,
   'sars-cov-2': 0.003143093465674111,
   'saxagliptin': 0.00467328370554177,
   'septicemia': 0.00041356492969396195,
   'serum': 0.014502343534601598,
   'serum ferritin': 0.016101461262751585,
   'severe acute respiratory syndrome': 0.004383788254755997,
   'shortness of breath visual analogue scale': 0.00041356492969396195,
   'sitagliptin': 0.006472291149710504,
   'sulfonylurea antidiabetic agent': 0.00641025641025641,
   'survival': 0.004466501240694789,
   't-lymphocyte': 0.029404466501240695,
   'therapeutic corticosteroid': 0.011248966087675765,
   'thrombophilia': 0.018458781362007164,
   'tissue': 0.011993382961124897,
   'transmembrane protein': 0.010918114143920596,
   'troponin t, cardiac muscle': 0.0029914529914529912,
   'tumor necrosis factor': 0.016804521643231318,
   'tzd': 0.007154673283705543,
   'vaccine': 0.003349875930521092,
   'vascular': 0.0028949545078577337,
   'vildagliptin': 0.0030603804797353184,
   'viral': 0.012241521918941274,
   'viral entry': 0.015508684863523574,
   'viral infection': 0.006286186931348222,
   'virus': 0.005789909015715467}},
 'closeness': {}}

Community detection

Community detection methods partition the network into clusters of densely connected nodes in a way that nodes in the same community are more connected between themselves relatively to the nodes in different communities. In this section we will illustrate the use of the CommunityDetector interface provided by BlueGraph for community detection and estimation of its quality using modularity, performance and coverange methods. The unified interface allows us to use various community detection methods available in different graph backends.

First, we create a NetworkX-based instance and use several different community detection strategies provided by this library.

nx_detector = NXCommunityDetector(new_paragraph_network, directed=False)
nx_detector.graph
<networkx.classes.graph.Graph at 0x7fc6ae6035c0>

Louvain algorithm

partition = nx_detector.detect_communities(
    strategy="louvain", weight="npmi")
print("Modularity: ", nx_detector.evaluate_parition(partition, metric="modularity", weight="npmi"))
print("Performance: ", nx_detector.evaluate_parition(partition, metric="performance", weight="npmi"))
print("Coverage: ", nx_detector.evaluate_parition(partition, metric="coverage", weight="npmi"))
Modularity:  0.34260090074583094
Performance:  0.7893189612934836
Coverage:  0.3929003630496168

Label propagation

partition = nx_detector.detect_communities(
    strategy="lpa", weight="npmi", intermediate=False)
print("Modularity: ", nx_detector.evaluate_parition(partition, metric="modularity", weight="npmi"))
print("Performance: ", nx_detector.evaluate_parition(partition, metric="performance", weight="npmi"))
print("Coverage: ", nx_detector.evaluate_parition(partition, metric="coverage", weight="npmi"))
Modularity:  0.07719091705395371
Performance:  0.3316184876694431
Coverage:  0.9415086728519564

Stochastic block model

gt_detector = GTCommunityDetector(new_paragraph_network, directed=False)
partition = gt_detector.detect_communities(strategy="sbm", weight="npmi")
print("Modularity: ", nx_detector.evaluate_parition(partition, metric="modularity", weight="npmi"))
print("Performance: ", nx_detector.evaluate_parition(partition, metric="performance", weight="npmi"))
print("Coverage: ", nx_detector.evaluate_parition(partition, metric="coverage", weight="npmi"))
Modularity:  0.21771245914317255
Performance:  0.7700473624040503
Coverage:  0.2408229124647035

Writing community partition as node properties

nx_detector.detect_communities(
    strategy="louvain", weight="npmi",
    write=True, write_property="louvain_community")
new_paragraph_network = nx_detector.get_pgframe(
    node_prop_types=new_paragraph_network._node_prop_types,
    edge_prop_types=new_paragraph_network._edge_prop_types)
new_paragraph_network.nodes(raw_frame=True).sample(5)
@type paper_frequency degree pagerank betweenness louvain_community
@id
multi-organ dysfunction Entity 2.0 23.0 0.002937 0.006793 4
adenosine Entity 2.0 22.0 0.002692 0.009377 1
millimole per liter Entity 2.0 16.0 0.002137 0.001627 5
respiratory failure Entity 2.0 23.0 0.002986 0.006407 4
interleukin-6 Entity 5.0 209.0 0.017315 0.013317 1

Export network and the computed metrics

# Save graph as JSON
new_paragraph_network.export_json("../data/literature_comention.json")
# Save the graph for Gephi import.
new_paragraph_network.export_to_gephi(
    "../data/gephi_literature_comention",
    node_attr_mapping = {
        "degree": "Degree",
        "pagerank": "PageRank",
        "betweenness": "Betweenness",
        "louvain_community": "Community"
    },
    edge_attr_mapping={
        "npmi": "Weight"
    })

The representation of the network saved above can be imported into Gephi for producing graph visualizations, as in the following example:

In the figures below colors represent communities detected using the Louvain algorithm (with NPMI edge weights), node sizes are proportional to the PageRank of nodes and edge thickness to the NPMI values.

Full network

Full network

Community “Symptoms and comorbidities”

Symptoms and comorbidities

Community “Viral biology”

Viral biology

Community “Immunity”

“Immunity”

Minimum spanning trees

A minimum spanning tree of a network is given by a subset of edges that make the network connected (\(n - 1\) edges connecting \(n\) nodes). Its weighted version minimizes not only the number of edges included in the tree, but the total edge weight.

In the following example we compute a minimum spanning tree minimizing the NPMI-based distance weight of the network edges. We use the graph_tool-based implementation of the PathFinder interface.

gt_paragraph_path_finder = GTPathFinder(new_paragraph_network, directed=False)
gt_paragraph_path_finder.graph
<Graph object, undirected, with 157 vertices and 2479 edges at 0x7fc6ae3e8438>
tree = graph_tool_to_pgframe(gt_paragraph_path_finder.minimum_spanning_tree(distance="distance_npmi"))
tree.export_to_gephi(
    "../data/gephi_literature_spanning_tree",
    node_attr_mapping = {
        "degree": "Degree",
        "pagerank": "PageRank",
        "betweenness": "Betweenness",
        "louvain_community": "Community"
    },
    edge_attr_mapping={
        "npmi": "Weight"
    })

The representation of the network saved above can be imported into Gephi for producing graph visualizations, as in the following example:

In the figures below colors represent communities detected using the NPMI weight, node sizes are proportional to the PageRank of nodes and edge thickness to the NPMI values.

Full network

Full network

Zoom into “covid-19”

covid-19