NASA dataset keywords analysis: Neo4j analytics tutorial¶

In this notebook we use graph-based co-occurrence analysis on the publicly available Data catalog of NASA (https://data.nasa.gov/browse, and the API endpoint https://data.nasa.gov/data.json). This dataset consists of the meta-data for different NASA datasets. The source notebook can be found here.

We will work on the sets of keywords attached to each dataset and build a keyword co-occurrence graph describing relations between different dataset keywords. The keyword relations in the above-mentioned graph are quantified using mutual-information-based scores (normalized pointwise mutual information).

See the related tutorial here: https://www.tidytextmining.com/nasa.html

In this tutorial we will use the Neo4j-based implementation of different analytics interfaces provided by BlueGraph. Therefore, in order to use it, you need a running instance of the Neo4j database (see installation instructions).

import os
import json
import pandas as pd
import requests
import getpass

from bluegraph.core import (PandasPGFrame,
                            pretty_print_paths,
                            pretty_print_tripaths)
from bluegraph.preprocess.generators import CooccurrenceGenerator
from bluegraph.backends.neo4j import (pgframe_to_neo4j,
                                      Neo4jMetricProcessor,
                                      Neo4jPathFinder,
                                      Neo4jCommunityDetector,
                                      Neo4jGraphProcessor)
from bluegraph.backends.networkx import NXPathFinder, networkx_to_pgframe

Data preparation¶

Download and read the NASA dataset.

NASA_META_DATA_URL = "https://data.nasa.gov/data.json"
if not os.path.isfile("../data/nasa.json"):
    r = requests.get(NASA_META_DATA_URL)
    open("../data/nasa.json", "wb").write(r.content)

with open("../data/nasa.json", "r") as f:
    data = json.load(f)

print("Example dataset: ")
print("----------------")
print(json.dumps(data["dataset"][0], indent="   "))

print()
print("Keywords: ", data["dataset"][0]["keyword"])

Example dataset:
----------------
{
   "accessLevel": "public",
   "landingPage": "https://pds.nasa.gov/ds-view/pds/viewDataset.jsp?dsid=RO-E-RPCMAG-2-EAR2-RAW-V3.0",
   "bureauCode": [
      "026:00"
   ],
   "issued": "2018-06-26",
   "@type": "dcat:Dataset",
   "modified": "2020-03-04",
   "references": [
      "https://pds.nasa.gov"
   ],
   "keyword": [
      "earth",
      "unknown",
      "international rosetta mission"
   ],
   "contactPoint": {
      "@type": "vcard:Contact",
      "fn": "Thomas Morgan",
      "hasEmail": "mailto:thomas.h.morgan@nasa.gov"
   },
   "publisher": {
      "@type": "org:Organization",
      "name": "National Aeronautics and Space Administration"
   },
   "identifier": "urn:nasa:pds:context_pds3:data_set:data_set.ro-e-rpcmag-2-ear2-raw-v3.0",
   "description": "This dataset contains EDITED RAW DATA of the second Earth Flyby (EAR2). The closest approach (CA) took place on November 13, 2007 at 20:57",
   "title": "ROSETTA-ORBITER EARTH RPCMAG 2 EAR2 RAW V3.0",
   "programCode": [
      "026:005"
   ],
   "distribution": [
      {
         "@type": "dcat:Distribution",
         "downloadURL": "https://www.socrata.com",
         "mediaType": "text/html"
      }
   ],
   "accrualPeriodicity": "irregular",
   "theme": [
      "Earth Science"
   ]
}

Keywords:  ['earth', 'unknown', 'international rosetta mission']

Create a dataframe with keyword occurrence in different datasets

rows = []
for el in data['dataset']:
    row = [el["identifier"]]
    if "keyword" in el:
        for k in el["keyword"]:
            rows.append(row + [k])
keyword_data = pd.DataFrame(rows, columns=["dataset", "keyword"])

keyword_data

	dataset	keyword
0	urn:nasa:pds:context_pds3:data_set:data_set.ro...	earth
1	urn:nasa:pds:context_pds3:data_set:data_set.ro...	unknown
2	urn:nasa:pds:context_pds3:data_set:data_set.ro...	international rosetta mission
3	TECHPORT_9532	completed
4	TECHPORT_9532	jet propulsion laboratory
...	...	...
112731	NASA-877__2	lunar
112732	NASA-877__2	jsc
112733	NASA-877__2	sample
112734	TECHPORT_94299	active
112735	TECHPORT_94299	trustees of the colorado school of mines

112736 rows × 2 columns

Aggregate dataset ids for each keyword and select the 500 most frequently used keywords.

n = 500

aggregated_datasets = keyword_data.groupby("keyword").aggregate(set)["dataset"]
most_frequent_keywords = list(aggregated_datasets.apply(len).nlargest(n).index)
most_frequent_keywords[:5]

['completed',
 'earth science',
 'atmosphere',
 'national geospatial data asset',
 'ngda']

Create a property graph object whose nodes are unique keywords.

graph = PandasPGFrame()
graph.add_nodes(most_frequent_keywords)
graph.add_node_types({n: "Keyword" for n in most_frequent_keywords})

Add sets of dataset ids as properties of our keyword nodes.

aggregated_datasets.index.name = "@id"
graph.add_node_properties(aggregated_datasets, prop_type="category")

graph._nodes.sample(5)

	@type	dataset
@id
4 vesta	Keyword	{urn:nasa:pds:context_pds3:data_set:data_set.d...
jsc	Keyword	{NASA-872, NASA-877, NASA-873, NASA-871, NASA-...
paaliaq	Keyword	{urn:nasa:pds:context_pds3:data_set:data_set.c...
international halley watch	Keyword	{urn:nasa:pds:context_pds3:data_set:data_set.i...
active remote sensing	Keyword	{C1243149604-ASF, C1243162394-ASF, C1243197502...

n_datasets = len(keyword_data["dataset"].unique())
print("Total number of dataset: ", n_datasets)

Total number of dataset:  25722

Co-occurrence graph generation¶

We create a co-occurrence graph using the 500 most frequent keywords: nodes are keywords and a pair of nodes is connected with an undirected edge if two corresponding keywords co-occur in at lease one dataset. Moreover, the edges are equipped with weights corresponding to:

raw co-occurrence frequency
normalized pointwise mutual information (NPMI)
frequency- and mutual-information-based distances (1 / frequency, 1 / NPMI)

generator = CooccurrenceGenerator(graph)
comention_edges = generator.generate_from_nodes(
    "dataset", total_factor_instances=n_datasets,
    compute_statistics=["frequency", "npmi"])

Examining 124750 pairs of terms for co-occurrence...

Remove edges with zero NPMI

comention_edges = comention_edges[comention_edges["npmi"] > 0]

Compute the NPMI-based distance score

comention_edges.loc[:, "distance_npmi"] = comention_edges.loc[:, "npmi"].apply(lambda x: 1 / x)

Add generated edges to the property graph.

graph.remove_node_properties("dataset") # Remove datasets from node properties
graph._edges = comention_edges.drop(columns=["common_factors"])
graph._edge_prop_types = {
    "frequency": "numeric",
    "npmi": "numeric",
    "distance_npmi": "numeric"
}

graph.edges(raw_frame=True).sample(5)

		frequency	npmi	distance_npmi
@source_id	@target_id
surface radiative properties	natural hazards	3	0.022294	44.855423
cryosphere	radar	56	0.248405	4.025688
mars global surveyor	stardust	2	0.395268	2.529931
sample treatment protocol	temperature	1	0.446902	2.237629
radar	synthetic	1	0.056982	17.549306

Initializing Neo4j graph from a PGFrame¶

In this section we will populate a Neo4j database with the generated keyword co-occurrence property graph.

In the cells below provide the credentials for connecting to your instance of the Neo4j database.

NEO4J_URI = "bolt://localhost:7687"
NEO4J_USER = "neo4j"

NEO4J_PASSWORD = getpass.getpass()

········

Populate the Neo4j database with the nodes and edges of the generated property graph using pgframe_to_neo4j. We specify labels of nodes (Keyword) and edges (CoOccurs) to use for the new elements.

NODE_LABEL = "Keyword"
EDGE_LABEL = "CoOccurs"

# (!) If you run this cell multiple times, you may create nodes and edges of the graph
# multiple times, if you have already run the notebook, set the parameter `pgframe` to None
# this will prevent population of the Neo4j database with the generated graph, but will create
# the necessary `Neo4jGraphView` object.
graph_view = pgframe_to_neo4j(
    pgframe=graph,  # None, if no population is required
    uri=NEO4J_URI, username=NEO4J_USER, password=NEO4J_PASSWORD,
    node_label=NODE_LABEL, edge_label=EDGE_LABEL,
    directed=False)

# # If you want to clear the database from created elements, run
# graph_view._clear()

Nearest neighours by NPMI¶

In this section we will compute top 10 neighbors of the keywords ‘mars’ and ‘saturn’ by the highest NPMI.

To do so, we will use the top_neighbors method of the PathFinder interface provided by the BlueGraph. This interface allows us to search for top neighbors with the highest edge weight. In this example, we use Neo4j-based Neo4jPathFinder interface.

path_finder = Neo4jPathFinder.from_graph_object(graph_view)

path_finder.top_neighbors("mars", 10, weight="npmi")

{'mars exploration rover': 0.7734334910676389,
 'phoenix': 0.6468063979421724,
 'mars science laboratory': 0.6354738555723674,
 '2001 mars odyssey': 0.5902693119742288,
 'mars global surveyor': 0.5756873488729959,
 'mars reconnaissance orbiter': 0.555421194053889,
 'viking': 0.5490894121185264,
 'mars pathfinder': 0.5223639673427369,
 'mars express': 0.5153112375202485,
 'phobos': 0.49283414887183974}

path_finder.top_neighbors("saturn", 10, weight="npmi")

{'iapetus': 0.7512958866945076,
 'tethys': 0.750629376973449,
 'mimas': 0.7481024499128304,
 'phoebe': 0.7458314316016054,
 'rhea': 0.7453385116030462,
 'dione': 0.7425859139664013,
 'cassini-huygens': 0.74217432172955,
 'enceladus': 0.7347323196182364,
 'hyperion': 0.7346061630878281,
 'janus': 0.7144193057581066}

Graph metrics and node centrality measures¶

BlueGraph provides the MetricProcessor interface for computing various graph statistics. We will use Neo4j-based Neo4jMetricProcessor interface.

metrics = Neo4jMetricProcessor.from_graph_object(graph_view)

print("Density of the constructed network: ", metrics.density())

Density of the constructed network:  0.051334669338677356

Node centralities¶

In this example we will compute the Degree and PageRank centralities only for the raw frequency, and the Betweenness centrality for the mutual-information-based scores. We will use methods provided by the MetricProcessor interface in the write mode, i.e. computed metrics will be written as node properties of the underlying graph object.

Degree centrality is given by the sum of weights of all incident edges of the given node and characterizes the importance of the node in the network in terms of its connectivity to other nodes (high degree = high connectivity).

metrics.degree_centrality("frequency", write=True, write_property="degree")

PageRank centrality is another measure that estimated the importance of the given node in the network. Roughly speaking it can be interpreted as the probablity that having landed on a random node in the network we will jump to the given node (here the edge weights are taken into account”).

https://en.wikipedia.org/wiki/PageRank

metrics.pagerank_centrality("frequency", write=True, write_property="pagerank")

We then compute the betweenness centrality based on the NPMI distances.

Betweenness centrality is a node importance measure that estimates how often a shortest path between a pair of nodes will pass through the given node.

metrics.betweenness_centrality("distance_npmi", write=True, write_property="betweenness")

/Users/oshurko/opt/anaconda3/envs/bluegraph/lib/python3.6/site-packages/bluegraph/backends/neo4j/analyse/metrics.py:111: MetricProcessingWarning: Weighted betweenness centrality for Neo4j graphs is not implemented: computing the unweighted version
  MetricProcessor.MetricProcessingWarning)

Now, we will export this backend-specific graph object into a PGFrame.

new_graph = metrics.get_pgframe(node_prop_types=graph._node_prop_types, edge_prop_types=graph._edge_prop_types)

new_graph.nodes(raw_frame=True).sample(5)

	degree	betweenness	pagerank
@id
delta	25.0	122.525847	0.948799
langley research center	2.0	2.832536	0.553508
radiation dosimetry	33.0	60.795710	1.255296
population	25.0	49.157125	0.729864
tarvos	32.0	0.000000	0.923662

print("Top 10 nodes by degree")
for n in new_graph.nodes(raw_frame=True).nlargest(10, columns=["degree"]).index:
    print("\t", n)

Top 10 nodes by degree
     earth science
     jupiter
     earth
     land surface
     terrestrial hydrosphere
     imagery
     support archives
     sun
     atmosphere
     surface water

print("Top 10 nodes by PageRank")
for n in new_graph.nodes(raw_frame=True).nlargest(10, columns=["pagerank"]).index:
    print("\t", n)

Top 10 nodes by PageRank
     active
     earth science
     completed
     project
     pds
     earth
     jupiter
     imagery
     data
     moon

print("Top 10 nodes by betweenness")
for n in new_graph.nodes(raw_frame=True).nlargest(10, columns=["betweenness"]).index:
    print("\t", n)

Top 10 nodes by betweenness
     astronomy
     imagery
     goddard space flight center
     radar
     active
     topography
     safety
     time
     images
     temperature

Community detection¶

Community detection methods partition the graph into clusters of densely connected nodes in a way that nodes in the same community are more connected between themselves relatively to the nodes in different communities. In this section we will illustrate the use of the CommunityDetector interface provided by BlueGraph for community detection and estimation of its quality using modularity, performance and coverange methods.

First, we create a Neo4j-based instance and use several different community detection strategies provided by Neo4j.

com_detector = Neo4jCommunityDetector.from_graph_object(graph_view)

Louvain algorithm¶

partition = com_detector.detect_communities(
    strategy="louvain", weight="npmi")

print("Modularity: ", com_detector.evaluate_parition(partition, metric="modularity", weight="npmi"))
print("Performance: ", com_detector.evaluate_parition(partition, metric="performance", weight="npmi"))
print("Coverage: ", com_detector.evaluate_parition(partition, metric="coverage", weight="npmi"))

Modularity:  0.8055352122880087
Performance:  0.9050420841683366
Coverage:  0.9223953224304512

Label propagation¶

partition = com_detector.detect_communities(
    strategy="lpa", weight="npmi")

print("Modularity: ", com_detector.evaluate_parition(partition, metric="modularity", weight="npmi"))
print("Performance: ", com_detector.evaluate_parition(partition, metric="performance", weight="npmi"))
print("Coverage: ", com_detector.evaluate_parition(partition, metric="coverage", weight="npmi"))

Modularity:  0.6599097374293331
Performance:  0.6372184368737475
Coverage:  0.9699235341510681

Writing community partition as node properties¶

com_detector.detect_communities(
    strategy="louvain", weight="npmi",
    write=True, write_property="louvain_community")

new_graph = com_detector.get_pgframe(
    node_prop_types=new_graph._node_prop_types,
    edge_prop_types=new_graph._edge_prop_types)

new_graph.nodes(raw_frame=True).sample(5)

	degree	betweenness	louvain_community	pagerank
@id
sample collection	29.0	251.402898	330	1.130440
atmospheric chemistry	50.0	134.337209	319	1.304748
neptune	48.0	2524.723644	353	1.602773
coanda	8.0	0.000000	141	0.696748
phobos	49.0	868.444670	353	1.426603

Export network and the computed metrics¶

Save graph as JSON

new_graph.export_json("../data/nasa_comention.json")

Save the graph for Gephi import.

new_graph.export_to_gephi(
    "../data/gephi_nasa_comention",
    node_attr_mapping = {
        "degree": "Degree",
        "pagerank": "PageRank",
        "betweenness": "Betweenness",
        "louvain_community": "Community"
    },
    edge_attr_mapping={
        "npmi": "Weight"
    })

The representation of the network saved above can be imported into Gephi for producing graph visualizations, as in the following example:

In the figures below colors represent communities detected using the raw frequency of the co-occurrence edges, node sizes are proportional to the PageRank of nodes and edge thickness to the NPMI values.

NASA dataset keywords co-occurrence network

We can zoom into some of the communities of keywords identified using the community detection method above

Celestial bodies

Earth science

Space programs and missions

Minimum spanning tree¶

A minimum spanning tree of a network is given by a subset of edges that make the network connected (\(n - 1\) edges connecting \(n\) nodes). Its weighted version minimizes not only the number of edges included in the tree, but the total edge weight.

In the following example we compute a minimum spanning tree minimizing the NPMI-based distance weight of the network edges. We use the Neo4j-based implementation of the PathFinder interface.

path_finder.minimum_spanning_tree(distance="distance_npmi", write=True, write_edge_label="MSTEdge")

nx_path_finder = NXPathFinder(new_graph, directed=False)
tree = nx_path_finder.minimum_spanning_tree(distance="distance_npmi")

tree_pgframe = networkx_to_pgframe(
    tree,
    node_prop_types=new_graph._node_prop_types,
    edge_prop_types=new_graph._edge_prop_types)

tree_pgframe.export_to_gephi(
    "../data/gephi_nasa_spanning_tree",
    node_attr_mapping = {
        "degree": "Degree",
        "pagerank": "PageRank",
        "betweenness": "Betweenness",
        "louvain_community": "Community"
    },
    edge_attr_mapping={
        "npmi": "Weight"
    })

Zoom Earth Science

Zoom Asteroids

Shortest path search¶

The shortest path search problem consisits in finding a sequence of edges from the source node to the target node that minimizes the cumulative weight (or distance) associated to the edges.

path = path_finder.shortest_path("ecosystems", "oceans")
pretty_print_paths([path])

ecosystems <->               <-> oceans
               earth science

The cell above illustrates that the single shortest path form ‘ecosystems’ and ‘oceans’ consists of the direct edge between them.

Now to explore related keywords we would like to find a set of \(n\) shortest paths between them. Moreover, we would like these paths to be indirect (not to include the direct edge from the source to the target). In the following examples we use mutual-information-based edge weights to perform our literature exploration.

In the following examples we use Yen’s algorithm for finding \(n\) loopless shortest paths from the source to the target (https://en.wikipedia.org/wiki/Yen%27s_algorithm).

paths = path_finder.n_shortest_paths(
    "ecosystems", "oceans", n=10,
    distance="distance_npmi",
    strategy="yen")

pretty_print_paths(paths)

ecosystems <->                                              <-> oceans
               biosphere <-> coastal processes
               biosphere <-> ocean waves
               biosphere <-> terrestrial ecosystems
               biosphere <-> erosion/sedimentation
               geomorphic landforms/processes
               biosphere <-> geomorphic landforms/processes
               land use/land cover <-> coastal processes
               biosphere <-> forest science
               earth science
               land use/land cover <-> ocean waves

paths = path_finder.n_shortest_paths(
    "mission", "mars", n=10,
    distance="distance_npmi",
    strategy="yen")

pretty_print_paths(paths)

mission <->                                                         <-> mars
            delta <-> mars reconnaissance orbiter
            earth's bridge to space <-> mars reconnaissance orbiter
            vehicle <-> mars reconnaissance orbiter
            mars reconnaissance orbiter
            history <-> mars reconnaissance orbiter
            support <-> mars reconnaissance orbiter
            landing <-> mars reconnaissance orbiter
            delta <-> phoenix
            earth's bridge to space <-> phoenix
            vehicle <-> phoenix

Nested path search¶

To explore the space of co-occurring terms in depth, we can run the path search procedure presented above in a nested fashion. For each edge \(e_1, e_2, ..., e_n\) encountered on a path from the source to the target from, we can further expand it into \(n\) shortest paths between each pair of successive entities (i.e. paths between \(e_1\) and \(e_2\), \(e_2\) and \(e_3\), etc.).

paths1 = path_finder.n_nested_shortest_paths(
    "ecosystems", "oceans",
    top_level_n=10, nested_n=3, depth=2, distance="distance_npmi",
    strategy="yen")

paths2 = path_finder.n_nested_shortest_paths(
    "mission", "mars",
    top_level_n=10, nested_n=3, depth=2, distance="distance_npmi",
    strategy="yen")

We can now build and visualize the subnetwork constructed using the nodes and the edges discovered during our nested path search.

summary_graph_oceans = networkx_to_pgframe(nx_path_finder.get_subgraph_from_paths(paths1))
summary_graph_mars = networkx_to_pgframe(nx_path_finder.get_subgraph_from_paths(paths2))

# Save the graph for Gephi import.
summary_graph_oceans.export_to_gephi(
    "../data/gephi_nasa_path_graph_oceans",
    node_attr_mapping = {
        "degree": "Degree",
        "pagerank": "PageRank",
        "betweenness": "Betweenness",
        "louvain_community": "Community"
    },
    edge_attr_mapping={
        "npmi": "Weight"
    })
# Save the graph for Gephi import.
summary_graph_mars.export_to_gephi(
    "../data/gephi_nasa_path_graph_mars",
    node_attr_mapping = {
        "degree": "Degree",
        "pagerank": "PageRank",
        "betweenness": "Betweenness",
        "louvain_community": "Community"
    },
    edge_attr_mapping={
        "npmi": "Weight"
    })

The resulting graphs visualized with Gephi

Ecosystems <-> Oceans

Mission <-> Mars