vO.1.1 Release Notes

Introducing Blue Graph

This is an initial release of Blue Graph, a graph analytics framework for Python designed to unify different graph processing APIs. Its main goal is to provide a set of uniform APIs enabling users to peform graph analytics tasks abstracting away from the specificities of different graph processing tools, backends, and the graph data structures they provide.

Using the built-in PGFrame data structure for representing property graphs with node/edge types and property data types as a ‘data exchange layer’ between different analytics interfaces, it provides a backend-agnostic API supporting both in-memory (NetworkX, graph-tool and StellarGraph) and persistent graph backends (Neo4j).

It provides the following set of interfaces:

  • preprocessing and co-occurrence analysis API providing semantic property encoders and co-occurrence graph generators;

  • graph analytics API providing interfaces for computing graph metrics, performing path search and community detection;

  • unsupervised representation learning API for applying various graph embedding techniques;

  • representation learning downstream tasks API allowing the user to perform node classification, similarity queries, link prediction.

The main components of BlueGraph’s API are illustrated in the following diagram:

BlueGraph components

Blue Graph’s core

The bluegraph.core package provides the exchange graph data structure that serves as the input to graph processors based on different backends (PGFrame), as well as basic interfaces for different graph analytics and embedding classes (MetricProcessor, PathFinder, CommunityDetector, GraphElementEmbedder, etc).

PGFrame

This data structure represents property graphs with node/edge types and property data types. It is based on two dataframes: one for nodes, their types and properties as well as one for edges and their types/properties. Moreover, PGFrame supports setting data types to node/edge properties. Three high-level data types are currently supported: numeric for numerical properties (for example, age), category for properties representing categorical labels (for example, sex) and text for representing textual properties (for example, description).

This release includes a pandas-based implementation (PandasPGFrame) of the PGFrame data structure.

GraphProcessor

GraphProcessor is a basic interface for accessing and manipulating different graph elements (nodes/edges/properties) of backend-specific graph objects.

Graph preprocessing with BlueGraph

The bluegraph.preprocessing package provides a set of utils for preprocessing PGFrame graph objects.

Co-occurrence generation

The CooccurrenceGenerator interface allows the user to examine a provided property graph for co-occurrences (based on its nodes or edges), generate co-occurrence edges, and compute the following statistics:

  • co-occurrece factors,

  • co-occurrence frequency,

  • positive/normalized pointwise mutual information of the co-occurrence 4.

Semantic property encoding

The SemanticPGEncoder interface allows the user to convert properties of nodes and edges into numerical vectors. Currently, ScikitLearnPGEncoder is implemented in Blue Graph.

The encoder object can be instantiated as in the following example:

encoder = ScikitLearnPGEncoder(node_properties, edge_properties,
                               heterogeneous, drop_types,
                               encode_types, edge_features,
                               categorical_encoding,
                               text_encoding,
                               text_encoding_max_dimension,
                               missing_numeric,
                               imputation_strategy,
                               standardize_numeric)

Where the following parameters can be specified:

  • node_properties a list of node property names to encode;

  • edge_properties a list of edge property names to encode;

  • heterogeneous a flag indicating if the feature space is heterogeneous accross different node/edge types;

  • encode_types a flag indicating if node/edge types should be included in the generated features;

  • categorical_encoding defines a strategy for encoding categorical properties (currently, Multi Label Binarizer is available);

  • text_encoding defines a strategy for encoding text properties (currently, TfIdf and Doc2Vec encodings are available);

  • missing_numeric a strategy for treating missing numeric property values (dropping or imputing such values is possible);

  • imputation_strategy a strategy for imputing missing numeric property values (currently, only the mean imputation is available);

  • standardize_numeric a flag indicating if numerical property values should be standardized.

Backend support

Blue Graph supports the following in-memory and persistent graph analytics tools (see bluegraph.backends):

  • NetworkX (for the analytics API)

  • graph-tool (for the analytics API)

  • Neo4j (for the analytics and representation learning API);

  • StellarGraph (for the representation learning API).

NetworkX

The NetworkX-based graph analytics interfaces implemented in Blue Graph can be found in bluegraph.backends.networkx.

Conversion between PGFrame and NetworkX graph objects can be done using pgframe_to_networkx and networkx_to_pgframe. NXGraphProcessor allows accessing and manipulating graph elements of NetworkX graphs (bluegraph.backends.networkx.io).

Graph metrics

The NXMetricsProcessor interface supports the following graph metrics (all supporting unweighted and weighted graph edges):

  • graph density,

  • degree centrality,

  • PageRank centrality,

  • betweenness centrality,

  • closeness centrality.

Writing centrality measures as node properties is available.

Community detection

The NXCommunityDetector interface supports the following community detection methods (all supporting both weighted and unweighted graphs):

  • Louvain algorithm (strategy="louvain")

  • Girvan–Newman algorithm (strategy="girvan-newman")

  • Label propagation (strategy="lpa")

  • Hierarchical clustering (strategy="hierarchical", only for nodes with numerical features)

And supports the following partition quality metrics: modularity, performance, coverage.

Writing community labels as node properties is available.

graph-tool

The graph-tool-based analytics interfaces implemented in Blue Graph can be found in bluegraph.backends.graph_tool.

Conversion between PGFrame and graph-tool objects can be done using pgframe_to_graph_tool and graph_tool_to_pgframe. GTGraphProcessor allows accessing and manipulating graph elements of graph-tool graphs (bluegraph.backends.graph_tool.io).

Graph metrics

The GTMetricsProcessor interface supports the following graph metrics (all supporting unweighted and weighted graph edges):

  • graph density,

  • degree centrality,

  • PageRank centrality,

  • betweenness centrality,

  • closeness centrality.

Writing centrality measures as node properties is available.

Path search

The GTPathFinder interface supports the following path search tasks:

  • minimum spanning tree (unweighted and weighted, in-place labeling of the tree edges is available),

  • top neighbors by edge weight,

  • single shortest path (unweighted and weighted, excluding direct edge is available),

  • all shortest paths (excluding direct edge is available),

  • top shortest paths (unweighted and weighted, excluding direct edge is available, only the Naive 1 strategy is available)

  • nested shortest paths 2 (unweighted and weighted, excluding direct edge is available, the Naive strategy is available).

Community detection

The GTCommunityDetector interface supports the following community detection methods (supporting both weighted and unweighted graphs):

  • Statistical inference with Stochastic Block Models (strategy="sbm") 5,

  • Hierarchical clustering (strategy="hierarchical", only for nodes with numerical features).

And supports the following partition quality metrics: modularity, performance, coverage.

Writing community labels as node properties is available.

Neo4j

The Neo4j-based graph analytics interfaces implemented in Blue Graph can be found in bluegraph.backends.neo4j and are based on the Neo4j Graph Data Science Library (version >= 1.5).

A Neo4j database instance can be populated from a PGFrame object using pgframe_to_neo4j and a PGFrame object can be read from a Neo4j database using neo4j_to_pgframe. Neo4jGraphProcessor allows accessing and manipulating graph elements of Neo4j graphs (bluegraph.backends.neo4j.io).

Graph metrics

The Neo4jMetricsProcessor interface supports the following graph metrics:

  • graph density (unweighted and weighted),

  • degree centrality (unweighted and weighted),

  • PageRank centrality (unweighted and weighted),

  • betweenness centrality (only unweighted),

  • closeness centrality (only unweighted).

Writing centrality measures as node properties is available.

Path search

The Neo4jPathFinder interface supports the following path search tasks:

  • minimum spanning tree (unweighted and weighted, only in-place labeling of the tree edges is available),

  • top neighbors by edge weight,

  • single shortest path (unweighted and weighted, excluding direct edge is available),

  • all shortest paths (excluding direct edge is available),

  • top shortest paths (unweighted and weighted, excluding direct edge is available, only the Yen 3 strategy is available)

  • nested shortest paths 2 (unweighted and weighted, excluding direct edge is available, only the Yen 3 strategy is available).

Community detection

The Neo4jCommunityDetector interface supports the following community detection methods (all supporting both weighted and unweighted graphs):

  • Louvain algorithm (strategy="louvain")

  • Girvan–Newman algorithm (strategy="girvan-newman")

  • Label propagation (strategy="lpa")

  • Hierarchical clustering (strategy="hierarchical", only for nodes with numerical features)

And supports the following partition quality metrics: modularity, performance, coverage.

Writing community labels as node properties is available.

Node representation learning

The Neo4jNodeEmbedder interface supports the following unsupervised node representation models:

  • node2vec (transductive, only unweighted version),

  • FastRP (transductive),

  • GraphSAGE (inductive, model is saved in a model catalog of the current Neo4j instance).

StellarGraph

The StellarGraph-based graph representation learning interfaces implemented in Blue Graph can be found in bluegraph.backends.stellargraph.

Conversion between PGFrame and StellarGraph objects can be done using pgframe_to_stellargraph and stellargraph_to_pgframe (bluegraph.backends.stellargraph.io).

Node representation learning

The StellarGraphNodeEmbedder interface supports the following unsupervised node representation models:

  • node2vec (transductive, based on this demo),

  • Watch Your Step (transductive, based on this demo),

  • Deep Graph Infomax GCN, GAT, GraphSAGE (transductive, based on this demo),

  • attri2vec (inductive, based on this demo),

  • attri2vec (inductive, based on this demo),

  • GraphSAGE (inductive, based on this demo),

  • Deep Graph Infomax GCN & GAT with Cluster-GCN training procedure (inductive, based on this demo).

Downstream tasks with BlueGraph

Node classification API

The NodeClassifier interface provides a wrapper allowing to build classification models of PGFrame nodes.

The classifier object can be instantiated as in the following example:

classifier = NodeClassifier(model, feature_vector_prop, feature_props, **kwargs)

Where the following parameters can be specified:

  • model a classification model object that supports two methods model.fit(data, labels) and model.fit(data) (for example, LinearSVC);

  • feature_vector_prop name of the feature vector property to use;

  • feature_props a list of property names (all must be numeric) to concatenate in order to obtain the node feature vectors.

Edge prediction API

The EdgePredictor interface provides a wrapper allowing for building edge prediction models for PGFrame graphs. Such models allow the user to discriminate between ‘true’ and ‘false’ edges based on embedding vectors of their source and target nodes.

The classifier object can be instantiated as in the following example:

classifier = EdgePredictor(model, feature_vector_prop, feature_props, operator, directed)

Where the following parameters can be specified:

  • model a classification model object that supports two methods model.fit(data, labels) and model.fit(data) (for example, LinearSVC);

  • feature_vector_prop name of the node feature vector property to use;

  • feature_props a list of property names (all must be numeric) to concatenate in order to obtain the node feature vectors;

  • operator binary operator to apply on the embedding vectors of source and target nodes (available operators, “hadamard”, “l1”, “l2” and “average”).

The generate_negative_edges util allows the user to generate ‘fake’ edges given a graph (that can be used for training an EdgePredictor).

Similarity API

The SimilarityProcessor interface allows building vector similarity indices using the Faiss Library. It wraps the indices (names or IDs) of the points, vector space and similarity measure configurations. It also allows segmenting the search space into Voronoi cells (see this example) allowing to speed up the search.

The NodeSimilarityProcessor is another wrapper that provides a higher-level abstraction to SimilarityProcessor and allows building and querying node similarity indices using Faiss. It wraps the underlying graph object and the vector similarity processor and provides an interface for querying similar nodes.

Building embedding pipelines

EmbeddingPipeline allows chaining the following steps and respective components:

  • property encoding (optional),

  • embedding model training (and prediction),

  • similarity index training (and querying).

Such embedding pipelines can also be saved and loaded.

Appendix

1(1,2)

Naive strategy for finding top shortest paths: the strategy first finds the set of all shortest paths from the source to the target node, it then ranks them by the cumulative distance score and returns n best paths. This naive strategy performs better for highly dense graphs (where every node is connected to almost every other node). Note, that if there are less than n unweighted shortest paths in the graph, the naive strategy may return less than n paths.

2(1,2,3)

Nested shortest paths: nested paths are found iteratively for each level of depth. For example, if e1 <-> e2 <-> … <-> eN is a path on the current level of depth, then the function searches for paths between each consecutive pair of nodes (e1 and e2, e2 and e3, etc.).

3(1,2,3)

Yen’s k-shortest paths search algorithm: see Yen, Jin Y. “Finding the k shortest loopless paths in a network”. Management Science 17.11 (1971): 712-716.

4

Pointwise mutual information: see Bouma, Gerlof. “Normalized (pointwise) mutual information in collocation extraction.” Proceedings of GSCL (2009): 31-40.

5

https://graph-tool.skewed.de/static/doc/demos/inference/inference.html