# vO.1.1 Release Notes¶

## Introducing Blue Graph¶

This is an initial release of Blue Graph, a graph analytics framework for Python designed to unify different graph processing APIs. Its main goal is to provide a set of uniform APIs enabling users to peform graph analytics tasks abstracting away from the specificities of different graph processing tools, **backends**, and the graph data structures they provide.

Using the built-in PGFrame data structure for representing **property graphs with node/edge types and property data types** as a ‘data exchange layer’ between different analytics interfaces, it provides a backend-agnostic API supporting both in-memory (NetworkX, graph-tool and StellarGraph) and persistent graph backends (Neo4j).

It provides the following set of interfaces:

preprocessing and co-occurrence analysis API providing semantic property encoders and co-occurrence graph generators;

graph analytics API providing interfaces for computing graph metrics, performing path search and community detection;

unsupervised representation learning API for applying various graph embedding techniques;

representation learning downstream tasks API allowing the user to perform node classification, similarity queries, link prediction.

The main components of BlueGraph’s API are illustrated in the following diagram:

## Blue Graph’s core¶

The `bluegraph.core`

package provides the exchange graph data structure that serves as the input to graph processors based on different backends (`PGFrame`

), as well as basic interfaces for different graph analytics and embedding classes (`MetricProcessor`

, `PathFinder`

, `CommunityDetector`

, `GraphElementEmbedder`

, etc).

### PGFrame¶

This data structure represents property graphs with node/edge types and property data types. It is based on two dataframes: one for nodes, their types and properties as well as one for edges and their types/properties. Moreover, PGFrame supports setting data types to node/edge properties. Three high-level data types are currently supported: `numeric`

for numerical properties (for example, age), `category`

for properties representing categorical labels (for example, sex) and `text`

for representing textual properties (for example, description).

This release includes a `pandas`

-based implementation (`PandasPGFrame`

) of the PGFrame data structure.

### GraphProcessor¶

`GraphProcessor`

is a basic interface for accessing and manipulating different graph elements (nodes/edges/properties) of backend-specific graph objects.

## Graph preprocessing with BlueGraph¶

The `bluegraph.preprocessing`

package provides a set of utils for preprocessing PGFrame graph objects.

### Co-occurrence generation¶

The `CooccurrenceGenerator`

interface allows the user to examine a provided property graph for co-occurrences (based on its nodes or edges), generate co-occurrence edges, and compute the following statistics:

co-occurrece factors,

co-occurrence frequency,

positive/normalized pointwise mutual information of the co-occurrence 4.

### Semantic property encoding¶

The `SemanticPGEncoder`

interface allows the user to convert properties of nodes and edges into numerical vectors. Currently, ScikitLearnPGEncoder is implemented in Blue Graph.

The encoder object can be instantiated as in the following example:

```
encoder = ScikitLearnPGEncoder(node_properties, edge_properties,
heterogeneous, drop_types,
encode_types, edge_features,
categorical_encoding,
text_encoding,
text_encoding_max_dimension,
missing_numeric,
imputation_strategy,
standardize_numeric)
```

Where the following parameters can be specified:

`node_properties`

a list of node property names to encode;`edge_properties`

a list of edge property names to encode;`heterogeneous`

a flag indicating if the feature space is heterogeneous accross different node/edge types;`encode_types`

a flag indicating if node/edge types should be included in the generated features;`categorical_encoding`

defines a strategy for encoding**categorical properties**(currently, Multi Label Binarizer is available);`text_encoding`

defines a strategy for encoding**text properties**(currently, TfIdf and Doc2Vec encodings are available);`missing_numeric`

a strategy for treating missing**numeric property**values (dropping or imputing such values is possible);`imputation_strategy`

a strategy for imputing missing**numeric property**values (currently, only the mean imputation is available);`standardize_numeric`

a flag indicating if**numerical property**values should be standardized.

## Backend support¶

Blue Graph supports the following in-memory and persistent graph analytics tools (see `bluegraph.backends`

):

NetworkX (for the analytics API)

graph-tool (for the analytics API)

Neo4j (for the analytics and representation learning API);

StellarGraph (for the representation learning API).

### NetworkX¶

The NetworkX-based graph analytics interfaces implemented in Blue Graph can be found in `bluegraph.backends.networkx`

.

Conversion between `PGFrame`

and NetworkX graph objects can be done using `pgframe_to_networkx`

and `networkx_to_pgframe`

. `NXGraphProcessor`

allows accessing and manipulating graph elements of NetworkX graphs (`bluegraph.backends.networkx.io`

).

#### Graph metrics¶

The `NXMetricsProcessor`

interface supports the following graph metrics (all supporting unweighted and weighted graph edges):

graph density,

degree centrality,

PageRank centrality,

betweenness centrality,

closeness centrality.

Writing centrality measures as node properties is available.

#### Path search¶

The `NXPathFinder`

interface supports the following path search tasks:

minimum spanning tree (unweighted and weighted, in-place labeling of the tree edges is available),

top neighbors by edge weight,

single shortest path (unweighted and weighted, excluding direct edge is available),

all shortest paths (excluding direct edge is available),

top shortest paths (unweighted and weighted, excluding direct edge is available, Naive 1 and Yen 3 strategies are available)

nested shortest paths 2 (unweighted and weighted, excluding direct edge is available, Naive and Yen strategies are available).

#### Community detection¶

The `NXCommunityDetector`

interface supports the following community detection methods (all supporting both weighted and unweighted graphs):

Louvain algorithm (

`strategy="louvain"`

)Girvan–Newman algorithm (

`strategy="girvan-newman"`

)Label propagation (

`strategy="lpa"`

)Hierarchical clustering (

`strategy="hierarchical"`

, only for nodes with numerical features)

And supports the following partition quality metrics: modularity, performance, coverage.

Writing community labels as node properties is available.

### graph-tool¶

The graph-tool-based analytics interfaces implemented in Blue Graph can be found in `bluegraph.backends.graph_tool`

.

Conversion between `PGFrame`

and graph-tool objects can be done using `pgframe_to_graph_tool`

and `graph_tool_to_pgframe`

. `GTGraphProcessor`

allows accessing and manipulating graph elements of graph-tool graphs (`bluegraph.backends.graph_tool.io`

).

#### Graph metrics¶

The `GTMetricsProcessor`

interface supports the following graph metrics (all supporting unweighted and weighted graph edges):

graph density,

degree centrality,

PageRank centrality,

betweenness centrality,

closeness centrality.

Writing centrality measures as node properties is available.

#### Path search¶

The `GTPathFinder`

interface supports the following path search tasks:

minimum spanning tree (unweighted and weighted, in-place labeling of the tree edges is available),

top neighbors by edge weight,

single shortest path (unweighted and weighted, excluding direct edge is available),

all shortest paths (excluding direct edge is available),

top shortest paths (unweighted and weighted, excluding direct edge is available, only the Naive 1 strategy is available)

nested shortest paths 2 (unweighted and weighted, excluding direct edge is available, the Naive strategy is available).

#### Community detection¶

The `GTCommunityDetector`

interface supports the following community detection methods (supporting both weighted and unweighted graphs):

Statistical inference with Stochastic Block Models (

`strategy="sbm"`

) 5,Hierarchical clustering (

`strategy="hierarchical"`

, only for nodes with numerical features).

And supports the following partition quality metrics: modularity, performance, coverage.

Writing community labels as node properties is available.

### Neo4j¶

The Neo4j-based graph analytics interfaces implemented in Blue Graph can be found in `bluegraph.backends.neo4j`

and are based on the Neo4j Graph Data Science Library (version `>= 1.5`

).

A Neo4j database instance can be populated from a `PGFrame`

object using `pgframe_to_neo4j`

and a `PGFrame`

object can be read from a Neo4j database using `neo4j_to_pgframe`

. `Neo4jGraphProcessor`

allows accessing and manipulating graph elements of Neo4j graphs (`bluegraph.backends.neo4j.io`

).

#### Graph metrics¶

The `Neo4jMetricsProcessor`

interface supports the following graph metrics:

graph density (unweighted and weighted),

degree centrality (unweighted and weighted),

PageRank centrality (unweighted and weighted),

betweenness centrality (only unweighted),

closeness centrality (only unweighted).

Writing centrality measures as node properties is available.

#### Path search¶

The `Neo4jPathFinder`

interface supports the following path search tasks:

minimum spanning tree (unweighted and weighted, only in-place labeling of the tree edges is available),

top neighbors by edge weight,

single shortest path (unweighted and weighted, excluding direct edge is available),

all shortest paths (excluding direct edge is available),

top shortest paths (unweighted and weighted, excluding direct edge is available, only the Yen 3 strategy is available)

nested shortest paths 2 (unweighted and weighted, excluding direct edge is available, only the Yen 3 strategy is available).

#### Community detection¶

The `Neo4jCommunityDetector`

interface supports the following community detection methods (all supporting both weighted and unweighted graphs):

Louvain algorithm (

`strategy="louvain"`

)Girvan–Newman algorithm (

`strategy="girvan-newman"`

)Label propagation (

`strategy="lpa"`

)Hierarchical clustering (

`strategy="hierarchical"`

, only for nodes with numerical features)

And supports the following partition quality metrics: modularity, performance, coverage.

Writing community labels as node properties is available.

#### Node representation learning¶

The `Neo4jNodeEmbedder`

interface supports the following unsupervised node representation models:

node2vec (transductive, only unweighted version),

FastRP (transductive),

GraphSAGE (inductive, model is saved in a model catalog of the current Neo4j instance).

### StellarGraph¶

The StellarGraph-based graph representation learning interfaces implemented in Blue Graph can be found in `bluegraph.backends.stellargraph`

.

Conversion between `PGFrame`

and StellarGraph objects can be done using `pgframe_to_stellargraph`

and `stellargraph_to_pgframe`

(`bluegraph.backends.stellargraph.io`

).

#### Node representation learning¶

The `StellarGraphNodeEmbedder`

interface supports the following unsupervised node representation models:

node2vec (transductive, based on this demo),

Watch Your Step (transductive, based on this demo),

Deep Graph Infomax GCN, GAT, GraphSAGE (transductive, based on this demo),

attri2vec (inductive, based on this demo),

attri2vec (inductive, based on this demo),

GraphSAGE (inductive, based on this demo),

Deep Graph Infomax GCN & GAT with Cluster-GCN training procedure (inductive, based on this demo).

## Downstream tasks with BlueGraph¶

### Node classification API¶

The `NodeClassifier`

interface provides a wrapper allowing to build classification models of PGFrame nodes.

The classifier object can be instantiated as in the following example:

```
classifier = NodeClassifier(model, feature_vector_prop, feature_props, **kwargs)
```

Where the following parameters can be specified:

`model`

a classification model object that supports two methods`model.fit(data, labels)`

and`model.fit(data)`

(for example, LinearSVC);`feature_vector_prop`

name of the feature vector property to use;`feature_props`

a list of property names (all must be numeric) to concatenate in order to obtain the node feature vectors.

### Edge prediction API¶

The `EdgePredictor`

interface provides a wrapper allowing for building edge prediction models for PGFrame graphs. Such models allow the user to discriminate between ‘true’ and ‘false’ edges based on embedding vectors of their source and target nodes.

The classifier object can be instantiated as in the following example:

```
classifier = EdgePredictor(model, feature_vector_prop, feature_props, operator, directed)
```

Where the following parameters can be specified:

`model`

a classification model object that supports two methods`model.fit(data, labels)`

and`model.fit(data)`

(for example, LinearSVC);`feature_vector_prop`

name of the node feature vector property to use;`feature_props`

a list of property names (all must be numeric) to concatenate in order to obtain the node feature vectors;`operator`

binary operator to apply on the embedding vectors of source and target nodes (available operators, “hadamard”, “l1”, “l2” and “average”).

The `generate_negative_edges`

util allows the user to generate ‘fake’ edges given a graph (that can be used for training an `EdgePredictor`

).

### Similarity API¶

The `SimilarityProcessor`

interface allows building vector similarity indices using the Faiss Library. It wraps the indices (names or IDs) of the points, vector space and similarity measure configurations. It also allows segmenting the search space into Voronoi cells (see this example) allowing to speed up the search.

The `NodeSimilarityProcessor`

is another wrapper that provides a higher-level abstraction to `SimilarityProcessor`

and allows building and querying **node** similarity indices using Faiss. It wraps the underlying graph object and the vector similarity processor and provides an interface for querying similar nodes.

### Building embedding pipelines¶

`EmbeddingPipeline`

allows chaining the following steps and respective components:

property encoding (optional),

embedding model training (and prediction),

similarity index training (and querying).

Such embedding pipelines can also be saved and loaded.

### Appendix¶

- 1(1,2)
**Naive strategy for finding top shortest paths**: the strategy first finds the set of all shortest paths from the source to the target node, it then ranks them by the cumulative distance score and returns n best paths. This naive strategy performs better for highly dense graphs (where every node is connected to almost every other node). Note, that if there are less than n unweighted shortest paths in the graph, the naive strategy may return less than n paths.- 2(1,2,3)
**Nested shortest paths**: nested paths are found iteratively for each level of depth. For example, if e1 <-> e2 <-> … <-> eN is a path on the current level of depth, then the function searches for paths between each consecutive pair of nodes (e1 and e2, e2 and e3, etc.).- 3(1,2,3)
**Yen’s k-shortest paths search algorithm**: see Yen, Jin Y. “Finding the k shortest loopless paths in a network”. Management Science 17.11 (1971): 712-716.- 4
**Pointwise mutual information**: see Bouma, Gerlof. “Normalized (pointwise) mutual information in collocation extraction.” Proceedings of GSCL (2009): 31-40.- 5
https://graph-tool.skewed.de/static/doc/demos/inference/inference.html