bluegraph.preprocess package

Semantic encoding

Collection of property graph encoders.

class bluegraph.preprocess.encoders.ScikitLearnPGEncoder(node_properties=None, edge_properties=None, heterogeneous=False, drop_types=False, encode_types=False, edge_features=False, categorical_encoding='multibin', text_encoding='tfidf', text_encoding_max_dimension=128, missing_numeric='drop', imputation_strategy='mean', standardize_numeric=True, reduce_node_dims=False, reduce_edge_dims=False, n_node_components=None, n_edge_components=None)

Scikit-learn-based in-memory property graph encoder.

The encoder provides a wrapper for multiple heterogeneous models for encoding various node/edge properties of different data types into numerical vectors. It supports the following encoders:

  • for categorical properties: MultiLabelBinarizer

  • for text properties: TfIdf, word2vec

  • for numerical properties: standard scaler

exception EncodingException
load(path)

Load the encoder from the file.

save(path)

Save the encoder to the file.

class bluegraph.preprocess.encoders.SemanticPGEncoder(node_properties=None, edge_properties=None, heterogeneous=False, drop_types=False, encode_types=False, edge_features=False, categorical_encoding='multibin', text_encoding='tfidf', text_encoding_max_dimension=128, missing_numeric='drop', imputation_strategy='mean', standardize_numeric=True, reduce_node_dims=False, reduce_edge_dims=False, n_node_components=None, n_edge_components=None)

Abstract class for semantic property graph encoder.

The encoder provides a wrapper for multiple heterogeneous models for encoding various node/edge properties (of different data types) into numerical vectors. It supports three types of properties: categorical properties, text properties and numerical properties.

TODO: Make it concrete by allowing to specify custom encoding models for different property types (?)

fit(pgframe)

Fit encoders for node and edge properties.

fit_transform(pgframe)

Fit the encoder and transform the input PGFrame.

info()

Get dictionary with the info.

abstract load(path)

Load the encoder from the file.

abstract save(path)

Save the encoder to the file.

transform(pgframe, skip_reduction=False)

Transform the input PGFrame.

Co-occurrence generation

class bluegraph.preprocess.generators.CooccurrenceGenerator(pgframe)

Generator of co-occurrence edges from PGFrames.

This interface allows to inspect nodes of the wrapped graph for their co-occurrence. The co-occurrence can be based on node properties: two nodes co-occur when they share some property values. For instance, two terms have common values in sets of papers in which they occur, i.e. two terms co-occur in the same papers. The co-occurrence can be also based on edge types: two nodes co-occur when they both have an edge of the same type pointing to the same target node. For example, two nodes representing terms have an edge of the type ‘occursIn’ pointing to the same node representing a paper. The class generate edges between co-occurring nodes according to the input criteria and computes a set of statistics (frequency, PPMI, NPMI) quantifying their co-occurrence relationships.

bluegraph.preprocess.generators.mutual_information(co_freq, s_freq, t_freq, total_instances, mitype=None)

Compute mutual information on a pair of terms.

Parameters
  • co_freq (int) – Co-occurrence frequency of s & t

  • s_freq (int) – Occurrence frequency of s

  • t_freq (int) – Occurrence frequency of t

  • total_instances (int) – Total number of all unique instances of the occurrence factor (for example, the total number of all scientific articles in the dataset).

  • mitype (str, optional) – Mutual information score type. Possible types ‘expected’, ‘normalized’, ‘pmi2’, ‘pmi3’, by default, no normalization is applied (i.e. positive pointwise mutual information is computed).

bluegraph.preprocess.generators.schedule_scanning(task_queue, indices, n_workers)

Schedule scanning work.