Creating and running embedding pipelines

bluegraph allows to create emebedding pipelines (using the EmbeddingPipeline class) that represent a useful wrapper around a sequence of steps necessary to produce embeddings and compute point similarities. In the examples below we create a pipeline for encoding text properties of nodes into feature vectors, producing attri2vec node embeddings and computing their similarity based on two different similarity backends. The source notebook can be found here.

from bluegraph.core import PandasPGFrame

from bluegraph.preprocess.encoders import ScikitLearnPGEncoder
from bluegraph.backends.stellargraph import StellarGraphNodeEmbedder
from bluegraph.downstream.similarity import (SimilarityProcessor,
                                             FaissSimilarityIndex,
                                             ScikitLearnSimilarityIndex,
                                             SimilarityIndex)
from bluegraph.downstream import EmbeddingPipeline

Example 1: creating pipeline trainable with run_fitting

We first create an encoder object that will be used in our pipeline to encode node property definition using a TfIdf encoder.

definition_encoder = ScikitLearnPGEncoder(
    node_properties=["definition"],
    text_encoding_max_dimension=512,
    text_encoding="tfidf")

We then create an embedder object that can compute node embeddings for input graphs using attri2vec node embedding technique.

D = 128
params = {
    "length": 5,
    "number_of_walks": 10,
    "epochs": 5,
    "embedding_dimension": D
}
attri2vec_embedder = StellarGraphNodeEmbedder(
    "attri2vec", feature_vector_prop="features", edge_weight="npmi", **params)

Next, we create a similarity processor based of Faiss indices that allows us to perform fast search for nearest neighbors according to our embedding vectors. We set our similarity measure to cosine similarity.

Note: in the code below we use the SimilarityProcessor interface and not NodeSimilarityProcessor, as we have done it in previous tutorials. We use this lower abstraction level interface, because the EmbeddingPipeline is designed to work with any embedding models (not only node embedding models).

similarity_processor = SimilarityProcessor(
    FaissSimilarityIndex(
        similarity="cosine", dimension=D, n_segments=5))

And finally we create a pipeline object that stacks all the above-mentioned elements.

attri2vec_pipeline = EmbeddingPipeline(
    preprocessor=definition_encoder,
    embedder=attri2vec_embedder,
    similarity_processor=similarity_processor)

Now, let us load the training graph from the provided example dataset.

graph = PandasPGFrame.load_json("../data/cooccurrence_graph.json")

We run the fitting process, which given the input data performs the following steps: 1. fits the encoder 2. transforms the data 3. fits the embedder 4. produces the embedding table 5. fits the similarity index

attri2vec_pipeline.run_fitting(graph)

We can save our pipeline to the file system as follows:

attri2vec_pipeline.save(
    "../data/attri2vec_test_model",
    compress=True)

We can launch prediction of the unseen graph nodes using our pipeline as follows (in this case we use the same graph). As an output, we obtain embedding vectors produced by the model.

vectors = attri2vec_pipeline.run_prediction(graph)

Example 2: creating manually trained pipeline

In the previous example we used FaissSimilarityIndex and the backend for our nearest neighbors search. Faiss indices are updatable and allow us to add new points to the index at any point. Therefore, we were able to create an ‘untrained’ pipeline stacking preprocessor, embedder and empty similarity index. We then run all the training steps at once by using run_fitting. As the result, vectors output by the embedder were added to the index, once they were produced.

However, in some cases, similarity indices are static and the set of vectors on which they are built must be provided at the creation time. Consider the following example.

We would like to use BallTree index implemented in scikit-learn and provided by bluegraph’s ScikitLearnSimilarityIndex. In the cell below we try to initialize this index without initial vectors on which it must be built.

try:
    sklearn_similarity_processor = SimilarityProcessor(
        ScikitLearnSimilarityIndex(
            similarity="poincare", dimension=D,
            index_type="ballktree", leaf_size=10)
    )
except SimilarityIndex.SimilarityException as e:
    print("Caught the following error: ")
    print(e)
Caught the following error:
Initial vectors must be provied (scikit learn indices are not updatable)

This means that we cannot create an initially empty similarity index and let our pipeline fill it with vectors once the embedder has output the them. What we can do instead is run encoding and embedding manually, as follows:

transformed_graph = definition_encoder.fit_transform(graph)
embedding = attri2vec_embedder.fit_model(transformed_graph)
link_classification: using 'ip' method to combine node embeddings into edge embeddings

We now can create a similarity index on the produced embedding vectors.

sklearn_similarity_processor = SimilarityProcessor(
    ScikitLearnSimilarityIndex(
        similarity="poincare", dimension=D,
        initial_vectors=embedding["embedding"].tolist(),
        index_type="ballktree", leaf_size=10))

And, finally, stack our steps into a pipeline that can be dumped and re-used as in the previous example.

attri2vec_sklearn_pipeline = EmbeddingPipeline(
    preprocessor=definition_encoder,
    embedder=attri2vec_embedder,
    similarity_processor=sklearn_similarity_processor)