CORD-19 data preparation helpers

Module containing utils for COVID-19 network generation and analysis.

cord19kg.utils.aggregate_cord_entities(x, factors)

Aggregate a collection of entity mentions.

Entity types are aggregated as lists (to preserve the multiplicity, e.g. how many times a given entity was recognized as a particular type). The rest of the input occurrence factors are aggregated as sets (e.g. sets of unique papers/sections/paragraphs).

cord19kg.utils.clean_up_entity(s)

Clean-up entity by removing common errors from NER.

cord19kg.utils.generate_cooccurrence_analysis(occurrence_data, factor_counts, type_data=None, min_occurrences=1, n_most_frequent=None, keep=None, factors=None, cores=8, graph_dump_prefix=None, communities=True, remove_zero_mi=False, backend_configs=None, community_strategy='louvain')

Generate co-occurrence analysis.

This utility executes the entire pipeline of the co-occurrence analysis: it generates co-occurrence networks based on the input factors, yields various co-occurrence statistics (frequency, mutual-information-based scores) as edge attributes, computes various node centrality measures, node communities (and attaches them to the node attributes of the generated networks). Finally, it computes minimum spanning trees given the mutual-information-based distance scores (1 / NPMI). The function allows to dump the resulting graph objects using a pickle representation.

Parameters:
  • occurrence_data (pd.DataFrame) – Input occurrence data table. Rows represent unique entities (indexed by entity names), columns contain sets of aggregated occurrence factors (e.g. sets of papers/sections/paragraphs where the given term occurs).

  • factor_counts (dict) – Dictionary whose keys are factor column names ( i.e. “paper”/”section”/”paragraph”) and whose values are counts of unique factor instances (e.g. total number of papers/sections/ paragraphs in the dataset)

  • type_data (pd.DataFrame, optional) – Table containing node types (these types are saved as node attributes)

  • min_occurrences (int, optional) – Minimum co-occurrence frequency to consider (add as an edge to the co- occurrence network). By default every non-zero co-occurrence frequency yields an edge in the resulting network.

  • n_most_frequent (int, optional) – Number of most frequent entitites to include in the co-occurrence network. By default is not set, therefore, all the terms from the occurrence table are included.

  • keep (iterable) – Collection of entities to keep even if they are not included in N most frequent entities.

  • factors (iterable, optional) – Set of factors to use for constructing co-occurrence networks (a network per factor is produced).

  • cores (int, optional) – Number of cores to use during the parallel network generation.

  • graph_dump_prefix (str) – Path prefix for dumping the generated networks (the edge list, edge attributes, node list and node attributes are saved).

  • communities (bool, optional) – Flag indicating whether the community detection should be included in the analysis. By default True.

  • remove_zero_mi (bool, optional) – Flag indicating whether edges with zero mutual-information scores (PPMI and NPMI) should be removed from the network (helps to sparsify the network, however may result in isolated nodes of high occurrence frequency).

Returns:

  • graphs (dict of nx.DiGraph) – Dictionary whose keys are factor names and whose values are generated co-occurrence networks.

  • trees (dict of nx.DiGraph) – Dictionary whose keys are factor names and whose values are minimum spanning trees of generated co-occurrence networks.

cord19kg.utils.generate_curation_table(data)

Generate curation table from the raw co-occurrence data

This function converts CORD-19 entity mentions into a dataframe indexed by unique entities. Each row contains aggregated data for different occurrence factors, i.e. papers/section/paragraphs, where the given term was mentioned).

Parameters:

data (pd.DataFrame) – Dataframe containing occurrence data with the following columns: entity, entity_type, occurrence (occurrence in a paragraph identified with a string of format <paper_id>:<section_id>:<paragraph_id>).

Returns:

  • result_data (pd.DataFrame) – Dataframe indexed by distinct terms containing aggregated occurrences of terms as columns (e.g. for each terms, sets of papers/sections/ paragraphs) where the term occurs.

  • counts (dict) – Dictionary whose keys are factor column names ( i.e. “paper”/”section”/”paragraph”) and whose values are counts of unique factor instances (e.g. total number of papers/sections/ paragraphs in the dataset)

cord19kg.utils.has_min_length(entities, length)

Check if a term has the min required length.

Check if the title is experiment related.

Merge the input occurrence table with the ontology linking.

Parameters:
  • linking (pd.DataFrame) – Datatable containing the linking data. The table includes the following columns: mention contains raw entities given by the NER model, concept linked ontology term, uid ID of the linked term in NCIT, definition definition of the linked term, taxonomy a list containing uid’s and names of the parent ontology classes of the term.

  • type_mapping (dict) – Mapping whose keys are type names to be used, values are dictionaries with two keys: include and exclude specifying NCIT ontology classes to respectively include and exclude to/from when assigning the given type.

  • curated_table (pd.DataFrame) – Input occurrence data table. Rows represent unique entities (indexed by entity names), columns contain sets of aggregated occurrence factors (e.g. sets of papers/sections/paragraphs where the given term occurs), raw entity types (given by the NER model)

Returns:

linked_table – The resulting table after grouping synonymical entities according to the ontology linking. The table is indexed by unqique linked entities containing the following columns: paper, section, paragraph representing aggregated factors where the term occurs, aggregated_entities set of raw entities linked to the given term, uid unique identifier in NCIT, if available, definition definition of the term in NCIT, paper_frequency number of unique papers where mentioned, entity_type a unique entity type per entity resolved using the ontology linking data (hierarchy, or taxonomy, of NCIT classes) according to the input type mapping.

Return type:

pd.DataFrame

cord19kg.utils.mentions_to_occurrence(raw_data, term_column='entity', factor_columns=None, term_cleanup=None, term_filter=None, mention_filter=None, aggregation_function=None, dump_prefix=None)

Convert a raw mentions data into occurrence data.

This function converts entity mentions into a dataframe indexed by unique entities. Each row contains aggregated data for different occurrence factors (sets of factor instances where the given term occurrs, e.g. sets of papers/section/paragraphs where the give term was mentioned).

Parameters:
  • raw_data (pandas.DataFrame) – Dataframe containing occurrence data with the following columns: one column for terms, one or more columns for occurrence factors ( e.g. paper/section or paragraph of a term occurrence).

  • term_column (str) – Name of the column with terms

  • factor_columns (collection of str) – Set of column names containing occurrence factors (e.g. “paper”/”section”/”paragraph”).

  • term_cleanup (func, optional) – A clean-up function to be applied to every term

  • term_filter (func, optional) – A filter function to apply to terms (e.g. include terms only with 2 or more symbols)

  • mention_filter (func, optional) – A filter function to apply to occurrence factors (e.g. filter out all the occurrences in sections called “Methods”)

  • aggregation_function (func, optional) – Function to be applied to aggregated occurrence factors. By default, the constructor of set.

  • dump_prefix (str, optional) – Prefix to use for dumping the resulting occurrence dataset.

Returns:

  • occurence_data (pd.DataFrame) – Dataframe indexed by distinct terms containing aggregated occurrences of terms as columns (e.g. for each terms, sets of papers/sections/ paragraphs) where the term occurs.

  • factor_counts (dict) – Dictionary whose keys are factor column names ( e.g. “paper”/”section”/”paragraph”) and whose values are counts of unique factor instances (e.g. total number of papers/sections/ paragraphs in the dataset)

cord19kg.utils.merge_attrs(source_attrs, collection_of_attrs, attr_resolver, attrs_to_ignore=None)

Merge two attribute dictionaries into the target using the input resolver.

Parameters:
  • source_attrs (dict) – Source dictionary with attributes (the other attributes will be merged into it and a new object will be returned)

  • collection_of_attrs (iterable of dict) – Collection of dictionaries to merge into the target dictionary

  • attr_resolver (dict) – Dictionary containing attribute resolvers, its keys are attribute names and its values are functions applied to the set of attribute values in order to resolve this set to a single value

  • attrs_to_ignore (iterable, optional) – Set of attributes to ignore (will not be included in the merged node or edges incident to this merged node)

cord19kg.utils.merge_nodes(graph_processor, nodes_to_merge, new_name=None, attr_resolver=None)

Merge the input set of nodes.

Parameters:
  • graph_processor (GraphProcessor) – Input graph object

  • nodes_to_merge (iterable) – Collection of node IDs to merge

  • new_name (str, optional) – New name to use for the result of merging

  • attr_resolver (dict, optional) – Dictionary containing attribute resolvers, its keys are attribute names and its values are functions applied to the set of attribute values in order to resolve this set to a single value

Returns:

graph – Resulting graph (references to the input graph, if copy is False, or to another object if copy is True).

Return type:

nx.Graph

cord19kg.utils.merge_with_ontology_linking(occurence_data, factor_columns, linking_df=None, linking_path=None, linked_occurrence_data_path=None)

Merge occurrence data with ontology linking data.

cord19kg.utils.prepare_occurrence_data(mentions_df=None, mentions_path=None, occurrence_data_path=None, factor_counts_path=None)

Prepare mentions data for the co-occurrence analysis and dump.

This function converts CORD-19 entity mentions into a dataframe indexed by unique entities. Each row contains aggregated data for different occurrence factors (sets of factor instances where the given term occurrs, e.g. sets of papers/section/paragraphs where the give term was mentioned).

Parameters:
  • mentions_df (pd.DataFrame, optional) – Dataframe containing occurrence data with the following columns: entity, entity_type, occurrence (occurrence in a paragraph identified with a string of format <paper_id>:<section_id>:<paragraph_id>). If not specified, the mentions_path argument will be used to load the mentions file.

  • mentions_path (str, optional) – Path to a pickle file containing occurrence data of the shape described above.

  • occurrence_data_path (str, optional) – Path to write the resulting aggregated occurrence data.

  • factor_counts_path (str, optional) – Path to write the dictorary containing counts of different occurrence factors (papers, sections, paragraphs).

Returns:

  • occurence_data (pd.DataFrame) – Dataframe indexed by distinct terms containing aggregated occurrences of terms as columns (e.g. for each terms, sets of papers/sections/ paragraphs) where the term occurs.

  • counts (dict) – Dictionary whose keys are factor column names ( i.e. “paper”/”section”/”paragraph”) and whose values are counts of unique factor instances (e.g. total number of papers/sections/ paragraphs in the dataset)

cord19kg.utils.resolve_taxonomy_to_types(occurrence_data, mapping)

Assign entity types from hierarchies of NCIT classes.

This function assigns a unique entity type to every entity using the ontology linking data (hierarchy, or taxonomy, of NCIT classes) according to the input type mapping. If a term was not linked, i.e. does not have such a taxonomy attached, raw entity types from the NER model are using ( a unique entity type is chosen by the majority vote).

Parameters:
  • occurrence_data (pd.DataFrame) – Input occurrence data table. Rows represent unique entities (indexed by entity names), columns contain the following columns: taxonomy list containing a hierarchy of NCIT ontology classes of the given entity, raw_entity_types list of raw entity types provided by the NER model.

  • mapping (dict) – Mapping whose keys are type names to be used, values are dictionaries with two keys: include and exclude specifying NCIT ontology classes to respectively include and exclude to/from when assigning the given type.

Returns:

type_data – Dataframe indexed by unique entities, containing the column type specifying the assigned types.

Return type:

pd.DataFrame