Developer Interface

This part of the documentation covers the public interface of charnetto.

Book

charnetto.extract_flair_df(lines, tagger)

Apply Flair NER to a file and generate a dataframe of all entities

Each line in the file should correspond to a paragraph for the NER to work optimally.

The tagger depends on the language of the text. For example, for an english text, use SequenceTagger.load(‘ner’). See the documentation of Flair for the complete list of available taggers.

Columns of the dataframe:

name : text of the entity start_pos : starting position of the entity in the text (in characters) end_pos : ending position of the entity in the text (in characters) tag : category attributed by Flair to the entity score : confidence in the tag (attributed by Flair) block : id of the line containing the entity

Parameters
  • lines (list) – List of lines from the text (can be extracted with get_lines(path))

  • tagger (object) – Flair tagger, depends on the language of the text.

charnetto.extract_spacy_df(lines, tagger)

Apply Spacy NER to a file and generate a dataframe of all entities

Each line in the file should correspond to a paragraph for the NER to work optimally.

The tagger depends on the language of the text. For example, for an english text, use spacy.load(“en_core_web_lg”). See the documentation of Spacy for the complete list of available taggers.

Columns of the dataframe:

name : text of the entity start_pos : starting position of the entity in the text (in characters) end_pos : ending position of the entity in the text (in characters) tag : category attributed by Spacy to the entity score : 1 (the confidence is given by Flair but not by Spacy, so we choose a default value of 1) block : id of the line containing the entity

Parameters
  • lines (list) – List of lines from the text (can be extracted with get_lines(path))

  • tagger (object) – Spacy tagger, depends on the language of the text.

charnetto.unify_tags(spacy_df)

Unify the tags attributed by Spacy to match those of Flair

charnetto.extract_manual_df(lines)

Export manual annotations (in markdown mode) and generate a dataframe of all entities

The following syntax is expected in the file:

[Mark](PER) is outside.

The entity is put in square brackets [], and the tag is given in parentheses () just after the entity. This is the syntax used for URLs in Markdown, which allows you to use a Markdown editor and put the tags as hyperlinks on the entities.

Columns of the output dataframe:

name : text of the entity start_pos : starting position of the entity in the text (in characters) end_pos : ending position of the entity in the text (in characters) tag : category attributed to the entity score : 1 (the confidence is given by Flair, so we choose a default value of 1 to unify the outputs) block : id of the line containing the entity

Parameters

lines (list) – List of lines from the text (can be extracted with get_lines(path))

Movie

charnetto.extract_movie_df(text)

Extract entities in a movie script and load them in a dataframe.

The extraction is based on a regex and a blacklist for tokens which should not appear in a character name. Therefore, the script should look like those of IMSDB.

Columns of the output dataframe:

name : text of the entity start_pos : starting position of the entity in the text (in characters) end_pos : ending position of the entity in the text (in characters) tag : category attributed to the entity (here always ‘PER’) score : 1 (the confidence is given by Flair, so we choose a default value of 1 to unify the outputs) block : id of the scene containing the entity

Parameters

text (str) – Script of the film in .txt

Characters

charnetto.concatenate_parents(df, min_occ)

Unify the entities and map those related to the same character.

Tags unification: the function groups by name, then counts all different tags associated to the same entity and unifies them by taking the most used tag and replacing all tags by this one in a new column “utag”.

Graph creation: the function creates links between entities if one is embedded in another, then builds a graph to show the links between the entities. A list of aliases is then generated based on the hierarchy in the directed graph.

Parameters
  • df (pandas.DataFrame) – The dataframe of all named entities occurrences.

  • min_occ (int) – Threshold for the minimal amount of occurrences for each entity.

charnetto.write_charlist(output_path, concat_char_list)

Save concatenated list in structured text file

charnetto.highlight(path, df_path)

Network

charnetto.load_char_list(path)

Load hierarchical list of characters and aliases.

Parameters

path (str) – The path of the clean list of desired characters.

charnetto.find_pairs(char_df, char_list, blocks_size)

Find cooccurrences over the entire dataframe.

Parameters
  • char_df (pandas.DataFrame) – The dataframe of all named entities occurrences.

  • char_list (List) – The clean list of all desired characters.

  • blocks_size (int) – Size of each batch (of paragraphs or scenes) to find cooccurrences.

charnetto.create_charnet(pairs, min_pairs)

Create a character network based on a list of pairs.

The pairs represent cooccurrences within the text, and a threshold can be fixed to ensure each cooccurrence appears at least min_pairs times.

Parameters
  • pairs (list of lists) – The list of cooccurrences between two characters.

  • min_pairs (int) – Threshold for minimum amount of occurrences of the same pair.