Extractive API Reference

Model/Module

class extractive.ExtractiveSummarizer(hparams, embedding_model_config=None, classifier_obj=None)[source]

A machine learning model that extractively summarizes an input text by scoring the sentences. Main class that handles the data loading, initial processing, training/testing/validating setup, and contains the actual model.

static add_model_specific_args(parent_parser)[source]

Arguments specific to this model

compute_loss(outputs, labels, mask)[source]

Compute the loss between model outputs and ground-truth labels.

Parameters
  • outputs (torch.Tensor) – Output sentence scores obtained from forward()

  • labels (torch.Tensor) – Ground-truth labels (1 for sentences that should be in the summary, 0 otherwise) from the batch generated during the data preprocessing stage.

  • mask (torch.Tensor) – Mask returned by forward(), either sent_rep_mask or sent_lengths_mask depending on the pooling mode used during model initialization.

Returns

Losses: (total_loss, total_norm_batch_loss, sum_avg_seq_loss,

mean_avg_seq_loss, average_loss)

Return type

[tuple]

configure_optimizers()[source]

Configure the optimizers. Returns the optimizer and scheduler specified by the values in self.hparams.

forward(input_ids, attention_mask, sent_rep_mask=None, token_type_ids=None, sent_rep_token_ids=None, sent_lengths=None, sent_lengths_mask=None, **kwargs)[source]

Model forward function. See the 60 minute bliz tutorial if you are unsure what a forward function is.

Parameters
  • input_ids (torch.Tensor) – Indices of input sequence tokens in the vocabulary. What are input IDs?

  • attention_mask (torch.Tensor) – Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]: 1 for tokens that are NOT MASKED, 0 for MASKED tokens. What are attention masks?

  • sent_rep_mask (torch.Tensor, optional) – Indicates which numbers in sent_rep_token_ids are actually the locations of sentence representation ids and which are padding. Defaults to None.

  • token_type_ids (torch.Tensor, optional) – Usually, segment token indices to indicate first and second portions of the inputs. However, for summarization they are used to indicate different sentences. Depending on the size of the token type id vocabulary, these values may alternate between 0 and 1 or they may increase sequentially for each sentence in the input.. Defaults to None.

  • sent_rep_token_ids (torch.Tensor, optional) – The locations of the sentence representation tokens. Defaults to None.

  • sent_lengths (torch.Tensor, optional) – A list of the lengths of each sentence in input_ids. See data.pad_batch_collate() for more info about the generation of thisfeature. Defaults to None.

  • sent_lengths_mask (torch.Tensor, optional) – Created on-the-fly by data.pad_batch_collate(). Similar to sent_rep_mask: 1 for value and 0 for padding. See data.pad_batch_collate() for more info about the generation of this feature. Defaults to None.

Returns

Contains the sentence scores and mask as torch.Tensors. The mask is either the sent_rep_mask or sent_lengths_mask depending on the pooling mode used during model initialization.

Return type

tuple

freeze_web_model()[source]

Freezes the encoder word_embedding_model

json_to_dataset(tokenizer, hparams, inputs=None, num_files=0, processor=None)[source]

Convert json output from convert_to_extractive.py to a “.pt” file containing lists or tensors using a data.SentencesProcessor. This function is run by prepare_data() in parallel.

Parameters
  • tokenizer (transformers.PreTrainedTokenizer) – Tokenizer used to convert examples to input_ids. Usually is self.tokenizer.

  • hparams (argparse.Namespace) – Hyper-parameters used to create the model. Usually is self.hparams.

  • inputs (tuple, optional) – (idx, json_file) Current loop index and path to json file. Defaults to None.

  • num_files (int, optional) – The total number of files to process. Used to display a nice progress indicator. Defaults to 0.

  • processor (data.SentencesProcessor, optional) – The data.SentencesProcessor object to convert the json file to usable features. Defaults to None.

predict(input_text: str, raw_scores=False, num_summary_sentences=3)[source]

Summarizes input_text using the model.

Parameters
  • input_text (str) – The text to be summarized.

  • raw_scores (bool, optional) – Return a list containing each sentence and its corespoding score instead of the summary. Defaults to False.

  • num_summary_sentences (int, optional) – The number of sentences in the output summary. This value specifies the number of top sentences to select as the summary. Defaults to 3.

Returns

The summary text. If raw_scores is set then returns a list of input sentences and their corespoding scores.

Return type

str

predict_sentences(input_sentences: Union[List[str], generator], raw_scores=False, num_summary_sentences=3, tokenized=False)[source]

Summarizes input_sentences using the model.

Parameters
  • input_sentences (list or generator) – The sentences to be summarized as a list or a generator of spacy Spans (spacy.tokens.span.Span), which can be obtained by running nlp("input document").sents where nlp is a spacy model with a sentencizer.

  • raw_scores (bool, optional) – Return a list containing each sentence and its corespoding score instead of the summary. Defaults to False.

  • num_summary_sentences (int, optional) – The number of sentences in the output summary. This value specifies the number of top sentences to select as the summary. Defaults to 3.

  • tokenized (bool, optional) – If the input sentences are already tokenized using spacy. If true, input_sentences should be a list of lists where the outer list contains sentences and the inner lists contain tokens. Defaults to False.

Returns

The summary text. If raw_scores is set then returns a list of input sentences and their corespoding scores.

Return type

str

prepare_data()[source]

Runs json_to_dataset() in parallel. json_to_dataset() is the function that actually loads and processes the examples as described below. Algorithm: For each json file outputted by the convert_to_extractive.py script:

  1. Load json file.

  2. Add each document in json file to SentencesProcessor defined in self.processor,

    overwriting any previous data in the processor.

  3. Run data.SentencesProcessor.get_features() to save the extracted features to disk

    as a .pt file containing a pickled python list of dictionaries, which each dictionary contains the extracted features.

Memory Usage Note: If sharding was turned off during the convert_to_extractive process then this function will run once, loading the entire dataset into memory to process just like the convert_to_extractive.py script.

setup(stage)[source]

Download the word_embedding_model if the model will be trained.

test_dataloader()[source]

Create dataloader for testing.

test_epoch_end(outputs)[source]

Called at the end of a testing epoch: PyTorch Lightning Documentation Finds the mean of all the metrics logged by test_step().

test_step(batch, batch_idx)[source]

Test step: PyTorch Lightning Documentation Similar to validation_step() in that in runs the inputs through the model. However, this method also calculates the ROUGE scores for each example-summary pair.

train_dataloader()[source]

Create dataloader for training if it has not already been created.

training: bool
training_step(batch, batch_idx)[source]

Training step: PyTorch Lightning Documentation

unfreeze_web_model()[source]

Un-freezes the word_embedding_model

val_dataloader()[source]

Create dataloader for validation.

validation_epoch_end(outputs)[source]

Called at the end of a validation epoch: PyTorch Lightning Documentation Finds the mean of all the metrics logged by validation_step().

validation_step(batch, batch_idx)[source]

Validation step: PyTorch Lightning Documentation Similar to training_step() in that in runs the inputs through the model. However, this method also calculates accuracy and f1 score by marking every sentence score >0.5 as 1 (meaning should be in the summary) and each score <0.5 as 0 (meaning should not be in the summary).

extractive.longformer_modifier(final_dictionary)[source]

Creates the global_attention_mask for the longformer. Tokens with global attention attend to all other tokens, and all other tokens attend to them. This is important for task-specific finetuning because it makes the model more flexible at representing the task. For example, for classification, the <s> token should be given global attention. For QA, all question tokens should also have global attention. For summarization, global attention is given to all of the <s> (RoBERTa ‘CLS’ equivalent) tokens. Please refer to the Longformer paper for more details. Mask values selected in [0, 1]: 0 for local attention, 1 for global attention.

Data

class data.FSDataset(files_list, shuffle=True, verbose=False)[source]
get_files_lengths(files_list)[source]
class data.FSIterableDataset(files_list, shuffle=True, verbose=False)[source]

A dataset to yield examples from a list of files that are saved python objects that can be iterated over. These files could be other PyTorch datasets (tested with TensorDataset) or other python objects such as lists, for example. Each file will be loaded one at a time until all the examples have been yielded, at which point the next file will be loaded and used to yield examples, and so on. This means a large dataset can be broken into smaller chunks and this class can be used to load samples as if those files were one dataset while only utilizing the ram required for one chunk.

Explanation about batch_size and __len__(): If the len() function is needed to be accurate then the batch_size must be specified when constructing objects of this class. PyTorch DataLoader objects will report accurate lengths by dividing the number of examples in the dataset by the batch size only if the dataset if not an IterableDataset. If the dataset is an IterableDataset then a DataLoader will simply ask the dataset for its length, without diving by the batch size, because in some cases the length of an IterableDataset might be difficult or impossible to determine. However, in this case the number of examples (length of dataset) is known. The division by batch size must happen in the dataset (for datasets of type IterableDataset) since the DataLoader will not calculate this.

class data.InputExample(text, labels, guid=None, target=None)[source]
to_dict()[source]

Serializes this instance to a Python dictionary.

to_json_string()[source]

Serializes this instance to a JSON string.

class data.InputFeatures(input_ids, attention_mask=None, token_type_ids=None, labels=None, sent_rep_token_ids=None, sent_lengths=None, source=None, target=None)[source]

A single set of features of data.

Parameters
  • input_ids – Indices of input sequence tokens in the vocabulary.

  • attention_mask – Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]: Usually 1 for tokens that are NOT MASKED, 0 for MASKED (padded) tokens.

  • token_type_ids – Usually, segment token indices to indicate first and second portions of the inputs. However, for summarization they are used to indicate different sentences. Depending on the size of the token type id vocabulary, these values may alternate between 0 and 1 or they may increase sequentially for each sentence in the input.

  • labels – Labels corresponding to the input.

  • sent_rep_token_ids – The locations of the sentence representation tokens.

  • sent_lengths – A list of the lengths of each sentence in the source and input_ids.

  • source – The actual source document as a list of sentences.

  • target – The ground truth abstractive summary.

to_dict()[source]

Serializes this instance to a Python dictionary.

to_json_string()[source]

Serializes this instance to a JSON string.

class data.SentencesProcessor(name=None, labels=None, examples=None, verbose=False)[source]

Create a SentencesProcessor

Parameters
  • name (str, optional) – A label for the SentencesProcessor object, used internally for saving if a save name is not specified in data.SentencesProcessor.get_features(), Default is None.

  • labels (list, optional) – The label that goes with each sample, can be a list of lists where the inside lists are the labels for each sentence in the coresponding example. Default is None.

  • examples (list, optional) – List of InputExamples. Default is None.

  • verbose (bool, optional) – Log extra information (such as examples of processed data points). Default is False.

add_examples(texts, labels=None, ids=None, oracle_ids=None, targets=None, overwrite_labels=False, overwrite_examples=False)[source]

Primary method of adding example sets of texts, labels, ids, and targets to the SentencesProcessor

Parameters
  • texts (list) – A list of documents where each document is a list of sentences where each sentence is a list of tokens. This is the output of convert_to_extractive.py and is in the ‘src’ field for each doc. See extractive.ExtractiveSummarizer.prepare_data().

  • labels (list, optional) – A list of the labels for each document where each label is a list of labels where the index of the label coresponds with the index of the sentence in the respective entry in texts. Similarly to texts, this is handled automatically by ExtractiveSummarizer.prepare_data. Default is None.

  • ids (list, optional) – A list of ids for each document. Not used by ExtractiveSummarizer. Default is None.

  • oracle_ids (list, optional) – Similar to labels but is a list of indexes of the chosen sentences instead of a one-hot encoded vector. These will be converted to labels. Default is None.

  • targets (list, optional) – A list of the abstractive target for each document. Default is None.

  • overwrite_labels (bool, optional) – Replace any labels currently stored by the SentencesProcessor. Default is False.

  • overwrite_examples (bool, optional) – Replace any examples currently stored by the SentencesProcessor. Default is False.

Returns

The examples as InputExamples that have been added.

Return type

list

classmethod create_from_examples(texts, labels=None, **kwargs)[source]

Create a SentencesProcessor with **kwargs and add texts and labels` through add_examples().

get_features(tokenizer, bert_compatible_cls=True, create_sent_rep_token_ids=True, sent_rep_token_id=None, create_sent_lengths=True, create_segment_ids='binary', segment_token_id=None, create_source=False, n_process=2, max_length=None, pad_on_left=False, pad_token=0, mask_padding_with_zero=True, create_attention_mask=True, pad_ids_and_attention=True, return_type=None, save_to_path=None, save_to_name=None, save_as_type='txt')[source]

Convert the examples stored by the SentencesProcessor to features that can be used by a model. The following processes can be performed: tokenization, token type ids (to separate sentences), sentence representation token ids (the locations of each sentence representation token), sentence lengths, and the attention mask. Padding can be applied to the tokenized examples and the attention masks but it is recommended to instead use the data.pad_batch_collate() function so each batch is padded individually for efficiency (less zeros passed through model).

Parameters
  • tokenizer (transformers.PreTrainedTokenizer) – The tokenizer used to tokenize the examples.

  • bert_compatible_cls (bool, optional) – Adds ‘[CLS]’ tokens in front of each sentence. This is useful so that the ‘[CLS]’ token can be used to obtain sentence embeddings. This only works if the chosen model has the ‘[CLS]’ token in its vocabulary. Default is True.

  • create_sent_rep_token_ids (bool, optional) – Option to create sentence representation token ids. This will store a list of the indexes of all the sent_rep_token_ids in the tokenized example. Default is True.

  • sent_rep_token_id ([type], optional) – The token id that should be captured for each sentence (should have one per sentence and each should represent that sentence). Default is '[CLS]' token if bert_compatible_cls else '[SEP]' token.

  • create_sent_lengths (bool, optional) – Option to create a list of sentence lengths where each index in the list coresponds to the respective sentence in the example. Default is True.

  • create_segment_ids (str, optional) –

    Option to create segment ids (aka token type ids). See https://huggingface.co/transformers/glossary.html#token-type-ids for more info. Set to either “binary”, “sequential”, or False.

    • binary alternates between 0 and 1 for each sentence.

    • sequential starts at 0 and increments by 1 for each sentence.

    • False does not create any segment ids.

    Note: Many pretrained models that accept token type ids use them for question answering ans related tasks where the model receives two inputs. Therefore, most models have a token type id vocabulary size of 2, which means they only have learned 2 token type ids. The “binary” mode exists so that these pretrained models can easily be used. Default is “binary”.

  • segment_token_id (str, optional) – The token id to be used when creating segment ids. Can be set to ‘period’ to treat periods as sentence separation tokens, but this is a terrible idea for obvious reasons. Default is ‘[SEP]’ token id.

  • create_source (bool, optional) – Option to save the source text (non-tokenized) as a string. Default is False.

  • n_process (int, optional) – How many processes to use for multithreading for running get_features_process(). Set higher to run faster and set lower is you experience OOM issues. Default is 2.

  • max_length (int, optional) – If pad_ids_and_attention is True then pad to this amount. Default is tokenizer.max_len.

  • pad_on_left (bool, optional) – Optionally, pad on the left instead of right. Default is False.

  • pad_token (int, optional) – Which token to use for padding the input_ids. Default is 0.

  • mask_padding_with_zero (bool, optional) – Use zeros to pad the attention. Uses ones otherwise. Default is True.

  • create_attention_mask (bool, optional) – Option to create the attention mask. It is recommended to use the data.pad_batch_collate() function, which will automatically create attention masks and pad them on a per batch level. Default is False if return_type == "lists" else True.

  • pad_ids_and_attention (bool, optional) – Pad the input_ids with pad_token and attention masks with 0s or 1s deneding on mask_padding_with_zero. Pad both to max_length. Default is False if return_type == "lists" else True

  • return_type (str, optional) – Either “tensors”, “lists”, or None. See “Returns” section below. Default is None.

  • save_to_path (str, optional) – The folder/directory to save the data to OR None to not save. Will save the data specified by return_type to disk. Default is None.

  • save_to_name (str, optional) – The name of the file to save. The extension ‘.pt’ is automatically appended. Default is 'dataset_' + self.name + '.pt'.

  • save_as_type (str, optional) – The file extension of saved file if save_to_path is set. Supports “pt” (PyTorch) and “txt” (Text). Saving as “txt” requires the return_type to be lists. If return_type is tensors the only save_as_type available is “pt”. Defaults to “txt”.

Returns

If return_type is None return the list of calculated features. If return_type == "tensors" return the features converted to tensors and stacked such that features are grouped together into individual tensors. If return_type == "lists", which is the recommended option then exports each InputFeatures object in the exported features list as a dictionary and appends each dictionary to a list. Returns that list.

Return type

list or torch.TensorDataset

get_features_process(input_information, num_examples=0, tokenizer=None, bert_compatible_cls=True, sep_token=None, cls_token=None, create_sent_rep_token_ids=True, sent_rep_token_id=None, create_sent_lengths=True, create_segment_ids='binary', segment_token_id=None, create_source=False, max_length=None, pad_on_left=False, pad_token=0, mask_padding_with_zero=True, create_attention_mask=True, pad_ids_and_attention=True)[source]

The process that actually creates the features. get_features() is the driving function, look there for a description of how this function works. This function only exists so that processing can easily be done in parallel using Pool.map.

classmethod get_input_ids(tokenizer, src_txt, bert_compatible_cls=True, sep_token=None, cls_token=None, max_length=None)[source]

Get input_ids from src_txt using tokenizer. See get_features() for more info.

load(load_from_path, dataset_name=None)[source]

Attempts to load the dataset from storage. If that fails, will return None.

data.pad_batch_collate(batch, modifier=None)[source]

Collate function to be passed to DataLoaders. PyTorch Docs: https://pytorch.org/docs/stable/data.html#dataloader-collate-fn

Calculates padding (per batch for efficiency) of labels and token_type_ids if they exist within the batch from the Dataset. Also, pads sent_rep_token_ids and creates the sent_rep_mask to indicate which numbers in the sent_rep_token_ids list are actually the locations of sentence representation ids and which are padding. Finally, calculates the attention_mask for each set of input_ids and pads both the attention_mask and the input_ids. Converts all inputs to tensors.

If sent_lengths are found then they will also automatically be padded. However, the padding for sentence lengths is complicated. Each list of sentence lengths needs to be the length of the longest list of sentence lengths and the sum of all the lengths in each list needs to add to the length of the input_ids width (the length of each input_id). The second requirement exists because torch.split() (which is used in the mean_tokens pooling algorithm to convert word vectors to sentence embeddings in pooling.py) will split a tensor into the lengths requested but will error instead of returning any extra. However, torch.split() will split a tensor into zero length segments. Thus, to solve this, zeros are added to each sentence length list for each example until one more padding value is needed to get the maximum number of sentences. Once only one more value is needed, the total value needded to reach the width of the input_ids is added.

source and target, if present, are simply passed on without any processing. Therefore, the standard collate_fn function for DataLoaders will not work if these are present since they cannot be converted to tensors without padding. This collate_fn must be used if source or target is present in the loaded dataset.

The modifier argument accepts a function that takes the final_dictionary and returns a modified final_dictionary. The modifier function will be called directly before final_dictionary is returned in pad_batch_collate(). This allows for easy extendability.

Pooling

class pooling.Pooling(sent_rep_tokens=True, mean_tokens=False, max_tokens=False)[source]

Methods to obtains sentence embeddings from word vectors. Multiple methods can be specificed and their results will be concatenated together.

Parameters
  • sent_rep_tokens (bool, optional) – Use the sentence representation token as sentence embeddings. Default is True.

  • mean_tokens (bool, optional) – Take the mean of all the token vectors in

  • False. (each sentence. Default is) –

forward(word_vectors=None, sent_rep_token_ids=None, sent_rep_mask=None, sent_lengths=None, sent_lengths_mask=None)[source]

Forward pass of the Pooling nn.Module.

Parameters
Returns

(output_vector, output_mask) Contains the sentence scores and mask as torch.Tensors. The mask is either the sent_rep_mask or sent_lengths_mask depending on the pooling mode used during model initialization.

Return type

tuple

training: bool

Classifier

class classifier.LinearClassifier(web_hidden_size, linear_hidden=1536, dropout=0.1, activation_string='gelu')[source]

nn.Module to classify sentences by reducing the hidden dimension to 1.

Parameters
  • web_hidden_size (int) – The output hidden size from the word embedding model. Used as the input to the first linear layer in this nn.Module.

  • linear_hidden (int, optional) – The number of hidden parameters for this Classifier. Default is 1536.

  • dropout (float, optional) – The value for dropout applied before the 2nd linear layer. Default is 0.1.

  • activation_string (str, optional) – A string representing an activation function in get_activation() Default is “gelu”.

forward(x, mask)[source]

Forward function. x is the input sent_vector tensor and mask avoids computations on padded values. Returns sent_scores.

training: bool
class classifier.SimpleLinearClassifier(web_hidden_size)[source]

nn.Module to classify sentences by reducing the hidden dimension to 1. This module contains a single linear layer and a sigmoid.

Parameters

web_hidden_size (int) – The output hidden size from the word embedding model. Used as the input to the first linear layer in this nn.Module.

forward(x, mask)[source]

Forward function. x is the input sent_vector tensor and mask avoids computations on padded values. Returns sent_scores.

training: bool
class classifier.TransformerEncoderClassifier(d_model, nhead=8, dim_feedforward=2048, dropout=0.1, num_layers=2, custom_reduction=None)[source]

nn.Module to classify sentences by running the sentence vectors through some nn.TransformerEncoder layers and then reducing the hidden dimension to 1 with a linear layer.

Parameters
  • d_model (int) – The number of expected features in the input

  • nhead (int, optional) – The number of heads in the multiheadattention models. Default is 8.

  • dim_feedforward (int, optional) – The dimension of the feedforward network model. Default is 2048.

  • dropout (float, optional) – The dropout value. Default is 0.1.

  • num_layers (int, optional) – The number of TransformerEncoderLayers. Default is 2.

  • reduction (nn.Module, optional) – a nn.Module that maps d_model inputs to 1 value; if not specified then a nn.Sequential() module consisting of a linear layer and a sigmoid will automatically be created. Default is nn.Sequential(linear, sigmoid).

forward(x, mask)[source]

Forward function. x is the input sent_vector tensor and mask avoids computations on padded values. Returns sent_scores.

training: bool

Convert To Extractive

convert_to_extractive.cal_rouge(evaluated_ngrams, reference_ngrams)[source]
convert_to_extractive.check_resume_success(nlp, args, source_file, last_shard, output_path, split, compression)[source]
convert_to_extractive.combination_selection(doc_sent_list, abstract_sent_list, summary_size)[source]
convert_to_extractive.convert_to_extractive_driver(args)[source]

Driver function to convert an abstractive summarization dataset to an extractive dataset. The abstractive dataset must be formatted with two files for each split: a source and target file. Example file list for two splits: ["train.source", "train.target", "val.source", "val.target"]

convert_to_extractive.convert_to_extractive_process(args, nlp, source_docs, target_docs, name, piece_idx=None)[source]

Main process to convert an abstractive summarization dataset to extractive. Tokenizes, gets the oracle_ids, splits into source and labels, and saves processed data.

convert_to_extractive.example_processor(inputs, args, oracle_mode='greedy', no_preprocess=False)[source]

Create oracle_ids, convert them to labels and run preprocess().

convert_to_extractive.greedy_selection(doc_sent_list, abstract_sent_list, summary_size)[source]
convert_to_extractive.preprocess(example, labels, min_sentence_ntokens=5, max_sentence_ntokens=200, min_example_nsents=3, max_example_nsents=100)[source]

Removes sentences that are too long/short and examples that have too few/many sentences.

convert_to_extractive.read_in_chunks(file_object, chunk_size=5000)[source]

Read a file line by line but yield chunks of chunk_size number of lines at a time.

convert_to_extractive.resume(output_path, split, chunk_size)[source]

Find the last shard created and return the total number of lines read and last shard number.

convert_to_extractive.save(json_to_save, output_path, compression=False)[source]

Save json_to_save to output_path with optional gzip compresssion specified by compression.

convert_to_extractive.seek_files(files, line_num)[source]

Seek a set of files to line number line_num and return the files.

convert_to_extractive.tokenize(nlp, docs, n_process=5, batch_size=100, name='', tokenizer_log_interval=0.1, disable_progress_bar=False)[source]

Tokenize using spacy and split into sentences and tokens.