Extractive API Reference
Model/Module
- class extractive.ExtractiveSummarizer(hparams, embedding_model_config=None, classifier_obj=None)[source]
A machine learning model that extractively summarizes an input text by scoring the sentences. Main class that handles the data loading, initial processing, training/testing/validating setup, and contains the actual model.
- compute_loss(outputs, labels, mask)[source]
Compute the loss between model outputs and ground-truth labels.
- Parameters
outputs (torch.Tensor) – Output sentence scores obtained from
forward()
labels (torch.Tensor) – Ground-truth labels (
1
for sentences that should be in the summary,0
otherwise) from the batch generated during the data preprocessing stage.mask (torch.Tensor) – Mask returned by
forward()
, eithersent_rep_mask
orsent_lengths_mask
depending on the pooling mode used during model initialization.
- Returns
- Losses: (total_loss, total_norm_batch_loss, sum_avg_seq_loss,
mean_avg_seq_loss, average_loss)
- Return type
[tuple]
- configure_optimizers()[source]
Configure the optimizers. Returns the optimizer and scheduler specified by the values in
self.hparams
.
- forward(input_ids, attention_mask, sent_rep_mask=None, token_type_ids=None, sent_rep_token_ids=None, sent_lengths=None, sent_lengths_mask=None, **kwargs)[source]
Model forward function. See the 60 minute bliz tutorial if you are unsure what a forward function is.
- Parameters
input_ids (torch.Tensor) – Indices of input sequence tokens in the vocabulary. What are input IDs?
attention_mask (torch.Tensor) – Mask to avoid performing attention on padding token indices. Mask values selected in
[0, 1]
:1
for tokens that are NOT MASKED,0
for MASKED tokens. What are attention masks?sent_rep_mask (torch.Tensor, optional) – Indicates which numbers in
sent_rep_token_ids
are actually the locations of sentence representation ids and which are padding. Defaults to None.token_type_ids (torch.Tensor, optional) – Usually, segment token indices to indicate first and second portions of the inputs. However, for summarization they are used to indicate different sentences. Depending on the size of the token type id vocabulary, these values may alternate between
0
and1
or they may increase sequentially for each sentence in the input.. Defaults to None.sent_rep_token_ids (torch.Tensor, optional) – The locations of the sentence representation tokens. Defaults to None.
sent_lengths (torch.Tensor, optional) – A list of the lengths of each sentence in
input_ids
. Seedata.pad_batch_collate()
for more info about the generation of thisfeature. Defaults to None.sent_lengths_mask (torch.Tensor, optional) – Created on-the-fly by
data.pad_batch_collate()
. Similar tosent_rep_mask
:1
for value and0
for padding. Seedata.pad_batch_collate()
for more info about the generation of this feature. Defaults to None.
- Returns
Contains the sentence scores and mask as
torch.Tensor
s. The mask is either thesent_rep_mask
orsent_lengths_mask
depending on the pooling mode used during model initialization.- Return type
tuple
- json_to_dataset(tokenizer, hparams, inputs=None, num_files=0, processor=None)[source]
Convert json output from
convert_to_extractive.py
to a “.pt” file containing lists or tensors using adata.SentencesProcessor
. This function is run byprepare_data()
in parallel.- Parameters
tokenizer (transformers.PreTrainedTokenizer) – Tokenizer used to convert examples to input_ids. Usually is
self.tokenizer
.hparams (argparse.Namespace) – Hyper-parameters used to create the model. Usually is
self.hparams
.inputs (tuple, optional) – (idx, json_file) Current loop index and path to json file. Defaults to None.
num_files (int, optional) – The total number of files to process. Used to display a nice progress indicator. Defaults to 0.
processor (data.SentencesProcessor, optional) – The
data.SentencesProcessor
object to convert the json file to usable features. Defaults to None.
- predict(input_text: str, raw_scores=False, num_summary_sentences=3)[source]
Summarizes
input_text
using the model.- Parameters
input_text (str) – The text to be summarized.
raw_scores (bool, optional) – Return a list containing each sentence and its corespoding score instead of the summary. Defaults to False.
num_summary_sentences (int, optional) – The number of sentences in the output summary. This value specifies the number of top sentences to select as the summary. Defaults to 3.
- Returns
The summary text. If
raw_scores
is set then returns a list of input sentences and their corespoding scores.- Return type
str
- predict_sentences(input_sentences: Union[List[str], generator], raw_scores=False, num_summary_sentences=3, tokenized=False)[source]
Summarizes
input_sentences
using the model.- Parameters
input_sentences (list or generator) – The sentences to be summarized as a list or a generator of spacy Spans (
spacy.tokens.span.Span
), which can be obtained by runningnlp("input document").sents
wherenlp
is a spacy model with a sentencizer.raw_scores (bool, optional) – Return a list containing each sentence and its corespoding score instead of the summary. Defaults to False.
num_summary_sentences (int, optional) – The number of sentences in the output summary. This value specifies the number of top sentences to select as the summary. Defaults to 3.
tokenized (bool, optional) – If the input sentences are already tokenized using spacy. If true,
input_sentences
should be a list of lists where the outer list contains sentences and the inner lists contain tokens. Defaults to False.
- Returns
The summary text. If
raw_scores
is set then returns a list of input sentences and their corespoding scores.- Return type
str
- prepare_data()[source]
Runs
json_to_dataset()
in parallel.json_to_dataset()
is the function that actually loads and processes the examples as described below. Algorithm: For each json file outputted by theconvert_to_extractive.py
script:Load json file.
- Add each document in json file to
SentencesProcessor
defined inself.processor
, overwriting any previous data in the processor.
- Add each document in json file to
- Run
data.SentencesProcessor.get_features()
to save the extracted features to disk as a
.pt
file containing a pickled python list of dictionaries, which each dictionary contains the extracted features.
- Run
Memory Usage Note: If sharding was turned off during the
convert_to_extractive
process then this function will run once, loading the entire dataset into memory to process just like theconvert_to_extractive.py
script.
- test_epoch_end(outputs)[source]
Called at the end of a testing epoch: PyTorch Lightning Documentation Finds the mean of all the metrics logged by
test_step()
.
- test_step(batch, batch_idx)[source]
Test step: PyTorch Lightning Documentation Similar to
validation_step()
in that in runs the inputs through the model. However, this method also calculates the ROUGE scores for each example-summary pair.
- training: bool
- training_step(batch, batch_idx)[source]
Training step: PyTorch Lightning Documentation
- validation_epoch_end(outputs)[source]
Called at the end of a validation epoch: PyTorch Lightning Documentation Finds the mean of all the metrics logged by
validation_step()
.
- validation_step(batch, batch_idx)[source]
Validation step: PyTorch Lightning Documentation Similar to
training_step()
in that in runs the inputs through the model. However, this method also calculates accuracy and f1 score by marking every sentence score >0.5 as 1 (meaning should be in the summary) and each score <0.5 as 0 (meaning should not be in the summary).
- extractive.longformer_modifier(final_dictionary)[source]
Creates the
global_attention_mask
for the longformer. Tokens with global attention attend to all other tokens, and all other tokens attend to them. This is important for task-specific finetuning because it makes the model more flexible at representing the task. For example, for classification, the <s> token should be given global attention. For QA, all question tokens should also have global attention. For summarization, global attention is given to all of the <s> (RoBERTa ‘CLS’ equivalent) tokens. Please refer to the Longformer paper for more details. Mask values selected in[0, 1]
:0
for local attention,1
for global attention.
Data
- class data.FSIterableDataset(files_list, shuffle=True, verbose=False)[source]
A dataset to yield examples from a list of files that are saved python objects that can be iterated over. These files could be other PyTorch datasets (tested with
TensorDataset
) or other python objects such as lists, for example. Each file will be loaded one at a time until all the examples have been yielded, at which point the next file will be loaded and used to yield examples, and so on. This means a large dataset can be broken into smaller chunks and this class can be used to load samples as if those files were one dataset while only utilizing the ram required for one chunk.Explanation about
batch_size
and__len__()
: If thelen()
function is needed to be accurate then thebatch_size
must be specified when constructing objects of this class. PyTorchDataLoader
objects will report accurate lengths by dividing the number of examples in the dataset by the batch size only if the dataset if not anIterableDataset
. If the dataset is anIterableDataset
then aDataLoader
will simply ask the dataset for its length, without diving by the batch size, because in some cases the length of anIterableDataset
might be difficult or impossible to determine. However, in this case the number of examples (length of dataset) is known. The division by batch size must happen in the dataset (for datasets of typeIterableDataset
) since theDataLoader
will not calculate this.
- class data.InputFeatures(input_ids, attention_mask=None, token_type_ids=None, labels=None, sent_rep_token_ids=None, sent_lengths=None, source=None, target=None)[source]
A single set of features of data.
- Parameters
input_ids – Indices of input sequence tokens in the vocabulary.
attention_mask – Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]: Usually 1 for tokens that are NOT MASKED, 0 for MASKED (padded) tokens.
token_type_ids – Usually, segment token indices to indicate first and second portions of the inputs. However, for summarization they are used to indicate different sentences. Depending on the size of the token type id vocabulary, these values may alternate between
0
and1
or they may increase sequentially for each sentence in the input.labels – Labels corresponding to the input.
sent_rep_token_ids – The locations of the sentence representation tokens.
sent_lengths – A list of the lengths of each sentence in the source and input_ids.
source – The actual source document as a list of sentences.
target – The ground truth abstractive summary.
- class data.SentencesProcessor(name=None, labels=None, examples=None, verbose=False)[source]
Create a SentencesProcessor
- Parameters
name (str, optional) – A label for the
SentencesProcessor
object, used internally for saving if a save name is not specified indata.SentencesProcessor.get_features()
, Default is None.labels (list, optional) – The label that goes with each sample, can be a list of lists where the inside lists are the labels for each sentence in the coresponding example. Default is None.
examples (list, optional) – List of
InputExample
s. Default is None.verbose (bool, optional) – Log extra information (such as examples of processed data points). Default is False.
- add_examples(texts, labels=None, ids=None, oracle_ids=None, targets=None, overwrite_labels=False, overwrite_examples=False)[source]
Primary method of adding example sets of texts, labels, ids, and targets to the
SentencesProcessor
- Parameters
texts (list) – A list of documents where each document is a list of sentences where each sentence is a list of tokens. This is the output of convert_to_extractive.py and is in the ‘src’ field for each doc. See
extractive.ExtractiveSummarizer.prepare_data()
.labels (list, optional) – A list of the labels for each document where each label is a list of labels where the index of the label coresponds with the index of the sentence in the respective entry in texts. Similarly to texts, this is handled automatically by ExtractiveSummarizer.prepare_data. Default is None.
ids (list, optional) – A list of ids for each document. Not used by ExtractiveSummarizer. Default is None.
oracle_ids (list, optional) – Similar to labels but is a list of indexes of the chosen sentences instead of a one-hot encoded vector. These will be converted to labels. Default is None.
targets (list, optional) – A list of the abstractive target for each document. Default is None.
overwrite_labels (bool, optional) – Replace any labels currently stored by the
SentencesProcessor
. Default is False.overwrite_examples (bool, optional) – Replace any examples currently stored by the
SentencesProcessor
. Default is False.
- Returns
The examples as
InputExample
s that have been added.- Return type
list
- classmethod create_from_examples(texts, labels=None, **kwargs)[source]
Create a SentencesProcessor with
**kwargs
and addtexts
and labels` throughadd_examples()
.
- get_features(tokenizer, bert_compatible_cls=True, create_sent_rep_token_ids=True, sent_rep_token_id=None, create_sent_lengths=True, create_segment_ids='binary', segment_token_id=None, create_source=False, n_process=2, max_length=None, pad_on_left=False, pad_token=0, mask_padding_with_zero=True, create_attention_mask=True, pad_ids_and_attention=True, return_type=None, save_to_path=None, save_to_name=None, save_as_type='txt')[source]
Convert the examples stored by the
SentencesProcessor
to features that can be used by a model. The following processes can be performed: tokenization, token type ids (to separate sentences), sentence representation token ids (the locations of each sentence representation token), sentence lengths, and the attention mask. Padding can be applied to the tokenized examples and the attention masks but it is recommended to instead use thedata.pad_batch_collate()
function so each batch is padded individually for efficiency (less zeros passed through model).- Parameters
tokenizer (transformers.PreTrainedTokenizer) – The tokenizer used to tokenize the examples.
bert_compatible_cls (bool, optional) – Adds ‘[CLS]’ tokens in front of each sentence. This is useful so that the ‘[CLS]’ token can be used to obtain sentence embeddings. This only works if the chosen model has the ‘[CLS]’ token in its vocabulary. Default is True.
create_sent_rep_token_ids (bool, optional) – Option to create sentence representation token ids. This will store a list of the indexes of all the
sent_rep_token_id
s in the tokenized example. Default is True.sent_rep_token_id ([type], optional) – The token id that should be captured for each sentence (should have one per sentence and each should represent that sentence). Default is
'[CLS]' token if bert_compatible_cls else '[SEP]' token
.create_sent_lengths (bool, optional) – Option to create a list of sentence lengths where each index in the list coresponds to the respective sentence in the example. Default is True.
create_segment_ids (str, optional) –
Option to create segment ids (aka token type ids). See https://huggingface.co/transformers/glossary.html#token-type-ids for more info. Set to either “binary”, “sequential”, or False.
binary
alternates between 0 and 1 for each sentence.sequential
starts at 0 and increments by 1 for each sentence.False
does not create any segment ids.
Note: Many pretrained models that accept token type ids use them for question answering ans related tasks where the model receives two inputs. Therefore, most models have a token type id vocabulary size of 2, which means they only have learned 2 token type ids. The “binary” mode exists so that these pretrained models can easily be used. Default is “binary”.
segment_token_id (str, optional) – The token id to be used when creating segment ids. Can be set to ‘period’ to treat periods as sentence separation tokens, but this is a terrible idea for obvious reasons. Default is ‘[SEP]’ token id.
create_source (bool, optional) – Option to save the source text (non-tokenized) as a string. Default is False.
n_process (int, optional) – How many processes to use for multithreading for running get_features_process(). Set higher to run faster and set lower is you experience OOM issues. Default is 2.
max_length (int, optional) – If
pad_ids_and_attention
is True then pad to this amount. Default istokenizer.max_len
.pad_on_left (bool, optional) – Optionally, pad on the left instead of right. Default is False.
pad_token (int, optional) – Which token to use for padding the
input_ids
. Default is 0.mask_padding_with_zero (bool, optional) – Use zeros to pad the attention. Uses ones otherwise. Default is True.
create_attention_mask (bool, optional) – Option to create the attention mask. It is recommended to use the
data.pad_batch_collate()
function, which will automatically create attention masks and pad them on a per batch level. Default isFalse if return_type == "lists" else True
.pad_ids_and_attention (bool, optional) – Pad the
input_ids
withpad_token
and attention masks with 0s or 1s deneding onmask_padding_with_zero
. Pad both tomax_length
. Default isFalse if return_type == "lists" else True
return_type (str, optional) – Either “tensors”, “lists”, or None. See “Returns” section below. Default is None.
save_to_path (str, optional) – The folder/directory to save the data to OR None to not save. Will save the data specified by
return_type
to disk. Default is None.save_to_name (str, optional) – The name of the file to save. The extension ‘.pt’ is automatically appended. Default is
'dataset_' + self.name + '.pt'
.save_as_type (str, optional) – The file extension of saved file if save_to_path is set. Supports “pt” (PyTorch) and “txt” (Text). Saving as “txt” requires the
return_type
to belists
. Ifreturn_type
istensors
the onlysave_as_type
available is “pt”. Defaults to “txt”.
- Returns
If
return_type is None
return the list of calculated features. Ifreturn_type == "tensors"
return the features converted to tensors and stacked such that features are grouped together into individual tensors. Ifreturn_type == "lists"
, which is the recommended option then exports eachInputFeatures
object in the exportedfeatures
list as a dictionary and appends each dictionary to a list. Returns that list.- Return type
list or torch.TensorDataset
- get_features_process(input_information, num_examples=0, tokenizer=None, bert_compatible_cls=True, sep_token=None, cls_token=None, create_sent_rep_token_ids=True, sent_rep_token_id=None, create_sent_lengths=True, create_segment_ids='binary', segment_token_id=None, create_source=False, max_length=None, pad_on_left=False, pad_token=0, mask_padding_with_zero=True, create_attention_mask=True, pad_ids_and_attention=True)[source]
The process that actually creates the features.
get_features()
is the driving function, look there for a description of how this function works. This function only exists so that processing can easily be done in parallel usingPool.map
.
- classmethod get_input_ids(tokenizer, src_txt, bert_compatible_cls=True, sep_token=None, cls_token=None, max_length=None)[source]
Get
input_ids
fromsrc_txt
usingtokenizer
. Seeget_features()
for more info.
- data.pad_batch_collate(batch, modifier=None)[source]
Collate function to be passed to
DataLoaders
. PyTorch Docs: https://pytorch.org/docs/stable/data.html#dataloader-collate-fnCalculates padding (per batch for efficiency) of
labels
andtoken_type_ids
if they exist within the batch from theDataset
. Also, padssent_rep_token_ids
and creates thesent_rep_mask
to indicate which numbers in thesent_rep_token_ids
list are actually the locations of sentence representation ids and which are padding. Finally, calculates theattention_mask
for each set ofinput_ids
and pads both theattention_mask
and theinput_ids
. Converts all inputs to tensors.If
sent_lengths
are found then they will also automatically be padded. However, the padding for sentence lengths is complicated. Each list of sentence lengths needs to be the length of the longest list of sentence lengths and the sum of all the lengths in each list needs to add to the length of the input_ids width (the length of each input_id). The second requirement exists becausetorch.split()
(which is used in themean_tokens
pooling algorithm to convert word vectors to sentence embeddings inpooling.py
) will split a tensor into the lengths requested but will error instead of returning any extra. However,torch.split()
will split a tensor into zero length segments. Thus, to solve this, zeros are added to each sentence length list for each example until one more padding value is needed to get the maximum number of sentences. Once only one more value is needed, the total value needded to reach the width of theinput_ids
is added.source
andtarget
, if present, are simply passed on without any processing. Therefore, the standardcollate_fn
function forDataLoader
s will not work if these are present since they cannot be converted to tensors without padding. Thiscollate_fn
must be used ifsource
ortarget
is present in the loaded dataset.The
modifier
argument accepts a function that takes thefinal_dictionary
and returns a modifiedfinal_dictionary
. Themodifier
function will be called directly beforefinal_dictionary
is returned inpad_batch_collate()
. This allows for easy extendability.
Pooling
- class pooling.Pooling(sent_rep_tokens=True, mean_tokens=False, max_tokens=False)[source]
Methods to obtains sentence embeddings from word vectors. Multiple methods can be specificed and their results will be concatenated together.
- Parameters
sent_rep_tokens (bool, optional) – Use the sentence representation token as sentence embeddings. Default is True.
mean_tokens (bool, optional) – Take the mean of all the token vectors in
False. (each sentence. Default is) –
- forward(word_vectors=None, sent_rep_token_ids=None, sent_rep_mask=None, sent_lengths=None, sent_lengths_mask=None)[source]
Forward pass of the Pooling nn.Module.
- Parameters
word_vectors (torch.Tensor, optional) – Vectors representing words created by a
word_embedding_model
. Defaults to None.sent_rep_token_ids (torch.Tensor, optional) – See
extractive.ExtractiveSummarizer.forward()
. Defaults to None.sent_rep_mask (torch.Tensor, optional) – See
extractive.ExtractiveSummarizer.forward()
. Defaults to None.sent_lengths (torch.Tensor, optional) – See
extractive.ExtractiveSummarizer.forward()
. Defaults to None.sent_lengths_mask (torch.Tensor, optional) – See
extractive.ExtractiveSummarizer.forward()
. Defaults to None.
- Returns
(output_vector, output_mask) Contains the sentence scores and mask as
torch.Tensor
s. The mask is either thesent_rep_mask
orsent_lengths_mask
depending on the pooling mode used during model initialization.- Return type
tuple
- training: bool
Classifier
- class classifier.LinearClassifier(web_hidden_size, linear_hidden=1536, dropout=0.1, activation_string='gelu')[source]
nn.Module
to classify sentences by reducing the hidden dimension to 1.- Parameters
web_hidden_size (int) – The output hidden size from the word embedding model. Used as the input to the first linear layer in this nn.Module.
linear_hidden (int, optional) – The number of hidden parameters for this Classifier. Default is 1536.
dropout (float, optional) – The value for dropout applied before the 2nd linear layer. Default is 0.1.
activation_string (str, optional) – A string representing an activation function in
get_activation()
Default is “gelu”.
- forward(x, mask)[source]
Forward function.
x
is the inputsent_vector
tensor andmask
avoids computations on padded values. Returnssent_scores
.
- training: bool
- class classifier.SimpleLinearClassifier(web_hidden_size)[source]
nn.Module
to classify sentences by reducing the hidden dimension to 1. This module contains a single linear layer and a sigmoid.- Parameters
web_hidden_size (int) – The output hidden size from the word embedding model. Used as the input to the first linear layer in this nn.Module.
- forward(x, mask)[source]
Forward function.
x
is the inputsent_vector
tensor andmask
avoids computations on padded values. Returnssent_scores
.
- training: bool
- class classifier.TransformerEncoderClassifier(d_model, nhead=8, dim_feedforward=2048, dropout=0.1, num_layers=2, custom_reduction=None)[source]
nn.Module
to classify sentences by running the sentence vectors through somenn.TransformerEncoder
layers and then reducing the hidden dimension to 1 with a linear layer.- Parameters
d_model (int) – The number of expected features in the input
nhead (int, optional) – The number of heads in the multiheadattention models. Default is 8.
dim_feedforward (int, optional) – The dimension of the feedforward network model. Default is 2048.
dropout (float, optional) – The dropout value. Default is 0.1.
num_layers (int, optional) – The number of
TransformerEncoderLayer
s. Default is 2.reduction (nn.Module, optional) – a nn.Module that maps d_model inputs to 1 value; if not specified then a
nn.Sequential()
module consisting of a linear layer and a sigmoid will automatically be created. Default isnn.Sequential(linear, sigmoid)
.
- forward(x, mask)[source]
Forward function.
x
is the inputsent_vector
tensor andmask
avoids computations on padded values. Returnssent_scores
.
- training: bool
Convert To Extractive
- convert_to_extractive.check_resume_success(nlp, args, source_file, last_shard, output_path, split, compression)[source]
- convert_to_extractive.combination_selection(doc_sent_list, abstract_sent_list, summary_size)[source]
- convert_to_extractive.convert_to_extractive_driver(args)[source]
Driver function to convert an abstractive summarization dataset to an extractive dataset. The abstractive dataset must be formatted with two files for each split: a source and target file. Example file list for two splits:
["train.source", "train.target", "val.source", "val.target"]
- convert_to_extractive.convert_to_extractive_process(args, nlp, source_docs, target_docs, name, piece_idx=None)[source]
Main process to convert an abstractive summarization dataset to extractive. Tokenizes, gets the
oracle_ids
, splits intosource
andlabels
, and saves processed data.
- convert_to_extractive.example_processor(inputs, args, oracle_mode='greedy', no_preprocess=False)[source]
Create
oracle_ids
, convert them tolabels
and runpreprocess()
.
- convert_to_extractive.preprocess(example, labels, min_sentence_ntokens=5, max_sentence_ntokens=200, min_example_nsents=3, max_example_nsents=100)[source]
Removes sentences that are too long/short and examples that have too few/many sentences.
- convert_to_extractive.read_in_chunks(file_object, chunk_size=5000)[source]
Read a file line by line but yield chunks of
chunk_size
number of lines at a time.
- convert_to_extractive.resume(output_path, split, chunk_size)[source]
Find the last shard created and return the total number of lines read and last shard number.
- convert_to_extractive.save(json_to_save, output_path, compression=False)[source]
Save
json_to_save
tooutput_path
with optional gzip compresssion specified bycompression
.