Differences from PreSumm/BertSum
This project accomplishes a task similar to BertSum/PreSumm (and is based on their research). However, TransformerSum
improves various aspects and adds several features on top of BertSum
/PreSumm
. The most notable improvements are listed below.
Note
PreSumm (Yang Liu and Mirella Lapata) builds on top of BertSum (Yang Liu) by adding abstractive summarization. BertSum was extractive only.
General
This project uses
pytorch_lighting
, which is a template to better organize PyTorch code. It automates most of the training loop. It also results in easier to read code than plain PyTorch because everything is organized.TransformerSum
contains comments explaining design decisions and how code works. Significant functions also have detailed docstrings. Seeget_features()
for an example.Easy pre-trained model loading and modification thanks to
pytorch_lightning
’sload_from_checkpoint()
that is automatically inherited by everypl.LightningModule
.There is a
prediction()
function included with theextractive.ExtractiveSummarizer
andabstractive.AbstractiveSummarizer
classes that summaries a string. This makes it easy to initialize a model and quickly perform inference.
Converting a Dataset to Extractive
Separation between the dataset conversion from abstractive to extractive and the training process. This means the same converted dataset can be used by any transformer model with further automatic processing.
The
convert_to_extractive.py
script uses the more up-to-datespacy
whereas BertSum utilizesStanford CoreNLP
. Spacy has the added benefit of being a python library (while CoreNLP is a java application), which means easier to understand code.Supports the same
greedy_selection()
andcombination_selection()
ofBertSum
.A more robust CLI that allows for various desired outputs.
Built in optional gzip compression.
Integration with huggingface/nlp means any summarization dataset in the
nlp
library can be converted to extractive by only modifying 4 options (specifically--dataset
,--dataset_version
,--data_example_column
, and--data_summarized_column
). Thenlp
library will handle downloading and pre-processing.
Pooling for Extractive Models
Supports the
sent_rep_tokens
andmean_tokens
pooling methods whereasBertSum
only supportssent_rep_tokens
. See Pooling Modes for more info.Pooling is separated from the main model’s
forward()
function for easy extendability.
Data for Extractive Models
The data processing and loading uses normal PyTorch
DataLoader``s and ``Dataset``s instead of recreations of these classes as in ``BertSum
.A
collate_fn
function converts the data from the dataset to tensors suitable for input to the model, as is stand PyTorch convention.The
collate_fn
function also performs “smart batching.” It performs padding on the necessary information for each batch, which is more efficient than padding for the entire dataset or each chunk.The
data.FSIterableDataset
class loads a dataset in chunks and is a subclass oftorch.utils.data.IterableDataset
. WhileBertSum
also supports chunked loading to lower RAM usage,TransformerSum
’s technique is more robust and directly integrates with PyTorch.
Extractive Model and Training
Compatible with every huggingface/transformers transformer encoder model.
BertSum
can only use Bert, whereas this project supports all encoders by only changing two options when training.Easily extendable with new custom models that are saved in the
huggingface/transformers
format. In this way, integration with thelongformer
was easily accomplished (this integration was since removed sincehuggingface/transformers
implemented thelongformer
directly..For the extractive component,
BertSum
supports three classifiers: a linear layer, a transformer, and a LSTM network. This project supports four different classifiers:linear
, a few linear layers with dropout and an activation function;simple_linear
, a single linear layer;transformer
, which runs the input through somenn.TransformerEncoderLayer
layers; andtransformer_linear
, a linear layer combined with a transformer. The classifier is responsible for removing the hidden features from each sentence embedding and converting them to a single number. The BertSum paper indicates that the difference between these classifiers is not major.The reduction method for the BCE loss function is different in
TransformerSum
than BertSum. BertSum takes the sum of the losses for each sentence (ignoring padding) even though it looks like it uses the mean. Five different reduction methods were tested (see the Loss Functions). There did not appear to a significant difference, but the best was chosen.The batch size parameter of
BertSum
is not the real batch size (which is likely caused by the customDataLoader
). In this project batch size is the number of documents processed on the GPU at once.Multiple optimizers are supported “out-of-the-box” in
TransformerSum
without any need to modify the code.The
OneCycle
andlinear_schedule_with_warmup
schedulers are supported inTransformerSum
“out-of-the-box.”Logging of all five loss functions (for both the train and validation sets), accuracy, and more is supported. Weights & Biases and Tensorboard are supported “out-of-the-box” but
pytorch_lightning
can integrate several other loggers.
Abstractive Model and Training
Dataset preparation happens extremely quickly (minutes instead of hours; CNN/DM can be ready to train in about 10 minutes from the raw data)
Integration with huggingface/nlp means any summarization dataset in the
nlp
library can be used for training by only modifying 4 options (specifically--dataset
,--dataset_version
,--data_example_column
, and--data_summarized_column
). Thenlp
library will handle downloading and pre-processing while theabstractive.py
script will handle tokenization automatically.Compatible with every huggingface/transformers EncoderDecoder model.
PreSumm
only supports a BERT encoder and a standard transformer decoder, whereas this project supports all EncoderDecoder models by changing a single option (--model_name_or_path
).
Where BertSum
is Better
BertSum
has an LSTM classifier, whichTransformerSum
does not replicate.