Training an Extractive Summarization Model
Details
Once the dataset has been converted to the extractive task, it can be used as input to a data.SentencesProcessor
, which has a add_examples()
function to add sets of (example, labels)
and a get_features()
function that processes the data and prepares it to be inputted into the model (input_ids
, attention_masks
, labels
, token_type_ids
, sent_rep_token_ids
, sent_rep_token_ids_masks
). Feature extraction runs in parallel and tokenizes text using the tokenizer appropriate for the model specified with --model_name_or_path
. The tokenizer can be changed to another huggingface/transformers
tokenizer with the --tokenizer_name
option.
Important
When loading a pre-trained model you may encounter this common error:
RuntimeError: Error(s) in loading state_dict for ExtractiveSummarizer:
Missing key(s) in state_dict: "word_embedding_model.embeddings.position_ids".
To solve this issue, set strict=False
like so: model = ExtractiveSummarizer.load_from_checkpoint("distilroberta-base-ext-sum.ckpt", strict=False)
. If you are using the main.py
script, then you can alternatively sepcify the --no_strict
option.
For the CNN/DM dataset, to train a model for 50,000 steps on the data run:
python main.py --data_path ./datasets/cnn_dailymail_processor/cnn_dm --weights_save_path ./trained_models --do_train --max_steps 50000
The
--do_train
argument runs the training process. Set –do_test to test after training.The
--data_path
argument specifies where the extractive dataset json file are located.The –weights_save_path argument specifies where the model weights should be stored.
If you prefer to measure training progress by epochs instead of steps, you can use the --max_epochs
and --min_epochs
options.
The batch size can be changed with the --batch_size
option. This changes the batch size for training, validation, and testing. You can set the --auto_scale_batch_size
option to automatically determine this value. See “Auto scaling of batch size” from the pytorch_lightning documentation for more information about the algorithm and available options.
If the extractive dataset json files are compressed using gzip, then they will be automatically decompressed during the data preprocessing step of training.
By default, the model weights are saved after every epoch to the --default_root_dir
. The logs are also saved to this folder. You can change the weight save path (separate folder for logs and weights) with the --weights_save_path
option.
The length of output summaries during testing is 3 by default. You can change this by setting the --test_k
option to the number of sentences desired in generated summaries. This assumes --test_id_method
is set to top_k
, which is the default. top_k
selects the top k
sentences and the other option, greater_k
, selects those sentences with a rank above k
. k
is specified by the --test_k
argument.
Important
More example training commands can be found on the TransformerSum Weights & Biases page. Just click the name of a training run, go to the overview page by clicking the “i” icon in the top left, and look at the command value.
Pooling Modes
The pooling model determines how word vectors should be converted to sentence embeddings. The implementation can be found in pooling.py. The --pooling_mode
argument can be set to either sent_rep_tokens
or mean_tokens
. While the pooling nn.Module
allows multiple methods to be used at once (it will concatenate and return the results), the training script does not.
sent_rep_tokens
: Uses the sentence representation token (commonly called the classification token;[CLS]
in BERT and<s>
in RoBERTa) vectors as sentence embeddings.mean_tokens
: Uses the average of the token vectors for each sentence in the input as sentence embeddings.max_tokens
: Uses the maximum of the token vectors for each sentence in the input as sentence embeddings.
Custom Models
You can use any autoencoding transformer model for the word embedding model (by setting the --model_name_or_path
CLI argument) as long as it was saved in the huggingface/transformers
format. Any model that is loaded with this option by specifying a path is considered “custom” in this project. Currently, there are no “custom” models that are “officially” supported. The longformer used to be a custom model, but it was since added to the huggingface/transformers repository, and thus can be used in this project just like any other model.
Script Help
Output of python main.py --mode extractive --help
(generic options removed):
usage: main.py [-h]
[--model_name_or_path MODEL_NAME_OR_PATH] [--model_type MODEL_TYPE]
[--tokenizer_name TOKENIZER_NAME] [--tokenizer_no_use_fast]
[--max_seq_length MAX_SEQ_LENGTH] [--data_path DATA_PATH]
[--data_type {txt,pt,none}] [--num_threads NUM_THREADS]
[--processing_num_threads PROCESSING_NUM_THREADS]
[--pooling_mode {sent_rep_tokens,mean_tokens,max_tokens}]
[--num_frozen_steps NUM_FROZEN_STEPS] [--batch_size BATCH_SIZE]
[--dataloader_type {map,iterable}]
[--dataloader_num_workers DATALOADER_NUM_WORKERS]
[--processor_no_bert_compatible_cls] [--only_preprocess]
[--preprocess_resume] [--create_token_type_ids {binary,sequential}]
[--no_use_token_type_ids]
[--classifier {linear,simple_linear,transformer,transformer_linear}]
[--classifier_dropout CLASSIFIER_DROPOUT]
[--classifier_transformer_num_layers CLASSIFIER_TRANSFORMER_NUM_LAYERS]
[--train_name TRAIN_NAME] [--val_name VAL_NAME]
[--test_name TEST_NAME] [--test_id_method {greater_k,top_k}]
[--test_k TEST_K] [--no_test_block_trigrams] [--test_use_pyrouge]
[--loss_key {loss_total,loss_total_norm_batch,loss_avg_seq_sum,loss_avg_seq_mean,loss_avg}]
optional arguments:
-h, --help show this help message and exit
--model_name_or_path MODEL_NAME_OR_PATH
Path to pre-trained model or shortcut name. A list of
shortcut names can be found at https://huggingface.co/tran
sformers/pretrained_models.html. Community-uploaded models
are located at https://huggingface.co/models.
--model_type MODEL_TYPE
Model type selected in the list: retribert, t5,
distilbert, albert, camembert, xlm-roberta, bart,
longformer, roberta, bert, openai-gpt, gpt2, mobilebert,
transfo-xl, xlnet, flaubert, xlm, ctrl, electra, reformer
--tokenizer_name TOKENIZER_NAME
--tokenizer_no_use_fast
Don't use the fast version of the tokenizer for the
specified model. More info: https://huggingface.co/transfo
rmers/main_classes/tokenizer.html.
--max_seq_length MAX_SEQ_LENGTH
The maximum sequence length of processed documents.
--data_path DATA_PATH
Directory containing the dataset.
--data_type {txt,pt,none}
The file extension of the prepared data. The 'map'
`--dataloader_type` requires `txt` and the 'iterable'
`--dataloader_type` works with both. If the data is not
prepared yet (in JSON format) this value specifies the
output format after processing. If the data is prepared,
this value specifies the format to load. If it is `none`
then the type of data to be loaded will be inferred from
the `data_path`. If data needs to be prepared, this cannot
be set to `none`.
--num_threads NUM_THREADS
--processing_num_threads PROCESSING_NUM_THREADS
--pooling_mode {sent_rep_tokens,mean_tokens,max_tokens}
How word vectors should be converted to sentence
embeddings.
--num_frozen_steps NUM_FROZEN_STEPS
Freeze (don't train) the word embedding model for this
many steps.
--batch_size BATCH_SIZE
Batch size per GPU/CPU for training/evaluation/testing.
--dataloader_type {map,iterable}
The style of dataloader to use. `map` is faster and uses
less memory.
--dataloader_num_workers DATALOADER_NUM_WORKERS
The number of workers to use when loading data. A general
place to start is to set num_workers equal to the number
of CPU cores on your machine. If `--dataloader_type` is
'iterable' then this setting has no effect and num_workers
will be 1. More details here: https://pytorch-
lightning.readthedocs.io/en/latest/performance.html#num-
workers
--processor_no_bert_compatible_cls
If model uses bert compatible [CLS] tokens for sentence
representations.
--only_preprocess Only preprocess and write the data to disk. Don't train
model. This will force data to be preprocessed, even if it
was already computed and is detected on disk, and any
previous processed files will be overwritten.
--preprocess_resume Resume preprocessing. `--only_preprocess` must be set in
order to resume. Determines which files to process by
finding the shards that do not have a coresponding ".pt"
file in the data directory.
--create_token_type_ids {binary,sequential}
Create token type ids during preprocessing.
--no_use_token_type_ids
Set to not train with `token_type_ids` (don't pass them
into the model).
--classifier {linear,simple_linear,transformer,transformer_linear}
Which classifier/encoder to use to reduce the hidden
dimension of the sentence vectors. `linear` - a
`LinearClassifier` with two linear layers, dropout, and an
activation function. `simple_linear` - a
`LinearClassifier` with one linear layer and a sigmoid.
`transformer` - a `TransformerEncoderClassifier` which
runs the sentence vectors through some
`nn.TransformerEncoderLayer`s and then a simple
`nn.Linear` layer. `transformer_linear` - a
`TransformerEncoderClassifier` with a `LinearClassifier`
as the `reduction` parameter, which results in the same
thing as the `transformer` option but with a
`LinearClassifier` instead of a `nn.Linear` layer.
--classifier_dropout CLASSIFIER_DROPOUT
The value for the dropout layers in the classifier.
--classifier_transformer_num_layers CLASSIFIER_TRANSFORMER_NUM_LAYERS
The number of layers for the `transformer` classifier.
Only has an effect if `--classifier` contains
"transformer".
--train_name TRAIN_NAME
name for set of training files on disk (for loading and
saving)
--val_name VAL_NAME name for set of validation files on disk (for loading and
saving)
--test_name TEST_NAME
name for set of testing files on disk (for loading and
saving)
--test_id_method {greater_k,top_k}
How to chose the top predictions from the model for ROUGE
scores.
--test_k TEST_K The `k` parameter for the `--test_id_method`. Must be set
if using the `greater_k` option. (default: 3)
--no_test_block_trigrams
Disable trigram blocking when calculating ROUGE scores
during testing. This will increase repetition and thus
decrease accuracy.
--test_use_pyrouge Use `pyrouge`, which is an interface to the official ROUGE
software, instead of the pure-python implementation
provided by `rouge-score`. You must have the real ROUGE
package installed. More details about ROUGE 1.5.5 here: ht
tps://github.com/andersjo/pyrouge/tree/master/tools/ROUGE-
1.5.5. It is recommended to use this option for official
scores. The `ROUGE-L` measurements from `pyrouge` are
equivalent to the `rougeLsum` measurements from the
default `rouge-score` package.
--loss_key {loss_total,loss_total_norm_batch,loss_avg_seq_sum,loss_avg_seq_mean,loss_avg}
Which reduction method to use with BCELoss. See the
`experiments/loss_functions/` folder for info on how the
default (`loss_avg_seq_mean`) was chosen.