Extractive Pre-trained Models & Results

The recommended model to use is distilroberta-base-ext-sum because of its fast performance, relatively low number of parameters, and good performance.

Notes

The distil* models are of special significance. Distil* is a class of compressed models that started with DistilBERT. DistilBERT stands for Distillated-BERT. DistilBERT is a small, fast, cheap and light Transformer model based on Bert architecture. It has 40% less parameters than bert-base-uncased, runs 60% faster while preserving 99% of BERT’s performances as measured on the GLUE language understanding benchmark. DistilBERT is a smaller Transformer model that bears a lot of similarities with the original BERT model while being lighter, smaller and faster to run. DistilRoBERTa reaches 95% of RoBERTa-base’s performance on GLUE and is twice as fast as RoBERTa while being 35% smaller. More info at huggingface/transformers.

The remarkable performance to size ratio of the distil* models can be transferred to summarization. distilroberta is recommended over distilbert because of the architecture improvements that the original RoBERTa brought over the original BERT. Essentially, distilroberta is more modern than distilbert.

MobileBERT is similar to distilbert in that it is a smaller version of BERT that achieves amazing performance at a very small size. According to the authors, MobileBERT is 2.64x smaller and 2.45x faster than DistilBERT. DistilBERT successfully halves the depth of BERT model by knowledge distillation in the pre-training stage and an optional fine-tuning stage. MobileBERT only uses knowledge transfer in the pre-training stage and does not require a fine-tuned teacher or data augmentation in the down-stream tasks. DistilBERT compresses BERT by reducing its depth, while MobileBERT compresses BERT by reducing its width, which has been shown to be more effective. MobileBERT usually needs a larger learning rate and more training epochs in fine-tuning than the original BERT.

Important

Interactive charts, graphs, raw data, run commands, hyperparameter choices, and more for all trained models are publicly available on the TransformerSum Weights & Biases page. You can download the raw data for each model on this site, or download an overview as a CSV. Please open an issue if you have questions about these models.

Additionally, all of the models on this page were trained completely for free using Tesla P100-PCIE-16GB GPUs on Google Colaboratory. Those that took over 12 hours to train were split into multiple training sessions since pytorch_lightning enables easy resuming with the --resume_from_checkpoint argument.

CNN/DM

Name	Comments	Model Download	Data Download
distilbert-base-uncased-ext-sum	None	Model & All Checkpoints	CNN/DM Bert Uncased
distilroberta-base-ext-sum	None	Model & All Checkpoints	CNN/DM Roberta
bert-base-uncased-ext-sum	None	Model & All Checkpoints	CNN/DM Bert Uncased
roberta-base-ext-sum	None	Model & All Checkpoints	CNN/DM Roberta
bert-large-uncased-ext-sum	None	Not yet…	CNN/DM Bert Uncased
roberta-large-ext-sum	None	Not yet…	CNN/DM Roberta
longformer-base-4096-ext-sum	None	Not yet…	CNN/DM Longformer
mobilebert-uncased-ext-sum	Trained with lr=8e-5	Model & All Checkpoints	CNN/DM Bert Uncased

CNN/DM ROUGE Scores

Test set results on the CNN/DailyMail dataset using ROUGE F₁.

Name	ROUGE-1	ROUGE-2	ROUGE-L	ROUGE-L-Sum
distilbert-base-uncased-ext-sum	42.71	19.91	27.52	39.18
distilroberta-base-ext-sum	42.87	20.02	27.46	39.31
bert-base-uncased-ext-sum	42.78	19.83	27.43	39.18
roberta-base-ext-sum	43.24	20.36	27.64	39.65
bert-large-uncased-ext-sum	Not yet…	Not yet…	Not yet…	Not yet…
roberta-large-ext-sum	Not yet…	Not yet…	Not yet…	Not yet…
longformer-base-4096-ext-sum	Not yet…	Not yet…	Not yet…	Not yet…
mobilebert-uncased-ext-sum	42.01	19.31	26.89	38.53

Note

Currently, distilbert beats bert-base-uncased by 1.0014% ((42.71/42.78+19.91/19.83+27.52/27.43+39.18/39.18)/4=1.0014197729882865). Since bert-base-uncased has more parameters than distilbert, this is unusual and is likely a tuning issue. This suggests that tuning the hyperparameters of bert-base-uncased can improve its performance. distilroberta matches 92.7% of the performance of roberta-base ((42.87/43.24+20.02/20.36+27.46/27.64+29.31/39.65)/4=0.9268623888753363).

Important

mobilebert-uncased-ext-sum achieves 96.59% ((42.01/43.25+19.31/20.24+38.53/39.63)/3) of the performance of BertSum while containing 4.45 times (109483009/24582401) fewer parameters. It achieves 94.06% ((42.01/44.41+19.31/20.86+38.53/40.55)/3) of the performance of MatchSum (Zhong et al., 2020), the current extractive state-of-the-art.

CNN/DM Training Times and Model Sizes

Name	Time	Model Size
distilbert-base-uncased-ext-sum	6h 22m 32s	796.4MB
distilroberta-base-ext-sum	6h 21m 37s	980.8MB
bert-base-uncased-ext-sum	12h 51m 17s	1.3GB
roberta-base-ext-sum	13h 7m 3s	1.5GB
bert-large-uncased-ext-sum	Not yet…	Not yet…
roberta-large-ext-sum	Not yet…	Not yet…
longformer-base-4096-ext-sum	Not yet…	Not yet…
mobilebert-uncased-ext-sum	8h 26m 32s	295.6MB

Important

distilroberta-base-ext-sum trains in about 6.5 hours on 1 P100-PCIE-16GB GPU, while MatchSum, the current state-of-the-art in extractive summarization on CNN/DM, takes 30 hours on 8 Tesla-V100-16G GPUs to train. If a V100 is about 2x as powerful as a P100, then it would take 480 hours (30*8*2) to train MatchSum on one P100. This simplistic approximation suggests that it takes about 74x (480/6.5) more time to train MatchSum than distilroberta-base-ext-sum.

WikiHow

Name	Comments	Model Download	Data Download
distilbert-base-uncased-ext-sum	None	Model & All Checkpoints	WikiHow Bert Uncased
distilroberta-base-ext-sum	None	Model & All Checkpoints	WikiHow Roberta
bert-base-uncased-ext-sum	None	Model & All Checkpoints	WikiHow Bert Uncased
roberta-base-ext-sum	None	Model & All Checkpoints	WikiHow Roberta
bert-large-uncased-ext-sum	None	Not yet…	WikiHow Bert Uncased
roberta-large-ext-sum	None	Not yet…	WikiHow Roberta
mobilebert-uncased-ext-sum	Trained with lr=8e-5	Model & All Checkpoints	WikiHow Bert Uncased

WikiHow ROUGE Scores

Test set results on the WikiHow dataset using ROUGE F₁.

Name	ROUGE-1	ROUGE-2	ROUGE-L	ROUGE-L-Sum
distilbert-base-uncased-ext-sum	30.69	8.65	19.13	28.58
distilroberta-base-ext-sum	31.07	8.96	19.34	28.95
bert-base-uncased-ext-sum	30.68	08.67	19.16	28.59
roberta-base-ext-sum	31.26	09.09	19.47	29.14
bert-large-uncased-ext-sum	Not yet…	Not yet…	Not yet…	Not yet…
roberta-large-ext-sum	Not yet…	Not yet…	Not yet…	Not yet…
mobilebert-uncased-ext-sum	30.72	8.78	19.18	28.59

Note

These are the results of an extractive model, which means they are fairly good because they come close to abstractive models. The R1/R2/RL-Sum results of a base transformer model from the PEGASUS paper are 32.48/10.53/23.86. The net difference from distilroberta-base-ext-sum is +1.41/+1.57/-5.09. Compared to the abstractive SOTA prior to PEGASUS, which was 28.53/9.23/26.54, distilroberta-base-ext-sum performs +2.54/-0.27/+2.41. However, the base PEGASUS model obtains scores of 36.58/15.64/30.01, which are much better than distilroberta-base-ext-sum, as one would expect.

WikiHow Training Times and Model Sizes

Name	Time	Model Size
distilbert-base-uncased-ext-sum	3h 42m 12s	796.4MB
distilroberta-base-ext-sum	4h 27m 23s	980.8MB
bert-base-uncased-ext-sum	7h 29m 06s	1.3GB
roberta-base-ext-sum	7h 35m 59s	1.5GB
bert-large-uncased-ext-sum	Not yet…	Not yet…
roberta-large-ext-sum	Not yet…	Not yet…
mobilebert-uncased-ext-sum	4h 22m 19s	295.6MB

arXiv-PubMed

Name	Comments	Model Download	Data Download
distilbert-base-uncased-ext-sum	None	Model & All Checkpoints	arXiv-PubMed Bert Uncased
distilroberta-base-ext-sum	None	Model & All Checkpoints	arXiv-PubMed Roberta
bert-base-uncased-ext-sum	None	Model & All Checkpoints	arXiv-PubMed Bert Uncased
roberta-base-ext-sum	None	Model & All Checkpoints	arXiv-PubMed Roberta
bert-large-uncased-ext-sum	None	Not yet…	arXiv-PubMed Bert Uncased
roberta-large-ext-sum	None	Not yet…	arXiv-PubMed Roberta
longformer-base-4096-ext-sum	None	Not yet…	arXiv-PubMed Longformer
mobilebert-uncased-ext-sum	None	Model & All Checkpoints	arXiv-PubMed Bert Uncased

arXiv-PubMed ROUGE Scores

Test set results on the arXiv-PubMed dataset using ROUGE F₁.

Name	ROUGE-1	ROUGE-2	ROUGE-L	ROUGE-L-Sum
distilbert-base-uncased-ext-sum	34.93	12.21	19.62	31.00
distilroberta-base-ext-sum	34.70	12.16	19.52	30.82
bert-base-uncased-ext-sum	34.80	12.26	19.67	30.92
roberta-base-ext-sum	34.81	12.26	19.65	30.91
bert-large-uncased-ext-sum	Not yet…	Not yet…	Not yet…	Not yet…
roberta-large-ext-sum	Not yet…	Not yet…	Not yet…	Not yet…
longformer-base-4096-ext-sum	Not yet…	Not yet…	Not yet…	Not yet…
mobilebert-uncased-ext-sum	33.97	11.74	19.63	30.19

Note

These are the results of an extractive model, which means they are fairly good because they come close to abstractive models. The R1/R2/RL-Sum results of a base transformer model from the PEGASUS paper are 34.79/7.69/19.51 (average of 35.63/7.95/20.00 (arXiv) and 33.94/7.43/19.02 (PubMed)). The net difference from distilroberta-base-ext-sum is +0.09/-4.47/-11.31. Compared to the abstractive SOTA prior to PEGASUS, which was 41.09/14.93/23.57 (average of 41.59/14.26/23.55 (arXiv) and 40.59/15.59/23.59 (PubMed)), distilroberta-base-ext-sum performs -6.39/-2.77/+7.25. However, the base PEGASUS model obtains scores of 37.39/12.66/23.87 (average of 34.81/10.16/22.50 (arXiv) and 39.98/15.15/25.23 (PubMed)). The large model obtains scores of 45.10/18.59/26.75 (average of 44.70/17.27/25.80 (arXiv) and 45.49/19.90/27.69 (PubMed)) which are much better than distilroberta-base-ext-sum, as one would expect.

arXiv-PubMed Training Times and Model Sizes

Name	Time	Model Size
distilbert-base-uncased-ext-sum	06h 46m 0s	796.4MB
distilroberta-base-ext-sum	06h 33m 58s	980.8MB
bert-base-uncased-ext-sum	14h 40m 10s	1.3GB
roberta-base-ext-sum	14h 39m 43s	1.5GB
bert-large-uncased-ext-sum	Not yet…	Not yet…
roberta-large-ext-sum	Not yet…	Not yet…
longformer-base-4096-ext-sum	Not yet…	Not yet…
mobilebert-uncased-ext-sum	09h 5m 45s	295.6MB