Extractive Pre-trained Models & Results

The recommended model to use is distilroberta-base-ext-sum because of its fast performance, relatively low number of parameters, and good performance.

Notes

The distil* models are of special significance. Distil* is a class of compressed models that started with DistilBERT. DistilBERT stands for Distillated-BERT. DistilBERT is a small, fast, cheap and light Transformer model based on Bert architecture. It has 40% less parameters than bert-base-uncased, runs 60% faster while preserving 99% of BERT’s performances as measured on the GLUE language understanding benchmark. DistilBERT is a smaller Transformer model that bears a lot of similarities with the original BERT model while being lighter, smaller and faster to run. DistilRoBERTa reaches 95% of RoBERTa-base’s performance on GLUE and is twice as fast as RoBERTa while being 35% smaller. More info at huggingface/transformers.

The remarkable performance to size ratio of the distil* models can be transferred to summarization. distilroberta is recommended over distilbert because of the architecture improvements that the original RoBERTa brought over the original BERT. Essentially, distilroberta is more modern than distilbert.

MobileBERT is similar to distilbert in that it is a smaller version of BERT that achieves amazing performance at a very small size. According to the authors, MobileBERT is 2.64x smaller and 2.45x faster than DistilBERT. DistilBERT successfully halves the depth of BERT model by knowledge distillation in the pre-training stage and an optional fine-tuning stage. MobileBERT only uses knowledge transfer in the pre-training stage and does not require a fine-tuned teacher or data augmentation in the down-stream tasks. DistilBERT compresses BERT by reducing its depth, while MobileBERT compresses BERT by reducing its width, which has been shown to be more effective. MobileBERT usually needs a larger learning rate and more training epochs in fine-tuning than the original BERT.

Important

Interactive charts, graphs, raw data, run commands, hyperparameter choices, and more for all trained models are publicly available on the TransformerSum Weights & Biases page. You can download the raw data for each model on this site, or download an overview as a CSV. Please open an issue if you have questions about these models.

Additionally, all of the models on this page were trained completely for free using Tesla P100-PCIE-16GB GPUs on Google Colaboratory. Those that took over 12 hours to train were split into multiple training sessions since pytorch_lightning enables easy resuming with the --resume_from_checkpoint argument.

CNN/DM

Name

Comments

Model Download

Data Download

distilbert-base-uncased-ext-sum

None

Model & All Checkpoints

CNN/DM Bert Uncased

distilroberta-base-ext-sum

None

Model & All Checkpoints

CNN/DM Roberta

bert-base-uncased-ext-sum

None

Model & All Checkpoints

CNN/DM Bert Uncased

roberta-base-ext-sum

None

Model & All Checkpoints

CNN/DM Roberta

bert-large-uncased-ext-sum

None

Not yet…

CNN/DM Bert Uncased

roberta-large-ext-sum

None

Not yet…

CNN/DM Roberta

longformer-base-4096-ext-sum

None

Not yet…

CNN/DM Longformer

mobilebert-uncased-ext-sum

Trained with lr=8e-5

Model & All Checkpoints

CNN/DM Bert Uncased

CNN/DM ROUGE Scores

Test set results on the CNN/DailyMail dataset using ROUGE F1.

Name

ROUGE-1

ROUGE-2

ROUGE-L

ROUGE-L-Sum

distilbert-base-uncased-ext-sum

42.71

19.91

27.52

39.18

distilroberta-base-ext-sum

42.87

20.02

27.46

39.31

bert-base-uncased-ext-sum

42.78

19.83

27.43

39.18

roberta-base-ext-sum

43.24

20.36

27.64

39.65

bert-large-uncased-ext-sum

Not yet…

Not yet…

Not yet…

Not yet…

roberta-large-ext-sum

Not yet…

Not yet…

Not yet…

Not yet…

longformer-base-4096-ext-sum

Not yet…

Not yet…

Not yet…

Not yet…

mobilebert-uncased-ext-sum

42.01

19.31

26.89

38.53

Note

Currently, distilbert beats bert-base-uncased by 1.0014% ((42.71/42.78+19.91/19.83+27.52/27.43+39.18/39.18)/4=1.0014197729882865). Since bert-base-uncased has more parameters than distilbert, this is unusual and is likely a tuning issue. This suggests that tuning the hyperparameters of bert-base-uncased can improve its performance. distilroberta matches 92.7% of the performance of roberta-base ((42.87/43.24+20.02/20.36+27.46/27.64+29.31/39.65)/4=0.9268623888753363).

Important

mobilebert-uncased-ext-sum achieves 96.59% ((42.01/43.25+19.31/20.24+38.53/39.63)/3) of the performance of BertSum while containing 4.45 times (109483009/24582401) fewer parameters. It achieves 94.06% ((42.01/44.41+19.31/20.86+38.53/40.55)/3) of the performance of MatchSum (Zhong et al., 2020), the current extractive state-of-the-art.

CNN/DM Training Times and Model Sizes

Name

Time

Model Size

distilbert-base-uncased-ext-sum

6h 22m 32s

796.4MB

distilroberta-base-ext-sum

6h 21m 37s

980.8MB

bert-base-uncased-ext-sum

12h 51m 17s

1.3GB

roberta-base-ext-sum

13h 7m 3s

1.5GB

bert-large-uncased-ext-sum

Not yet…

Not yet…

roberta-large-ext-sum

Not yet…

Not yet…

longformer-base-4096-ext-sum

Not yet…

Not yet…

mobilebert-uncased-ext-sum

8h 26m 32s

295.6MB

Important

distilroberta-base-ext-sum trains in about 6.5 hours on 1 P100-PCIE-16GB GPU, while MatchSum, the current state-of-the-art in extractive summarization on CNN/DM, takes 30 hours on 8 Tesla-V100-16G GPUs to train. If a V100 is about 2x as powerful as a P100, then it would take 480 hours (30*8*2) to train MatchSum on one P100. This simplistic approximation suggests that it takes about 74x (480/6.5) more time to train MatchSum than distilroberta-base-ext-sum.

WikiHow

Name

Comments

Model Download

Data Download

distilbert-base-uncased-ext-sum

None

Model & All Checkpoints

WikiHow Bert Uncased

distilroberta-base-ext-sum

None

Model & All Checkpoints

WikiHow Roberta

bert-base-uncased-ext-sum

None

Model & All Checkpoints

WikiHow Bert Uncased

roberta-base-ext-sum

None

Model & All Checkpoints

WikiHow Roberta

bert-large-uncased-ext-sum

None

Not yet…

WikiHow Bert Uncased

roberta-large-ext-sum

None

Not yet…

WikiHow Roberta

mobilebert-uncased-ext-sum

Trained with lr=8e-5

Model & All Checkpoints

WikiHow Bert Uncased

WikiHow ROUGE Scores

Test set results on the WikiHow dataset using ROUGE F1.

Name

ROUGE-1

ROUGE-2

ROUGE-L

ROUGE-L-Sum

distilbert-base-uncased-ext-sum

30.69

8.65

19.13

28.58

distilroberta-base-ext-sum

31.07

8.96

19.34

28.95

bert-base-uncased-ext-sum

30.68

08.67

19.16

28.59

roberta-base-ext-sum

31.26

09.09

19.47

29.14

bert-large-uncased-ext-sum

Not yet…

Not yet…

Not yet…

Not yet…

roberta-large-ext-sum

Not yet…

Not yet…

Not yet…

Not yet…

mobilebert-uncased-ext-sum

30.72

8.78

19.18

28.59

Note

These are the results of an extractive model, which means they are fairly good because they come close to abstractive models. The R1/R2/RL-Sum results of a base transformer model from the PEGASUS paper are 32.48/10.53/23.86. The net difference from distilroberta-base-ext-sum is +1.41/+1.57/-5.09. Compared to the abstractive SOTA prior to PEGASUS, which was 28.53/9.23/26.54, distilroberta-base-ext-sum performs +2.54/-0.27/+2.41. However, the base PEGASUS model obtains scores of 36.58/15.64/30.01, which are much better than distilroberta-base-ext-sum, as one would expect.

WikiHow Training Times and Model Sizes

Name

Time

Model Size

distilbert-base-uncased-ext-sum

3h 42m 12s

796.4MB

distilroberta-base-ext-sum

4h 27m 23s

980.8MB

bert-base-uncased-ext-sum

7h 29m 06s

1.3GB

roberta-base-ext-sum

7h 35m 59s

1.5GB

bert-large-uncased-ext-sum

Not yet…

Not yet…

roberta-large-ext-sum

Not yet…

Not yet…

mobilebert-uncased-ext-sum

4h 22m 19s

295.6MB

arXiv-PubMed

Name

Comments

Model Download

Data Download

distilbert-base-uncased-ext-sum

None

Model & All Checkpoints

arXiv-PubMed Bert Uncased

distilroberta-base-ext-sum

None

Model & All Checkpoints

arXiv-PubMed Roberta

bert-base-uncased-ext-sum

None

Model & All Checkpoints

arXiv-PubMed Bert Uncased

roberta-base-ext-sum

None

Model & All Checkpoints

arXiv-PubMed Roberta

bert-large-uncased-ext-sum

None

Not yet…

arXiv-PubMed Bert Uncased

roberta-large-ext-sum

None

Not yet…

arXiv-PubMed Roberta

longformer-base-4096-ext-sum

None

Not yet…

arXiv-PubMed Longformer

mobilebert-uncased-ext-sum

None

Model & All Checkpoints

arXiv-PubMed Bert Uncased

arXiv-PubMed ROUGE Scores

Test set results on the arXiv-PubMed dataset using ROUGE F1.

Name

ROUGE-1

ROUGE-2

ROUGE-L

ROUGE-L-Sum

distilbert-base-uncased-ext-sum

34.93

12.21

19.62

31.00

distilroberta-base-ext-sum

34.70

12.16

19.52

30.82

bert-base-uncased-ext-sum

34.80

12.26

19.67

30.92

roberta-base-ext-sum

34.81

12.26

19.65

30.91

bert-large-uncased-ext-sum

Not yet…

Not yet…

Not yet…

Not yet…

roberta-large-ext-sum

Not yet…

Not yet…

Not yet…

Not yet…

longformer-base-4096-ext-sum

Not yet…

Not yet…

Not yet…

Not yet…

mobilebert-uncased-ext-sum

33.97

11.74

19.63

30.19

Note

These are the results of an extractive model, which means they are fairly good because they come close to abstractive models. The R1/R2/RL-Sum results of a base transformer model from the PEGASUS paper are 34.79/7.69/19.51 (average of 35.63/7.95/20.00 (arXiv) and 33.94/7.43/19.02 (PubMed)). The net difference from distilroberta-base-ext-sum is +0.09/-4.47/-11.31. Compared to the abstractive SOTA prior to PEGASUS, which was 41.09/14.93/23.57 (average of 41.59/14.26/23.55 (arXiv) and 40.59/15.59/23.59 (PubMed)), distilroberta-base-ext-sum performs -6.39/-2.77/+7.25. However, the base PEGASUS model obtains scores of 37.39/12.66/23.87 (average of 34.81/10.16/22.50 (arXiv) and 39.98/15.15/25.23 (PubMed)). The large model obtains scores of 45.10/18.59/26.75 (average of 44.70/17.27/25.80 (arXiv) and 45.49/19.90/27.69 (PubMed)) which are much better than distilroberta-base-ext-sum, as one would expect.

arXiv-PubMed Training Times and Model Sizes

Name

Time

Model Size

distilbert-base-uncased-ext-sum

06h 46m 0s

796.4MB

distilroberta-base-ext-sum

06h 33m 58s

980.8MB

bert-base-uncased-ext-sum

14h 40m 10s

1.3GB

roberta-base-ext-sum

14h 39m 43s

1.5GB

bert-large-uncased-ext-sum

Not yet…

Not yet…

roberta-large-ext-sum

Not yet…

Not yet…

longformer-base-4096-ext-sum

Not yet…

Not yet…

mobilebert-uncased-ext-sum

09h 5m 45s

295.6MB