Extractive Pre-trained Models & Results
The recommended model to use is distilroberta-base-ext-sum
because of its fast performance, relatively low number of parameters, and good performance.
Notes
The distil* models are of special significance. Distil* is a class of compressed models that started with DistilBERT. DistilBERT stands for Distillated-BERT. DistilBERT is a small, fast, cheap and light Transformer model based on Bert architecture. It has 40% less parameters than bert-base-uncased
, runs 60% faster while preserving 99% of BERT’s performances as measured on the GLUE language understanding benchmark. DistilBERT is a smaller Transformer model that bears a lot of similarities with the original BERT model while being lighter, smaller and faster to run. DistilRoBERTa reaches 95% of RoBERTa-base’s performance on GLUE and is twice as fast as RoBERTa while being 35% smaller. More info at huggingface/transformers.
The remarkable performance to size ratio of the distil* models can be transferred to summarization. distilroberta
is recommended over distilbert
because of the architecture improvements that the original RoBERTa brought over the original BERT. Essentially, distilroberta
is more modern than distilbert
.
MobileBERT is similar to distilbert
in that it is a smaller version of BERT that achieves amazing performance at a very small size. According to the authors, MobileBERT is 2.64x smaller and 2.45x faster than DistilBERT. DistilBERT successfully halves the depth of BERT model by knowledge distillation in the pre-training stage and an optional fine-tuning stage. MobileBERT only uses knowledge transfer in the pre-training stage and does not require a fine-tuned teacher or data augmentation in the down-stream tasks. DistilBERT compresses BERT by reducing its depth, while MobileBERT compresses BERT by reducing its width, which has been shown to be more effective. MobileBERT usually needs a larger learning rate and more training epochs in fine-tuning than the original BERT.
Important
Interactive charts, graphs, raw data, run commands, hyperparameter choices, and more for all trained models are publicly available on the TransformerSum Weights & Biases page. You can download the raw data for each model on this site, or download an overview as a CSV. Please open an issue if you have questions about these models.
Additionally, all of the models on this page were trained completely for free using Tesla P100-PCIE-16GB GPUs on Google Colaboratory. Those that took over 12 hours to train were split into multiple training sessions since pytorch_lightning
enables easy resuming with the --resume_from_checkpoint
argument.
CNN/DM
Name |
Comments |
Model Download |
Data Download |
---|---|---|---|
distilbert-base-uncased-ext-sum |
None |
||
distilroberta-base-ext-sum |
None |
||
bert-base-uncased-ext-sum |
None |
||
roberta-base-ext-sum |
None |
||
bert-large-uncased-ext-sum |
None |
Not yet… |
|
roberta-large-ext-sum |
None |
Not yet… |
|
longformer-base-4096-ext-sum |
None |
Not yet… |
|
mobilebert-uncased-ext-sum |
Trained with lr=8e-5 |
CNN/DM ROUGE Scores
Test set results on the CNN/DailyMail dataset using ROUGE F1.
Name |
ROUGE-1 |
ROUGE-2 |
ROUGE-L |
ROUGE-L-Sum |
---|---|---|---|---|
distilbert-base-uncased-ext-sum |
42.71 |
19.91 |
27.52 |
39.18 |
distilroberta-base-ext-sum |
42.87 |
20.02 |
27.46 |
39.31 |
bert-base-uncased-ext-sum |
42.78 |
19.83 |
27.43 |
39.18 |
roberta-base-ext-sum |
43.24 |
20.36 |
27.64 |
39.65 |
bert-large-uncased-ext-sum |
Not yet… |
Not yet… |
Not yet… |
Not yet… |
roberta-large-ext-sum |
Not yet… |
Not yet… |
Not yet… |
Not yet… |
longformer-base-4096-ext-sum |
Not yet… |
Not yet… |
Not yet… |
Not yet… |
mobilebert-uncased-ext-sum |
42.01 |
19.31 |
26.89 |
38.53 |
Note
Currently, distilbert
beats bert-base-uncased
by 1.0014% ((42.71/42.78+19.91/19.83+27.52/27.43+39.18/39.18)/4=1.0014197729882865
). Since bert-base-uncased
has more parameters than distilbert
, this is unusual and is likely a tuning issue. This suggests that tuning the hyperparameters of bert-base-uncased
can improve its performance. distilroberta
matches 92.7% of the performance of roberta-base
((42.87/43.24+20.02/20.36+27.46/27.64+29.31/39.65)/4=0.9268623888753363
).
Important
mobilebert-uncased-ext-sum
achieves 96.59% ((42.01/43.25+19.31/20.24+38.53/39.63)/3
) of the performance of BertSum while containing 4.45 times (109483009/24582401
) fewer parameters. It achieves 94.06% ((42.01/44.41+19.31/20.86+38.53/40.55)/3
) of the performance of MatchSum (Zhong et al., 2020), the current extractive state-of-the-art.
CNN/DM Training Times and Model Sizes
Name |
Time |
Model Size |
---|---|---|
distilbert-base-uncased-ext-sum |
6h 22m 32s |
796.4MB |
distilroberta-base-ext-sum |
6h 21m 37s |
980.8MB |
bert-base-uncased-ext-sum |
12h 51m 17s |
1.3GB |
roberta-base-ext-sum |
13h 7m 3s |
1.5GB |
bert-large-uncased-ext-sum |
Not yet… |
Not yet… |
roberta-large-ext-sum |
Not yet… |
Not yet… |
longformer-base-4096-ext-sum |
Not yet… |
Not yet… |
mobilebert-uncased-ext-sum |
8h 26m 32s |
295.6MB |
Important
distilroberta-base-ext-sum
trains in about 6.5 hours on 1 P100-PCIE-16GB GPU, while MatchSum, the current state-of-the-art in extractive summarization on CNN/DM, takes 30 hours on 8 Tesla-V100-16G GPUs to train. If a V100 is about 2x as powerful as a P100, then it would take 480 hours (30*8*2
) to train MatchSum on one P100. This simplistic approximation suggests that it takes about 74x (480/6.5
) more time to train MatchSum than distilroberta-base-ext-sum
.
WikiHow
Name |
Comments |
Model Download |
Data Download |
---|---|---|---|
distilbert-base-uncased-ext-sum |
None |
||
distilroberta-base-ext-sum |
None |
||
bert-base-uncased-ext-sum |
None |
||
roberta-base-ext-sum |
None |
||
bert-large-uncased-ext-sum |
None |
Not yet… |
|
roberta-large-ext-sum |
None |
Not yet… |
|
mobilebert-uncased-ext-sum |
Trained with lr=8e-5 |
WikiHow ROUGE Scores
Test set results on the WikiHow dataset using ROUGE F1.
Name |
ROUGE-1 |
ROUGE-2 |
ROUGE-L |
ROUGE-L-Sum |
---|---|---|---|---|
distilbert-base-uncased-ext-sum |
30.69 |
8.65 |
19.13 |
28.58 |
distilroberta-base-ext-sum |
31.07 |
8.96 |
19.34 |
28.95 |
bert-base-uncased-ext-sum |
30.68 |
08.67 |
19.16 |
28.59 |
roberta-base-ext-sum |
31.26 |
09.09 |
19.47 |
29.14 |
bert-large-uncased-ext-sum |
Not yet… |
Not yet… |
Not yet… |
Not yet… |
roberta-large-ext-sum |
Not yet… |
Not yet… |
Not yet… |
Not yet… |
mobilebert-uncased-ext-sum |
30.72 |
8.78 |
19.18 |
28.59 |
Note
These are the results of an extractive model, which means they are fairly good because they come close to abstractive models. The R1/R2/RL-Sum results of a base transformer model from the PEGASUS paper are 32.48/10.53/23.86. The net difference from distilroberta-base-ext-sum
is +1.41/+1.57/-5.09. Compared to the abstractive SOTA prior to PEGASUS, which was 28.53/9.23/26.54, distilroberta-base-ext-sum
performs +2.54/-0.27/+2.41. However, the base PEGASUS model obtains scores of 36.58/15.64/30.01, which are much better than distilroberta-base-ext-sum
, as one would expect.
WikiHow Training Times and Model Sizes
Name |
Time |
Model Size |
---|---|---|
distilbert-base-uncased-ext-sum |
3h 42m 12s |
796.4MB |
distilroberta-base-ext-sum |
4h 27m 23s |
980.8MB |
bert-base-uncased-ext-sum |
7h 29m 06s |
1.3GB |
roberta-base-ext-sum |
7h 35m 59s |
1.5GB |
bert-large-uncased-ext-sum |
Not yet… |
Not yet… |
roberta-large-ext-sum |
Not yet… |
Not yet… |
mobilebert-uncased-ext-sum |
4h 22m 19s |
295.6MB |
arXiv-PubMed
Name |
Comments |
Model Download |
Data Download |
---|---|---|---|
distilbert-base-uncased-ext-sum |
None |
||
distilroberta-base-ext-sum |
None |
||
bert-base-uncased-ext-sum |
None |
||
roberta-base-ext-sum |
None |
||
bert-large-uncased-ext-sum |
None |
Not yet… |
|
roberta-large-ext-sum |
None |
Not yet… |
|
longformer-base-4096-ext-sum |
None |
Not yet… |
|
mobilebert-uncased-ext-sum |
None |
arXiv-PubMed ROUGE Scores
Test set results on the arXiv-PubMed dataset using ROUGE F1.
Name |
ROUGE-1 |
ROUGE-2 |
ROUGE-L |
ROUGE-L-Sum |
---|---|---|---|---|
distilbert-base-uncased-ext-sum |
34.93 |
12.21 |
19.62 |
31.00 |
distilroberta-base-ext-sum |
34.70 |
12.16 |
19.52 |
30.82 |
bert-base-uncased-ext-sum |
34.80 |
12.26 |
19.67 |
30.92 |
roberta-base-ext-sum |
34.81 |
12.26 |
19.65 |
30.91 |
bert-large-uncased-ext-sum |
Not yet… |
Not yet… |
Not yet… |
Not yet… |
roberta-large-ext-sum |
Not yet… |
Not yet… |
Not yet… |
Not yet… |
longformer-base-4096-ext-sum |
Not yet… |
Not yet… |
Not yet… |
Not yet… |
mobilebert-uncased-ext-sum |
33.97 |
11.74 |
19.63 |
30.19 |
Note
These are the results of an extractive model, which means they are fairly good because they come close to abstractive models. The R1/R2/RL-Sum results of a base transformer model from the PEGASUS paper are 34.79/7.69/19.51 (average of 35.63/7.95/20.00 (arXiv) and 33.94/7.43/19.02 (PubMed)). The net difference from distilroberta-base-ext-sum
is +0.09/-4.47/-11.31. Compared to the abstractive SOTA prior to PEGASUS, which was 41.09/14.93/23.57 (average of 41.59/14.26/23.55 (arXiv) and 40.59/15.59/23.59 (PubMed)), distilroberta-base-ext-sum
performs -6.39/-2.77/+7.25. However, the base PEGASUS model obtains scores of 37.39/12.66/23.87 (average of 34.81/10.16/22.50 (arXiv) and 39.98/15.15/25.23 (PubMed)). The large model obtains scores of 45.10/18.59/26.75 (average of 44.70/17.27/25.80 (arXiv) and 45.49/19.90/27.69 (PubMed)) which are much better than distilroberta-base-ext-sum
, as one would expect.
arXiv-PubMed Training Times and Model Sizes
Name |
Time |
Model Size |
---|---|---|
distilbert-base-uncased-ext-sum |
06h 46m 0s |
796.4MB |
distilroberta-base-ext-sum |
06h 33m 58s |
980.8MB |
bert-base-uncased-ext-sum |
14h 40m 10s |
1.3GB |
roberta-base-ext-sum |
14h 39m 43s |
1.5GB |
bert-large-uncased-ext-sum |
Not yet… |
Not yet… |
roberta-large-ext-sum |
Not yet… |
Not yet… |
longformer-base-4096-ext-sum |
Not yet… |
Not yet… |
mobilebert-uncased-ext-sum |
09h 5m 45s |
295.6MB |