Extractive Supported Datasets
Note
In addition to the below datasets, all of the abstractive datasets can be converted for extractive summarization and thus be used to train models. See Option 2: Automatic pre-processing through nlp for more information.
There are several ways to obtain and process the datasets below:
Download the converted extractive version for use with the training script (which will preprocess the data automatically (tokenization, etc.)). Note that all the provided extractive versions are split every 500 documents and are compressed. You will have to manually process if you desire different chunk sizes.
Download the processed abstractive version. This is the original data after being run through its respective processor located in the
datasets
folder.Download the original data in its original form, which depends on how it was obtained in the original paper.
The table under each heading contains quick links to download the data. Beneath that are instructions to process the data manually.
CNN/DM
The CNN/DailyMail (Hermann et al., 2015) dataset contains 93k articles from the CNN, and 220k articles the Daily Mail newspapers. Both publishers supplement their articles with bullet point summaries. Non-anonymized variant in See et al. (2017).
Type |
Link |
---|---|
Processor Repository |
|
Data Download Link |
|
Processed Abstractive Dataset |
|
Extractive Version |
Download and unzip the stories directories from here for both CNN and Daily Mail. The files can be downloaded from the terminal with gdown, which can be installed with pip install gdown.
pip install gdown
gdown https://drive.google.com/uc?id=0BwmD_VLjROrfTHk4NFg2SndKcjQ
gdown https://drive.google.com/uc?id=0BwmD_VLjROrfM1BxdkxVaTY2bWs
tar zxf cnn_stories.tgz
tar zxf dailymail_stories.tgz
Note
The above Google Drive links may be outdated depending on the time you are reading this. Check the CNN/DM official website for the most up-to-date download links.
Next, run the processing code in the git submodule for artmatsak/cnn-dailymail located in datasets/cnn_dailymail_processor
. Run python make_datafiles.py /path/to/cnn/stories /path/to/dailymail/stories
, replacing /`path/to/cnn/stories` with the path to where you saved the cnn/stories
directory that you downloaded; similarly for dailymail/stories
.
For each of the URL lists (all_train.txt
, all_val.txt
and all_test.txt
) in cnn_dailymail_processor/url_lists
, the corresponding stories are read from file and written to text files train.source
, train.target
, val.source
, val.target
, and test.source
and test.target
. These will be placed in the newly created cnn_dm
directory.
The original processing code is available at abisee/cnn-dailymail, but for this project the artmatsak/cnn-dailymail processing code is used since it does not tokenize and writes the data to text file train.source
, train.target
, val.source
, val.target
, test.source
and test.target
, which is the format expected by convert_to_extractive.py
.
WikiHow
WikiHow (Koupaee and Wang, 2018) is a large-scale dataset of instructions from the online WikiHow.com website. Each of 200k examples consists of multiple instruction-step paragraphs along with a summarizing sentence. The task is to generate the concatenated summary-sentences from the paragraphs.
Dataset Size |
230,843 |
---|---|
Average Article Length |
579.8 |
Average Summary Length |
62.1 |
Vocabulary Size |
556,461 |
Type |
Link |
---|---|
Processor Repository |
|
Data Download Link |
|
Processed Abstractive Dataset |
|
Extractive Version |
Processing Steps:
Download wikihowAll.csv (main repo for most up-to-date links) to
datasets/wikihow_processor
Run
python process.py
(runtime: 2m), which will create a new directory calledwikihow
containing thetrain.source
,train.target
,val.source
,val.target
,test.source
andtest.target
files necessary for convert_to_extractive.py.
PubMed/ArXiv
ArXiv and PubMed (Cohan et al., 2018) are two long document datasets of scientific publications from [arXiv.org](http://arxiv.org/) (113k) and PubMed (215k). The task is to generate the abstract from the paper body.
Datasets |
# docs |
avg. doc. length (words) |
avg. summary length (words) |
---|---|---|---|
CNN |
92K |
656 |
43 |
Daily Mail |
219K |
693 |
52 |
NY Times |
655K |
530 |
38 |
PubMed (this dataset) |
133K |
3016 |
203 |
arXiv (this dataset) |
215K |
4938 |
220 |
Type |
Link |
---|---|
Processor Repository |
|
Data Download Link |
|
Processed Abstractive Dataset |
|
Extractive Version |
Processing Steps:
Download PubMed and ArXiv (main repo for most up-to-date links) to
datasets/arxiv-pubmed_processor
Run the command
python process.py <arxiv_articles_dir> <pubmed_articles_dir>
(runtime: 5-10m), which will create a new directory calledarxiv-pubmed
containing thetrain.source
,train.target
,val.source
,val.target
,test.source
andtest.target
files necessary for convert_to_extractive.py.
See the repository’s README.md.
Note
To convert this dataset to extractive it is recommended to use the --sentencizer
option due to the size of the dataset. Additionally, the --max_sentence_ntokens
should be set to 300
and the --max_example_nsents
should be set to 600
. See the Convert Abstractive to Extractive Dataset section for more information. The full command should be similar to:
python convert_to_extractive.py ./datasets/arxiv-pubmed_processor/arxiv-pubmed \
--shard_interval 5000 \
--sentencizer \
--max_sentence_ntokens 300 \
--max_example_nsents 600