Convert Abstractive to Extractive Dataset

Overview

This script will reformat an abstractive summarization dataset to be used for extractive summarization by determining the best extractive summary that maximizes ROUGE scores. It can be used on a dataset composed of the following file structure: train.source, train.target, val.source, val.target, test.source, and test.target where each file contains one example per line and the lines of every .train file correspond to the lines in the respective .target. All the datasets on the Extractive Supported Datasets page will be processed to this format. You can also process any dataset contained in the huggingface/nlp library. If you use a dataset this way, the downloading and pre-processing will happen automatically.

Option 1: Manual Data Download

Simply run convert_to_extractive.py with the path to the data. For example, with the CNN/DM dataset: python convert_to_extractive.py ./datasets/cnn_dailymail_processor/cnn_dm. However, the recommended command is:

python convert_to_extractive.py ./datasets/cnn_dailymail_processor/cnn_dm --shard_interval 5000 --compression --add_target_to test

--shard_interval processes the file in chunks of 5000 and writes results to disk in chunks of 5000 (saves RAM)
--compression compresses each output chunk with gzip (depending on the dataset reduces space usage requirement by about 1/2 to 1/3)
--add_target_to will save the abstractive target text to the splits (in --split_names) specified.

The default output directory is the input directory that was specified, but the output directory can be changed with --base_output_path if desired.

If your files are not train, val, and test, then the --split_names argument will let you specify the correct naming pattern. The --source_ext and --target_ext let you specify the file extension of the source and target files respectively. These must be different so the process can tell each section apart.

Option 2: Automatic pre-processing through `nlp`

You will need to run the convert_to_extractive.py command with the --dataset, --dataset_version, --data_example_column, and --data_summarized_column options set. To use the CNN/DM dataset you would set these arguments as shown below:

--dataset cnn_dailymail \
--dataset_version 3.0.0 \
--data_example_column article \
--data_summarized_column highlights

View the help page (python convert_to_extractive.py --help) for more info about these options. The options are nearly identical to the abstractive script.

Important

All of the abstractive datasets can be converted for extractive summarization using this method.

The live nlp viewer visualizes the data and describes each dataset that can be used with through parameters.

Convert To Extractive Tips

Large Dataset? Need to Resume?: The --resume option will read the output directory and determine on which document the script left off based on the shard_file names. If --shard_interval was None then resuming is not possible. Resuming is guaranteed to produce the same output as if --resume was not used because of convert_to_extractive.check_resume_success(), which checks to make sure the last line in the shard file is the same as the line directly before the line to resume with.

Speed: Running Slowly? There is a --sentencizer option to detect sentence boundaries without parsing dependencies. Instead of loading a statistical model using spacy, this option will initialize the English Language object and add a sentencizer to the pipeline. This is much faster than a DependencyParser but is also less accurate since the sentencizer uses a simpler, rule-based strategy.

Custom Datasets

Any dataset in the format described in the Overview can be used with this script. Once converted, training should be the same as if using CNN/DM from Option 1 because the Convert To Extractive script outputs a consistent format.

Extractive Dataset Format

This section briefly discusses the format of datasets created by the convert_to_extractive script.

The training and validation sets only need the src and labels keys saved as json. The src value should be a list of lists where each list contains a series of tokens (see below). The labels value is a list of 0s (not in summary) and 1s (sentence should be in summary) that is the same length as the src value (the number of sentences). Each value in this list corresponds to a sentence in src. The testing set is special because it needs the src, labels, and tgt keys. The tgt key represents the target summary as a single string with a <q> between each sentence.

First document in CNN/DM extractive training set:

{'src': [['Editor', "'s", 'note', ':', 'In', 'our', 'Behind', 'the', 'Scenes', 'series', ',', 'CNN', 'correspondents', 'share', 'their', 'experiences', 'in', 'covering', 'news', 'and', 'analyze', 'the', 'stories', 'behind', 'the', 'events', '.'], ['Here', ',', 'Soledad', "O'Brien", 'takes', 'users', 'inside', 'a', 'jail', 'where', 'many', 'of', 'the', 'inmates', 'are', 'mentally', 'ill', '.'], ['An', 'inmate', 'housed', 'on', 'the', '"', 'forgotten', 'floor', ',', '"', 'where', 'many', 'mentally', 'ill', 'inmates', 'are', 'housed', 'in', 'Miami', 'before', 'trial', '.'], ['MIAMI', ',', 'Florida', '(', 'CNN', ')', '--', 'The', 'ninth', 'floor', 'of', 'the', 'Miami', '-', 'Dade', 'pretrial', 'detention', 'facility', 'is', 'dubbed', 'the', '"', 'forgotten', 'floor', '.', '"'], ['Here', ',', 'inmates', 'with', 'the', 'most', 'severe', 'mental', 'illnesses', 'are', 'incarcerated', 'until', 'they', "'re", 'ready', 'to', 'appear', 'in', 'court', '.'], ['Most', 'often', ',', 'they', 'face', 'drug', 'charges', 'or', 'charges', 'of', 'assaulting', 'an', 'officer', '--charges', 'that', 'Judge', 'Steven', 'Leifman', 'says', 'are', 'usually', '"', 'avoidable', 'felonies', '.', '"'], ['He', 'says', 'the', 'arrests', 'often', 'result', 'from', 'confrontations', 'with', 'police', '.'], ['Mentally', 'ill', 'people', 'often', 'wo', "n't", 'do', 'what', 'they', "'re", 'told', 'when', 'police', 'arrive', 'on', 'the', 'scene', '--', 'confrontation', 'seems', 'to', 'exacerbate', 'their', 'illness', 'and', 'they', 'become', 'more', 'paranoid', ',', 'delusional', ',', 'and', 'less', 'likely', 'to', 'follow', 'directions', ',', 'according', 'to', 'Leifman', '.'], ['So', ',', 'they', 'end', 'up', 'on', 'the', 'ninth', 'floor', 'severely', 'mentally', 'disturbed', ',', 'but', 'not', 'getting', 'any', 'real', 'help', 'because', 'they', "'re", 'in', 'jail', '.'], ['We', 'toured', 'the', 'jail', 'with', 'Leifman', '.'], ['He', 'is', 'well', 'known', 'in', 'Miami', 'as', 'an', 'advocate', 'for', 'justice', 'and', 'the', 'mentally', 'ill', '.'], ['Even', 'though', 'we', 'were', 'not', 'exactly', 'welcomed', 'with', 'open', 'arms', 'by', 'the', 'guards', ',', 'we', 'were', 'given', 'permission', 'to', 'shoot', 'videotape', 'and', 'tour', 'the', 'floor', '.', ' '], ['Go', 'inside', 'the', "'", 'forgotten', 'floor', "'", '»', '.'], ['At', 'first', ',', 'it', "'s", 'hard', 'to', 'determine', 'where', 'the', 'people', 'are', '.'], ['The', 'prisoners', 'are', 'wearing', 'sleeveless', 'robes', '.'], ['Imagine', 'cutting', 'holes', 'for', 'arms', 'and', 'feet', 'in', 'a', 'heavy', 'wool', 'sleeping', 'bag'], ['--', 'that', "'s", 'kind', 'of', 'what', 'they', 'look', 'like', '.'], ['They', "'re", 'designed', 'to', 'keep', 'the', 'mentally', 'ill', 'patients', 'from', 'injuring', 'themselves', '.'], ['That', "'s", 'also', 'why', 'they', 'have', 'no', 'shoes', ',', 'laces', 'or', 'mattresses', '.'], ['Leifman', 'says', 'about', 'one', '-', 'third', 'of', 'all', 'people', 'in', 'Miami', '-', 'Dade', 'county', 'jails', 'are', 'mentally', 'ill', '.'], ['So', ',', 'he', 'says', ',', 'the', 'sheer', 'volume', 'is', 'overwhelming', 'the', 'system', ',', 'and', 'the', 'result', 'is', 'what', 'we', 'see', 'on', 'the', 'ninth', 'floor', '.'], ['Of', 'course', ',', 'it', 'is', 'a', 'jail', ',', 'so', 'it', "'s", 'not', 'supposed', 'to', 'be', 'warm', 'and', 'comforting', ',', 'but'], ['the', 'lights', 'glare', ',', 'the', 'cells', 'are', 'tiny', 'and', 'it', "'s", 'loud', '.'], ['We', 'see', 'two', ',', 'sometimes', 'three', 'men', '--', 'sometimes', 'in', 'the', 'robes', ',', 'sometimes', 'naked', ',', 'lying', 'or', 'sitting', 'in', 'their', 'cells', '.'], ['"', 'I', 'am', 'the', 'son', 'of', 'the', 'president', '.'], ['You', 'need', 'to', 'get', 'me', 'out', 'of', 'here', '!', '"'], ['one', 'man', 'shouts', 'at', 'me', '.'], ['He', 'is', 'absolutely', 'serious', ',', 'convinced', 'that', 'help', 'is', 'on', 'the', 'way', '--', 'if', 'only', 'he', 'could', 'reach', 'the', 'White', 'House', '.'], ['Leifman', 'tells', 'me', 'that', 'these', 'prisoner', '-', 'patients', 'will', 'often', 'circulate', 'through', 'the', 'system', ',', 'occasionally', 'stabilizing', 'in', 'a', 'mental', 'hospital', ',', 'only', 'to', 'return', 'to', 'jail', 'to', 'face', 'their', 'charges', '.'], ['It', "'s", 'brutally', 'unjust', ',', 'in', 'his', 'mind', ',', 'and', 'he', 'has', 'become', 'a', 'strong', 'advocate', 'for', 'changing', 'things', 'in', 'Miami', '.'], ['Over', 'a', 'meal', 'later', ',', 'we', 'talk', 'about', 'how', 'things', 'got', 'this', 'way', 'for', 'mental', 'patients', '.'], ['Leifman', 'says', '200', 'years', 'ago', 'people', 'were', 'considered', '"', 'lunatics', '"', 'and', 'they', 'were', 'locked', 'up', 'in', 'jails', 'even', 'if', 'they', 'had', 'no', 'charges', 'against', 'them', '.'], ['They', 'were', 'just', 'considered', 'unfit', 'to', 'be', 'in', 'society', '.'], ['Over', 'the', 'years', ',', 'he', 'says', ',', 'there', 'was', 'some', 'public', 'outcry', ',', 'and', 'the', 'mentally', 'ill', 'were', 'moved', 'out', 'of', 'jails', 'and', 'into', 'hospitals', '.'], ['But', 'Leifman', 'says', 'many', 'of', 'these', 'mental', 'hospitals', 'were', 'so', 'horrible', 'they', 'were', 'shut', 'down', '.'], ['Where', 'did', 'the', 'patients', 'go', '?'], ['They', 'became', ',', 'in', 'many', 'cases', ',', 'the', 'homeless', ',', 'he', 'says', '.'], ['Leifman', 'says', 'in', '1955', 'there', 'were', 'more', 'than', 'half', 'a', 'million', 'people', 'in', 'state', 'mental', 'hospitals', ',', 'and', 'today', 'that', 'number', 'has', 'been', 'reduced', '90', 'percent', ',', 'and', '40,000', 'to', '50,000', 'people', 'are', 'in', 'mental', 'hospitals', '.'], ['The', 'judge', 'says', 'he', "'s", 'working', 'to', 'change', 'this', '.'], ['Starting', 'in', '2008', ',', 'many', 'inmates', 'who', 'would', 'otherwise', 'have', 'been', 'brought', 'to', 'the', '"', 'forgotten', 'floor', '"', ' ', 'will', 'instead', 'be', 'sent', 'to', 'a', 'new', 'mental', 'health', 'facility', '--', 'the', 'first', 'step', 'on', 'a', 'journey', 'toward', 'long', '-', 'term', 'treatment', ',', 'not', 'just', 'punishment', '.'], ['Leifman', 'says', 'it', "'s", 'not', 'the', 'complete', 'answer', ',', 'but', 'it', "'s", 'a', 'start', '.'], ['Leifman', 'says', 'the', 'best', 'part', 'is', 'that', 'it', "'s", 'a', 'win', '-', 'win', 'solution', '.'], ['The', 'patients', 'win', ',', 'the', 'families', 'are', 'relieved', ',', 'and', 'the', 'state', 'saves', 'money', 'by', 'simply', 'not', 'cycling', 'these', 'prisoners', 'through', 'again', 'and', 'again', '.'], ['And', ',', 'for', 'Leifman', ',', 'justice', 'is', 'served', '.'], ['E', '-', 'mail', 'to', 'a', 'friend', '.']], 'labels': [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]}

First document in CNN/DM extractive testing set:

{'src': [['Marseille', ',', 'France', '(', 'CNN)The', 'French', 'prosecutor', 'leading', 'an', 'investigation', 'into', 'the', 'crash', 'of', 'Germanwings', 'Flight', '9525', 'insisted', 'Wednesday', 'that', 'he', 'was', 'not', 'aware', 'of', 'any', 'video', 'footage', 'from', 'on', 'board', 'the', 'plane', '.'], ['Marseille', 'prosecutor', 'Brice', 'Robin', 'told', 'CNN', 'that', '"', 'so', 'far', 'no', 'videos', 'were', 'used', 'in', 'the', 'crash', 'investigation', '.', '"'], ['He', 'added', ',', '"', 'A', 'person', 'who', 'has', 'such', 'a', 'video', 'needs', 'to', 'immediately', 'give', 'it', 'to', 'the', 'investigators', '.', '"'], ['Robin', "'s", 'comments', 'follow', 'claims', 'by', 'two', 'magazines', ',', 'German', 'daily', 'Bild', 'and', 'French', 'Paris', 'Match', ',', 'of', 'a', 'cell', 'phone', 'video', 'showing', 'the', 'harrowing', 'final', 'seconds', 'from', 'on', 'board', 'Germanwings', 'Flight', '9525', 'as', 'it', 'crashed', 'into', 'the', 'French', 'Alps', '.'], ['All', '150', 'on', 'board', 'were', 'killed', '.'], ['Paris', 'Match', 'and', 'Bild', 'reported', 'that', 'the', 'video', 'was', 'recovered', 'from', 'a', 'phone', 'at', 'the', 'wreckage', 'site', '.'], ['The', 'two', 'publications', 'described', 'the', 'supposed', 'video', ',', 'but', 'did', 'not', 'post', 'it', 'on', 'their', 'websites', '.'], ['The', 'publications', 'said', 'that', 'they', 'watched', 'the', 'video', ',', 'which', 'was', 'found', 'by', 'a', 'source', 'close', 'to', 'the', 'investigation', '.'], ['"', 'One', 'can', 'hear', 'cries', 'of', "'", 'My', 'God', "'", 'in', 'several', 'languages', ',', '"', 'Paris', 'Match', 'reported', '.'], ['"', 'Metallic', 'banging', 'can', 'also', 'be', 'heard', 'more', 'than', 'three', 'times', ',', 'perhaps', 'of', 'the', 'pilot', 'trying', 'to', 'open', 'the', 'cockpit', 'door', 'with', 'a', 'heavy', 'object', '.', ' '], ['Towards', 'the', 'end', ',', 'after', 'a', 'heavy', 'shake', ',', 'stronger', 'than', 'the', 'others', ',', 'the', 'screaming', 'intensifies', '.'], ['"', 'It', 'is', 'a', 'very', 'disturbing', 'scene', ',', '"', 'said', 'Julian', 'Reichelt', ',', 'editor', '-', 'in', '-', 'chief', 'of', 'Bild', 'online', '.'], ['An', 'official', 'with', 'France', "'s", 'accident', 'investigation', 'agency', ',', 'the', 'BEA', ',', 'said', 'the', 'agency', 'is', 'not', 'aware', 'of', 'any', 'such', 'video', '.'], ['Jean', '-', 'Marc', 'Menichini', ',', 'a', 'French', 'Gendarmerie', 'spokesman', 'in', 'charge', 'of', 'communications', 'on', 'rescue', 'efforts', 'around', 'the', 'Germanwings', 'crash', 'site', ',', 'told', 'CNN', 'that', 'the', 'reports', 'were', '"', 'completely', 'wrong', '"', 'and', '"', 'unwarranted', '.', '"'], ['Cell', 'phones', 'have', 'been', 'collected', 'at', 'the', 'site', ',', 'he', 'said', ',', 'but', 'that', 'they', '"', 'had', "n't", 'been', 'exploited', 'yet', '.'], ['Menichini', 'said', 'he', 'believed', 'the', 'cell', 'phones', 'would', 'need', 'to', 'be', 'sent', 'to', 'the', 'Criminal', 'Research', 'Institute', 'in', 'Rosny', 'sous', '-', 'Bois', ',', 'near', 'Paris', ',', 'in', 'order', 'to', 'be', 'analyzed', 'by', 'specialized', 'technicians', 'working', 'hand', '-', 'in', '-', 'hand', 'with', 'investigators', '.'], ['But', 'none', 'of', 'the', 'cell', 'phones', 'found', 'so', 'far', 'have', 'been', 'sent', 'to', 'the', 'institute', ',', 'Menichini', 'said', '.'], ['Asked', 'whether', 'staff', 'involved', 'in', 'the', 'search', 'could', 'have', 'leaked', 'a', 'memory', 'card', 'to', 'the', 'media', ',', 'Menichini', 'answered', 'with', 'a', 'categorical', '"', 'no', '.', '"'], ['Reichelt', 'told', '"', 'Erin', 'Burnett', ':'], ['Outfront', '"', 'that', 'he', 'had', 'watched', 'the', 'video', 'and', 'stood', 'by', 'the', 'report', ',', 'saying', 'Bild', 'and', 'Paris', 'Match', 'are', '"', 'very', 'confident', '"', 'that', 'the', 'clip', 'is', 'real', '.'], ['He', 'noted', 'that', 'investigators', 'only', 'revealed', 'they', "'d", 'recovered', 'cell', 'phones', 'from', 'the', 'crash', 'site', 'after', 'Bild', 'and', 'Paris', 'Match', 'published', 'their', 'reports', '.'], ['"', 'That', 'is', 'something', 'we', 'did', 'not', 'know', 'before', '.', '...'], ['Overall', 'we', 'can', 'say', 'many', 'things', 'of', 'the', 'investigation', 'were', "n't", 'revealed', 'by', 'the', 'investigation', 'at', 'the', 'beginning', ',', '"', 'he', 'said', '.'], ['German', 'airline', 'Lufthansa', 'confirmed', 'Tuesday', 'that', 'co', '-', 'pilot', 'Andreas', 'Lubitz', 'had', 'battled', 'depression', 'years', 'before', 'he', 'took', 'the', 'controls', 'of', 'Germanwings', 'Flight', '9525', ',', 'which', 'he', "'s", 'accused', 'of', 'deliberately', 'crashing', 'last', 'week', 'in', 'the', 'French', 'Alps', '.'], ['Lubitz', 'told', 'his', 'Lufthansa', 'flight', 'training', 'school', 'in', '2009', 'that', 'he', 'had', 'a', '"', 'previous', 'episode', 'of', 'severe', 'depression', ',', '"', 'the', 'airline', 'said', 'Tuesday', '.'], ['Email', 'correspondence', 'between', 'Lubitz', 'and', 'the', 'school', 'discovered', 'in', 'an', 'internal', 'investigation', ',', 'Lufthansa', 'said', ',', 'included', 'medical', 'documents', 'he', 'submitted', 'in', 'connection', 'with', 'resuming', 'his', 'flight', 'training', '.'], ['The', 'announcement', 'indicates', 'that', 'Lufthansa', ',', 'the', 'parent', 'company', 'of', 'Germanwings', ',', 'knew', 'of', 'Lubitz', "'s", 'battle', 'with', 'depression', ',', 'allowed', 'him', 'to', 'continue', 'training', 'and', 'ultimately', 'put', 'him', 'in', 'the', 'cockpit', '.'], ['Lufthansa', ',', 'whose', 'CEO', 'Carsten', 'Spohr', 'previously', 'said', 'Lubitz', 'was', '100', '%', 'fit', 'to', 'fly', ',', 'described', 'its', 'statement', 'Tuesday', 'as', 'a', '"', 'swift', 'and', 'seamless', 'clarification', '"', 'and', 'said', 'it', 'was', 'sharing', 'the', 'information', 'and', 'documents', '--', 'including', 'training', 'and', 'medical', 'records', '--', 'with', 'public', 'prosecutors', '.'], ['Spohr', 'traveled', 'to', 'the', 'crash', 'site', 'Wednesday', ',', 'where', 'recovery', 'teams', 'have', 'been', 'working', 'for', 'the', 'past', 'week', 'to', 'recover', 'human', 'remains', 'and', 'plane', 'debris', 'scattered', 'across', 'a', 'steep', 'mountainside', '.'], ['He', 'saw', 'the', 'crisis', 'center', 'set', 'up', 'in', 'Seyne', '-', 'les', '-', 'Alpes', ',', 'laid', 'a', 'wreath', 'in', 'the', 'village', 'of', 'Le', 'Vernet', ',', 'closer', 'to', 'the', 'crash', 'site', ',', 'where', 'grieving', 'families', 'have', 'left', 'flowers', 'at', 'a', 'simple', 'stone', 'memorial', '.'], ['Menichini', 'told', 'CNN', 'late', 'Tuesday', 'that', 'no', 'visible', 'human', 'remains', 'were', 'left', 'at', 'the', 'site', 'but', 'recovery', 'teams', 'would', 'keep', 'searching', '.'], ['French', 'President', 'Francois', 'Hollande', ',', 'speaking', 'Tuesday', ',', 'said', 'that', 'it', 'should', 'be', 'possible', 'to', 'identify', 'all', 'the', 'victims', 'using', 'DNA', 'analysis', 'by', 'the', 'end', 'of', 'the', 'week', ',', 'sooner', 'than', 'authorities', 'had', 'previously', 'suggested', '.'], ['In', 'the', 'meantime', ',', 'the', 'recovery', 'of', 'the', 'victims', "'", 'personal', 'belongings', 'will', 'start', 'Wednesday', ',', 'Menichini', 'said', '.'], ['Among', 'those', 'personal', 'belongings', 'could', 'be', 'more', 'cell', 'phones', 'belonging', 'to', 'the', '144', 'passengers', 'and', 'six', 'crew', 'on', 'board', '.'], ['Check', 'out', 'the', 'latest', 'from', 'our', 'correspondents', '.'], ['The', 'details', 'about', 'Lubitz', "'s", 'correspondence', 'with', 'the', 'flight', 'school', 'during', 'his', 'training', 'were', 'among', 'several', 'developments', 'as', 'investigators', 'continued', 'to', 'delve', 'into', 'what', 'caused', 'the', 'crash', 'and', 'Lubitz', "'s", 'possible', 'motive', 'for', 'downing', 'the', 'jet', '.'], ['A', 'Lufthansa', 'spokesperson', 'told', 'CNN', 'on', 'Tuesday', 'that', 'Lubitz', 'had', 'a', 'valid', 'medical', 'certificate', ',', 'had', 'passed', 'all', 'his', 'examinations', 'and', '"', 'held', 'all', 'the', 'licenses', 'required', '.', '"'], ['Earlier', ',', 'a', 'spokesman', 'for', 'the', 'prosecutor', "'s", 'office', 'in', 'Dusseldorf', ',', 'Christoph', 'Kumpa', ',', 'said', 'medical', 'records', 'reveal'], ['Lubitz', 'suffered', 'from', 'suicidal', 'tendencies', 'at', 'some', 'point', 'before', 'his', 'aviation', 'career', 'and', 'underwent', 'psychotherapy', 'before', 'he', 'got', 'his', 'pilot', "'s", 'license', '.'], ['Kumpa', 'emphasized', 'there', "'s", 'no', 'evidence', 'suggesting', 'Lubitz', 'was', 'suicidal', 'or', 'acting', 'aggressively', 'before', 'the', 'crash', '.'], ['Investigators', 'are', 'looking', 'into', 'whether', 'Lubitz', 'feared', 'his', 'medical', 'condition', 'would', 'cause', 'him', 'to', 'lose', 'his', 'pilot', "'s", 'license', ',', 'a', 'European', 'government', 'official', 'briefed', 'on', 'the', 'investigation', 'told', 'CNN', 'on', 'Tuesday', '.'], ['While', 'flying', 'was', '"', 'a', 'big', 'part', 'of', 'his', 'life'], [',', '"', 'the', 'source', 'said', ',', 'it', "'s", 'only', 'one', 'theory', 'being', 'considered', '.'], ['Another', 'source', ',', 'a', 'law', 'enforcement', 'official', 'briefed', 'on', 'the', 'investigation', ',', 'also', 'told', 'CNN', 'that', 'authorities', 'believe', 'the', 'primary', 'motive', 'for', 'Lubitz', 'to', 'bring', 'down', 'the', 'plane', 'was', 'that', 'he', 'feared', 'he', 'would', 'not', 'be', 'allowed', 'to', 'fly', 'because', 'of', 'his', 'medical', 'problems', '.'], ['Lubitz', "'s", 'girlfriend', 'told', 'investigators', 'he', 'had', 'seen', 'an', 'eye', 'doctor', 'and', 'a', 'neuropsychologist', ',', 'both', 'of', 'whom', 'deemed', 'him', 'unfit', 'to', 'work', 'recently', 'and', 'concluded', 'he', 'had', 'psychological', 'issues', ',', 'the', 'European', 'government', 'official', 'said', '.'], ['But', 'no', 'matter', 'what', 'details', 'emerge', 'about', 'his', 'previous', 'mental', 'health', 'struggles', ',', 'there', "'s", 'more', 'to', 'the', 'story', ',', 'said', 'Brian', 'Russell', ',', 'a', 'forensic', 'psychologist', '.'], ['"', 'Psychology', 'can', 'explain', 'why', 'somebody', 'would', 'turn', 'rage', 'inward', 'on', 'themselves', 'about', 'the', 'fact', 'that', 'maybe', 'they', 'were', "n't", 'going', 'to', 'keep', 'doing', 'their', 'job', 'and', 'they', "'re", 'upset', 'about', 'that'], ['and', 'so', 'they', "'re", 'suicidal', ',', '"', 'he', 'said', '.'], ['"', 'But', 'there', 'is', 'no', 'mental', 'illness', 'that', 'explains', 'why', 'somebody', 'then', 'feels', 'entitled', 'to', 'also', 'take', 'that', 'rage', 'and', 'turn', 'it', 'outward', 'on', '149', 'other', 'people', 'who', 'had', 'nothing', 'to', 'do', 'with', 'the', 'person', "'s", 'problems', '.', '"'], ['Who', 'was', 'the', 'captain', 'of', 'Germanwings', 'Flight', '9525', '?'], ['CNN', "'s", 'Margot', 'Haddad', 'reported', 'from', 'Marseille', 'and', 'Pamela', 'Brown', 'from', 'Dusseldorf', ',', 'while', 'Laura', 'Smith', '-', 'Spark', 'wrote', 'from', 'London', '.'], ['CNN', "'s", 'Frederik', 'Pleitgen', ',', 'Pamela', 'Boykoff', ',', 'Antonia', 'Mortensen', ',', 'Sandrine', 'Amiel', 'and', 'Anna', '-', 'Maja', 'Rappard', 'contributed', 'to', 'this', 'report', '.']], 'labels': [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'tgt': 'Marseille prosecutor says " so far no videos were used in the crash investigation " despite media reports .<q>Journalists at Bild and Paris Match are " very confident " the video clip is real , an editor says .<q>Andreas Lubitz had informed his Lufthansa training school of an episode of severe depression , airline says .'}

Script Help

usage: convert_to_extractive.py [-h] [--base_output_path BASE_OUTPUT_PATH]
                                [--split_names {train,val,test} [{train,val,test} ...]]
                                [--add_target_to {train,val,test} [{train,val,test} ...]]
                                [--source_ext SOURCE_EXT] [--target_ext TARGET_EXT]
                                [--oracle_mode {greedy,combination}]
                                [--shard_interval SHARD_INTERVAL]
                                [--n_process N_PROCESS] [--batch_size BATCH_SIZE]
                                [--compression] [--resume]
                                [--tokenizer_log_interval TOKENIZER_LOG_INTERVAL]
                                [--sentencizer] [--no_preprocess]
                                [--min_sentence_ntokens MIN_SENTENCE_NTOKENS]
                                [--max_sentence_ntokens MAX_SENTENCE_NTOKENS]
                                [--min_example_nsents MIN_EXAMPLE_NSENTS]
                                [--max_example_nsents MAX_EXAMPLE_NSENTS]
                                [-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                                DIR

Convert an Abstractive Summarization Dataset to the Extractive Task

positional arguments:
DIR                   path to data directory

optional arguments:
-h, --help            show this help message and exit
--base_output_path BASE_OUTPUT_PATH
                        path to output processed data (default is `base_path`)
--split_names {train,val,test} [{train,val,test} ...]
                        which splits of dataset to process
--add_target_to {train,val,test} [{train,val,test} ...]
                        add the abstractive target to these splits (useful for
                        calculating rouge scores)
--source_ext SOURCE_EXT
                        extension of source files
--target_ext TARGET_EXT
                        extension of target files
--oracle_mode {greedy,combination}
                        method to convert abstractive summaries to extractive
                        summaries
--shard_interval SHARD_INTERVAL
                        how many examples to include in each shard of the dataset
                        (default: no shards)
--n_process N_PROCESS
                        number of processes for multithreading
--batch_size BATCH_SIZE
                        number of batches for tokenization
--compression         use gzip compression when saving data
--resume              resume from last shard
--tokenizer_log_interval TOKENIZER_LOG_INTERVAL
                        minimum progress display update interval [default: 0.1]
                        seconds
--sentencizer         use a spacy sentencizer instead of a statistical model for
                        sentence detection (much faster but less accurate); see
                        https://spacy.io/api/sentencizer
--no_preprocess       do not run the preprocess function, which removes sentences
                        that are too long/short and examples that have too few/many
                        sentences
--min_sentence_ntokens MIN_SENTENCE_NTOKENS
                        minimum number of tokens per sentence
--max_sentence_ntokens MAX_SENTENCE_NTOKENS
                        maximum number of tokens per sentence
--min_example_nsents MIN_EXAMPLE_NSENTS
                        minimum number of sentences per example
--max_example_nsents MAX_EXAMPLE_NSENTS
                        maximum number of sentences per example
-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --log {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Set the logging level (default: 'Info').