Instructions

This code produces the non-anonymized version of the CNN / Daily Mail summarization dataset, as used in the ACL 2017 paper Get To The Point: Summarization with Pointer-Generator Networks.

This is a modification of the scripts to output raw text as opposed to tensorflow binaries.

Instructions

1. Download data

Download and unzip the stories directories from here for both CNN and Daily Mail.

Warning: These files contain a few (114, in a dataset of over 300,000) examples for which the article text is missing - see for example cnn/stories/72aba2f58178f2d19d3fae89d5f3e9a4686bc4bb.story.

2. Download Stanford CoreNLP

We will need Stanford CoreNLP to tokenize the data. Download it here and unzip it. Then add the following command to your bash_profile:

export CLASSPATH=/path/to/stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar

replacing /path/to/ with the path to where you saved the stanford-corenlp-full-2016-10-31 directory. You can check if it's working by running

echo "Please tokenize this text." | java edu.stanford.nlp.process.PTBTokenizer

You should see something like:

Please
tokenize
this
text
.
PTBTokenizer tokenized 5 tokens at 68.97 tokens per second.

3. Process into text files

Run

python make_datafiles.py /path/to/cnn/stories /path/to/dailymail/stories

replacing /path/to/cnn/stories with the path to where you saved the cnn/stories directory that you downloaded; similarly for dailymail/stories.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
url_lists		url_lists
README.md		README.md
make_datafiles.py		make_datafiles.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Instructions

1. Download data

2. Download Stanford CoreNLP

3. Process into text files

About

Releases

Packages

Languages

OpenNMT/cnn-dailymail

Folders and files

Latest commit

History

Repository files navigation

Instructions

1. Download data

2. Download Stanford CoreNLP

3. Process into text files

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages