Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrectly formatted line in vocabulary file #15

Open
bwang482 opened this issue Oct 5, 2017 · 1 comment
Open

Incorrectly formatted line in vocabulary file #15

bwang482 opened this issue Oct 5, 2017 · 1 comment

Comments

@bwang482
Copy link

bwang482 commented Oct 5, 2017

Why is for example 0800 555 111 356 included in the generated vocab file? This example is at line 23163. Or is it just me who have this problem?

>>> with open('data/cnn-dailymail/vocab', 'r') as vocab_f:
...      for line in vocab_f:
...          pieces = line.split()
...          if len(pieces) != 2:
...             print(pieces)
... 
['0800', '555', '111', '356']
['1800', '333', '000', '139']
['2', '1/2', '124']
['3', '1/2', '86']
['1', '1/2', '68']
['0800', '555111', '59']
['4', '1/2', '47']
['0844', '472', '4157', '41']
['5', '1/2', '39']
['7', '1/2', '25']
['6', '1/2', '24']
['9', '1/2', '21']
['020', '7629', '9161', '19']
['8', '1/2', '19']
['0300', '123', '8018', '19']
['0808', '800', '5000', '19']
['11', '1/2', '18']
['0844', '493', '0787', '14']
['1300', '659', '467', '13']
['16', '1/2', '12']
['13', '1/2', '12']
['1800', '273', '8255', '11']
['18', '1/2', '10']
['0300', '1234', '999', '10']
['0845', '790', '9090', '10']
['0845', '634', '1414', '9']
['14', '1/2', '8']
['0207', '938', '6364', '8']
['0207', '938', '6683', '8']
['310', '642', '2317', '7']
['at', 'uefa.com', '7']
['0207', '386', '0868', '7']
['0808', '800', '2222', '6']
['0800', '789', '321', '6']
['0800', '854', '440', '6']
@f0k
Copy link

f0k commented Oct 20, 2017

That's intended, see normalizeSpace at https://nlp.stanford.edu/software/tokenizer.html. It will emit phone numbers (such as 0800 555 111) and numbers with fractions (such as 2 1/2) as a single token with non-breakable spaces in between. Not sure why at uefa.com is joined as well, but I get the same result as you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants