You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
with open(fileName, 'r', encoding='iso-8859-1') as f: # TODO: Solve Iso encoding pb !
TypeError: 'encoding' is an invalid keyword argument for this function
root@66004f351bea:/deepqa2# python deepqa2/dataset/preprocesser.py
('Saving logs into', '/deepqa2/logs/root.log')
2017-05-22 02:11:02,781 - __main__ - INFO - Corpus Name cornell
Max Length 20
2017-05-22 02:11:02,782 - dataset.textdata - INFO - Training samples not found. Creating dataset...
Extract conversations: 3%|####2 | 2260/83097 [00:03<02:25, 554.00it/s]
Traceback (most recent call last):
File "deepqa2/dataset/preprocesser.py", line 42, in <module>
main()
File "deepqa2/dataset/preprocesser.py", line 39, in main
'datasetTag': ''}))
File "/deepqa2/deepqa2/dataset/../dataset/textdata.py", line 79, in __init__
self.loadCorpus(self.samplesDir)
File "/deepqa2/deepqa2/dataset/../dataset/textdata.py", line 235, in loadCorpus
self.createCorpus(cornellData.getConversations())
File "/deepqa2/deepqa2/dataset/../dataset/textdata.py", line 306, in createCorpus
self.extractConversation(conversation)
File "/deepqa2/deepqa2/dataset/../dataset/textdata.py", line 323, in extractConversation
targetWords = self.extractText(targetLine["text"], True)
File "/deepqa2/deepqa2/dataset/../dataset/textdata.py", line 340, in extractText
sentencesToken = nltk.sent_tokenize(line)
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 91, in sent_tokenize
return tokenizer.tokenize(text)
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1226, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1274, in sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1265, in span_tokenize
return [(sl.start, sl.stop) for sl in slices]
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1304, in _realign_boundaries
for sl1, sl2 in _pair_iter(slices):
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
prev = next(it)
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1280, in _slices_from_text
if self.text_contains_sentbreak(context):
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1325, in text_contains_sentbreak
for t in self._annotate_tokens(self._tokenize_words(text)):
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1460, in _annotate_second_pass
for t1, t2 in _pair_iter(tokens):
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
prev = next(it)
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 577, in _annotate_first_pass
for aug_tok in tokens:
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 542, in _tokenize_words
for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 11: ordinal not in range(128)
你好,开发者
我想学习试用你这个项目,我下载的docker镜像,我已经安装了依赖包,我遇到了一些错误,首先是:
一个open文件方法报错,说键值对的编码参数无效,看不出问题所在,我把这个参数删除脚本运行过了,
后面我按错误提示下载加入了语料,下载了nltk_data/tokenizers/punkt.zip,在运行下面的脚本时出错了,
python deepqa2/dataset/preprocesser.py
我尝试了python2和python3问题都一样,我下载了语料,我是在linux下解压的,出现了ascii解码错误,我都不知道是要给谁解码,我尝试看脚本了,似乎要下载一个文件,我不确定是文件还是上传的语料,下面的错误信息,如果看到麻烦指导一下,感激不尽!
loadCorpus
createCorpus
extractConversation
三个方法不确定是哪一个发送了错误
The text was updated successfully, but these errors were encountered: