预先处理数据出错 #1

xilu0 · 2017-05-22T02:18:29Z

你好，开发者
我想学习试用你这个项目，我下载的docker镜像，我已经安装了依赖包，我遇到了一些错误，首先是：

with open(fileName, 'r', encoding='iso-8859-1') as f:  # TODO: Solve Iso encoding pb !
TypeError: 'encoding' is an invalid keyword argument for this function

一个open文件方法报错，说键值对的编码参数无效，看不出问题所在，我把这个参数删除脚本运行过了，
后面我按错误提示下载加入了语料，下载了nltk_data/tokenizers/punkt.zip，在运行下面的脚本时出错了，
python deepqa2/dataset/preprocesser.py
我尝试了python2和python3问题都一样，我下载了语料，我是在linux下解压的，出现了ascii解码错误，我都不知道是要给谁解码，我尝试看脚本了，似乎要下载一个文件，我不确定是文件还是上传的语料，下面的错误信息，如果看到麻烦指导一下，感激不尽！

root@66004f351bea:/deepqa2# python deepqa2/dataset/preprocesser.py
('Saving logs into', '/deepqa2/logs/root.log')
2017-05-22 02:11:02,781 - __main__ - INFO - Corpus Name cornell
Max Length 20
2017-05-22 02:11:02,782 - dataset.textdata - INFO - Training samples not found. Creating dataset...
Extract conversations:   3%|####2                                                                                                                                                         | 2260/83097 [00:03<02:25, 554.00it/s]
Traceback (most recent call last):
  File "deepqa2/dataset/preprocesser.py", line 42, in <module>
    main()
  File "deepqa2/dataset/preprocesser.py", line 39, in main
    'datasetTag': ''}))
  File "/deepqa2/deepqa2/dataset/../dataset/textdata.py", line 79, in __init__
    self.loadCorpus(self.samplesDir)
  File "/deepqa2/deepqa2/dataset/../dataset/textdata.py", line 235, in loadCorpus
    self.createCorpus(cornellData.getConversations())
  File "/deepqa2/deepqa2/dataset/../dataset/textdata.py", line 306, in createCorpus
    self.extractConversation(conversation)
  File "/deepqa2/deepqa2/dataset/../dataset/textdata.py", line 323, in extractConversation
    targetWords = self.extractText(targetLine["text"], True)
  File "/deepqa2/deepqa2/dataset/../dataset/textdata.py", line 340, in extractText
    sentencesToken = nltk.sent_tokenize(line)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 91, in sent_tokenize
    return tokenizer.tokenize(text)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1226, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1274, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1265, in span_tokenize
    return [(sl.start, sl.stop) for sl in slices]
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1304, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
    prev = next(it)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1280, in _slices_from_text
    if self.text_contains_sentbreak(context):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1325, in text_contains_sentbreak
    for t in self._annotate_tokens(self._tokenize_words(text)):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1460, in _annotate_second_pass
    for t1, t2 in _pair_iter(tokens):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
    prev = next(it)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 577, in _annotate_first_pass
    for aug_tok in tokens:
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 542, in _tokenize_words
    for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 11: ordinal not in range(128)

loadCorpus
createCorpus
extractConversation
三个方法不确定是哪一个发送了错误

The text was updated successfully, but these errors were encountered:

zli2014 · 2017-05-23T03:05:45Z

看看NVIDIA docker 有没有安装 ; 建议自己尝试着搭建环境 , 作者的部署过程写的很清楚, 应该很快可以部署起来的

xilu0 · 2017-05-24T05:48:05Z

文档没说要集成NVIDIA docker，CPU训练也可以，我这个错误是编码问题，不知道该去哪里调试

zli2014 · 2017-05-24T08:15:52Z

你试试这个:

sudo python3 # 进入python shell

import nltk # 导入python nltk 

nltk.download() # 选择下载所有 即可

zli2014 · 2017-05-24T08:19:06Z

我是采用上面这种方式下载的语料库, 这样可以避免出现语料库的乱码问题

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

预先处理数据出错 #1

预先处理数据出错 #1

xilu0 commented May 22, 2017

zli2014 commented May 23, 2017

xilu0 commented May 24, 2017

zli2014 commented May 24, 2017

zli2014 commented May 24, 2017

预先处理数据出错 #1

预先处理数据出错 #1

Comments

xilu0 commented May 22, 2017

zli2014 commented May 23, 2017

xilu0 commented May 24, 2017

zli2014 commented May 24, 2017

zli2014 commented May 24, 2017