Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

预先处理数据出错 #1

Open
xilu0 opened this issue May 22, 2017 · 4 comments
Open

预先处理数据出错 #1

xilu0 opened this issue May 22, 2017 · 4 comments

Comments

@xilu0
Copy link

xilu0 commented May 22, 2017

你好,开发者
我想学习试用你这个项目,我下载的docker镜像,我已经安装了依赖包,我遇到了一些错误,首先是:

with open(fileName, 'r', encoding='iso-8859-1') as f:  # TODO: Solve Iso encoding pb !
TypeError: 'encoding' is an invalid keyword argument for this function

一个open文件方法报错,说键值对的编码参数无效,看不出问题所在,我把这个参数删除脚本运行过了,
后面我按错误提示下载加入了语料,下载了nltk_data/tokenizers/punkt.zip,在运行下面的脚本时出错了,
python deepqa2/dataset/preprocesser.py
我尝试了python2和python3问题都一样,我下载了语料,我是在linux下解压的,出现了ascii解码错误,我都不知道是要给谁解码,我尝试看脚本了,似乎要下载一个文件,我不确定是文件还是上传的语料,下面的错误信息,如果看到麻烦指导一下,感激不尽!

root@66004f351bea:/deepqa2# python deepqa2/dataset/preprocesser.py
('Saving logs into', '/deepqa2/logs/root.log')
2017-05-22 02:11:02,781 - __main__ - INFO - Corpus Name cornell
Max Length 20
2017-05-22 02:11:02,782 - dataset.textdata - INFO - Training samples not found. Creating dataset...
Extract conversations:   3%|####2                                                                                                                                                         | 2260/83097 [00:03<02:25, 554.00it/s]
Traceback (most recent call last):
  File "deepqa2/dataset/preprocesser.py", line 42, in <module>
    main()
  File "deepqa2/dataset/preprocesser.py", line 39, in main
    'datasetTag': ''}))
  File "/deepqa2/deepqa2/dataset/../dataset/textdata.py", line 79, in __init__
    self.loadCorpus(self.samplesDir)
  File "/deepqa2/deepqa2/dataset/../dataset/textdata.py", line 235, in loadCorpus
    self.createCorpus(cornellData.getConversations())
  File "/deepqa2/deepqa2/dataset/../dataset/textdata.py", line 306, in createCorpus
    self.extractConversation(conversation)
  File "/deepqa2/deepqa2/dataset/../dataset/textdata.py", line 323, in extractConversation
    targetWords = self.extractText(targetLine["text"], True)
  File "/deepqa2/deepqa2/dataset/../dataset/textdata.py", line 340, in extractText
    sentencesToken = nltk.sent_tokenize(line)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 91, in sent_tokenize
    return tokenizer.tokenize(text)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1226, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1274, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1265, in span_tokenize
    return [(sl.start, sl.stop) for sl in slices]
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1304, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
    prev = next(it)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1280, in _slices_from_text
    if self.text_contains_sentbreak(context):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1325, in text_contains_sentbreak
    for t in self._annotate_tokens(self._tokenize_words(text)):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1460, in _annotate_second_pass
    for t1, t2 in _pair_iter(tokens):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
    prev = next(it)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 577, in _annotate_first_pass
    for aug_tok in tokens:
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 542, in _tokenize_words
    for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 11: ordinal not in range(128)

loadCorpus
createCorpus
extractConversation
三个方法不确定是哪一个发送了错误

@zli2014
Copy link

zli2014 commented May 23, 2017

看看NVIDIA docker 有没有安装 ; 建议自己尝试着搭建环境 , 作者的部署过程写的很清楚, 应该很快可以部署起来的

@xilu0
Copy link
Author

xilu0 commented May 24, 2017

文档没说要集成NVIDIA docker,CPU训练也可以,我这个错误是编码问题,不知道该去哪里调试

@zli2014
Copy link

zli2014 commented May 24, 2017

你试试这个:

sudo python3 # 进入python shell

import nltk # 导入python nltk 

nltk.download() # 选择下载所有 即可  

@zli2014
Copy link

zli2014 commented May 24, 2017

我是采用上面这种方式下载的语料库, 这样可以避免出现语料库的乱码问题

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants