Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: test_adapter 兼容性 #441

Open
3 tasks done
FailedNamed opened this issue Sep 29, 2024 · 2 comments
Open
3 tasks done

[Bug]: test_adapter 兼容性 #441

FailedNamed opened this issue Sep 29, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@FailedNamed
Copy link

FailedNamed commented Sep 29, 2024

Before Reporting 报告之前

  • I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。

  • I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)

Search before reporting 先搜索,再报告

  • I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表 中搜索但是没有发现类似的bug报告。

OS 系统

ubuntu

Installation Method 安装方式

source

Data-Juicer Version Data-Juicer版本

v0.2.0

Python Version Python版本

3.9.19

Describe the bug 描述这个bug

执行 python -m tests.core.test_adapter 报错

To Reproduce 如何复现

  1. 在项目根目录执行 python -m tests.core.test_adapter
  2. 出现报错,经过定位应该是在Filter.run的这段代码
    dataset = dataset.map(add_same_content_to_new_column, fn_kwargs={ 'new_column_name': Fields.stats, 'initial_value': {} }, num_proc=self.runtime_np(), batch_size=self.batch_size, desc='Adding new column for stats')
    中的initial_value有问题,是个空字典,这个在PerplexityFilter算子中compute_stats中的 for idx, stat in enumerate(samples_stats) 进不去循环,没执行计算,后面报错KeyError: 'perplexity',参考其他test例子把 'initial_value': {}替换成了 'initial_value': [{}] * dataset.num_rows(ps:不知道要不要乘以这个rows),后执行,PerplexityFilter算子不再报错
  3. 继续执行,PerplexityFilter算子不再报错,但是DocumentDeduplicator算子报错,信息大概为 File "/root/data-juicer/data-juicer/data_juicer/ops/deduplicator/document_deduplicator.py", line 63, in _get_hash
    return hashlib.md5(txt.strip().encode('utf-8')).hexdigest()
    AttributeError: 'list' object has no attribute 'strip',看了下代码,是因为前置的FixUnicodeMapper算子处理完数据后,
    samples[self.text_key] = list( map( lambda text: ftfy.fix_text(text, normalization=self.normalization), samples[self.text_key]))
    samples[self.text_key]是一个数组,导致DocumentDeduplicator算子执行_get_hash处理时报错
    看了下其他mapper算子,貌似输出的samples[self.text_key]有许多格式,数组,字典,字符串都有,但是strip应该只支持字符串,是不是这些算子之间的兼容性处理的不够好,其他算子是否也有类似问题
  4. 麻烦有空帮忙解答下,感谢!

Configs 配置信息

No response

Logs 报错日志

Traceback (most recent call last):
File "/usr/local/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/local/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/root/data-juicer/data-juicer/data_juicer/core/monitor.py", line 17, in resource_monitor
if mdict['stop']:
File "", line 2, in getitem
File "/usr/local/lib/python3.9/multiprocessing/managers.py", line 809, in _callmethod
conn.send((self._id, methodname, args, kwds))
File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 411, in _send_bytes
self._send(header + buf)
File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

ERROR: test_execute_and_probe (main.AdapterTest)

Traceback (most recent call last):
File "/root/data-juicer/data-juicer/tests/core/test_adapter.py", line 126, in test_execute_and_probe
resource_util_list = Adapter.execute_and_probe(ds, ops)
File "/root/data-juicer/data-juicer/data_juicer/core/adapter.py", line 42, in execute_and_probe
dataset, resource_util_per_op = Monitor.monitor_func(
File "/root/data-juicer/data-juicer/data_juicer/core/monitor.py", line 201, in monitor_func
ret = func()
File "/root/data-juicer/data-juicer/data_juicer/ops/base_op.py", line 318, in run
new_dataset = dataset.filter(self.process,
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 567, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/datasets/fingerprint.py", line 482, in wrapper
out = func(dataset, *args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3709, in filter
indices = self.map(
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 602, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 567, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3156, in map
for rank, done, content in Dataset._map_single(**dataset_kwargs):
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3547, in _map_single
batch = apply_function_on_filtered_inputs(
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3416, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 6477, in get_indices_from_mask_function
mask.append(function(example, *additional_args, **fn_kwargs))
File "/root/data-juicer/data-juicer/data_juicer/core/data.py", line 72, in wrapped_f
return f(*args, **kargs)
File "/root/data-juicer/data-juicer/data_juicer/ops/filter/perplexity_filter.py", line 87, in process
return samples[Fields.stats][StatsKeys.perplexity] <= self.max_ppl
KeyError: 'perplexity'

Screenshots 截图

No response

Additional 额外信息

No response

@FailedNamed FailedNamed added the bug Something isn't working label Sep 29, 2024
@drcege drcege changed the title [Bug]: [Bug]: test_adapter 兼容性 Oct 14, 2024
@drcege
Copy link
Collaborator

drcege commented Oct 14, 2024

@HYLcool

@HYLcool
Copy link
Collaborator

HYLcool commented Oct 17, 2024

@FailedNamed ,感谢你的使用与反馈!

我们这边未能复现你遇到的问题,请你拉取最新版本代码再进行尝试,如还是遇到类似问题,欢迎与我们继续讨论~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants