Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

semantic_dedupe runs into IndexError: list index out of range #341

Open
ruchaa-apte opened this issue Oct 30, 2024 · 1 comment
Open

semantic_dedupe runs into IndexError: list index out of range #341

ruchaa-apte opened this issue Oct 30, 2024 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@ruchaa-apte
Copy link
Contributor

Describe the bug

While running Semantic Deduplication on text files, it starts semantic dedupe pipeline, but runs into IndexError: list index out of range
Error Log

GPU: 0, Part: 20: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  4.57it/s]
2024-10-30 13:42:26,014 - distributed.utils_perf - WARNING - full garbage collections took 72% CPU time recently (threshold: 10%)
2024-10-30 13:42:26,060 - distributed.utils_perf - WARNING - full garbage collections took 65% CPU time recently (threshold: 10%)
2024-10-30 13:42:26,196 - distributed.worker - WARNING - Compute Failed
Key:       ('read_single_partition-fused-toparquetdata-f053b8f0935a4edb94f161972e2f27a8', 2)
State:     executing
Function:  execute_task
args:      ((<function Fused._execute_task at 0x7f2e10b56d40>, {'read_single_partition-fused-toparquetdata-f053b8f0935a4edb94f161972e2f27a8': ('toparquetdata-9d98c9d2f77890cf221ce0d97398b829', 2), ('toparquetdata-9d98c9d2f77890cf221ce0d97398b829', 2): (<dask.dataframe.io.parquet.core.ToParquetFunctionWrapper object at 0x7f2a9b3a6140>, ('reset_index-0ced2634f121de0dd6cc480baf637ca7', 2), (2,)), ('reset_index-0ced2634f121de0dd6cc480baf637ca7', 2): (<function apply at 0x7f2e663ec9d0>, <methodcaller: reset_index>, [('<crossfit.backend.torch.op.base.predictor object a-bd58448c0c3a7e2471c1d5ce629f4850', 2)], {'drop': True}), ('<crossfit.backend.torch.op.base.predictor object a-bd58448c0c3a7e2471c1d5ce629f4850', 2): (<function apply at 0x7f2e663ec9d0>, <function apply_and_enforce at 0x7f2e1139b910>, [('<crossfit.op.tokenize.tokenizer object at 0x7fdad0-ea35aded3a541ceda1ad391c99bb6e42', 2)], {'partition_info': {'number': 2, 'division': None}, '_func': <crossfit.backend.torch.op.base.Predictor object at 
kwargs:    {}
Exception: "IndexError('list index out of range')"
Traceback: '  File "/home/nemo_curator/lib/python3.10/site-packages/dask_expr/_expr.py", line 3758, in _execute_task\n    return dask.core.get(graph, name)\n  File "/home/nemo_curator/lib/python3.10/site-packages/dask/core.py", line 157, in get\n    result = _execute_task(task, cache)\n  File "/home/nemo_curator/lib/python3.10/site-packages/dask/core.py", line 127, in _execute_task\n    return func(*(_execute_task(a, cache) for a in args))\n  File "/home/nemo_curator/lib/python3.10/site-packages/dask/utils.py", line 78, in apply\n    return func(*args, **kwargs)\n  File "/home/nemo_curator/lib/python3.10/site-packages/dask/dataframe/core.py", line 7164, in apply_and_enforce\n    df = func(*args, **kwargs)\n  File "/home/nemo_curator/lib/python3.10/site-packages/crossfit/op/base.py", line 96, in __call__\n    output = self.call(data, *args, **kwargs)\n  File "/home/nemo_curator/lib/python3.10/site-packages/crossfit/op/tokenize.py", line 155, in call\n    input_ids, attention_mask = self.call_column(data[col])\n  File "/home/nemo_curator/lib/python3.10/site-packages/crossfit/op/tokenize.py", line 120, in call_column\n    tokenized_data = self.tokenize_strings(text).copy()\n  File "/home/nemo_curator/lib/python3.10/site-packages/crossfit/op/tokenize.py", line 71, in tokenize_strings\n    tokenized_data = tokenizer.batch_encode_plus(\n  File "/home/nemo_curator/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3306, in batch_encode_plus\n    return self._batch_encode_plus(\n  File "/home/nemo_curator/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 562, in _batch_encode_plus\n    for key in tokens_and_encodings[0][0].keys():\n'

2024-10-30 13:42:26,203 - distributed.utils_perf - WARNING - full garbage collections took 73% CPU time recently (threshold: 10%)
GPU: 0, Part: 19:   0%|                                                                                                                                                                                                        | 0/1 [00:00<?, ?it/s]2024-10-30 13:42:26,302 - distributed.utils_perf - WARNING - full garbage collections took 65% CPU time recently (threshold: 10%)
Traceback (most recent call last):
  File "/home/projects/chem-data-curation/NeMo-Curator/tutorials/dapt-curation/code/main.py", line 283, in <module>
    main()
  File "/home/projects/chem-data-curation/NeMo-Curator/tutorials/dapt-curation/code/main.py", line 265, in main
    run_curation_pipeline(args, text_files, code_files)
  File "/home/projects/chem-data-curation/NeMo-Curator/tutorials/dapt-curation/code/main.py", line 177, in run_curation_pipeline
    semantic_dataset_text = semantic_dedupe(dataset=gpu_dataset_text, sem_dedupe_config_yaml_path=sem_dedupe_config_yaml_path, type='text')
  File "/home/projects/chem-data-curation/NeMo-Curator/tutorials/dapt-curation/code/utils.py", line 354, in semantic_dedupe
    duplicates = semdup(dataset)
  File "/home/nemo_curator/lib/python3.10/site-packages/nemo_curator/modules/semantic_dedup.py", line 637, in __call__
    embeddings_dataset = self.embedding_creator(dataset)
  File "/home/nemo_curator/lib/python3.10/site-packages/nemo_curator/modules/semantic_dedup.py", line 215, in __call__
    write_to_disk(
  File "/home/nemo_curator/lib/python3.10/site-packages/nemo_curator/utils/distributed_utils.py", line 577, in write_to_disk
    df.to_parquet(output_file_dir, write_index=False)
  File "/home/nemo_curator/lib/python3.10/site-packages/dask_expr/_collection.py", line 3281, in to_parquet
    return to_parquet(self, path, **kwargs)
  File "/home/nemo_curator/lib/python3.10/site-packages/dask_expr/io/parquet.py", line 653, in to_parquet
    out = out.compute(**compute_kwargs)
  File "/home/nemo_curator/lib/python3.10/site-packages/dask_expr/_collection.py", line 476, in compute
    return DaskMethodsMixin.compute(out, **kwargs)
  File "/home/nemo_curator/lib/python3.10/site-packages/dask/base.py", line 376, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/home/nemo_curator/lib/python3.10/site-packages/dask/base.py", line 662, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/home/nemo_curator/lib/python3.10/site-packages/dask_expr/_expr.py", line 3758, in _execute_task
    return dask.core.get(graph, name)
  File "/home/nemo_curator/lib/python3.10/site-packages/crossfit/op/base.py", line 96, in __call__
    output = self.call(data, *args, **kwargs)
  File "/home/nemo_curator/lib/python3.10/site-packages/crossfit/op/tokenize.py", line 155, in call
    input_ids, attention_mask = self.call_column(data[col])
  File "/home/nemo_curator/lib/python3.10/site-packages/crossfit/op/tokenize.py", line 120, in call_column
    tokenized_data = self.tokenize_strings(text).copy()
  File "/home/nemo_curator/lib/python3.10/site-packages/crossfit/op/tokenize.py", line 71, in tokenize_strings
    tokenized_data = tokenizer.batch_encode_plus(
  File "/home/nemo_curator/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3306, in batch_encode_plus
    return self._batch_encode_plus(
  File "/home/nemo_curator/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 562, in _batch_encode_plus
    for key in tokens_and_encodings[0][0].keys():
IndexError: list index out of range

Steps/Code to reproduce bug
Config for semantic dedupe

# Configuration file for semdantic dedup
cache_dir: "workspace/text/semdedup_cache"
num_files: 16
id_col_name: "id"
id_col_type: "str"
input_column: "text"

# Embeddings configuration
embeddings_save_loc: "embeddings"
embedding_model_name_or_path: "sentence-transformers/all-MiniLM-L6-v2"
embedding_batch_size: 128
embedding_max_mem_gb: 20

# Clustering configuration
clustering_save_loc: "clustering_results"
n_clusters: 20
seed: 1234
max_iter: 100
kmeans_with_cos_dist: false

# Semdedup configuration
which_to_keep: "hard"
largest_cluster_size_to_process: 100000
sim_metric: "cosine"

# Extract dedup configuration
eps_thresholds:
  - 0.01
  - 0.001

# Which threshold to use for extracting deduped data
eps_to_extract: 0.01
    cache_dir = f"./workspace/semantic_dedupe/{type}"
    if os.path.isdir(cache_dir):
        os.system(f"rm -rf {cache_dir}")
        
    semdedup_config = SemDedupConfig.from_yaml(sem_dedupe_config_yaml_path)
    expand_outdir_and_mkdir(semdedup_config.cache_dir)
    semdup = SemDedup(semdedup_config)
    duplicates = semdup(dataset)

Environment overview

  • Method of NeMo-Curator install: from source

Environment details

  • OS version - "Ubuntu 22.04.5 LTS"
  • Dask version - version 2024.7.1
  • Python version - Python 3.10.12
@ruchaa-apte ruchaa-apte added the bug Something isn't working label Oct 30, 2024
@VibhuJawa VibhuJawa self-assigned this Oct 30, 2024
@VibhuJawa
Copy link
Collaborator

VibhuJawa commented Nov 4, 2024

This was due to an empty partition and was fixed by

partition_lengths = ddf.map_partitions(len).compute()
non_empty_partitions = [i for i, length in enumerate(partition_lengths) if length > 0]
filtered_ddf = ddf.partitions[non_empty_partitions]

We should long term fix this in crossfit or NeMo Curator or at least fail loudly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants