Inconsistent IDs lead to distributed computing woes. #111

axelmagn · 2024-03-20T19:16:59Z

When trying to work with these data via Dataflow, I noticed a few things:

the ID field key is inconsistent between files. it is id in minhash and signals, doc_id in duplicates.
IDs are not present as an explicit field in documents. They must be reconstructed from the file path and line number.

This creates a lot of unnecessary friction when working with big data pipelines, since line number is not usually available. I'm finding myself writing a custom reader (sort of a bummer if you've ever had to do it).

For future data releases, please consider embedding a consistent key between all file groups for easier joining at scale. Just a UUID would be fine.

The text was updated successfully, but these errors were encountered:

mauriceweber · 2024-03-29T13:43:55Z

Hi @axelmagn thanks for your feedback, these are very good points and is something we will definitely do in future releases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent IDs lead to distributed computing woes. #111

Inconsistent IDs lead to distributed computing woes. #111

axelmagn commented Mar 20, 2024

mauriceweber commented Mar 29, 2024

Inconsistent IDs lead to distributed computing woes. #111

Inconsistent IDs lead to distributed computing woes. #111

Comments

axelmagn commented Mar 20, 2024

mauriceweber commented Mar 29, 2024