Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataPipe] extract keys #406

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

[DataPipe] extract keys #406

wants to merge 6 commits into from

Conversation

tmbdev
Copy link
Contributor

@tmbdev tmbdev commented May 13, 2022

This PR adds an ExtractKeys filter that turns samples represented as dictionaries into tuples. Tuples are constructed by selecting values from the dictionaries by matching the key against a given set of patterns.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 13, 2022
Copy link
Contributor

@VitalyFedyunin VitalyFedyunin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This DataPipe looks useful for other cases as well if me modify it a bit.

Could you please separate it by two pipes:

  • 1st one to filter dict keys based on pattern:
        stage1 = IterableWrapper([
            {"1.txt": "1", "1.bin": "1b", "3.jpg":"foo"},
            {"2.txt": "2", "2.bin": "2b"},
        ])
        stage2 = ExtractKeys(stage1, "*.txt", "*.bin")
        output = list(iter(stage2))
        self.assertEqual({"1.txt": "1", "1.bin": "1b"}, output[0])
  • Second is simple map, to drop keys:
dp = dp.map(lambda x: x.values())

@VitalyFedyunin VitalyFedyunin changed the title extract keys [DataPipe] extract keys May 19, 2022
Copy link
Contributor

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tmbdev,

Thanks for your commits on these PRs. Let us know if these are ready for review (but no rush at all!). @VitalyFedyunin and I will be happy to have a look.

Again, thanks for contributing to our library!

Copy link
Contributor

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments. Feel free to not accept every requested change. Can you rebase as well?

Again, thank you so much for your contribution!



@functional_datapipe("extract_keys")
class ExtractKeysIterDataPipe(IterDataPipe[Dict]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename this to KeyExtractor to follow our naming convention? Thanks.

We can still keep "extract_keys" as the functional name.

@@ -951,6 +952,30 @@ def test_mux_longest_iterdatapipe(self):
with self.assertRaises(TypeError):
len(output_dp)

def test_extractor(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_extractor(self):
def test_key_extractor(self):

nit: We used to have a different extractor

Comment on lines +21 to +22
duplicate_is_error: it is an error if the same key is selected twice (True)
ignore_missing: skip any dictionaries where one or more patterns don't match (False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
duplicate_is_error: it is an error if the same key is selected twice (True)
ignore_missing: skip any dictionaries where one or more patterns don't match (False)

Duplicate lines of descriptions

*args: list of glob patterns or list of glob patterns
duplicate_is_error: it is an error if the same key is selected twice (True)
ignore_missing: allow patterns not to match (i.e., incomplete outputs)
as_tuple: return a tuple instead of a dictionary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
as_tuple: return a tuple instead of a dictionary
as_tuple: return a tuple instead of a dictionary (True or False here)

"""

def __init__(
self, source_datapipe: IterDataPipe[Dict], *args, duplicate_is_error=True, ignore_missing=False, as_tuple=False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to default as_tuple=False? Based on the docstring I would've guessed you wanted True instead.

Suggested change
self, source_datapipe: IterDataPipe[Dict], *args, duplicate_is_error=True, ignore_missing=False, as_tuple=False
self, source_datapipe: IterDataPipe[Dict], *args, duplicate_is_error: bool = True, ignore_missing: bool = False, as_tuple: bool = False

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: allow_duplicate might be a better name than duplicate_is_error

Comment on lines +72 to +73
def __len__(self) -> int:
return len(self.source_datapipe)
Copy link
Contributor

@NivekT NivekT Sep 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: A sample will always be yielded even if nothing matches right?

duplicate_is_error: it is an error if the same key is selected twice (True)
ignore_missing: skip any dictionaries where one or more patterns don't match (False)
*args: list of glob patterns or list of glob patterns
duplicate_is_error: it is an error if the same key is selected twice (True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
duplicate_is_error: it is an error if the same key is selected twice (True)
duplicate_is_error: it is an error if the same key is selected twice (True), otherwise returns the first matched value

if len(matches) > 1 and self.duplicate_is_error:
raise ValueError(f"extract_keys: multiple sample keys {sample.keys()} match {pattern}.")
if matches[0] in used and self.duplicate_is_error:
raise ValueError(f"extract_keys: key {matches[0]} is selected twice.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise ValueError(f"extract_keys: key {matches[0]} is selected twice.")
raise ValueError(f"extract_keys: key {matches[0]} is selected twice by multiple patterns.")

nit

@functional_datapipe("extract_keys")
class ExtractKeysIterDataPipe(IterDataPipe[Dict]):
r"""
Given a stream of dictionaries, return a stream of tuples by selecting keys using glob patterns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Given a stream of dictionaries, return a stream of tuples by selecting keys using glob patterns.
Given a stream of dictionaries, return a stream of dicts (or tuples) by selecting keys using glob patterns.

Comment on lines +32 to +33
>>> dp = FileLister(...).load_from_tar().webdataset().decode(...).extract_keys(["*.jpg", "*.png"], "*.gt.txt")
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the one example with webdataset, please add an example with sample outputs here. Copying from the test cases is totally fine to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants