Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement new iter_by_chunks() in items #133

Merged
merged 11 commits into from
Oct 23, 2019

Conversation

BurnzZ
Copy link
Contributor

@BurnzZ BurnzZ commented Oct 4, 2019

Provides convenient wrapper for #121

@@ -59,3 +59,30 @@ def _modify_iter_params(self, params):
if offset:
params['start'] = '{}/{}'.format(self.key, offset)
return params

def iter_by_chunks(self, chunksize=10000, *args, **kwargs):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea what might be the best method name for this, so please suggest if you have something in mind.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

job.items.list_iter(n=10_000)
job.items.iter_by(n=10_000)
job.items.iter_by_chunks(chunksize=10_000)
I like the first two.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the first one @manycoding as it's giving a better context of what's being returned overall, which is a generator of lists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll omit the PEP 515 for now for backwards compatibility. It would be a great addition in 2020 nonetheless, when Python2 is deprecated.

Copy link
Contributor

@manycoding manycoding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Items class needs examples too.

scrapinghub/client/items.py Outdated Show resolved Hide resolved
scrapinghub/client/items.py Outdated Show resolved Hide resolved
scrapinghub/client/items.py Show resolved Hide resolved
@@ -59,3 +59,30 @@ def _modify_iter_params(self, params):
if offset:
params['start'] = '{}/{}'.format(self.key, offset)
return params

def iter_by_chunks(self, chunksize=10000, *args, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

job.items.list_iter(n=10_000)
job.items.iter_by(n=10_000)
job.items.iter_by_chunks(chunksize=10_000)
I like the first two.

@BurnzZ BurnzZ force-pushed the add-featureitems-iter_by_chunks branch from b43d3fd to 15d647d Compare October 5, 2019 06:02
@BurnzZ
Copy link
Contributor Author

BurnzZ commented Oct 5, 2019

Hi @vshlapakov, I was wondering if there was a guide on adding new the vcr cassettes for this repo? The missing cassettes for this new method is what's causing the test failures. Thanks!

Copy link
Contributor

@manycoding manycoding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve keeping in mind that the tests need to be fixed before merge.

Copy link
Contributor

@vshlapakov vshlapakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, looks good!

Re VCR cassettes, it's a shame we haven't documented it yet. The updated tests generation logic is defined and described by the PR #112: basically you need some test SC project and your credentials (it will be wiped from the cassettes in the end), that's it. The last comment in the PR describes how to update existing cassettes.

@@ -59,3 +73,30 @@ def _modify_iter_params(self, params):
if offset:
params['start'] = '{}/{}'.format(self.key, offset)
return params

def list_iter(self, chunksize=10000, *args, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I perfectly understand that the logic is a nice fit for iterating through huge items collections, but I'd suggest to have a lower default chunk size in the library, say 1k, for the general case. There're many parameters that should be taken into account when defining the specific value, like total amount of items and avg item size. Does it make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my experience, in most cases I would use 100_000 chunksize for 5m+ items job (there are few cases with heavily sized items), considering average computer has at least 16GB memory. Thus, 100_000 makes a reasonable default to me.
Am I missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say the average computer has still 8GB nowadays, and many budget notebooks have only 4GB. And it's not only about this: this is a thin client, it should be lightweight and fast, I wouldn't expect it eating even a few GBs of memory by default without my explicit approval. Also, having a pretty large chunk size means you need a relatively good network, which isn't often the case, so the decision regarding the chunk size should be conscious.

My key point here is that the default settings should fit an average user' demands and expectations, not an experienced one processing millions items in batches. It seems we have different opinions on "an average user", I'm OK with that. But as an experienced user you can use whatever you need by setting a custom chunk size value, I don't see a problem here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@manycoding @BurnzZ Guys, to keep it clear: I don't enforce decreasing the default size by any cost, it's just something I'd like to get a bit more attention and discuss rather than accepting as-is. From a practical point of view, a chunk of 100k items by 10KB gives approximately 1GB of memory, is it something you'd expect as users of the client library using default settings? Maybe I'm missing something here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good time to send a questionnaire.

As an average user, I would use .iter() or work in an environment with enough memory first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey gents! To chime in the discussion here, the default chunksize of 10_000 was unconsciously used since it was working for our particular project, which has pretty much a typical number of items that one would see in project. Size per item is pretty much average I would say.

Now when choosing the default chunksize value, I think we'll need to set it to whatever value that upholds the objective of this brand new method, which is to prevent OOM issues faced by users when processing large jobs.

Speaking of large jobs, I have worked on projects which has 1 MB per item which was insane. Consuming the jobs on those types of projects would really benefit in this new method we're creating. This makes me lean into choosing a lower chunksize altogether:

  • chunksize of 1_000 * 1 MB per item = ~1 GB
  • chunksize of 10_000 * 1 MB per item = ~10 GB

Lastly, I'm not sure if physical computers or laptops should be the basis for our benchmarks in setting this value. Nonetheless, we'll need to set the bar low. With that in mind, I'm proposing that we use Scrapy Cloud as the basis, where 1 Scrapy Cloud unit would have 1 GB of memory. Typical max SC units at the moment would be 6 which gives us a maximum of 6GB of memory per job. (P.S. We're using this heavily in Scrapy Cloud jobs so it would really make sense for us to make this as the basis)

In the example above, using chunksize=10_000 would result in OOM issues as 10 GB > 6 GB SC Maximum. With that, I'm leaning towards setting chunksize=1_000 as the default.

But before doing that, what do you guys think? @vshlapakov @manycoding

@@ -37,6 +37,20 @@ class Items(_DownloadableProxyMixin, _ItemsResourceProxy):
'size': 100000,
}]

- retrieve items via a generator of lists. This is most useful in cases
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Special thanks for the nice docstrings in the PR ❤️

@BurnzZ BurnzZ force-pushed the add-featureitems-iter_by_chunks branch from 4909582 to 6ad4ee9 Compare October 10, 2019 08:28
@BurnzZ
Copy link
Contributor Author

BurnzZ commented Oct 10, 2019

Hi @vshlapakov, thanks for linking out #112. I've tried it out and works well on my local machine. However, when it's being run on CI it's going to fail. Mostly because when we empty out the env vars HS_ENDPOINT and DASH_ENDPOINT, it's going to use the live endpoints: https://github.com/scrapinghub/python-scrapinghub/blob/master/tests/conftest.py#L68

When running in CI, these aren't emptied out and thus will use the default endpoints like: https://github.com/scrapinghub/python-scrapinghub/blob/master/tests/conftest.py#L15.

I'm not sure how to proceed with this as it may entail updating how the conftest.py works.

@manycoding manycoding self-requested a review October 10, 2019 15:22
Comment on lines 92 to 129
processed = 0
while True:
next_key = self.key + '/' + str(processed)
items = [
item for item in self.iter(
count=chunksize, start=next_key, *args, **kwargs)
]
yield items
processed += len(items)
if len(items) < chunksize:
break
Copy link
Contributor

@manycoding manycoding Oct 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried the code and realised it doesn't support count and start. The first one is the most important I guess.

def list_iter(self, chunksize=10000, *args, **kwargs):
    start = kwargs.pop("start", 0)
    count = kwargs.pop("count", sys.maxsize)
    processed = 0
    while True:
        next_key = self.key + "/" + str(start)
        if processed + chunksize > count:
            chunksize = count - processed
        items = [
            item for item in self.iter(count=chunksize, start=next_key, *args, **kwargs)
        ]
        yield items
        processed += len(items)
        start += len(items)
        if processed >= count:
            break
        if len(items) < chunksize:
            break

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't considered this use case since we pretty much consume the entire items in a job. Much appreciated @manycoding! I've added some docstrings and test case as well.

Though the CI tests will fail because of the current project setup on vcr-py. @vshlapakov is it okay to mock this at the moment or should we standardize it to use cassettes like the current tests?

@vshlapakov
Copy link
Contributor

vshlapakov commented Oct 11, 2019

@BurnzZ Thanks for checking! I see what you mean, it seems it doesn't work reliably 😕 I'm in lack of time right now, will check it in detail next week and update in the thread. And Valeriy is right, we still need to support count/offset logic, please take a look when you have time.

@BurnzZ BurnzZ force-pushed the add-featureitems-iter_by_chunks branch from 6ad4ee9 to ee40071 Compare October 16, 2019 11:34
@vshlapakov
Copy link
Contributor

@BurnzZ Sorry that the issue took so long, I finally found some time to go quietly through the problem and have something to share.

The case is pretty tricky. You correctly created the cassettes, and the normalizing process works fine. The real problem is that the normalization covers only requests data, not responses. So, it correctly forms the 1st run-spider request, VCR.py finds the related snapshot and provides the saved response, but the response has a real job key with a real project ID as a part of its key. As a result, the subsequent requests done using the related job instance have different paths and can't be matched with the saved snapshots which leads to the tests failure.

I made an attempt on top of your changes to fix the specific case by normalizing the job instances in tests, it works fine, feel free to use it to finish your changes. I have some doubts if this is the best approach to solve the puzzle, but that's clearly a completely different topic which will be addressed later.

@BurnzZ BurnzZ force-pushed the add-featureitems-iter_by_chunks branch from b3d5c73 to c4ef768 Compare October 21, 2019 12:03
@codecov
Copy link

codecov bot commented Oct 21, 2019

Codecov Report

Merging #133 into master will increase coverage by 0.36%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #133      +/-   ##
==========================================
+ Coverage   93.59%   93.96%   +0.36%     
==========================================
  Files          28       28              
  Lines        1921     1938      +17     
==========================================
+ Hits         1798     1821      +23     
+ Misses        123      117       -6
Impacted Files Coverage Δ
scrapinghub/client/items.py 100% <100%> (ø) ⬆️
scrapinghub/hubstorage/batchuploader.py 94.15% <0%> (+3.5%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 92ff096...9a67373. Read the comment docs.

@BurnzZ
Copy link
Contributor Author

BurnzZ commented Oct 21, 2019

Thanks @vshlapakov! I cherry-picked your commit and updated my tests. Everything passes now! 🎉

tests/client/test_items.py Outdated Show resolved Hide resolved
@BurnzZ BurnzZ force-pushed the add-featureitems-iter_by_chunks branch from b5ebdea to 6bc35ad Compare October 22, 2019 05:17
scrapinghub/client/items.py Outdated Show resolved Hide resolved
item for item in self.iter(
count=chunksize, start=next_key, *args, **kwargs)
]
yield items
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a side note, when the total amount of items >= count and count isn't divisible by chunksize, with the current logic the function will return amount of items >= count. Say, list_iter(chunksize=2, count=3) will return 2 batches by 2 items, so 4 items in total, not 3. This is probably fine, but if keeping it as-is, I'd ask to at least mention the behaviour in the function docstring.

Copy link
Contributor Author

@BurnzZ BurnzZ Oct 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vshlapakov! Turns out there was a logic that @manycoding suggested before that wasn't added to this PR. This has been fixed and I've added a new test case for it.

@vshlapakov vshlapakov merged commit 32b7c04 into scrapinghub:master Oct 23, 2019
@vshlapakov
Copy link
Contributor

@BurnzZ Awesome, thanks, it's a pleasure to work with you! 🚀
@manycoding Thanks a lot for help in CR and your suggestions! 🙌

@BurnzZ
Copy link
Contributor Author

BurnzZ commented Oct 23, 2019

Thanks for all the help @vshlapakov @manycoding! ❤️ Can't wait to use this in most of our projects! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants