-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement new iter_by_chunks() in items #133
implement new iter_by_chunks() in items #133
Conversation
scrapinghub/client/items.py
Outdated
@@ -59,3 +59,30 @@ def _modify_iter_params(self, params): | |||
if offset: | |||
params['start'] = '{}/{}'.format(self.key, offset) | |||
return params | |||
|
|||
def iter_by_chunks(self, chunksize=10000, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea what might be the best method name for this, so please suggest if you have something in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
job.items.list_iter(n=10_000)
job.items.iter_by(n=10_000)
job.items.iter_by_chunks(chunksize=10_000)
I like the first two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the first one @manycoding as it's giving a better context of what's being returned overall, which is a generator of list
s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll omit the PEP 515 for now for backwards compatibility. It would be a great addition in 2020 nonetheless, when Python2 is deprecated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Items class needs examples too.
scrapinghub/client/items.py
Outdated
@@ -59,3 +59,30 @@ def _modify_iter_params(self, params): | |||
if offset: | |||
params['start'] = '{}/{}'.format(self.key, offset) | |||
return params | |||
|
|||
def iter_by_chunks(self, chunksize=10000, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
job.items.list_iter(n=10_000)
job.items.iter_by(n=10_000)
job.items.iter_by_chunks(chunksize=10_000)
I like the first two.
b43d3fd
to
15d647d
Compare
Hi @vshlapakov, I was wondering if there was a guide on adding new the vcr cassettes for this repo? The missing cassettes for this new method is what's causing the test failures. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approve keeping in mind that the tests need to be fixed before merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, looks good!
Re VCR cassettes, it's a shame we haven't documented it yet. The updated tests generation logic is defined and described by the PR #112: basically you need some test SC project and your credentials (it will be wiped from the cassettes in the end), that's it. The last comment in the PR describes how to update existing cassettes.
scrapinghub/client/items.py
Outdated
@@ -59,3 +73,30 @@ def _modify_iter_params(self, params): | |||
if offset: | |||
params['start'] = '{}/{}'.format(self.key, offset) | |||
return params | |||
|
|||
def list_iter(self, chunksize=10000, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I perfectly understand that the logic is a nice fit for iterating through huge items collections, but I'd suggest to have a lower default chunk size in the library, say 1k, for the general case. There're many parameters that should be taken into account when defining the specific value, like total amount of items and avg item size. Does it make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my experience, in most cases I would use 100_000 chunksize for 5m+ items job (there are few cases with heavily sized items), considering average computer has at least 16GB memory. Thus, 100_000 makes a reasonable default to me.
Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say the average computer has still 8GB nowadays, and many budget notebooks have only 4GB. And it's not only about this: this is a thin client, it should be lightweight and fast, I wouldn't expect it eating even a few GBs of memory by default without my explicit approval. Also, having a pretty large chunk size means you need a relatively good network, which isn't often the case, so the decision regarding the chunk size should be conscious.
My key point here is that the default settings should fit an average user' demands and expectations, not an experienced one processing millions items in batches. It seems we have different opinions on "an average user", I'm OK with that. But as an experienced user you can use whatever you need by setting a custom chunk size value, I don't see a problem here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@manycoding @BurnzZ Guys, to keep it clear: I don't enforce decreasing the default size by any cost, it's just something I'd like to get a bit more attention and discuss rather than accepting as-is. From a practical point of view, a chunk of 100k items by 10KB gives approximately 1GB of memory, is it something you'd expect as users of the client library using default settings? Maybe I'm missing something here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a good time to send a questionnaire.
As an average user, I would use .iter()
or work in an environment with enough memory first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey gents! To chime in the discussion here, the default chunksize
of 10_000
was unconsciously used since it was working for our particular project, which has pretty much a typical number of items that one would see in project. Size per item is pretty much average I would say.
Now when choosing the default chunksize
value, I think we'll need to set it to whatever value that upholds the objective of this brand new method, which is to prevent OOM issues faced by users when processing large jobs.
Speaking of large jobs, I have worked on projects which has 1 MB per item which was insane. Consuming the jobs on those types of projects would really benefit in this new method we're creating. This makes me lean into choosing a lower chunksize
altogether:
chunksize
of 1_000 * 1 MB per item = ~1 GBchunksize
of 10_000 * 1 MB per item = ~10 GB
Lastly, I'm not sure if physical computers or laptops should be the basis for our benchmarks in setting this value. Nonetheless, we'll need to set the bar low. With that in mind, I'm proposing that we use Scrapy Cloud as the basis, where 1 Scrapy Cloud unit would have 1 GB of memory. Typical max SC units at the moment would be 6 which gives us a maximum of 6GB of memory per job. (P.S. We're using this heavily in Scrapy Cloud jobs so it would really make sense for us to make this as the basis)
In the example above, using chunksize=10_000
would result in OOM issues as 10 GB > 6 GB SC Maximum. With that, I'm leaning towards setting chunksize=1_000
as the default.
But before doing that, what do you guys think? @vshlapakov @manycoding
@@ -37,6 +37,20 @@ class Items(_DownloadableProxyMixin, _ItemsResourceProxy): | |||
'size': 100000, | |||
}] | |||
|
|||
- retrieve items via a generator of lists. This is most useful in cases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Special thanks for the nice docstrings in the PR ❤️
4909582
to
6ad4ee9
Compare
Hi @vshlapakov, thanks for linking out #112. I've tried it out and works well on my local machine. However, when it's being run on CI it's going to fail. Mostly because when we empty out the env vars When running in CI, these aren't emptied out and thus will use the default endpoints like: https://github.com/scrapinghub/python-scrapinghub/blob/master/tests/conftest.py#L15. I'm not sure how to proceed with this as it may entail updating how the |
scrapinghub/client/items.py
Outdated
processed = 0 | ||
while True: | ||
next_key = self.key + '/' + str(processed) | ||
items = [ | ||
item for item in self.iter( | ||
count=chunksize, start=next_key, *args, **kwargs) | ||
] | ||
yield items | ||
processed += len(items) | ||
if len(items) < chunksize: | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried the code and realised it doesn't support count and start. The first one is the most important I guess.
def list_iter(self, chunksize=10000, *args, **kwargs):
start = kwargs.pop("start", 0)
count = kwargs.pop("count", sys.maxsize)
processed = 0
while True:
next_key = self.key + "/" + str(start)
if processed + chunksize > count:
chunksize = count - processed
items = [
item for item in self.iter(count=chunksize, start=next_key, *args, **kwargs)
]
yield items
processed += len(items)
start += len(items)
if processed >= count:
break
if len(items) < chunksize:
break
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't considered this use case since we pretty much consume the entire items in a job. Much appreciated @manycoding! I've added some docstrings and test case as well.
Though the CI tests will fail because of the current project setup on vcr-py
. @vshlapakov is it okay to mock this at the moment or should we standardize it to use cassettes like the current tests?
@BurnzZ Thanks for checking! I see what you mean, it seems it doesn't work reliably 😕 I'm in lack of time right now, will check it in detail next week and update in the thread. And Valeriy is right, we still need to support count/offset logic, please take a look when you have time. |
6ad4ee9
to
ee40071
Compare
@BurnzZ Sorry that the issue took so long, I finally found some time to go quietly through the problem and have something to share. The case is pretty tricky. You correctly created the cassettes, and the normalizing process works fine. The real problem is that the normalization covers only requests data, not responses. So, it correctly forms the 1st run-spider request, VCR.py finds the related snapshot and provides the saved response, but the response has a real job key with a real project ID as a part of its key. As a result, the subsequent requests done using the related job instance have different paths and can't be matched with the saved snapshots which leads to the tests failure. I made an attempt on top of your changes to fix the specific case by normalizing the job instances in tests, it works fine, feel free to use it to finish your changes. I have some doubts if this is the best approach to solve the puzzle, but that's clearly a completely different topic which will be addressed later. |
b3d5c73
to
c4ef768
Compare
Codecov Report
@@ Coverage Diff @@
## master #133 +/- ##
==========================================
+ Coverage 93.59% 93.96% +0.36%
==========================================
Files 28 28
Lines 1921 1938 +17
==========================================
+ Hits 1798 1821 +23
+ Misses 123 117 -6
Continue to review full report at Codecov.
|
Thanks @vshlapakov! I |
b5ebdea
to
6bc35ad
Compare
item for item in self.iter( | ||
count=chunksize, start=next_key, *args, **kwargs) | ||
] | ||
yield items |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as a side note, when the total amount of items >= count
and count
isn't divisible by chunksize
, with the current logic the function will return amount of items >= count
. Say, list_iter(chunksize=2, count=3)
will return 2 batches by 2 items, so 4 items in total, not 3. This is probably fine, but if keeping it as-is, I'd ask to at least mention the behaviour in the function docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @vshlapakov! Turns out there was a logic that @manycoding suggested before that wasn't added to this PR. This has been fixed and I've added a new test case for it.
@BurnzZ Awesome, thanks, it's a pleasure to work with you! 🚀 |
Thanks for all the help @vshlapakov @manycoding! ❤️ Can't wait to use this in most of our projects! 🎉 |
Provides convenient wrapper for #121