-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Google Crawler can only get around 100 images instead of 1000 #79
Comments
I think your problem might be related to #38 . |
@vogelbam hi, thanks for your reply. However, I find that the icrawler/icrawler/builtin/google.py Line 114 in 1acbb96
|
Same issue for me |
I have just looked into it for a bit and it seems goolge is now updating the result page through a post request like this |
Is this problem fixed now? I have the same issue and hope to download more pictures. |
It seems that Google's algorithm may causes to crawl fewer resources than expected. I brute-forcely solved this problem by setting disjoint from icrawler.builtin import GoogleImageCrawler
import datetime
n_total_images = 10000
n_per_crawl = 100
delta = datetime.timedelta(days=30)
end_day = datetime.datetime(2022, 9, 29)
def datetime2tuple(date):
return (date.year, date.month, date.day)
for i in range(int(n_total_images / n_per_crawl )):
start_day = end_day - delta
google_crawler = GoogleImageCrawler(downloader_threads=4, storage={'root_dir': '/path/to/image'})
google_crawler.crawl(keyword='<YOUR_KEYWORDS>', filters={'date':(datetime2tuple(start_day), datetime2tuple(end_day))}, file_idx_offset=i*n_per_crawl , max_num=n_per_crawl)
end_day = start_day - datetime.timedelta(days=1) Edit: Note that this method may causes image duplication. You should postprocess the collected images. FYI, I use imagededup python library, which is CNN-based duplicated image detector. |
you may get 2000 perfectly. |
Hi, when I used the searching URLs generated by
feed()
function inGoogleFeeder
, I can only get around 100 images although themax_num=1000
. I find that all the URLs get the same 100 results as the first URL. It seems that theijn
andstart
params are no use anymore. I just want to get nearly 1000 images per keyword. Is there anybody who has a solution?The text was updated successfully, but these errors were encountered: