Google Crawler can only get around 100 images instead of 1000 #79

jianjieluo · 2020-08-16T14:30:47Z

Hi, when I used the searching URLs generated by feed() function in GoogleFeeder, I can only get around 100 images although the max_num=1000. I find that all the URLs get the same 100 results as the first URL. It seems that the ijn and start params are no use anymore. I just want to get nearly 1000 images per keyword. Is there anybody who has a solution?

def feed(self, keyword, offset, max_num, language=None, filters=None):
        base_url = 'https://www.google.com/search?'
        self.filter = self.get_filter()
        filter_str = self.filter.apply(filters, sep=',')
        for i in range(offset, offset + max_num, 100):
            params = dict(
                q=keyword,
                ijn=int(i / 100),
                start=i,
                tbs=filter_str,
                tbm='isch')
            if language:
                params['lr'] = 'lang_' + language
            url = base_url + urlencode(params)
            self.out_queue.put({'url': url, 'keyword': keyword, 'next_offset': i+100})
            self.logger.debug('put url to url_queue: {}'.format(url))

The text was updated successfully, but these errors were encountered:

vogelbam · 2020-08-17T17:59:43Z

I think your problem might be related to #38 .

jianjieluo · 2020-08-17T18:13:17Z

@vogelbam hi, thanks for your reply. However, I find that the date_min argument was removed in the docs after #38 issue. What's worse, search image by date doesn't work any more #78. I have tried to search with different date ranges but it failed. It seems that the URL param below doesn't work anymore.

icrawler/icrawler/builtin/google.py

Line 114 in 1acbb96

return 'cdr:1,cd_min:{},cd_max:{}'.format(*date_range)

r-y-zadeh · 2020-08-20T01:28:49Z

Same issue for me
It seems that the paging method is not working correctly and only the first page is processed . for example to crawl car images the URL of the first page is:
https://www.google.com/search?q=car&ijn=0&start=0&tbm=isch
this page is ok and the crawler can fetch around 100 images. for the next pages the URL is:
https://www.google.com/search?q=car&ijn=1&start=100&tbm=isch
https://www.google.com/search?q=car&ijn=2&start=200&tbm=isch
...
parsing these pages does not return any results. Also, I've checked these pages in my browser, and all return the same results of the first page.

ZhiyuanChen · 2020-11-03T10:09:00Z

I have just looked into it for a bit and it seems goolge is now updating the result page through a post request like this
https://www.google.com/imgevent?ei=vimhX4KlHOqYr7wPsP6YwAk&iact=ms&forward=1&ct=vfe_scroll&scroll=1400&page=1&start=24&ndsp=4&bih=1830&biw=389

ManiaaJia · 2022-01-25T10:49:51Z

Is this problem fixed now? I have the same issue and hope to download more pictures.

somisawa · 2022-09-29T05:00:35Z

It seems that Google's algorithm may causes to crawl fewer resources than expected. I brute-forcely solved this problem by setting disjoint date argument iteratively like:

from icrawler.builtin import GoogleImageCrawler
import datetime

n_total_images = 10000
n_per_crawl = 100

delta = datetime.timedelta(days=30)
end_day = datetime.datetime(2022, 9, 29)

def datetime2tuple(date):
    return (date.year, date.month, date.day)

for i in range(int(n_total_images / n_per_crawl )):
    start_day = end_day - delta
    google_crawler = GoogleImageCrawler(downloader_threads=4, storage={'root_dir': '/path/to/image'})
    google_crawler.crawl(keyword='<YOUR_KEYWORDS>', filters={'date':(datetime2tuple(start_day), datetime2tuple(end_day))}, file_idx_offset=i*n_per_crawl , max_num=n_per_crawl)
    end_day = start_day - datetime.timedelta(days=1)

Edit: Note that this method may causes image duplication. You should postprocess the collected images. FYI, I use imagededup python library, which is CNN-based duplicated image detector.

hasnatsakil · 2023-06-18T14:46:57Z

you may get 2000 perfectly.

ZhiyuanChen added bug help wanted labels Oct 26, 2020

somisawa mentioned this issue Sep 29, 2022

Unable to download more images in BingImageDownloader #92

Open

ZhiyuanChen mentioned this issue Nov 9, 2022

How do I get 1000 images correctly? #38

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Google Crawler can only get around 100 images instead of 1000 #79

Google Crawler can only get around 100 images instead of 1000 #79

jianjieluo commented Aug 16, 2020

vogelbam commented Aug 17, 2020

jianjieluo commented Aug 17, 2020

r-y-zadeh commented Aug 20, 2020 •

edited

Loading

ZhiyuanChen commented Nov 3, 2020

ManiaaJia commented Jan 25, 2022

somisawa commented Sep 29, 2022 •

edited

Loading

hasnatsakil commented Jun 18, 2023

Google Crawler can only get around 100 images instead of 1000 #79

Google Crawler can only get around 100 images instead of 1000 #79

Comments

jianjieluo commented Aug 16, 2020

vogelbam commented Aug 17, 2020

jianjieluo commented Aug 17, 2020

r-y-zadeh commented Aug 20, 2020 • edited Loading

ZhiyuanChen commented Nov 3, 2020

ManiaaJia commented Jan 25, 2022

somisawa commented Sep 29, 2022 • edited Loading

hasnatsakil commented Jun 18, 2023

r-y-zadeh commented Aug 20, 2020 •

edited

Loading

somisawa commented Sep 29, 2022 •

edited

Loading