Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google Crawler can only get around 100 images instead of 1000 #79

Open
jianjieluo opened this issue Aug 16, 2020 · 7 comments
Open

Google Crawler can only get around 100 images instead of 1000 #79

jianjieluo opened this issue Aug 16, 2020 · 7 comments

Comments

@jianjieluo
Copy link

Hi, when I used the searching URLs generated by feed() function in GoogleFeeder, I can only get around 100 images although the max_num=1000. I find that all the URLs get the same 100 results as the first URL. It seems that the ijn and start params are no use anymore. I just want to get nearly 1000 images per keyword. Is there anybody who has a solution?

def feed(self, keyword, offset, max_num, language=None, filters=None):
        base_url = 'https://www.google.com/search?'
        self.filter = self.get_filter()
        filter_str = self.filter.apply(filters, sep=',')
        for i in range(offset, offset + max_num, 100):
            params = dict(
                q=keyword,
                ijn=int(i / 100),
                start=i,
                tbs=filter_str,
                tbm='isch')
            if language:
                params['lr'] = 'lang_' + language
            url = base_url + urlencode(params)
            self.out_queue.put({'url': url, 'keyword': keyword, 'next_offset': i+100})
            self.logger.debug('put url to url_queue: {}'.format(url))
@vogelbam
Copy link

I think your problem might be related to #38 .

@jianjieluo
Copy link
Author

@vogelbam hi, thanks for your reply. However, I find that the date_min argument was removed in the docs after #38 issue. What's worse, search image by date doesn't work any more #78. I have tried to search with different date ranges but it failed. It seems that the URL param below doesn't work anymore.

return 'cdr:1,cd_min:{},cd_max:{}'.format(*date_range)

@r-y-zadeh
Copy link

r-y-zadeh commented Aug 20, 2020

Same issue for me
It seems that the paging method is not working correctly and only the first page is processed . for example to crawl car images the URL of the first page is:
https://www.google.com/search?q=car&ijn=0&start=0&tbm=isch
this page is ok and the crawler can fetch around 100 images. for the next pages the URL is:
https://www.google.com/search?q=car&ijn=1&start=100&tbm=isch
https://www.google.com/search?q=car&ijn=2&start=200&tbm=isch
...
parsing these pages does not return any results. Also, I've checked these pages in my browser, and all return the same results of the first page.

@ZhiyuanChen
Copy link
Collaborator

I have just looked into it for a bit and it seems goolge is now updating the result page through a post request like this
https://www.google.com/imgevent?ei=vimhX4KlHOqYr7wPsP6YwAk&iact=ms&forward=1&ct=vfe_scroll&scroll=1400&page=1&start=24&ndsp=4&bih=1830&biw=389

@ManiaaJia
Copy link

Is this problem fixed now? I have the same issue and hope to download more pictures.

@somisawa
Copy link

somisawa commented Sep 29, 2022

It seems that Google's algorithm may causes to crawl fewer resources than expected. I brute-forcely solved this problem by setting disjoint date argument iteratively like:

from icrawler.builtin import GoogleImageCrawler
import datetime

n_total_images = 10000
n_per_crawl = 100

delta = datetime.timedelta(days=30)
end_day = datetime.datetime(2022, 9, 29)

def datetime2tuple(date):
    return (date.year, date.month, date.day)

for i in range(int(n_total_images / n_per_crawl )):
    start_day = end_day - delta
    google_crawler = GoogleImageCrawler(downloader_threads=4, storage={'root_dir': '/path/to/image'})
    google_crawler.crawl(keyword='<YOUR_KEYWORDS>', filters={'date':(datetime2tuple(start_day), datetime2tuple(end_day))}, file_idx_offset=i*n_per_crawl , max_num=n_per_crawl)
    end_day = start_day - datetime.timedelta(days=1)

Edit: Note that this method may causes image duplication. You should postprocess the collected images. FYI, I use imagededup python library, which is CNN-based duplicated image detector.

@hasnatsakil
Copy link

you may get 2000 perfectly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants