GoogleImageCrawler not working! #125

Shashwat79802 · 2024-04-11T16:32:18Z

from icrawler.builtin import BingImageCrawler, GoogleImageCrawler

google_crawler = GoogleImageCrawler(storage={'root_dir': './downloads'})
google_crawler.crawl(keyword='gui based tool', max_num=50)

Error:
2024-04-11 21:58:34,449 - INFO - icrawler.crawler - start crawling...
2024-04-11 21:58:34,450 - INFO - icrawler.crawler - starting 1 feeder threads...
2024-04-11 21:58:34,450 - INFO - feeder - thread feeder-001 exit
2024-04-11 21:58:34,451 - INFO - icrawler.crawler - starting 1 parser threads...
2024-04-11 21:58:34,451 - INFO - icrawler.crawler - starting 1 downloader threads...
2024-04-11 21:58:39,452 - INFO - downloader - downloader-001 is waiting for new download tasks
2024-04-11 21:58:43,322 - INFO - parser - parsing result page https://www.google.com/search?q=gui+based+tool&ijn=0&start=0&tbs=&tbm=isch
Exception in thread parser-001:
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/home/shashwat/Django/SummarEaseMain/Test/.venv/lib/python3.10/site-packages/icrawler/parser.py", line 94, in worker_exec
for task in self.parse(response, **kwargs):
TypeError: 'NoneType' object is not iterable
2024-04-11 21:58:44,453 - INFO - downloader - no more download task for thread downloader-001
2024-04-11 21:58:44,453 - INFO - downloader - thread downloader-001 exit
2024-04-11 21:58:44,462 - INFO - icrawler.crawler - Crawling task done!

Patty-OFurniture · 2024-04-16T02:18:38Z

I've made some fixes in my fork to address this, but since then:

GIS has made changes in the results
Even chromedriver/geckodriver solutions fail the first time now, just retry

Just after I made changes, GIS wouldn't work the first time for iCrawler.

wjdghks950 · 2024-04-24T04:31:50Z

@Patty-OFurniture @Shashwat79802 Any updates on this? I'm trying to make this work; And as @Patty-OFurniture said, when I repeatedly ran the following code for the second time, it suddenly started working. But when I ran the below code for the third time, it just won't work. Any idea what is causing this issue?

google_crawler = GoogleImageCrawler(storage={'root_dir': './downloads'})
google_crawler.crawl(keyword='gui based tool', max_num=50)

thiagorizzo · 2024-05-06T19:33:36Z

Hi! is there any fix for this error yet?

Patty-OFurniture · 2024-05-07T20:12:17Z

chromedriver crawlers seem to have problems as well. The first search doesn't work, but the second try does. It may be setting and looking for a cookie like DuckDuckGo does. chromedriver works well enough for me, for not.

@wjdghks950 What's causing it is: Google changed things. I've been looking through various forks to see if anyone else has fixed this, and I haven't seen one yet.

fixes hellock#96, hellock#107, hellock#117, hellock#125

ZhiyuanChen · 2024-05-15T04:22:32Z

Please let me know if 0.6.8 fixes this issue~

ed2050 · 2024-05-16T17:52:00Z

I still have this issue but it's intermittent. For every 10 queries, I get 1-2 failures. Running the failed query a second time usually succeeds. Not a showstopper, I can rerun the queries.

Could be an issue with google response: not sure if they send the exact same data every time, or if the response data varies somehow. Could also be a race condition where the parser thread is started before the response object is finished. Just a guess, could be something completely different.

Exception in thread parser-001:
Traceback (most recent call last):
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/foobar/Library/Python/3.8/lib/python/site-packages/icrawler/parser.py", line 94, in worker_exec
    for task in self.parse(response, **kwargs):
TypeError: 'NoneType' object is not iterable

ed2050 · 2024-06-04T22:55:58Z

I know what's causing the exceptions, at least superficially. Sometimes the returned google page doesn't have any <script> tags, or doesn't have any http urls. This makes the parse method hit the end without executing a return statement. Hence it returns none, triggering the exception above. See code below.

This can be fixed by adding return [] at the end of the function, so it at least returns an empty list if no urls were found. That will prevent the for task in parse() loop from trying to iterate a None value.

class GoogleParser(Parser):
    def parse(self, response):
        soup = BeautifulSoup(response.content.decode("utf-8", "ignore"), "lxml")
        # image_divs = soup.find_all('script')
        image_divs = soup.find_all(name="script")
        for div in image_divs:   # <--- only runs if <script> tag found
            txt = str(div)
            uris = re.findall(r"http[^\[]*?.(?:jpg|png|bmp)", txt)
            if not uris:
                uris = re.findall(r"http[^\[]*?\.(?:jpg|png|bmp)", txt)
            uris = [bytes(uri, "utf-8").decode("unicode-escape") for uri in uris]
            if uris:
                return [{"file_url": uri} for uri in uris]

        # <--- end of function
        return []  # <--- add this line

As for why the returned google page sometimes doesn't have any detectable script tags or urls, I'm not sure. Could be a failure at source, where google returns a malformed page. Could be a mishandled return code; does icrawler always check the HTTP response code and content-type before processing a response? I've seen similar errors in my own web crawlers when I forget to check those things.

Also, wget and curl do some advanced checks that seamlessly handle HTTP redirects, temporary errors, javascript detection, etc. Could be something along those lines.

Hope this helps.

ed2050 · 2024-06-05T10:24:59Z

Response checks

As I suspected the issue is not checking HTTP response in Parser.worker_exec. The response status code is never checked for valid data before trying to parse it. In particular here:

                try:
                    base_url = "{0.scheme}://{0.netloc}".format(urlsplit(url))
                    response = self.session.get(url, timeout=req_timeout, headers={"Referer": base_url})
                except Exception as e:
                    self.logger.error(
                        "Exception caught when fetching page %s, " "error: %s, remaining retry times: %d",
                        url,  e,  retry - 1, )
                else:
                    self.logger.info(f"parsing result page {url}")
                    for task in self.parse(response, **kwargs):
                        ...

Google sometimes returns status code 429 Too many requests with a captcha challenge html doc. That's why response.text contains a valid html doc for soup, but no <script> tags with urls.

I suggest handling 429 responses with an exponential backoff before retrying.

Other return codes

FYI the response check in Downloader.download is much too strict. The code stops on any http code that isn't exactly 200:

                elif response.status_code != 200:
                    self.logger.error("Response status code %d, file %s", response.status_code, file_url)
                    break

Any 2xx level code indicates success and may return content (except 204 No Content). Better to check 200 <= status_code <= 299.

If you want to make it even more robust, you can easily handle certain return codes outside 200 range. For instance, 301 Moved, 302 Found, 303 See Other, and 307/308 are all simple redirects to another url returned by the server.

ed2050 · 2024-06-05T10:47:30Z

Also sometimes google returns a page with status 200 that has no search results. The string <title>Before you continue</title> seems to indicate further action is needed by the user. It would be good to detect and catch such pages instead of trying to parse them for images.

ZhiyuanChen · 2024-10-16T13:02:42Z

I still have this issue but it's intermittent. For every 10 queries, I get 1-2 failures. Running the failed query a second time usually succeeds. Not a showstopper, I can rerun the queries.

Could be an issue with google response: not sure if they send the exact same data every time, or if the response data varies somehow. Could also be a race condition where the parser thread is started before the response object is finished. Just a guess, could be something completely different.
Exception in thread parser-001:
Traceback (most recent call last):
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/foobar/Library/Python/3.8/lib/python/site-packages/icrawler/parser.py", line 94, in worker_exec
    for task in self.parse(response, **kwargs):
TypeError: 'NoneType' object is not iterable

These are very valuable information.
Would you mind to submit a patch to fix this?

ZhiyuanChen added a commit to ZhiyuanChen/icrawler that referenced this issue May 15, 2024

fix GoogleCrawler

1aaaeb8

fixes hellock#96, hellock#107, hellock#117, hellock#125

ZhiyuanChen added a commit to ZhiyuanChen/icrawler that referenced this issue May 15, 2024

fix GoogleCrawler

e062fd0

fixes hellock#96, hellock#107, hellock#117, hellock#125

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GoogleImageCrawler not working! #125

GoogleImageCrawler not working! #125

Shashwat79802 commented Apr 11, 2024

Patty-OFurniture commented Apr 16, 2024 •

edited

Loading

wjdghks950 commented Apr 24, 2024

thiagorizzo commented May 6, 2024

Patty-OFurniture commented May 7, 2024

ZhiyuanChen commented May 15, 2024

ed2050 commented May 16, 2024

ed2050 commented Jun 4, 2024

ed2050 commented Jun 5, 2024

ed2050 commented Jun 5, 2024

ZhiyuanChen commented Oct 16, 2024

GoogleImageCrawler not working! #125

GoogleImageCrawler not working! #125

Comments

Shashwat79802 commented Apr 11, 2024

Patty-OFurniture commented Apr 16, 2024 • edited Loading

wjdghks950 commented Apr 24, 2024

thiagorizzo commented May 6, 2024

Patty-OFurniture commented May 7, 2024

ZhiyuanChen commented May 15, 2024

ed2050 commented May 16, 2024

ed2050 commented Jun 4, 2024

ed2050 commented Jun 5, 2024

Response checks

Other return codes

ed2050 commented Jun 5, 2024

ZhiyuanChen commented Oct 16, 2024

Patty-OFurniture commented Apr 16, 2024 •

edited

Loading