Crawl only jpg files #98

LostInDarkMath · 2021-03-12T18:50:51Z

Hi there!
When I use GoogleImageCrawler, I get sometimes png and sometimes jpg files, depending on what google finds.
Is there a way to configure the crawler to only download jpg files and no other file types?

The text was updated successfully, but these errors were encountered:

ZhiyuanChen · 2021-03-22T15:25:17Z

You could overwrite the parse function for now

for example

class GoogleParser(Parser):

    def parse(self, response):
        soup = BeautifulSoup(
            response.content.decode('utf-8', 'ignore'), 'lxml')
        #image_divs = soup.find_all('script')
        image_divs = soup.find_all(name='script')
        for div in image_divs:
            #txt = div.text
            txt = str(div)
            #if not txt.startswith('AF_initDataCallback'):
            if 'AF_initDataCallback' not in txt:
                continue
            if 'ds:0' in txt or 'ds:1' not in txt:
                continue
            #txt = re.sub(r"^AF_initDataCallback\({.*key: 'ds:(\d)'.+data:function\(\){return (.+)}}\);?$",
            #             "\\2", txt, 0, re.DOTALL)
            #meta = json.loads(txt)
            #data = meta[31][0][12][2]
            #uris = [img[1][3][0] for img in data if img[0] == 1]
            
            uris = re.findall(r'http.*?\.(?:jpg|jpeg)', txt)
            return [{'file_url': uri} for uri in uris]

LostInDarkMath · 2021-03-22T16:08:50Z

Okay thx, but I thought there would be an easier way.

ZhiyuanChen added the feature label Mar 22, 2021

Patty-OFurniture mentioned this issue Feb 3, 2024

Tkinter UI for icrawler #123

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawl only jpg files #98

Crawl only jpg files #98

LostInDarkMath commented Mar 12, 2021

ZhiyuanChen commented Mar 22, 2021 •

edited

Loading

LostInDarkMath commented Mar 22, 2021

Crawl only jpg files #98

Crawl only jpg files #98

Comments

LostInDarkMath commented Mar 12, 2021

ZhiyuanChen commented Mar 22, 2021 • edited Loading

LostInDarkMath commented Mar 22, 2021

ZhiyuanChen commented Mar 22, 2021 •

edited

Loading