Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawl only jpg files #98

Open
LostInDarkMath opened this issue Mar 12, 2021 · 2 comments
Open

Crawl only jpg files #98

LostInDarkMath opened this issue Mar 12, 2021 · 2 comments
Labels

Comments

@LostInDarkMath
Copy link

Hi there!
When I use GoogleImageCrawler, I get sometimes png and sometimes jpg files, depending on what google finds.
Is there a way to configure the crawler to only download jpg files and no other file types?

@ZhiyuanChen
Copy link
Collaborator

ZhiyuanChen commented Mar 22, 2021

You could overwrite the parse function for now

for example

class GoogleParser(Parser):

    def parse(self, response):
        soup = BeautifulSoup(
            response.content.decode('utf-8', 'ignore'), 'lxml')
        #image_divs = soup.find_all('script')
        image_divs = soup.find_all(name='script')
        for div in image_divs:
            #txt = div.text
            txt = str(div)
            #if not txt.startswith('AF_initDataCallback'):
            if 'AF_initDataCallback' not in txt:
                continue
            if 'ds:0' in txt or 'ds:1' not in txt:
                continue
            #txt = re.sub(r"^AF_initDataCallback\({.*key: 'ds:(\d)'.+data:function\(\){return (.+)}}\);?$",
            #             "\\2", txt, 0, re.DOTALL)
            #meta = json.loads(txt)
            #data = meta[31][0][12][2]
            #uris = [img[1][3][0] for img in data if img[0] == 1]
            
            uris = re.findall(r'http.*?\.(?:jpg|jpeg)', txt)
            return [{'file_url': uri} for uri in uris]

@LostInDarkMath
Copy link
Author

Okay thx, but I thought there would be an easier way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants