Skip to content

Latest commit

 

History

History
64 lines (43 loc) · 2.09 KB

File metadata and controls

64 lines (43 loc) · 2.09 KB

Requests

Modify User Agent

Some websites combat crawler by detecting the user agent. User agent can be simply regarded as the name of your browser. Websites may stop your HTTP request if it detects you are not using a normal browser. That is because requests will tell the website its identity by default. You can modify this behaviour using headers parameter of requests.get:

r = requests.get(url, headers = {'user-agent': 'Put-User-Agent-String-Here'})

Use Open Rice as an example:

url = 'https://www.openrice.com/en/hongkong/restaurants?what=sushi'
r = requests.get(url, headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.167 Safari/537.36'})

A more complete experiment can be found here

HTTP status code

When you make a request to a website, there might be different status responded. Common examples here:

  • 200 OK
  • 400 Bad Request
  • 401 Unauthorized
  • 403 Forbidden

For more examples, please refer to here .

Return empty results

Case: Airbnb

import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.airbnb.com/s/all?adults=1&children=0&infants=0&guests=1&toddlers=0&refinement_paths%5B%5D=%2Ffor_you')
html_text = BeautifulSoup(r.text,"html.parser")
hotels = html_text.find_all('div')
hotels

You will find the content you wanted is not there and if you save the content in a html and reopen it, it's a blank page.

open('mypage.html','w').write(r.text)

This is indicator that this page is loaded dynamically, you may need to use selenium or splinter to scrape instead.