- Week 08: Advanced scraping - anti-crawler, browser emulation and other nitty gritty
- Objective
- Anti-crawling
- Common issues
- Browser emulation
- Analyse Network Traces
- Bonus: Crawl mobile Apps
- Bonus: Other quick scraping/ crawling tricks
- Exercises and Challenges
- Related Readings
In this chapter, we will learn advanced scraping, which is scraping dynamic loading pages or some pages that need us interactively involved, the ones that we need emulate browsers to navigate and find elements.
As you may have a clue due to its name, this chapter will be more demanding than previous chapters. We need spend more time to learn how to use those two complete new libraries - selenium
and splinter
. Which are similar but with a little bit difference. At the same time, we need learn more about Frontend three
: HTML, JS, and CSS. And how to locate and find elements from them.
After this chapter, I believe, we can apply what we learn to the most of usual scraping cases
, most of websites, social media etc…Which will paves the way for further data analysis stage (interested students can talk to me or refer to chapter 7 for learning advanced).
- Bypass anti-crawler by modifying user-agent
- Handle glitches: encoding, pagination, ...
- Handle dynamic page with headless browser
- Handle login with headless browser
- Scrape social networks
- Case studies on different websites
- Further strengthen the list-of-dict data types; organise multi-layer loops/ item based parsing logics.
The simplest way to prevent crawler access it to limit by user agent. "User agent" can be thought of synonym for "web browser". When you surft the Internet with a normal web browser, the server will know whether you use Chrome, Firefix, IE, or other browsers. Your browser give this information to the web server by a field called user-agent
in the HTTP request headers. Similarly, requests
is a like a web browser, for Python code, not for human. It also gives the user-agent
to the web server and the default value is like python-requests/*
. In this way, the server knows that the client is Python requests module, not a regular human user. One can by-pass this limit by modifying the user-agent string.
r = requests.get(url,
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.167 Safari/537.36'
}
)
Full code and demo can be found in this notebook.
https://nghttp2.org/httpbin is a useful service to test HTTP requests. This service basically echos the content or certain parameters from your HTTP request. You can get better idea of what your tools send to the server via this service.
Example: Check the user-agent of shell command curl
:
%curl https://nghttp2.org/httpbin/user-agent
{"user-agent":"curl/7.54.0"}
Example: Check the default user-agent of requests
:
>>> r = requests.get('https://nghttp2.org/httpbin/user-agent')
>>> r.text
'{"user-agent":"python-requests/2.19.1"}\n'
>>> r = requests.get('https://nghttp2.org/httpbin/user-agent', headers={'user-agent': 'See, I modified the user agent!!'})
>>> r.text
'{"user-agent":"See, I modified the user agent!!"}\n'
- Limit by IP
- Limit by cookie/ access token
- Limit by API quota per a unit time, usually implemented with a leaky bucket algorithm
Scrapers usually return a list of objects. Sometimes the list can be enumerated given certain IDs. One common case is the use of page=xxx
parameter in the URL. You can increment the page number and assemble valid URLs. Some carefully designed web service will try to hide this kind of incremental IDs, in order to prevent other's crawler accessing this information so easily. Nevertheless, you can analyse page structure in depth and find a way. The principle is that: as long as the user can see it, there is no way to ultimately hide it from a robot. The only thing website builder can do is to make the crawling less straightforward.
qidian.com hides key information using special fonts. When normal user visit the webpage, those non-printable characters are rendered in a normal way because they load a special font. However, when you check the Chrome Developer Console, or try to get the values of the string in Python, the number field appears to be non-printable characters. The way to work around is to analyse the font file and make a decoding logic yourself. Find discussion on #85.
Most early websites are designed in a stateless manner. That is, the order how you visit those pages does not make a difference. For example, you can visit article 1 and jump to article 2.
On the contrary, some modern websites can be stateful. One common example is the requirement of login. You need to first visit the login page, before sending username/ password to the server. You need to have successfully sent the username/ password in order to access certain restricted resources.
Imagine a social network for another stateful example. As a regular user, you must first visit your own homepage to access friend list. Then you can access the profile page of a friend, after which you can visit a specific post.
Websites can record the user activity log and enforce stateful transition. However, this design also consumes a lot of resources and is not commonly preferred. In some mission critcal sites, like banks, you may find stateful enforcement by certain token injected from the server side onto the current page.
Again, nothing can be completely hidden if a regular user can access it. More efforts are needed.
Checking user-agent is the first step to identify whether a client is a valid or not. This step is generally referred as "client authentication". The server checks if the request is sent from a legitimate client. Examples are like client side signature. However, the web is an open world. You can be assured that people can hide nothing if the data is already in your browser. Even if the web developers apply complex compuational logics to conduct client authentication, the client side authenticator code is available in your browser as Javascript. With some cryptography knowledge, one may be able to reverse engineer it.
However, cracking the system is not the purpose. Our objective is to get data. Instead of cracking and traslating the logics into Python script, we had better re-use the Javascript scripts as-is. Or further more, we can run browser emulator to naturally trigger those logics, which may be a more direct solution. Browser emulator is one major topic introduced in this chapter.
Example 1: See 51job.com example
Example 2:
r = requests.get('http://www.comm.hkbu.edu.hk/comd-www/english/people/m_facutly_dept.htm')
r.encoding = 'utf-8'
mypage = BeautifulSoup(r.text)
mypage.find('td', {'class': 'personNameArea'}).text
- Delay refers to the elapsed time used for the browser to receive the complete HTTP response since the first byte. Your code needs to consider potential delays especially under different network condition. Or the data you are trying to access may not be ready when your code tries to process. This is not a major problem when we use
requests
in last chapter, becauserequests
is "blocking". That is, the whole HTTP response will be received before Python further executes the code. However, it may arise as a major problem in this chapter when we use browser emulator. - Jitter refers to the unstable/ unsteady delay. Sometimes the delay is large and some time it is small. You will find your usually working codes go wrong in some rare scenarios.
To handle delay and jitter, the common strategy is to wait and test. You can use time.sleep
to pause for some time before proceed. If there is a way to test the finishing condition, you had better test before move. For example, use .find_element_by_xxx
to check if the intended element is already loaded. If not, wait further.
Network can be interrupted in many ways, like sudden loss of wifi signal. When the network is interrupted, you may get partial data or corrupted data. Make sure to guard your parser codes with try...except
block, handle the errors and print detailed log for further trouble shooting.
If you are behind firewalls, which is common in campus and enterprise networks, some automatic HTTP requests may be flagged and further stopped. There is no direct solution to this. When in doubt, try one alternative network (wifi, wired, 4G, ..) to see if things work.
When you use browser emulator, you also need to know that it takes time to render the whole page. You do not feel that because computer is very fast. Most of the loading and rendering could have happened in minisecond level. However, when you use automatic programs to browse and click, things become different. Your own program may be so fast that the dependent element/ data has not loaded when you try to access it.
For example, StaleElementReferenceException
and IndexError
are quite common when using selenium
. The error sometimes disappears when you execute the same script with the same parameter again. Or it is better to add some time.sleep
between critical operations. For example, you want to wait the browser to load new content after triggering a click event on a button.
Related issues:
- Booking.com #68
Primarily, Browser emulation or browser automation is used for automating web applications for testing purpose. Like when you build your web application, you want to simulate how many users your server can handle, and how the users act when they look into your website, how they open page, click, navigate and read the the page content.
But browser emulation is certainly not limited to just that. For our course, we mainly use it to manipulate the browser, to interactively communicate with the website, locate the information and get the data we want.
- Some of complicated website can't be directly scraped by static method. For example, some elements, especially page turning buttons/links in the webpage have embedded javascript codes, which need users certain actions to further loading the content.
- Browser Emulation way can handle some complicated scraping work like ones that need you login.
- Some webpages have strictly rules for anti-scraping. However, in browser emulation, we simulate users' behaviors, which is more camouflaged and not easy to discover, meaning that the limits is smaller than static scraping like
request
.
In our course, we mainly introduce two libraries - Selenium
and Splinter
for browser emulation and dynamic scraping. Those packages are wildly used in this field. And their documentations are easy to read.
Each time, it need to load all the content of the webpage, the crawling speed is slow, therefore not suitable for scraping cases with a large load of data.
Selenium is a set of different software tools, each with a different approach to supporting browser automation. These tools are highly flexible, allowing many options for locating and manipulating elements within a browser, the key tool we will use is Selenium Python bindings.
Selenium Python bindings provides a simple API to write functional/acceptance tests using Selenium WebDriver. Through Selenium Python API you can access all functionalities of Selenium WebDriver in an intuitive way.
You can visit here to learn how to use those functions by yourself. In the following example, we will use CNN articles scraping case to elaborate the basic functions of it, and how to scrape a webpage that need our interaction.
!pip install selenium #in Jupyter Notebook
from selenium import webdriver #import
Selenium requires a driver to interface with the chosen browser. Chrome, for example, requires Chromedriver, which needs to be installed before the below examples can be run.
you can download different drivers for supported browsers in the following links:
Supported Browsers | Download Links |
---|---|
Chrome | https://sites.google.com/a/chromium.org/chromedriver/downloads |
Firefox | https://github.com/mozilla/geckodriver/releases |
Safari | https://webkit.org/blog/6900/webdriver-support-in-safari-10/ |
For windows users: please refer to here for instruction of download.
After you download the webdriver, we can use the following command to initiate the webdriver.
browser = webdriver.Chrome() #default to initiate webdriver, you can assign it with driver or browser or other things you like.
Note: Make sure it’s in your PATH. e. g. place it in /usr/bin
or /usr/local/bin
. If it's not in the PATH, when you initiate the Chromedriver, it will raise error Message: 'chromedriver' executable needs to be in PATH.
If raise this error, you can solve this problem by putting dependency/chromedriver you download into PATH. The PATH here is the current working folder, where your Jupyter Notebook is running. You can check out where it is by the following command.
!echo $PATH #you will get a set of paths and the first one is what we want. In the following example, it will be as:
#/Library/Frameworks/Python.framework/Versions/3.6/bin
!ls #add the first path returned from last step to list all files in the folder
!open #add the first path returned from first step to open the path and put the chromedriver into the path
You can do a lot of interactive things with the webpage with help of the selenium, like navigating to a link, searching, scrolling, clicking etc. In the following example, we will demo the basic usage of navigating.
from selenium import webdriver
browser = webdriver.Chrome() #initiate webdriver
browser.get('http://google.com/') #visit to google page
element = browser.find_element_by_name("q") #Find the search box
element.send_keys("github python for data and media communication gitbook") #search our openbook
element.submit() #submit search action
# you will find the webpage will automatically return the results you search
open_book = browser.find_element_by_css_selector('.g')
link = open_book.find_element_by_tag_name('a') #find our tutorial
# you can also find by the link text. link = browser.find_element_by_partial_link_text('GitHub - hupili')
link.click() #click the link, enter our tutorial
browser.execute_script("window.scrollTo(0,1200);") #scroll in the page, window.scrollTo(x,y), x means horizontal, y means vertical
notes_links = browser.find_element_by_link_text('notes-week-08.md') #find link of notes 6
notes_links.click() #click into notes 6
#browser.close()
There are many ways to locate the elements. It's similar to the usage in requests
method, just a simple find...
sentence but more diverse.
Selenium provides the following methods to locate elements in a page:
- find_element(s)_by_id
- find_element(s)_by_name
- find_element(s)_by_xpath
- find_element(s)_by_link_text
- find_element(s)_by_partial_link_text
- find_element(s)_by_tag_name
- find_element(s)_by_class_name
- find_element(s)_by_css_selector
For instruction of the syntax, you can refer this documentation. In our notes, we mainly use find_element(s)_by_css_selector
method, due to its easy expression and rich matchability.
Eg:
<div id="summaryList_mixed" class="summaryList" style="display: block;"></div>
css = element_name[<attribute_name>='<value>']
- Select id. Use
#
notation to select the id:
css="div#summaryList_mixed" or "#summaryList_mixed"
- Select class. Use the
.
notation to select the class:
css="div.summaryList" or just css=".summaryList"
- Select multiple attributes:
css="div[class='summaryList'] [style='display:block']"
When using .className
notation, every class needs a prefix .
: .className1.className2.className3
(no blanks between those class names if they are used to attribute one element)
For example: for the following case:
<i class='.sr_item sr_item_new sr_item_default sr_property_block sr_flex_layout '>
</i>
The css will be like this:
css='.sr_item.sr_item_new.sr_item_default.sr_property_block.sr_flex_layout'
For detail cases, please refer here
Eg:
<div id="summaryList_mixed" class="summaryList" style="display: block;">
<div class="summaryBlock"></div>
<div class="summaryBlock"></div>
<div class="summaryBlock"></div>
<div class="summaryBlock"></div>
</div>
- Locate all children
css="div#summaryList_mixed .summaryBlock"
- Locate the certain one with “nth-of-type”. The first one is "nth-of-type(1), and the last one is "last-child"
css="div#summaryList_mixed .summaryBlock:nth-of-type(2)"
For more explanations and examples about css selector, here is a good documentation you can refer to.
In the navigation or scraping, we may need to click the button to turn pages. And the buttons locate differently in different website. How we locate those buttons? Scroll down to the element may help you accomplish this. We use the cnn example
to demo here, and test how to turn pages via browser emulation.
from selenium import webdriver
import time
browser = webdriver.Chrome()
url = 'https://money.cnn.com/search/index.html?sortBy=date&primaryType=mixed&search=Search&query=trade%20war'
browser.get(url)
next_button = browser.find_element_by_css_selector('#mixedpagination ul.pagingLinks li.ends.next span a') #get the element's location
next_button.location
loc = next_button.location
browser.execute_script("window.scrollTo({x}, {y});".format(**loc)) #scroll to the element
next_button.click()
Apart for directly scroll to the elements. There are two scrolling usages you may need to know.
#method 1
browser.execute_script('window.scrollBy(x,y)') # x is horizontal, y is vertical
#method 2
browser.execute_script('window.scrollTo(0, document.body.scrollHeight);') #scroll to the page bottom
#method 3
browser.execute_script('window.scrollTo(0, document.body.scrollHeight/1.5);') #you can divide numbers after the page height
#method 4
element = browser.find_element_by_class_name("pn-next")#locate the element
browser.execute_script("return arguments[0].scrollIntoView();", element) #scroll to view the element
The following is the link of results returned by keyword searching of trade war
. We can scrape those articles title, time and url for further studying. The reason why we need use selenium
is because the page turning links are embedded javascript codes, which cannot be extracted and use directly in requests
way. To solve that, we need to interact with the page, and do browser emulation.
!pip3 install selenium # if you installed before, just ignore
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('http://money.cnn.com/search/index.html?sortBy=date&primaryType=mixed&search=Search&query=trade%20war')
articles = []
for session in browser.find_elements_by_css_selector('#summaryList_mixed .summaryBlock'): #find all articles wrapped in the path of class='summaryBlock' under the id='summaryList_mixed'
article = {}
h = session.find_element_by_css_selector(".cnnHeadline a")
article['headline'] = h.text #find headline block
article['url'] = h.get_attribute('href') #get url attributes from headline block
article['date'] = session.find_element_by_css_selector("span.cnnDateStamp").text #find date
articles.append(article)
articles
Output:
from selenium import webdriver
import time #mainly use its time sleep function
def get_articles_from_browser(b):
articles = []
for session in browser.find_elements_by_css_selector('#summaryList_mixed .summaryBlock'): #find all articles wrapped in the path of class='summaryBlock' under the id='summaryList_mixed'
article = {}
h = session.find_element_by_css_selector(".cnnHeadline a")
article['headline'] = h.text #find headline block
article['url'] = h.get_attribute('href') #get url attributes from headline block
article['date'] = session.find_element_by_css_selector("span.cnnDateStamp").text #find date
articles.append(article)
return articles
url = 'http://money.cnn.com/search/index.html?sortBy=date&primaryType=mixed&search=Search&query=trade%20war'
browser = webdriver.Chrome()
browser.get(url)
time.sleep(2) #sleep 2 second for each call action, if it's too frequently with no sleep time, its has high opportunity to be banned from the website.
all_page_articles = []
for i in range(10):
time.sleep(0.5)
try:
new_articles = get_articles_from_browser(browser)
all_page_articles.extend(new_articles)
browser.execute_script('window.scrollTo(0, document.body.scrollHeight/1.5);')#test several numbers to choose a suitable one
next_page = browser.find_element_by_link_text('Next')
next_page.click()
except Exception as e:
print(e)
print('Error on page %s' % i)
import pandas as pd #spoiler. pandas is the key module in the next chapter, you can check out chapter 7 for further information.
df = pd.DataFrame(all_page_articles) #convert articles into dataframe
df
Output:
Splinter achieves pretty much the same results as Selenium does, though there might be a little difference in syntax. In the following, we will also use splinter
method to demo the cnn example, you can compare it with selenium
method, and choose one you like to practice more.
!pip3 install splinter
from splinter import Browser
import time
url = 'http://money.cnn.com/search/index.html?sortBy=date&primaryType=mixed&search=Search&query=trade%20war'
browser = Browser('chrome')
browser.visit(url)
time.sleep(2)
Splinter provides 6 methods for finding elements in the page, one for each selector type: css
, xpath
, tag
, name
, id
, value
, text
. Each of these methods returns a list with the found elements. And you can use index to access each of them in the list. This method is different from selenium
which provides with finding single element and list of elements. All in all, those two methods are very alike. You can check out here for Splinter doc about finding elements. In this case, we mainly use find_by_css(css_selector)
method.
!pip3 install splinter
from splinter import Browser
import time
url = 'http://money.cnn.com/search/index.html?sortBy=date&primaryType=mixed&search=Search&query=trade%20war'
browser = Browser('chrome')
browser.visit(url)
time.sleep(2)
articles = []
for block in browser.find_by_css('#summaryList_mixed .summaryBlock'):
article = {}
h = block.find_by_css('.cnnHeadline a')
article['headline'] = h.text
article['url'] = h['href']
article['date'] = block.find_by_css('span.cnnDateStamp').text
articles.append(article)
articles
Output:
How to find its css? When you open chrome devtools, you can find the css in the style
console by corresponding to the elements in the webpage.
url = 'http://money.cnn.com/search/index.html?sortBy=date&primaryType=mixed&search=Search&query=trade%20war'
def get_articles_from_browser(b):
articles = []
for block in b.find_by_css('#summaryList_mixed .summaryBlock'):
article = {}
h = block.find_by_css('.cnnHeadline a')
article['headline'] = h.text
article['url'] = h['href']
article['date'] = block.find_by_css('span.cnnDateStamp').text
articles.append(article)
return articles
# Launch the initial page#
browser = Browser('chrome')
browser.visit(url)
time.sleep(2)
all_page_articles = []
for i in range(50): #scrape 50 pages
time.sleep(0.5)
try:
new_articles = get_articles_from_browser(browser)
all_page_articles.extend(new_articles)
browser.execute_script('window.scrollTo(0, document.body.scrollHeight/1.5);') #scroll down
next_buttons = browser.find_by_css('.pagingLinks li.ends.next')
next_buttons[0].click() #splinter find_by return a list, therefore we need use index0 to access the next_button.
except Exception as e:
print(e)
print('Error on page %s' % i)
import pandas as pd
df = pd.DataFrame(all_page_articles)
df
Output: There will be 500 rows.
After we can handle browser emulation to find and extract data from a dynamic loading webpage, we can further apply this method to crawl some data from social media platforms, like the recent hot topic discussed right now on Twitter, to further analyze people's comments and opinions about certain events. The following are some pointers that may be useful for you to manipulate browser emulation with Twitter:
- First you need to Simulate the login process
- Do some navigating, searching, scrolling action to load more contents and tweets you want
- Extract tweets by different finding elements method
Here are the common issues when scraping those social media platform:
- Here is strong limitation to the data you can get. For example, after the certain point, the browser window and not be scrolled and there is only a
back to top
button at the bottom of the page. - The scraping results from webpage end may be different from the mobile end. For example,
https://twitter.com/
andhttps://mobile.twitter.com/home
. Because Twitter has different regulations to different platforms.
Here is the sample codes we did with selenium browser emulation. We scrape tweets by keyword searching Mangkhut
, the typhon that stroke Hong Kong and nearby region in 2018-09-16.
Open "Google Chrome Developer Console" by command+option+i
. Check out the "Network" tab and use "XHR" filter to find potential packages. We are usually interested in json
files or xml
files sent over the line. Most dynamic web pages have predictable / enumerable internal API -- use HTTP request to get certain json
/ xml
files.
Some websites render HTML at the backend and send them to the frontend in a dynamic way. You find the URL in address bar stays the same but the content is changed. You can also find the HTML files and their real URL via developer console. One such example is xiachufang.com scraper. Another one here scraped Centaline Property's buildings on sale and schools around the buildings use the same way.
Another common case is "infinite scroll" design. When a page adopts an infinite scroll design, asynchronous data loading is inevitable. When you encounter those cases, network trace analysis may give more concise solution. There are usually XHR
interfaces. You don't even need dynamic crawling (browser simulation). One example is mafengwo.com's user history.
With the explosion of mobile Apps, more and more data is shifted from the open web to mobile platform. The design principle of web and mobile are very different. When Tim Berners Lee initially designed the WWW, it was intended to be an open standard that every one can connect to. That is why, once the web server is up, you can use Chrome to access it while other users may use Firefox or even Python requests
. There are many tools to emulate browser activities, so you can programmably do the same thing as if a regular user is surfing the Internet. Compare with the open web, mobile world is a closed eco system. It often requires heavy duty packet analysis, App decompilation, or App emulation, in order to get data behind the mobile Apps. "Packet analysis" is most close to our course and is elaborated below.
This section is very similar to earlier Analyse Network Traces. The only difference is that we analyse the mobile App packet here.
No matter how mysterious a mobile App seems to be, it has to talk to a server in order to get updated information. You can be assured that everything you see from your smart phone screen comes from either of the two channels:
- Embedded in the phone, i.e. in the operating system, or in the App when you initially install
- Loaded via the Internet upon certain user operation, e.g. App launch, swipe left, touch, ...
Channel 1 is the topic of next section. Channel 2 is what we are going to tackle. The idea is to insert a sniffer between the App and the backend server. In this way, whatever conversation the App has with the server will pass the sniffer first. The sniffer is also called "man-in-the-middle (MITM)", and a famous attack is named after this. You may have also heard the term "proxy", which intercepts your original network packet, modify it somehow, and then send the packet to the destination. One can use proxy to bypass Internet censorship or use proxy to hide the original sender's address. Our key tool is a MITM proxy. Here are two common choices
- "Charles proxy" -- Its GUI is very convenient for further packet analysis. It also has iOS and MAC clients. The software is not free though.
mitmproxy
-- You can install it viapip
. It is free and open source. It provides command line interface to help intercept and dump packets. Recent version also provides web interface to browse the sniffed packets.
Kuaishou is a popular video sharing platform originated from China. We analyse its international version, kwai, and scrape the top players data.
We'll omit the configuration of Charles Proxy on iOS and MAC, because there are numerous resources online and the interfaces are always changing. Once you finish configuration, do the following steps:
- Start sniffing in Charles Proxy on iOS.
- Open Kwai App.
- Browse like a normal user. Note that the packet sniffer can only intercept the conversations that happened. So you want to trigger more actions.
- Quit Kwai App.
- Send sniffed packet traces to MAC for further analysis.
There is no direct formula for packet analysis. We usually observe the request/ response sequence by time. For example, if you "pull down" to refresh the video list at 10th second, then the relevant packets are very likely to be sent around 10th second. You can find that data is obtained from an endpoint called http://api.kwai.com/
. Specifically, the App sends HTTP requests to http://api.kwai.com/rest/n/feed/hot
in order to obtain a list of hot videos. In our previous scraper examples, HTTP request is usually sent using the GET
method. In the case of Kwai, POST
is used. A complete POST
request is composed of three parts:
- headers -- send in HTTP protocol; users can not see
- params -- usually appears as
?a=3&b=5
in browser bar;a
andb
here are called parameters - data -- the
POST
body; this is the main content to be consumed by the web server; based on this content, the server give corresponding response.
Charles Proxy's MAC software can help you to convert one HTTP request into the Python language, with the above three parts filled -- that is, give you the Python code that can replay one request. The variable configurations are as follows, with certain fields masked to preserve privacy:
headers = {
'Host': 'api.kwai.com',
...
'Accept': 'application/json',
'User-Agent': 'kwai-ios',
'Accept-Language': 'en-HK;q=1, zh-HK;q=0.9, zh-Hans-HK;q=0.8',
}
params = (
('appver', '5.7.3.494'),
...
('c', 'a'),
('ver', '5.7'),
('sys', 'ios11.4'),
('mod', 'iPhone10,3'),
...
)
data = [
...
('coldStart', 'true'),
('count', '20'),
('country_code', 'hk'),
('id', '13'),
('language', 'en-HK;q=1, zh-HK;q=0.9, zh-Hans-HK;q=0.8'),
('pv', 'false'),
('refreshTimes', '0'),
('sig', ...),
('source', '1'),
('type', '7'),
]
Here's the request operation and its outcome:
Note the pd.DataFrame
is a pandas
object, which will be explained in notes-week-09.md.
This means to "crack" the App. You need to first get the installation package of the App, analyse its structure, decompile it, and understand how this App talk with a server from its source code.
Sophisticated App will embed certain cryptography routine in the App and authenticate itself with the server. Even if you successfully analysed the network packet, it is hard for you to come up with the correct authentication parameters. In order to understand how this authentication process is conducted, you may want to reverse engineer the App.
Further discussion is omitted here because this part takes years of computer science background, especially in information security domain.
Appium is a frequently used automatic testing tool. You can use this tool to emulate user operations on mobile Apps and scrape the data from the screen. For Android users, Auto.js is a convenient library that relies on accessibility features and does not require root access.
Actually, selenium
, we introduced earlier in this chapter, was initially also an automatic testing tool for the web frontend. Then it became a bridge between the programmable user and web browser driver, which was used in a lot scraping works. When you find yourself stuck with data access because of non-human behavoiur (e.g. anti-crawling), you can try to search the keywords "emulation" or "auto testing", and can usually get some pointers to useful tools.
In our class, we show you the very basic steps of scraping so that you know how it things happen in a sequential way. However, you don't have write codes for everything from scratch in real practice. People already made numerous tools and libraries that can help you do certain tasks quickly. Here are some examples related with course (Shell/ Python) for those who are interested:
- Type
wget -r {url}
, where{url}
is the URL of the website you want to crawl. After running this command, you can find all the web pages and their dependent resources are on your computer. You can fine tune the parameters to limit crawling scope, like number of hops or types of files. Useman wget
to find out more. - There are many shell commands which can be combined to perform efficient text processing. This article shows how one can combine a few Shell commands to quickly download the Shakespeare works.
- This repo, originally a workshop given on PyConHK in 2015, shows you some handy tools and libraries in Python that allow one to scrape more with less codes. For example, you can use
readability
to extract the main body of an HTML page, without bothering with its page structure. For readers with frontend development background,pyquery
is a handy library to allow you write jQuery like selectors to access HTML elements.scraply
is a machine learning based library that can learn the labelled crawling target and generate corresponding rules; The user only needs to tellscraply
what to crawl, instead of how to crawl. - Data Science at the Command Line by Jeroen Janssens is a comprehensive and duly updated reference book for command line tools for data science. Its Obtaining data is a good further reading for those who are interested in more efficient data collection in Linux shell environment.
pyspider
is a convenient spider (crawler/ scraper) framework in Python.
Search Engine Optimization (SEO) is one common technique a digital marketer needs to master. Suppose you have led a team to conduct the optimization. Now it is time to audit the optimization result. One of the key function is to build scraper which can:
- Input 1 is a search query, i.e. some keywords
- Input 2 is a set of URLs from your own website
- Output the ranks of each URL in the search result list
http://wenshu.court.gov.cn collects the legal cases in China. It supports advanced search options. One can emulate browser to download relevant documents on a certain area. Please try:
- Give a keyword as input.
- Download the documents of the first page, e.g.
.docx
files, onto local disk. - Organise an index of those documents into a
CSV
which may include "title", "court", "date", "document-path", and other fields if you deem useful.
Key Opinion Leader (KOL) is the goto person for targeted massive marketing. As a marketing specialist, you want to identify the KOLs in a certain area so that your team can reach out to them effectively. Before learning sophisticated graph mining algorithms, one can do the follow challenge to get some preliminary result:
-
Given an industry domain, identify
keywords
-
For every
keyword in keywords
, scrape the search of related micro blogs. -
Every piece of microblog may have following data structure:
microblog = { 'username': 'DATA HERE', 'datetime': 'DATA HERE', 'text': 'DATA HERE', 'num_like': 'DATA HERE', 'num_comment': 'DATA HERE', 'num_share': 'DATA HERE' }
-
A simple algorithm to find KOL is to count
num_like
,num_comment
,num_share
for eachusername
.
Some people host competitions online and calculate the leaderboard based on web traffic, like page visits, number of clicks of "upvote" and so on. The system is very easy to cheat if it does not adopt CAPTCHA system. After this week, you can use Python to emulate user behaviour and cheat those systems. Here are some examples for your reference:
- Increase Youtube page views by refreshing browser page code
Please find another system/ another parameter from the system, which you can cheat using similar tricks.
Imagine you have a crush on someone. You follow every posts from her and clicks "like" as soon as you see it. You decide that, you shall be the first one to like her every post even when she posts at mid night. Since you need adequate sleep as a human being, your buddy laptop agrees to help. However, the laptop needs to know what to do in an exact/ step-by-step manner. Now you tell it in the computer's language, i.e. Python, in our exercise. Use browser emulation to:
- Open Facebook
- Find input boxes for userrname and password and key in the right information.
- Submit and login
- Read the posts in the timeline
- Check if any post is from her
- If so, click "like"
- Repeat the above steps every 1 minute
- Dynamic loading and crawling example: libguides example
- Social media crawling example: Scrape a luxury brand with keyword in Weibo
- Dynamic page crawling, with a matter of parsing page content: Timeout
- Dynamic crawling a static page with a matter of pagination: Amazon Books
- A high school student writes Selenium Chinese documents SELENIUM 的中文文档
If you have any questions, or seek for help troubleshooting, please create an issue here