Skip to content

Latest commit

 

History

History

web_scraping

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Introduction to Web Scraping

We use this book: Web Scraping with Python: Collecting More Data from the Modern Web, 2nd edition, by Ryan Mitchell (O’Reilly, 2018). We must use the 2nd edition, because there are many changes from the previous edition.

Python3 is used throughout this book.

Note: This document assumes you have already installed Python3, pip, and virualenv. If not, refer to these instructions.

This document covers our second week in this section of the course. It's our second week with Python, and our first week with scraping.

Contents

See also, elsewhere in this repo:

  • mitchell-ch3 — Mitchell chapter 3: More web scraping. This covers our third week's assigned reading.
  • more-from-mitchell — More from Mitchell: Web scraping beyond the basics. This covers our fourth week's assigned reading.

BeautifulSoup documentation:

Setup for BeautifulSoup

BeautifulSoup is a scraping library for Python. We want to run all our scraping projects in a virtual environment. Students have already installed both Python3 and virtualenv.

Create a directory and change into it

The first step is to create a new folder (directory) for all your scraping projects. Mine is:

Documents/python/scraping

Do not use any spaces in your folder names. If you must use punctuation, do not use anything other than an underscore (_). It's easiest if you use only lowercase letters.

Change into that directory. For me, the command would be:

cd Documents/python/scraping

Create a new virtualenv in that directory and activate it

Create a new virtualenv there (this is done only once).

Mac OS/bash

$ virtualenv --python=/usr/local/bin/python3 env

Skip to Continue ... below.

Windows PowerShell

PS> virtualenv --python=C:\Python37\python.exe env

Note: On Windows, this might not be the location of your Python 3. To find the location, start the Python 3 interpreter and try this code to find your installed Python path:

>>> import os
>>> import sys
>>> os.path.dirname(sys.executable)

The next line will be the path on your computer. Use the code shown above, but replace C:\Python37\ with that new line you just got. Make sure to keep python.exe env at the end.

Continue ...

Activate the virtualenv:

Mac OS/bash

$ source env/bin/activate

Windows PowerShell

PS> env\Scripts\activate.bat

Note: The command on Windows might be different, depending on what you are using. The uppercase S is necessary.

Important: You should now see (env) at the far left side of your prompt. This indicates that the virtualenv is active. Example (Mac OS/bash):

(env) mcadams scraping $

When you are finished working in a virtualenv, you should deactivate it. The command is the same in Mac OS or Windows (DO NOT DO THIS NOW):

deactivate

You'll know it worked because (env) will no longer be at the far left side of your prompt.

Install the BeautifulSoup library

In Mac OS or Windows, at the $ bash prompt (or Windows PS>), type:

pip3 install beautifulsoup4

This is how you install any Python library that exists in the Python Package Index. Pretty handy. pip3 is a tool for installing Python packages, which is what you just did.

Note: You installed BeautifulSoup in the Python3 virtualenv that is currently active. When that virtualenv is not active, BeautifulSoup will not be available to you. This is ideal, because you will create different virtual environments for different Python projects, and you won't need to worry about updated libraries in the future breaking your (past) code.

Test BeautifulSoup

Start Python. Because you are in a Python3 virtualenv, you need only type python. (NOT python3.)

You should now be at the >>> prompt — the Python prompt.

In Mac OS or Windows, type (or copy/paste) one line at a time:

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://weimergeeks.com/examples/scraping/example1.html")
bsObj = BeautifulSoup(html, "html.parser")
print(bsObj.h1)
  1. You imported two Python modules, urlopen and BeautifulSoup (the first two lines). This allows you to use their functions.
  2. You used urlopen to copy the entire contents of the URL given into a new Python variable, html.
  3. You used the BeautifulSoup function to process the value of that variable (the contents of the file at that URL) through a built-in HTML parser (html.parser is not the only option for this; html5lib is more robust and can be installed with pip3).
  4. The result: All the HTML from the file is now in a BeautifulSoup object with the new Python variable name bsObj. (In Mitchell's first edition, she used bsObj. Now, in the second edition, she uses just bs. FYI, most other people use soup. It is just a variable name.)
  5. Using the syntax of the BeautifulSoup library, you printed the first H1 element (including its tags) from that parsed value. Option: Check out the page on the web to see what you scraped.

If it works, you'll see:

<h1>We Are Learning About Web Scraping!</h1>

If you got an error about SSL, quit Python (quit() or Command-D) and enter this at the bash prompt (Mac only):

/Applications/Python\ 3.7/Install\ Certificates.command

Then return to the Python prompt and retry the five lines above.

The example is based on the one in Mitchell's book; the code is in her GitHub repo (Chapter01) for the book, second edition.

The command bsObj.h1 would work the same way for any HTML tag (if it exists in the file). Instead of printing it, you might stash it in a variable:

heading = bsObj.h1

Understanding BeautifulSoup

BeautifulSoup is a Python library that enables us to extract information from web pages and even entire websites.

We use BeautifulSoup commands to create a well-structured data object (more about objects below) from which we can extract, for example, everything with an <li> tag, or everything with class="book-title".

After extracting the desired information, we can use other Python commands (and libraries) to write the data into a database, CSV file, or other usable format.

What is the BeautifulSoup object?

It's very important to understand that many of the BeautifulSoup commands work on an object, which is not the same as a simple string. Throughout her book, Mitchell uses the variable name bs to remind us of that fact. (Note: In the first edition, Mitchell used bsObj, and you'll see that in many examples in this repo. Most people use soup for this variable name — because the library is BeautifulSoup.)

Many programming languages include objects as a data type. Python does, JavaScript does, etc. An object is an even more powerful and complex data type than an array (JavaScript) or a list (Python) and can contain many other data types in a structured format.

When you extract information from an object with a BeautifulSoup command, sometimes you get a simple string, and sometimes you get a Python list (which is very similar to an array in JavaScript). The way you treat that extracted information will be different depending on whether it is a string (one item) or a list (usually more than one item).

That last paragraph is REALLY IMPORTANT, so read it again.

How BeautifulSoup handles the object

In the previous code, when this line ran:

html = urlopen("https://weimergeeks.com/examples/scraping/example1.html")

... you copied the entire contents of a file into a new Python variable named html. The contents were stored as an HTTPResponse object. We can read the contents of that object like this:

Results of html.read()

... but that's not going to be very usable, or useful — especially for a file with a lot more content in it.

When you transform that HTTPResponse object into a BeautifulSoup object — with the following line — you create a well-structured object from which you can extract any HTML element and the text within any HTML element.

bsObj = BeautifulSoup(html, "html.parser")

Let's look at a few examples of what BeautifulSoup can do.

Finding elements that have a particular class

Deciding the best way to extract what you want from a large HTML file requires you to dig around in the source before you write the Python/BeautifulSoup commands. In many cases, you'll see that everything you want has the same CSS class on it. After creating a BeautifulSoup object (here, as before, it is in the variable bsObj), this line will create a Python list (you can think of it as an array) containing all the <td> elements that have the class city.

city_list = bsObj.find_all( "td", {"class":"city"} )

Maybe there were 10 cities in <td> tags in that HTML file. Maybe there were 10,000. No matter how many, they are now in a list (in the variable city_list), and you can search them, print them, write them out to a database or a JSON file — whatever you like. Often you will want to perform the same actions on each item in the list, so you will use a normal Python for-loop:

for city in city_list:
    print( city.get_text() )

get_text() is a handy BeautifulSoup method that will extract the text — and only the text — from the item. If instead you wrote just print(city), you'd get the <td> and any other tags inside them as well.

Finding all vs. finding one

The BeautifulSoup find_all() method you just saw always produces a list. (Note: findAll() will work too.) If you know there will be only one item of the kind you want in a file, you should use the find() method instead.

For example, maybe you are scraping the address and phone number from every page in a large website. There is only one phone number on the page, and it is enclosed in a pair of tags with the attribute id="call". One line of your code gets the phone number from the current page:

phone_number = bsObj.find(id="call")

Naturally, you don't need to loop through that result — the variable phone_number will contain only a string, including any HTML tags. To test what the text alone will look like, just print it using get_text() to strip out the tags.

print( phone_number.get_text() )

Notice that you're always using bsObj. Review above if you've forgotten where that came from. (You may use bs instead. You may use soup. Pick ONE and stick with it.)

Finding the contents of a particular attribute

One last example: You've made a BeautifulSoup object from a page that has dozens of images on it. You want to capture the path to each image file on that page (perhaps so that you can download all the images). This requires two steps:

image_list = bsObj.find_all('img')
for image in image_list:
    print(image.attrs['src'])

First, you make a Python list containing all the img elements that exist in the object.

Second, you loop through that list and print the contents of the src attribute from each img tag in the list.

IMPORTANT: We do not need get_text() in this case, because the contents of the src attribute are nothing but text. There are never tags inside the src attribute. So think about exactly what you're trying to get, and what is it like inside the HTML of the page.

There's a lot more to learn about BeautifulSoup, and we'll be using Mitchell's book for that. You can also read the docs.

A BeautifulSoup example

To demonstrate a whole process of thinking through a small scraping project, I made a Jupyter Notebook that — through the comments in the code — shows how I thought about the problem step by step and tested each step, one thing at a time, to reach the solution I wanted. Open the Notebook here on GitHub to follow along and see all the steps.

The code in the final cell of the Jupyter Notebook produces this 51-line CSV file by scraping 10 separate web pages.