Easy Web Scraping with Python
Posted by
on underA little over a year ago I wrote an article on web scraping using Node.js. Today I'm revisiting the topic, but this time I'm going to use Python, so that the techniques offered by these two languages can be compared and contrasted.
The Problem
As I'm sure you know, I attended PyCon in Montréal earlier this month. The video recordings of all the talks and tutorials have already been released on YouTube, with an index available at pyvideo.org.
I thought it would be useful to know what are the most watched videos of the conference, so we are going to write a scraping script that will obtain the list of available videos from pyvideo.org and then get viewer statistics from each of the videos directly from their YouTube page. Sounds interesting? Let's get started!
The Tools
There are two basic tasks that are used to scrape web sites:
- Load a web page to a string.
- Parse HTML from a web page to locate the interesting bits.
Python offers two excellent tools for the above tasks. I will use the awesome requests to load web pages, and BeautifulSoup to do the parsing.
We can put these two packages in a virtual environment:
$ mkdir pycon-scraper
$ virtualenv venv
$ source venv/bin/activate
(venv) $ pip install requests beautifulsoup4
If you are using Microsoft Windows, note that the virtual environment activation command above is different, you should use venv\Scripts\activate
.
Basic Scraping Technique
The first thing to do when writing a scraping script is to manually inspect the page(s) to scrape to determine how the data can be located.
To begin with, we are going to look at the list of PyCon videos at http://pyvideo.org/category/50/pycon-us-2014. Inspecting the HTML source of this page we find that the structure of the video list is more or less as follows:
<div id="video-summary-content">
<div class="video-summary"> <!-- first video -->
<div class="thumbnail-data">...</div>
<div class="video-summary-data">
<div>
<strong><a href="#link to video page#">#title#</a></strong>
</div>
</div>
</div>
<div class="video-summary"> <!-- second video -->
...
</div>
...
</div>
So the first task is to load this page, and extract the links to the individual pages, since the links to the YouTube videos are in these pages.
Loading a web page using requests is extremely simple:
import requests
response = requests.get('http://pyvideo.org/category/50/pycon-us-2014')
That's it! After this function returns the HTML of the page is available in response.text
.
The next task is to extract the links to the individual video pages. With BeautifulSoup this can be done using CSS selector syntax, which you may be familiar if you work on the client-side.
To obtain the links we will use a selector that captures the <a>
elements inside each <div>
with class video-summary-data
. Since there are several <a>
elements for each video we will filter them to include only those that point to a URL that begins with /video
, which is unique to the individual video pages. The CSS selector that implements the above criteria is div.video-summary-data a[href^=/video]
. The following snippet of code uses this selector with BeautifulSoup to obtain the <a>
elements that point to video pages:
import bs4
soup = bs4.BeautifulSoup(response.text)
links = soup.select('div.video-summary-data a[href^=/video]')
Since we are really interested in the link itself and not in the <a>
element that contains it, we can improve the above with a list comprehension:
links = [a.attrs.get('href') for a in soup.select('div.video-summary-data a[href^=/video]')]
And now we have a list of all the links to the individual pages for each session!
The following script shows a cleaned up version of all the techniques we have learned so far:
import requests
import bs4
root_url = 'http://pyvideo.org'
index_url = root_url + '/category/50/pycon-us-2014'
def get_video_page_urls():
response = requests.get(index_url)
soup = bs4.BeautifulSoup(response.text)
return [a.attrs.get('href') for a in soup.select('div.video-summary-data a[href^=/video]')]
print(get_video_page_urls())
If you run the above script you will get a long list of URLs as a result. Now we need to parse each of these to get more information about each PyCon session.
Scraping Linked Pages
The next step is to load each of the pages in our URL list. If you want to see how these pages look, here is an example: http://pyvideo.org/video/2668/writing-restful-web-services-with-flask. Yes, that's me, that is one of my sessions!
From these pages we can scrape the session title, which appears at the top. We can also obtain the names of the speakers and the YouTube link from the sidebar that appears on the right side below the embedded video. The code that gets these elements is shown below:
def get_video_data(video_page_url):
video_data = {}
response = requests.get(root_url + video_page_url)
soup = bs4.BeautifulSoup(response.text)
video_data['title'] = soup.select('div#videobox h3')[0].get_text()
video_data['speakers'] = [a.get_text() for a in soup.select('div#sidebar a[href^=/speaker]')]
video_data['youtube_url'] = soup.select('div#sidebar a[href^=http://www.youtube.com]')[0].get_text()
A few things to note about this function:
- The URLs returned from the scraping of the index page are relative, so the
root_url
needs to be prepended. - The session title is obtained from the
<h3>
element inside the<div>
with idvideobox
. Note that[0]
is needed because theselect()
call returns a list, even if there is only one match. - The speaker names and YouTube links are obtained in a similar way to the links in the index page.
Now all that remains is to scrape the views count from the YouTube page for each video. This is actually very simple to write as a continuation of the above function. In fact, it is so simple that while we are at it, we can also scrape the likes and dislikes counts:
def get_video_data(video_page_url):
# ...
response = requests.get(video_data['youtube_url'])
soup = bs4.BeautifulSoup(response.text)
video_data['views'] = int(re.sub('[^0-9]', '',
soup.select('.watch-view-count')[0].get_text().split()[0]))
video_data['likes'] = int(re.sub('[^0-9]', '',
soup.select('.likes-count')[0].get_text().split()[0]))
video_data['dislikes'] = int(re.sub('[^0-9]', '',
soup.select('.dislikes-count')[0].get_text().split()[0]))
return video_data
The soup.select()
calls above capture the stats for the video using selectors for the specific id names used in the YouTube page. But the text of the elements need to be processed a bit before it can be converted to a number. Consider an example views count, which YouTube would show as "1,344 views"
. To remove the text after the number the contents are split at whitespace and only the first part is used. This first part is then filtered with a regular expression that removes any characters that are not digits, since the numbers can have commas in them. The resulting string is finally converted to an integer and stored.
To complete the scraping the following function invokes all the previously shown code:
def show_video_stats():
video_page_urls = get_video_page_urls()
for video_page_url in video_page_urls:
print get_video_data(video_page_url)
Parallel Processing
The script up to this point works great, but with over a hundred videos it can take a while to run. In reality we aren't doing so much work, what takes most of the time is to download all those pages, and during that time the script is blocked. It would be much more efficient if the script could run several of these download operations simultaneously, right?
Back when I wrote the scraping article using Node.js the parallelism came for free with the asynchronous nature of JavaScript. With Python this can be done as well, but it needs to be specified explicitly. For this example I'm going to start a pool of eight worker processes that can work concurrently. This is surprisingly simple:
from multiprocessing import Pool
def show_video_stats(options):
pool = Pool(8)
video_page_urls = get_video_page_urls()
results = pool.map(get_video_data, video_page_urls)
The multiprocessing.Pool
class starts eight worker processes that wait to be given jobs to run. Why eight? It's twice the number of cores I have on my computer. While experimenting with different sizes for the pool I've found this to be the sweet spot. Less than eight make the script run slower, more than eight do not make it go faster.
The pool.map()
call is similar to the regular map()
call in that it invokes the function given as the first argument once for each of the elements in the iterable given as the second argument. The big difference is that it sends all these to run by the processes owned by the pool, so in this example eight tasks will run concurrently.
The time savings are considerable. On my computer the first version of the script completes in 75 seconds, while the pool version does the same work in 16 seconds!
The Complete Scraping Script
The final version of my scraping script does a few more things after the data has been obtained.
I've added a --sort
command line option to specify a sorting criteria, which can be by views, likes or dislikes. The script will sort the list of results in descending order by the specified field. Another option, --max
takes a number of results to show, in case you just want to see a few entries from the top. Finally, I have added a --csv
option which prints the data in CSV format instead of table aligned, to make it easy to export the data to a spreadsheet.
The complete script is available for download at this location: https://gist.github.com/miguelgrinberg/5f52ceb565264b1e969a.
Below is an example output with the 25 most viewed sessions at the time I'm writing this:
(venv) $ python pycon-scraper.py --sort views --max 25 --workers 8 Views +1 -1 Title (Speakers) 3002 27 0 Keynote - Guido Van Rossum (Guido Van Rossum) 2564 21 0 Computer science fundamentals for self-taught programmers (Justin Abrahms) 2369 17 0 Ansible - Python-Powered Radically Simple IT Automation (Michael Dehaan) 2165 27 6 Analyzing Rap Lyrics with Python (Julie Lavoie) 2158 24 3 Exploring Machine Learning with Scikit-learn (Jake Vanderplas, Olivier Grisel) 2065 13 0 Fast Python, Slow Python (Alex Gaynor) 2024 24 0 Getting Started with Django, a crash course (Kenneth Love) 1986 47 0 It's Dangerous to Go Alone: Battling the Invisible Monsters in Tech (Julie Pagano) 1843 24 0 Discovering Python (David Beazley) 1672 22 0 All Your Ducks In A Row: Data Structures in the Standard Library and Beyond 1558 17 1 Keynote - Fernando Pérez (Fernando Pérez) 1449 6 0 Descriptors and Metaclasses - Understanding and Using Python's More Advanced Features 1402 12 0 Flask by Example (Miguel Grinberg) 1342 6 0 Python Epiphanies (Stuart Williams) 1219 5 0 0 to 00111100 with web2py (G. Clifford Williams) 1169 18 0 Cheap Helicopters In My Living Room (Ned Jackson Lovely) 1146 11 0 IPython in depth: high productivity interactive and parallel python (Fernando Perez) 1127 5 0 2D/3D graphics with Python on mobile platforms (Niko Skrypnik) 1081 8 0 Generators: The Final Frontier (David Beazley) 1067 12 0 Designing Poetic APIs (Erik Rose) 1064 6 0 Keynote - John Perry Barlow (John Perry Barlow) 1029 10 0 What Is Async, How Does It Work, And When Should I Use It? (A. Jesse Jiryu Davis) 981 11 0 The Sorry State of SSL (Hynek Schlawack) 961 12 2 Farewell and Welcome Home: Python in Two Genders (Naomi Ceder) 958 6 0 Getting Started Testing (Ned Batchelder)
Conclusion
I hope you have found this article useful as an introduction to web scraping with Python. I have been pleasantly surprised with the use of Python, the tools are robust and powerful, and the fact that the asynchronous optimizations can be left for the end is great compared to JavaScript, where there is no way to avoid working asynchronously from the start.
Miguel
-
#26 Miguel Grinberg said
@Anubhav: the select() call uses a subset of the CSS selector syntax, see http://www.w3schools.com/cssref/css_selectors.asp. For example, to look for a link with a specific class you can say "a.dot-company".
-
#27 Aswathy said
Can we load https webpages using requests? I need to do this on a secure page.
-
#28 Miguel Grinberg said
@Aswathy: yes, you can read https pages just fine.
-
#29 Fernando A. said
Typo:
"To obtain the links we will use a selector that captures the elements inside each <div> with id video-summary-data."
Should be:
"... with class video-summary-data." -
#30 Jabba Laci said
If you want to extract data from youtube, you can also try Pafy: https://pypi.python.org/pypi/Pafy that allows you to retrieve YouTube content and metadata.
-
#31 Jake Austwick said
Hey,
Nice article, very well written. I wrote an article on scraping the web with Python last year too, might be of use to some of your readers: http://jakeaustwick.me/python-web-scraping-resource/
Thanks,
Jake -
#32 Holl said
This is very helpful. Also if you wish to download the list of videos, "lansla pvle" is a good option.... its simple.
-
#33 villancikos said
Just found that the likes and dislikes should be changed. Now YouTube uses ID = #watch-like and #watch-dislike.
Also, you need to add an exception because in this url :http://pyvideo.org/video/2702/pycon-2014-awards the "youtube_link" or "Video Origin" says UNKNOWN making the script to break. And lastly, you missed the "import re".I would like to thank you a lot for this easy to follow tutorial and so helpful. Especially the part where you enter YouTube for each link to get the likes. :D it's easy but to read it from someone with expertise made it all more insightful.
-
#34 Miguel Grinberg said
@villancikos: I keep the gist referenced above updated, and I have already accounted for the change in the format of likes and dislikes. Get the updated script here: https://gist.github.com/miguelgrinberg/5f52ceb565264b1e969a.
-
#35 Safar Houssem said
i would like to thank you for this great tutorial, but i have a question:
i need to scrap data from multiple website and then i would like to store it into database that have 2 tables (articles; articles_informations). the problem is an article can have different names from site to another and this can causes a redundancy in the article table. please how to solve this problem -
#36 Miguel Grinberg said
@Safar: I can't really tell you how to do this, you will need to find a way to compare these articles and determine which are duplicates so that they don't get added to the database. Maybe taking an MD5 checksum of all articles can help with this.
-
#37 Maira said
Hi.. Is it possible to copy the contents of html from facebook and copy its DOM onto a text file and then parse the text file and accessing the tag elements and scarping using the DOM???
-
#38 Miguel Grinberg said
@Maira: yes, you should be able to do that, though I'm not sure you need to use a text file, you can just use the requests library to read the HTML directly from the web. After that use beautiful soup to parse the DOM.
-
#39 Harry Tran said
Hi Miguel,
Thanks for this tutorial, I am new to Python and only somewhat familiar with the terminal. That's a little about me and my background in this.
I'm trying to follow along, I'm able to pull the results you have here, but I'm confused what I would type into the terminal to get the result outputted as a csv excel file?
Thanks for writing this tutorial, it helps newbies like me get started!
Harry Tran
-
#40 Miguel Grinberg said
@Harry: if you use the complete version of the script (from the gist I referenced in the article) you can add --csv to the command to request that the output is presented in that format. You can redirect the output to a file with .csv extension and then import that into Excel.
-
#41 Tommy Carstensen said
I'm new to beautifulsoup. Can it also be used to parse the rows of a table?
-
#42 Miguel Grinberg said
@Tommy: sure, you can select any element from the page.
-
#43 Ajeet Khan said
Nice Tutorial, I am trying to scrap youtube video data and want to fetch how many days ago the video was posted and the views of the video related to a particular keyword. I am not able to fetch these individually as they are in two li tag. Help me out please.
video_data['views'] = [li.get_text().split("- ") for li in soup.select('ul.yt-lockup-meta-info')]
-
#44 Tony Hajdari said
Great article Miguel. It would be awesome if you could show how a JavaScript controlled page could be scraped--i.e. when scrolling down the page more data is fetched dynamically. Perhaps with spidermonkey?
-
#45 Miguel Grinberg said
@Tony: for a page that loads data dynamically via Javascript it is likely there is an API on the server, so you can just call the API directly and get the data without the need to scrape it. If you must use scraping on a site that uses Javascript, then Selenium might work.
-
#46 Miguel Grinberg said
@Ajeet: it may make more sense to use the YouTube API to get the data that you need.
-
#47 Piter said
Hi,
anyone getting the "UserWarning " below when running the script to scrap ??
I changed script as suggested in posts in other sites, but warning remain.
thank you.
bs4__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
To get rid of this warning, change this:
BeautifulSoup([your markup])
to this:
BeautifulSoup([your markup], "html.parser")
markup_type=markup_type))
-
#48 Piter said
C:\etc\Buffer\python\python_Examples\scrape_pages\scrape_venv\lib\site-packages\bs4__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
To get rid of this warning, change this:
BeautifulSoup([your markup])
to this:
BeautifulSoup([your markup], "html.parser")
markup_type=markup_type))
-
#49 bruce said
When I look at the page source, the table is blank. But, with view generated source, I see the data. Requests gives me a blank table. How do you read the generated page?
-
#50 Miguel Grinberg said
@bruce: some pages load data dynamically through Ajax requests. For those pages you need a more elaborated parser than beautiful soup, since the Javascript code in the page needs to be executed so that the data is loaded. For this type of page I would use Selenium.