Ryan Mitchell-Web Scraping with Python_ Collecting Data from the Modern Web-O’Reilly Media (2015).pdf

(4234 KB) Pobierz
Web Scraping
with Python
COLLECTING DATA FROM THE MODERN WEB
Ryan Mitchell
Web Scraping with Python
Learn web scraping and crawling techniques to access unlimited data
from any web source in any format. With this practical guide, you’ll learn
how to use Python scripts and web APIs to gather and process data from
thousands—or even millions—of web pages at once.
Ideal for programmers, security professionals, and web administrators
familiar with Python, this book not only teaches basic web scraping
mechanics, but also delves into more advanced topics, such as analyzing
raw data or using scrapers for frontend website testing. Code samples
are available to help you understand the concepts in practice.
The tools and examples
included in the book
allowed me to easily
automate several
repetitive tasks, freeing
that time to solve more
interesting problems.
It is a results-oriented
quick read that is well
rooted in real-world
problems and solutions.
Learn how to parse complicated HTML pages
Traverse multiple pages and sites
Get a general overview of APIs and how they work
Learn several methods for storing the data you scrape
Download, read, and extract data from documents
Use tools and techniques to clean badly formatted data
Read and write natural languages
Crawl through forms and logins
Understand how to scrape JavaScript
Learn image processing and text recognition
Electrical Computer Engineer,
Olin College of Engineering
—Eric VanWyk
Ryan Mitchell
is a software engineer at LinkeDrive in Boston, where she
develops the company’s API and data analysis tools. She regularly consults on
web scraping projects, primarily in the financial and retail industries.
PY THON
Twitter: @oreillymedia
facebook.com/oreilly
CAN $36.99
US $31.99
ISBN: 978-1-491-91029-0
Web Scraping with Python
Collecting Data from the Modern Web
Ryan Mitchell
Boston
Web Scraping with Python
by Ryan Mitchell
Copyright © 2015 Ryan Mitchell. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are
also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/
institutional sales department: 800-998-9938 or
corporate@oreilly.com.
Editors:
Simon St. Laurent and Allyson MacDonald
Production Editor:
Shiny Kalapurakkel
Copyeditor:
Jasmine Kwityn
Proofreader:
Carla Thornton
June 2015:
First Edition
Indexer:
Lucie Haskins
Interior Designer:
David Futato
Cover Designer:
Karen Montgomery
Illustrator:
Rebecca Demarest
Revision History for the First Edition
2015-06-10:
2015-07-22:
2015-10-30:
2016-03-18:
First Release
Second Release
Third Release
Fourth Release
See
http://oreilly.com/catalog/errata.csp?isbn=9781491910290
for release details.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc.
Web Scraping with Python,
the cover
image, and related trade dress are trademarks of O’Reilly Media, Inc.
While the publisher and the author have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the author disclaim all responsibility
for errors or omissions, including without limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is subject to open source
licenses or the intellectual property rights of others, it is your responsibility to ensure that your use
thereof complies with such licenses and/or rights.
978-1-491-91029-0
[LSI]
Table of Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Part I.
Building Scrapers
3
6
6
8
9
1. Your First Web Scraper. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Connecting
An Introduction to BeautifulSoup
Installing BeautifulSoup
Running BeautifulSoup
Connecting Reliably
2. Advanced HTML Parsing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
You Don’t Always Need a Hammer
Another Serving of BeautifulSoup
find() and findAll() with BeautifulSoup
Other BeautifulSoup Objects
Navigating Trees
Regular Expressions
Regular Expressions and BeautifulSoup
Accessing Attributes
Lambda Expressions
Beyond BeautifulSoup
Traversing a Single Domain
Crawling an Entire Site
Collecting Data Across an Entire Site
Crawling Across the Internet
Crawling with Scrapy
How APIs Work
13
14
16
18
18
22
27
28
28
29
31
35
38
40
45
50
iii
3. Starting to Crawl. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4. Using APIs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Zgłoś jeśli naruszono regulamin