https://www.scrapingbee.com/blog/web-scraping-101-with-python/
ScrapingBee logo
* Login
* Sign Up
* Pricing
* FAQ
* Blog
* Other Features
+ Screenshots
+ Google search API
+ Data extraction
+ JavaScript scenario
+ No code web scraping
* Developers
+ Tutorials
+ Documentation
+ Knowledge Base
Web Scraping with Python: Everything you need to know (2022)
Try ScrapingBee for Free
Kevin Sahin | 27 April 2022 (updated) | 26 min read
Table of contents
Introduction:
In this post, which can be read as a follow-up to our guide about web
scraping without getting blocked, we will cover almost all of the
tools Python offers to scrape the web. We will go from the basic to
advanced ones, covering the pros and cons of each. Of course, we
won't be able to cover every aspect of every tool we discuss, but
this post should give you a good idea of what each tool does, and
when to use one.
Note: When I talk about Python in this blog post you should assume
that I talk about Python3.
0. Web Fundamentals
The Internet is complex: there are many underlying technologies and
concepts involved to view a simple web page in your browser. The goal
of this article is not to go into excruciating detail on every single
of those aspects, but to provide you with the most important parts
for extracting data from the web with Python.
HyperText Transfer Protocol
HyperText Transfer Protocol (HTTP) uses a client/server model. An
HTTP client (a browser, your Python program, cURL, libraries such as
Requests...) opens a connection and sends a message ("I want to see
that page : /product") to an HTTP server (Nginx, Apache...). Then the
server answers with a response (the HTML code for example) and closes
the connection.
HTTP is called a stateless protocol because each transaction
(request/response) is independent. FTP, for example, is stateful
because it maintains the connection.
Basically, when you type a website address in your browser, the HTTP
request looks like this:
GET /product/ HTTP/1.1
Host: example.com
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding: gzip, deflate, sdch, br
Connection: keep-alive
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 12_3_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36
In the first line of this request, you can see the following:
* The HTTP method or verb. In our case GET, indicating that we
would like to fetch data. There are quite a few other HTTP
methods available as (e.g. for uploading data) and a full list is
available here.
* The path of the file, directory, or object we would like to
interact with. In the case here the directory product right
beneath the root directory.
* The version of the HTTP protocol. In this tutorial we will focus
on HTTP 1.
* Multiple headers fields: Connection, User-Agent... Here is an
exhaustive list of HTTP headers
Here are the most important header fields :
* Host: This header indicates the hostname for which you are
sending the request. This header is particularly important for
name-based virtual hosting, which is the standard in today's
hosting world.
* User-Agent: This contains information about the client
originating the request, including the OS. In this case, it is my
web browser (Chrome) on macOS. This header is important because
it is either used for statistics (how many users visit my website
on mobile vs desktop) or to prevent violations by bots. Because
these headers are sent by the clients, they can be modified (
"Header Spoofing"). This is exactly what we will do with our
scrapers - make our scrapers look like a regular web browser.
* Accept: This is a list of MIME types, which the client will
accept as response from the server. There are lots of different
content types and sub-types: text/plain, text/html, image/jpeg,
application/json ...
* Cookie : This header field contains a list of name-value pairs
(name1=value1;name2=value2). Cookies are one way how websites can
store data on your machine. This could be either up to a certain
date of expiration (standard cookies) or only temporarily until
you close your browser (session cookies). Cookies are used for a
number of different purposes, ranging from authentication
information, to user preferences, to more nefarious things such
as user-tracking with personalised, unique user identifiers.
However, they are a vital browser feature for mentioned
authentication. When you submit a login form, the server will
verify your credentials and, if you provided a valid login, issue
a session cookie, which clearly identifies the user session for
your particular user account. Your browser will receive that
cookie and will pass it along with all subsequent requests.
* Referer: The referrer header (please note the typo) contains the
URL from which the actual URL has been requested. This header is
important because websites use this header to change their
behavior based on where the user came from. For example, lots of
news websites have a paying subscription and let you view only
10% of a post, but if the user comes from a news aggregator like
Reddit, they let you view the full content. They use the referrer
to check this. Sometimes we will have to spoof this header to get
to the content we want to extract.
And the list goes on...you can find the full header list here.
A server will respond with something like this:
HTTP/1.1 200 OK
Server: nginx/1.4.6 (Ubuntu)
Content-Type: text/html; charset=utf-8
Content-Length: 3352
...[HTML CODE]
On the first line, we have a new piece of information, the HTTP code
200 OK. A code of 200 means the request was properly handled. You can
find a full list of all available codes on Wikipedia. Following the
status line, you have the response headers, which serve the same
purpose as the request headers we just discussed. After the response
headers, you will have a blank line, followed by the actual data sent
with this response.
Once your browser received that response, it will parse the HTML
code, fetch all embedded assets (JavaScript and CSS files, images,
videos), and render the result into the main window.
We will go through the different ways of performing HTTP requests
with Python and extract the data we want from the responses.
1. Manually Opening a Socket and Sending the HTTP Request
Socket
The most basic way to perform an HTTP request in Python is to open a
TCP socket and manually send the HTTP request.
import socket
HOST = 'www.google.com' # Server hostname or IP address
PORT = 80 # The standard port for HTTP is 80, for HTTPS it is 443
client_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
server_address = (HOST, PORT)
client_socket.connect(server_address)
request_header = b'GET / HTTP/1.0\r\nHost: www.google.com\r\n\r\n'
client_socket.sendall(request_header)
response = ''
while True:
recv = client_socket.recv(1024)
if not recv:
break
response += str(recv)
print(response)
client_socket.close()
Now that we have the HTTP response, the most basic way to extract
data from it is to use regular expressions.
Regular Expressions
Regular expressions (or also regex) are an extremely versatile tool
for handling, parsing, and validating arbitrary text. A regular
expression is essentially a string which defines a search pattern
using a standard syntax. For example, you could quickly identify all
phone numbers in a web page.
Combined with classic search and replace, regular expressions also
allow you to perform string substitution on dynamic strings in a
relatively straightforward fashion. The easiest example, in a web
scraping context, may be to replace uppercase tags in a poorly
formatted HTML document with the proper lowercase counterparts.
You may be, now, wondering why it is important to understand regular
expressions when doing web scraping. That's a fair question and after
all, there are many different Python modules to parse HTML, with
XPath and CSS selectors.
In an ideal semantic world, data is easily machine-readable, and the
information is embedded inside relevant HTML elements, with
meaningful attributes. But the real world is messy. You will often
find huge amounts of text inside a
element. For example, if you
want to extract specific data inside a large text (a price, a date, a
name...), you will have to use regular expressions.
Note: Here is a great website to test your regex: https://
regex101.com/. Also, here is an awesome blog to learn more about
them. This post will only cover a small fraction of what you can
do with regex.
Regular expressions can be useful when you have this kind of data:
Price : 19.99$
We could select this text node with an XPath expression and then use
this kind of regex to extract the price:
^Price\s*:\s*(\d+\.\d{2})\$
If you only have the HTML, it is a bit trickier, but not all that
much more after all. You can simply specify in your expression the
tag as well and then use a capturing group for the text.
import re
html_content = '
Price : 19.99$
'
m = re.match('
(.+)<\/p>', html_content)
if m:
print(m.group(1))
As you can see, manually sending the HTTP request with a socket and
parsing the response with regular expression can be done, but it's
complicated and there are higher-level API that can make this task
easier.
2. urllib3 & LXML
Disclaimer: It is easy to get lost in the urllib universe in Python.
The standard library contains urllib and urllib2 (and sometimes
urllib3). In Python3 urllib2 was split into multiple modules and
urllib3 won't be part of the standard library anytime soon. This
confusing situation will be the subject of another blog post. In this
section, I've decided to only talk about urllib3 because it is widely
used in the Python world, including by Pip and Requests.
Urllib3 is a high-level package that allows you to do pretty much
whatever you want with an HTTP request. With urllib3, we could do
what we did in the previous section with way fewer lines of code.
import urllib3
http = urllib3.PoolManager()
r = http.request('GET', 'http://www.google.com')
print(r.data)
As you can see, this is much more concise than the socket version.
Not only that, the API is straightforward. Also, you can easily do
many other things, like adding HTTP headers, using a proxy, POSTing
forms ...
For example, had we decided to set some headers and use a proxy, we
would only have to do the following (you can learn more about proxy
servers at bestproxyreviews.com):
import urllib3
user_agent_header = urllib3.make_headers(user_agent="")
pool = urllib3.ProxyManager(f'', headers=user_agent_header)
r = pool.request('GET', 'https://www.google.com/')
See? There are exactly the same number of lines. However, there are
some things that urllib3 does not handle very easily. For example, if
we want to add a cookie, we have to manually create the corresponding
headers and add it to the request.
There are also things that urllib3 can do that Requests can't:
creation and management of a pool and proxy pool, as well as managing
the retry strategy, for example.
To put it simply, urllib3 is between Requests and Socket in terms of
abstraction, although it's way closer to Requests than Socket.
Next, to parse the response, we are going to use the LXML package and
XPath expressions.
XPath
XPath is a technology that uses path expressions to select nodes or
node-sets in an XML document (or HTML document). If you are familiar
with the concept of CSS selectors, then you can imagine it as
something relatively similar.
As with the Document Object Model, XPath has been a W3C standard
since 1999. Although XPath is not a programming language in itself,
it allows you to write expressions that can directly access a
specific node, or a specific node-set, without having to go through
the entire HTML tree (or XML tree).
To extract data from an HTML document with XPath we need three
things:
* an HTML document
* some XPath expressions
* an XPath engine that will run those expressions
To begin, we will use the HTML we got from urllib3. And now we would
like to extract all of the links from the Google homepage. So, we
will use one simple XPath expression, //a, and we will use LXML to
run it. LXML is a fast and easy to use XML and HTML processing
library that supports XPath.
Installation:
pip install lxml
Below is the code that comes just after the previous snippet:
from lxml import html
# We reuse the response from urllib3
data_string = r.data.decode('utf-8', errors='ignore')
# We instantiate a tree object from the HTML
tree = html.fromstring(data_string)
# We run the XPath against this HTML
# This returns an array of element
links = tree.xpath('//a')
for link in links:
# For each element we can easily get back the URL
print(link.get('href'))
And the output should look like this:
https://books.google.fr/bkshp?hl=fr&tab=wp
https://www.google.fr/shopping?hl=fr&source=og&tab=wf
https://www.blogger.com/?tab=wj
https://photos.google.com/?tab=wq&pageId=none
http://video.google.fr/?hl=fr&tab=wv
https://docs.google.com/document/?usp=docs_alc
...
https://www.google.fr/intl/fr/about/products?tab=wh
Keep in mind that this example is really really simple and doesn't
show you how powerful XPath can be (Note: we could have also used //a
/@href, to point straight to the href attribute). If you want to
learn more about XPath, you can read this helpful introduction. The
LXML documentation is also well-written and is a good starting point.
XPath expressions, like regular expressions, are powerful and one of
the fastest way to extract information from HTML. And like regular
expressions, XPath can quickly become messy, hard to read, and hard
to maintain.
If you'd like to learn more about XPath, do not hesitate to read my
dedicated blog post about XPath applied to web scraping.
toto
3. Requests & BeautifulSoup
Requests
Requests is the king of Python packages. With more than 11,000,000
downloads, it is the most widely used package for Python.
Installation:
pip install requests
Making a request with - pun intended - Requests is easy:
import requests
r = requests.get('https://www.scrapingninja.co')
print(r.text)
With Requests, it is easy to perform POST requests, handle cookies,
query parameters... You can also download images with Requests.
On the following page, you will learn to use Requests with proxies.
This is almost mandatory for scraping the web at scale.
Authentication to Hacker News
Let's say we want to create a tool to automatically submit our blog
post to Hacker news or any other forum, like Buffer. We would need to
authenticate on those websites before posting our link. That's what
we are going to do with Requests and BeautifulSoup!
Here is the Hacker News login form and the associated DOM:
Hacker News login form
There are three tags with a name attribute (other input
elements are not sent) on this form. The first one has a type hidden
with a name "goto", and the two others are the username and password.
If you submit the form inside your Chrome browser, you will see that
there is a lot going on: a redirect and a cookie is being set. This
cookie will be sent by Chrome on each subsequent request in order for
the server to know that you are authenticated.
Doing this with Requests is easy. It will handle redirects
automatically for us, and handling cookies can be done with the
Session object.
BeautifulSoup
The next thing we will need is BeautifulSoup, which is a Python
library that will help us parse the HTML returned by the server, to
find out if we are logged in or not.
Installation:
pip install beautifulsoup4
So, all we have to do is POST these three inputs with our credentials
to the /login endpoint and check for the presence of an element that
is only displayed once logged in:
import requests
from bs4 import BeautifulSoup
BASE_URL = 'https://news.ycombinator.com'
USERNAME = ""
PASSWORD = ""
s = requests.Session()
data = {"goto": "news", "acct": USERNAME, "pw": PASSWORD}
r = s.post(f'{BASE_URL}/login', data=data)
soup = BeautifulSoup(r.text, 'html.parser')
if soup.find(id='logout') is not None:
print('Successfully logged in')
else:
print('Authentication Error')
Fantastic, with only a couple of lines of Python code, we have
managed to log in to a site and to check if the login was successful.
Now, on to the next challenge: getting all the links on the homepage.
By the way, Hacker News offers a powerful API, so we're doing
this as an example, but you should use the API instead of
scraping it!
The first thing we need to do is inspect Hacker News's home-page to
understand the structure and the different CSS classes that we will
have to select:
Hacker news's HTML
As evident from the screenshot, all postings are part of a
tag
with the class athing. So, let's simply find all these tags. Yet
again, we can do that with one line of code.
links = soup.findAll('tr', class_='athing')
Then, for each link, we will extract its ID, title, URL, and rank:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://news.ycombinator.com')
soup = BeautifulSoup(r.text, 'html.parser')
links = soup.findAll('tr', class_='athing')
formatted_links = []
for link in links:
data = {
'id': link['id'],
'title': link.find_all('td')[2].a.text,
"url": link.find_all('td')[2].a['href'],
"rank": int(link.find_all('td')[0].span.text.replace('.', ''))
}
formatted_links.append(data)
print(formatted_links)
Great, with only a couple of lines of Python code we have managed to
load the site of Hacker News and get the details of all the posting.
But on our journey to big data, we do not only want to print data, we
actually want to persist it. Let's try that now.
Storing our data in PostgreSQL
We chose a good ol' relational database for our example here -
PostgreSQL!
For starters, we will need a functioning database instance. Check out
www.postgresql.org/download for that, pick the appropriate package
for your operating system, and follow its installations instructions.
Once you have PostgreSQL installed, you'll need to set up a database
(let's name it scrape_demo), and add a table for our Hacker News
links to it (let's name that one hn_links) with the following schema.
CREATE TABLE "hn_links" (
"id" INTEGER NOT NULL,
"title" VARCHAR NOT NULL,
"url" VARCHAR NOT NULL,
"rank" INTEGER NOT NULL
);
For managing the database, you can either use PostgreSQL's own
command line client or one of the available UI interfaces.
All right, the database should be ready and we can turn to our code
again.
First thing, we need something that lets us talk to PostgreSQL and
Psycopg is a truly great library for that. As always, you can quickly
install it with pip.
pip install psycopg2
The rest is relatively easy and straightforward. We just need to get
the connection
con = psycopg2.connect(host="127.0.0.1", port="5432", user="postgres", password="", database="scrape_demo")
That connection will allow us to get a database cursor
cur = con.cursor()
And once we have the cursor, we can use the method execute, to
actually run our SQL command.
cur.execute("INSERT INTO table [HERE-GOES-OUR-DATA]")
Perfect, we have stored everything in our database!
Hold your horses, please. Don't forget to commit your (implicit)
database transaction . One more con.commit() (and a couple of
closes) and we are really good to go.
hn_links table
And for the grand finale, here the complete code with the scraping
logic from before, this time storing everything in the database.
import psycopg2
import requests
from bs4 import BeautifulSoup
# Establish database connection
con = psycopg2.connect(host="127.0.0.1",
port="5432",
user="postgres",
password="",
database="scrape_demo")
# Get a database cursor
cur = con.cursor()
r = requests.get('https://news.ycombinator.com')
soup = BeautifulSoup(r.text, 'html.parser')
links = soup.findAll('tr', class_='athing')
for link in links:
cur.execute("""
INSERT INTO hn_links (id, title, url, rank)
VALUES (%s, %s, %s, %s)
""",
(
link['id'],
link.find_all('td')[2].a.text,
link.find_all('td')[2].a['href'],
int(link.find_all('td')[0].span.text.replace('.', ''))
)
)
# Commit the data
con.commit();
# Close our database connections
cur.close()
con.close()
Summary
As you can see, Requests and BeautifulSoup are great libraries for
extracting data and automating different actions, such as posting
forms. If you want to run large-scale web scraping projects, you
could still use Requests, but you would need to handle lots of parts
yourself.
Did you know about ScrapingBee's Data Extraction tools. Not
only do they provide a complete no-code environment for your
project, but they also scale with ease and handle all advanced
features, such as JavaScript and proxy round-robin, out of the
box. Check it out and the first 1,000 requests are always on us.
If you like to learn more about Python, BeautifulSoup, POST requests,
and particularly CSS selectors, I'd highly recommend the following
articles
* BeautifulSoup tutorial: Scraping web pages with Python
* How to send a POST with Python Requests?
As so often, there are, of course plenty of opportunities to improve
upon:
* Finding a way to parallelize your code to make it faster
* Handling errors
* Filtering results
* Throttling your request so you don't over-load the server
Fortunately for us, tools exist that can handle those for us.
GRequests
While the Requests package is easy-to-use, you might find it a bit
slow if you have hundreds of pages to scrape. Out of the box, it will
only allow you to send synchronous requests, meaning that if you have
25 URLs to scrape, you will have to do it one by one.
So if one page takes ten seconds to be fetched, will take more than
four minutes to fetch those 25 pages.
import requests
# An array with 25 urls
urls = [...]
for url in urls:
result = requests.get(url)
The easiest way to speed up this process is to make several calls at
the same time. This means that instead of sending every request
sequentially, you can send requests in batches of five.
In that case, each batch will handle five URLs simultaneously, which
means you'll scrape five URLs in 10 seconds, instead of 50, or the
entire set of 25 URLs in 50 seconds instead of 250. Not bad for a
time-saver .
Usually, this is implemented using thread-based parallelism. Though,
as always, threading can be tricky, especially for beginners.
Fortunately, there is a version of the Requests package that does all
the hard work for us, GRequests. It's based on Requests, but also
incorporates gevent, an asynchronous Python API widely used for web
application. This library allows us to send multiple requests at the
same time and in an easy and elegant way.
For starters, let's install GRequests.
pip install grequests
Now, here is how to send our 25 initial URLs in batches of 5:
import grequests
BATCH_LENGTH = 5
# An array with 25 urls
urls = [...]
# Our empty results array
results = []
while urls:
# get our first batch of 5 URLs
batch = urls[:BATCH_LENGTH]
# create a set of unsent Requests
rs = (grequests.get(url) for url in batch)
# send them all at the same time
batch_results = grequests.map(rs)
# appending results to our main results array
results += batch_results
# removing fetched URLs from urls
urls = urls[BATCH_LENGTH:]
print(results)
# [, , ..., , ]
And that's it. GRequests is perfect for small scripts but less ideal
for production code or high-scale web scraping. For that, we have
Scrapy .
4. Web Crawling Frameworks
Scrapy
Scrapy Logo
Scrapy is a powerful Python web scraping and web crawling framework.
It provides lots of features to download web pages asynchronously and
handle and persist their content in various ways. It provides support
for multithreading, crawling (the process of going from link to link
to find every URL in a website), sitemaps, and more.
Scrapy also has an interactive mode called the Scrapy Shell. With
Scrapy Shell, you can test your scraping code quickly and make sure
all your XPath expressions or CSS selectors work without a glitch.
The downside of Scrapy is that the learning curve is steep. There is
a lot to learn.
To follow up on our example about Hacker News, we are going to write
a Scrapy Spider that scrapes the first 15 pages of results, and saves
everything in a CSV file.
You can easily install Scrapy with pip:
pip install Scrapy
Then you can use the Scrapy CLI to generate the boilerplate code for
our project:
scrapy startproject hacker_news_scraper
Inside hacker_news_scraper/spider we will create a new Python file
with our spider's code:
from bs4 import BeautifulSoup
import scrapy
class HnSpider(scrapy.Spider):
name = "hacker-news"
allowed_domains = ["news.ycombinator.com"]
start_urls = [f'https://news.ycombinator.com/news?p={i}' for i in range(1,16)]
def parse(self, response):
soup = BeautifulSoup(response.text, 'html.parser')
links = soup.findAll('tr', class_='athing')
for link in links:
yield {
'id': link['id'],
'title': link.find_all('td')[2].a.text,
"url": link.find_all('td')[2].a['href'],
"rank": int(link.td.span.text.replace('.', ''))
}
There is a lot of convention in Scrapy. We first provide all the
desired URLs in start_urls. Scrapy will then fetch each URL and call
parse for each of them, where we will use our custom code to parse
response.
We then need to fine-tune Scrapy a bit in order for our spider to
behave nicely with the target website.
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 5
You should always turn this on. Based on the response times, this
feature automatically adjusts the request rate and the number of
concurrent threads and makes sure your spider is not flooding the
website with requests. We wouldn't want that, would we?
You can run this code with the Scrapy CLI and with different output
formats (CSV, JSON, XML...):
scrapy crawl hacker-news -o links.json
And that's it! You now have all your links in a nicely formatted JSON
file.
There is a lot more to say about this Scrapy. So, if you wish to
learn more, please don't hesitate to check out our dedicated blog
post about web scraping with Scrapy.
PySpider
PySpider is an alternative to Scrapy, albeit a bit outdated. Its last
release is from 2018. However it is still relevant because it does
many things that Scrapy does not handle out of the box.
First, PySpider works well with JavaScript pages (SPA and Ajax call)
because it comes with PhantomJS, a headless browsing library. In
Scrapy, you would need to install middlewares to do this. On top of
that, PySpider comes with a nice UI that makes it easy to monitor all
of your crawling jobs.
PySpider interface
PySpider interface
However, you might still prefer to use Scrapy for a number of
reasons:
* Much better documentation than PySpider with easy-to-understand
guides
* A built-in HTTP cache system that can speed up your crawler
* Automatic HTTP authentication
* Support for 3XX redirections, as well as the HTML meta refresh
tag
5. Headless browsing
Selenium & Chrome
Scrapy is great for large-scale web scraping tasks. However, it is
difficult to handle sites with it, which are heavily using JavaScript
are implemented, e.g., as SPA (Single Page Application). Scrapy does
not handle JavaScript on its own and will only get you the static
HTML code.
It, generally, can be challenging to scrape SPAs because there are
often lots of AJAX calls and WebSocket connections involved. If
performance is an issue, always check out what exactly the JavaScript
code is doing. This means manually inspecting all of the network
calls with your browser inspector and replicating the AJAX calls
containing the interesting data.
Often, though, there are too many HTTP calls involved to get the data
you want and it can be easier to render the page in a headless
browser. Another great use case for that, would be to take a
screenshot of a page, and this is what we are going to do with the
Hacker News homepage (we do like Hacker News, don't we?) and the help
of Selenium.
Hey, I don't get it, when should I use Selenium or not?
Here are the three most common cases when you need Selenium:
1. You're looking for an information that is appearing a few
seconds after the webpage is loaded on a browser.
2. The website you're trying to scrape is using a lot of
JavaScript.
3. The website you're trying to scrape have some JavaScript
check to block "classic" HTTP client.
You can install the Selenium package with pip:
pip install selenium
You will also need ChromeDriver. On mac OS you can use brew for that.
brew install chromedriver
Then, we just have to import the Webdriver from the Selenium package,
configure Chrome with headless=True, set a window size (otherwise it
is really small), start the Chrome, load the page, and finally get
our beautiful screenshot:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")
driver = webdriver.Chrome(options=options, executable_path=r'/usr/local/bin/chromedriver')
driver.get("https://news.ycombinator.com/")
driver.save_screenshot('hn_homepage.png')
driver.quit()
True, being good netizens, we also quit() the WebDriver instance of
course. Now, you should get a nice screenshot of the homepage:
Hacker News's front page
Naturally, there's a lot more you can do with the Selenium API and
Chrome. After all, it's a full-blown browser instance.
* Running JavaScript
* Filling forms
* Clicking on elements
* Extracting elements with CSS selectors / XPath expressions
Selenium and Chrome in headless mode are the ultimate combination to
scrape anything you want. You can automate everything that you could
do with your regular Chrome browser.
The big drawback is that Chrome needs lots of memory / CPU power.
With some fine-tuning you can reduce the memory footprint to
300-400mb per Chrome instance, but you still need 1 CPU core per
instance.
Don't hesitate to check out our in-depth article about Selenium and
Python.
If you need to run several instances concurrently, this will require
a machine with an adequate hardware setup and enough memory to serve
all your browser instances. If you'd like a more lightweight and
carefree solution, check out ScrapingBee's site crawler SaaS
platform, which does a lot of the heavy lifting for you.
RoboBrowser
RoboBrowser is a Python library which wraps Requests and
BeautifulSoup into a single and easy-to-use package and allows you to
compile your own custom scripts to control the browsing workflow of
RoboBrowser. It is a lightweight library, but it is not a headless
browser and still has the same restrictions of Requests and
BeautifulSoup, we discussed earlier.
For example, if you want to login to Hacker-News, instead of manually
crafting a request with Requests, you can write a script that will
populate the form and click the login button:
# pip install RoboBrowser
from robobrowser import RoboBrowser
browser = RoboBrowser()
browser.open('https://news.ycombinator.com/login')
# Get the signup form
signin_form = browser.get_form(action='login')
# Fill it out
signin_form['acct'].value = 'account'
signin_form['password'].value = 'secret'
# Submit the form
browser.submit_form(signin_form)
As you can see, the code is written as if you were manually doing the
task in a real browser, even though it is not a real headless
browsing library.
RoboBrowser is cool because its lightweight approach allows you to
easily parallelize it on your computer. However, because it's not
using a real browser, it won't be able to deal with JavaScript like
AJAX calls or Single Page Applications.
Unfortunately, its documentation is also lightweight, and I would not
recommend it for newcomers or people not already used to the
BeautilfulSoup or Requests API.
6. Scraping Reddit data
Sometimes you don't even have to scrape the data using an HTTP client
or a headless browser. You can directly use the API exposed by the
target website. That's what we are going to try now with the Reddit
API.
To access the API, we're going to use Praw, a great Python package
that wraps the Reddit API.
To install it:
pip install praw
Then, you will need to get an API key. Go to https://www.reddit.com/
prefs/apps .
Scroll to the bottom to create application:
Scraping Reddit data with API
As outlined in the documentation of Praw, make sure to provide http:/
/localhost:8080 as "redirect URL".
After clicking create app, the screen with the API details and
credentials will load. You'll need the client ID, the secret, and the
user agent for our example.
Scraping Reddit data with API
Now we are going to get the top 1,000 posts from /r/Entrepreneur and
export it to a CSV file.
import praw
import csv
reddit = praw.Reddit(client_id='you_client_id', client_secret='this_is_a_secret', user_agent='top-1000-posts')
top_posts = reddit.subreddit('Entrepreneur').top(limit=1000)
with open('top_1000.csv', 'w', newline='') as csvfile:
fieldnames = ['title','score','num_comments','author']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for post in top_posts:
writer.writerow({
'title': post.title,
'score': post.score,
'num_comments': post.num_comments,
'author': post.author
})
As you can see, the actual extraction part is only one single line of
Python code. Running top on subreddit and storing the posts in
top_posts .
There are many other use cases for Praw. You can do all kinds of
crazy things, like analyzing sub reddits in real-time with sentiment
analysis libraries, predicting the next $GME ...
Conclusion
Here is a quick recap table of every technology we discussed in this
blog post. Please, do not hesitate to let us know if you know some
resources that you feel belong here.
Name socket urllib3 requests scrapy selenium
Ease of use - - - + + + + + + + +
Flexibility + + + + + + + + + + + + + +
Speed of + + + + + + + + + + +
execution
* High level * Crawling an * JS
application * Calling an important rendering
* Writing that needs API list of * Scraping
Common use low-level fine control * Simple websites SPA
case programming over HTTP application * Filtering, * Automated
interface (pip, aws (in terms of extracting testing
client, HTTP needs) and loading *
requests, scraped data Programmatic
streaming) screenshot
* Official * Official
* Official documentation documentation * Official * Official
Learn more documentation * PIP usage * Requests documentation documentation
* Great of urllib3, usage of - Scrapy * Scraping
tutorial very urllib3 overview SPA
interesting
I hope you enjoyed this blog post! This was a quick introduction to
the most used Python tools for web scraping. In the next posts we're
going to go more in-depth on all the tools or topics, like XPath and
CSS selectors.
If you want to learn more about HTTP clients in Python, we just
released this guide about the best Python HTTP clients.
Happy Scraping!
image description
Kevin Sahin
Kevin worked in the web scraping industry for 10 years before
co-founding ScrapingBee. He is also the author of the Java Web
Scraping Handbook.
You might also like:
Pyppeteer: the Puppeteer for Python Developers
[kalebu]
Kalebu Gwalugano
9 min read
Pyppeteer is a Python wrapper for Puppeteer. This article will show
you how to use it to scrape dynamic site, automate and render
Javascript-heavy websites.
Using Python and wget to Download Web Pages and Files
[roel]
Roel Peters
8 min read
This tutorial will teach you to use wget with Python using runcmd.
This article will show you the benefits of using Wget with Python
with some simple examples.
How to use cURL with Python?
[ahmed]
Ahmed Hashesh
7 min read
This tutorial will teach you to use cURL with Python using PycURL.
PycURL is an interface to cURL in Python. It's one of the fastest
HTTP client for Python, which is perfect if you need lots of
concurrent connections.
Ready to get started?
Get access to 1,000 free API credits
Try ScrapingBee for Free
ScrapingBee
ScrapingBee API handles headless browsers and rotates proxies for
you.
*
*
Company
* Team
* Company's journey
* Blog
* Rebranding
* Affiliate Program
Legal
* Terms of Service
* Privacy Policy
* GDPR Compliance
* Data Processing Agreement
Product
* Features
* Pricing
* Status
How we compare
* Alternative to Crawlera
* Alternative to Luminati
* Alternative to Smartproxy
* Alternative to Oxylabs
* Alternative to NetNut
* Alternatives to ScrapingBee
No code web scraping
* No code web scraping
* No code competitor monitoring
* How to put scraped website data into Google Sheets
* Send stock prices update to Slack
* Scrape Amazon products' price with no code
* Scrape Amazon products' price with no code
* Extract job listings, details and salaries
Learning Web Scraping
* A guide to Web Scraping without getting blocked
* Web Scraping Tools
* Best Free Proxies
* Best Mobile proxies
* Web Scraping vs Web Crawling
* Rotating and residential proxies
* Web Scraping with Python
* Web Scraping with PHP
* Web Scraping with Java
* Web Scraping with Ruby
* Web Scraping with NodeJS
* Web Scraping with R
* Web Scraping with C#
* Web Scraping with Go
Copyright (c) 2022
Made in France