https://medium.com/brick-by-brick/7-bite-sized-tips-for-reliable-web-automation-and-scraping-selectors-2612bc4de2a1
Brick by Brick
Sign in
Brick by Brick
7 bite-sized tips for reliable web automation and scraping selectors
Todd Schiller
Todd Schiller
Follow
Dec 3, 2020 * 6 min read
If you're like most developers, you've probably encountered Cascading
Style Sheet (CSS) selectors for styling webpages.
For example, the following CSS rule combines a paragraph element
selector p with a class name selector .lead to set the font size:
p.lead {
font-size: 32px;
}
We can mix-and-match CSS selectors to describe any subset of elements
on a page. There are CSS selectors for HTML tag types, ids, classes,
attributes, page structure, and even UX interactions.
Because of their expressiveness, CSS selectors are used everywhere in
the web ecosystem:
* Web Applications (Web API, JQuery): modifying the DOM, attaching
event handlers, etc.
* Automated Testing: finding elements to interact with, checking
assertions, etc.
* Web Scraping: selecting data to extract, finding links to
traverse, etc.
Over at PixieBrix, selectors are an integral part of our web
customization engine.
Unfortunately, if we don't control a web page, writing reliable
selectors can be an unpleasant experience. Some old-school
WYSIWYG-built sites (I'm looking at you, Microsoft FrontPage) and
modern Single Page Applications (SPAs) can be downright hostile to
work with.
To help folks out, I've compiled a set of tips I find useful when
writing selectors for enhancing, automating, and scraping 3rd-party
sites.
Tip #0: Use JQuery extensions
When working in a browser context (normal or headless), use JQuery's
selector extensions. Selectors such as :eq , :header , and :contains
make many selections significantly more straightforward.
A downside is that JQuery extensions are slower because they can't
leverage the browser's native CSS evaluation. However, for automation
and scraping, you probably won't ever notice. If it does become a
problem, there are simple techniques to optimize JQuery selectors.
In addition to CSS+JQuery, it's also helpful to learn a bit of XPath,
an XML/HTML traversal language that browsers support natively.
Tip #1: Avoid structure-based selectors
Browsers and some libraries are able to automatically generate unique
CSS selectors for elements.
Generating a unique selector for an element in Chrome
Generating a unique selector for an element in Chrome
Generating a unique CSS selector for an element in Chrome
Automatic generators are great because they can take advantage of the
structure of the Document Object Model (DOM). Automatic generators
are also terrible because they sometimes rely too much on the
structure of the DOM.
Recently, a tool gave me the following selector:
:nth-child(5) > :nth-child(2) > :nth-child(2) > :nth-child(1) > :nth-child(3)
Structural selectors like these are extremely sensitive to small
changes on the page, including non-visible changes. (For example: an
element with display:none could be inserted before an element.)
These selectors will silently return the wrong element, and we'll
wind up debugging a failed automated test, or ingesting bad extracted
data.
Therefore, whenever possible, I eschew overly-specific structural
selectors in favor of ids, class names, attributes, and textual
content.
Tip #2: Avoid dynamically generated attributes
In modern Single Page Applications (SPAs), element ids and class
names are commonly dynamically generated.
Dynamic class names are computed when the code is compiled/bundled.
That's because the class names in the HTML file must match the names
in separate CSS stylesheets. Dynamic class names will change whenever
the site is updated. (Unless they are computed deterministically,
e.g., using a hash. Then they'll change whenever the style is
changed).
For example, take the "Google Search" button on the Google homepage.
It's built with Google's Closure Framework:
At first glance, the name="btnK" also looks random, but it's not. The
"I'm Feeling Lucky" button button on the page has the name btnI,
suggesting they're human-picked. As a rule of thumb, input name
attributes can't be dynamic because other parts of the application
(either front-end, or backend) depend on them.
A common source of dynamic class names in applications is the use of
CSS Modules. CSS Modules automatically encapsulating styles to avoid
accidental styling collisions. Here's an example header:
An example heading
While we can't match the exact class name, it will follow a
consistent naming pattern across updates of the site. Therefore, we
can match on the pattern using CSS's starts-with attribute selector:
h1[class^="_styles__title_"]
Dynamic ids, on the other hand, are managed by the SPA framework at
runtime. Therefore, they'll change for each run of the application.
Take, for example, the profile header from LinkedIn, which uses the
Ember.js framework:
Here, the ember1398 id isn't random -- the framework sequentially
generates a new id for each element it renders. In practice, the id
of an element will differ between page loads because elements don't
always load in the exact same order.
Tip #3: Search text with JQuery's :contains selector
JQuery's contains selector selects elements that contain the given
text
anywhere within them. The selector is not available in plain CSS, but
it is available as an XPath function.
For example, consider the following HTML:
Lorem ipsum dolor sit amet, consectetur ...
We could select this element with:
:contains("Lorem ipsum")
If there are other tags, we need to add additional selectors to
clarify which element we want. For example, the above selector would
select the div, p, and b elements in this document:
Lorem ipsum dolor sit amet, consectetur...
To select just the div, we'd need to specify the tag in the selector:
div:contains("Lorem ipsum")
In some cases, the text might contain HTML entities, e.g., a
non-breaking space:
Lorem&nsbp;ipsum
Trying to select "Lorem ipsum" won't match here. In these cases, we
provide the contains selector multiple times to match only elements
that contain both words:
div:contains("Lorem"):contains("ipsum")
Finally, a caveat/warning: when writing text selectors, be aware of
which text is translated when the page is viewed in a different
language.
Tip #4: Target selection with :has
Sometimes we need more precision in targeting a text search. In these
cases we can combine a :contains selector within a CSS :has selector.
Suppose we wanted to get the year a property was built from
Apartments.com:
Property Information
*Built in 2018
*1016 Units/44 Stories
We can find the div with the Property Information header, and then
grab the list item corresponding to the year the property was built:
.specList:has(h3:contains('Property Information')) ul > li:contains('Built')
The :has selector can also be used to find the containing element.
The following selector matches a list containing an item with
"Built", rather than the individual list item:
ul:has(> li:contains("Built"))
Tip #5: Use ARIA attributes for more readable selectors
Accessible Rich Internet Applications (ARIA) attributes support the
accessibility of a site, e.g., for providing context to screen
readers. Using them for selectors also makes selectors more readable.
For example, the aria-label attribute labels elements that otherwise
would depend on visual cues:
These can be selected against using standard CSS attribute selectors,
e.g.:
button[aria-label="Close"]
One thing to watch out for with the aria-label attribute is that it's
always internationalized/translated alongside other UX text.
ARIA attributes also provide information about the role of elements
on the page (e.g., the role attribute) and even the relationship
between elements on the page (e.g., aria-labelledby).
Tip #6: Be on the lookout for automated testing attributes
Developers using automated testing frameworks, e.g. Cypress, add data
attributes to make their life easier when writing automated tests.
Sometimes these attributes will find their way to the production
site:
Testing attributes can be selected just like any other attribute, and
are often unique (just like an id):
[data-cy="content"]
Epic Handshake meme of Quality Assurance and Web Scrapers shaking
hands over data attributes
Epic Handshake meme of Quality Assurance and Web Scrapers shaking
hands over data attributes
We're all in this together
Bonus Tip: handle flat key/value pairs with the adjacent sibling
combinator
The adjacent sibling combinator (+) matches the element immediately
following another element.
Supposed we have the following HTML, structured as multiple key-value
pairs per row/list item:
Name:Laura SmithYear:2013
We can lookup the value for the year by finding the "Year:" label and
then selecting the next element, which will be the value:
span.k:contains('Year:') + span
Want more?
If you want more tips like these don't forget to follow @pixiebrix on
Twitter, sign up for our newsletter, and star our repository on
GitHub.
In the next post I'll cover how to pull information directly from web
application frameworks (like React and Angular), instead of parsing
HTML
Brick by Brick
PixieBrix's engineering publication about extensions, plugins,
no-code, and RPA
Follow
3
2
* Web Automation
* Web Scraping
* CSS
* Web Development
* Quality Assurance
3 claps
3 claps
2 responses
Todd Schiller
Written by
Todd Schiller
Follow
Co-founder @ PixieBrix. University of Washington CompSci PhD.
Formerly AI at Bridgewater Associates. Machine Business
Intelligence
Follow
Brick by Brick
Brick by Brick
Follow
PixieBrix's engineering publication covering the latest in browser
extensions, application plugins, robotic process automation, no-code,
and more
Follow
Todd Schiller
Written by
Todd Schiller
Follow
Co-founder @ PixieBrix. University of Washington CompSci PhD.
Formerly AI at Bridgewater Associates. Machine Business
Intelligence
Brick by Brick
Brick by Brick
Follow
PixieBrix's engineering publication covering the latest in browser
extensions, application plugins, robotic process automation, no-code,
and more
More From Medium
Keeping Memory Leaks in Mind to Program Better.
Ashab Ahmed in The Startup
[1]
[1]
Fast Virtual Functions: Hacking the VTable for Fun and Profit
Caleb Leak
[1]
[1]
Using RabbitMQ With Rails to Communicate Different Microservices
Jose Francisco Caiceo in The Startup
[1]
[1]
What to Do if You Went to a Coding Bootcamp But Don't Want to Be a
Developer
Roxy A in The Startup
[0]
[0]
Performance testing on the web
Tianguang Zhang in Flutter
[0]
[0]
Home Assistant and Shelly U&T sensor: what are the integration
options?
Carlo Caprini
[1]
[1]
How To Get the Last Item in a List in Python
Jonathan Hsu in Better Programming
[1]
[1]
Ruby Algorithm Questions For Beginners
Chris I. in Ruby Daily
[1]
[1]
Learn more.
Medium is an open platform where 170 million readers come to find
insightful and dynamic thinking. Here, expert and undiscovered voices
alike dive into the heart of any topic and bring new ideas to the
surface. Learn more
Make Medium yours.
Follow the writers, publications, and topics that matter to you, and
you'll see them on your homepage and in your inbox. Explore
Share your thinking.
If you have a story to tell, knowledge to share, or a perspective to
offer -- welcome home. It's easy and free to post your thinking on any
topic. Write on Medium
About
Help
Legal
Get the Medium app
A button that says 'Download on the App Store', and if clicked it
will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will
lead you to the Google Play store