[HN Gopher] Beautiful Soup
       ___________________________________________________________________
        
       Beautiful Soup
        
       Author : memorable
       Score  : 177 points
       Date   : 2022-06-08 13:32 UTC (9 hours ago)
        
 (HTM) web link (www.crummy.com)
 (TXT) w3m dump (www.crummy.com)
        
       | js2 wrote:
       | I used BS to scrape rogerebert.com and post his reviews to
       | letterboxd:
       | 
       | https://letterboxd.com/re2/
       | 
       | I copied over only the first two paragraphs of each of his
       | reviews with a link back to the original.
       | 
       | The HTML is a total mess having obviously moved around the web a
       | couple times. So it required a bunch of cleanup. But that wasn't
       | even the hard part. The hard part was getting the correct TMDB ID
       | for the movies because his reviews also have either no useful
       | metadata or metadata that's wrong, like incorrect movie years,
       | misspelled actor names, etc.
       | 
       | I never was able to get API access to letterboxd, but they have a
       | CSV import feature which worked-out well enough.
        
         | ravishi wrote:
         | I had my share of gigs where we just decided to scrape the old
         | site with BS and extract structured data from there to render a
         | new site. It was sometimes cheaper than dealing with their
         | ancient ad-hoc cms monstrosities.
        
           | js2 wrote:
           | After I managed to wrangle the review text from the HTML it
           | still needed this sort of cleanup:                   def
           | clean_text(text):             text = re.sub(r"[\x7f-\x9f]",
           | "", text)  # remove control chars             text =
           | re.sub(r"[\xa0\r\t]+", " ", text)  # replace with spaces
           | text = re.sub(r"\n+", "\n", text)  # squash runs of newlines
           | text = re.sub(r"\s+", " ", text)  # squash runs of spaces
           | # Remove newlines unless they appear to be at the end of a
           | sentence             # or if the sentence is shorter than 80
           | characters.             text = re.sub(r"([^.?!\"\)])\n", r"\1
           | ", text)             text = re.sub(r"\n([^\n]{,80})\n", r"\1
           | ", text)             return text.strip()
        
       | benibela wrote:
       | I wrote my own HTML parser in Pascal 15 years ago. Pascal is much
       | faster than Python
        
       | rawbot wrote:
       | Just used this to scrap game guides from GameFAQs. Pretty easy to
       | use, would recommend for quick projects.
        
       | latchkey wrote:
       | Years ago, I got the privilege of working at the same company
       | with the author, Leonard Richardson. Really nice guy, super nerd
       | and hilariously funny.
        
       | jng wrote:
       | We used this in a project many suns ago and we ended up switching
       | to libxml2, less pretty presentation, but more functional. YMMV.
        
         | brodouevencode wrote:
         | bs4 introduced some very nice features over bs3, if that's what
         | you were using, and includes the ability to use libxml2 as a
         | parser. For very simple things though libxml2 would be a better
         | fit.
        
           | jamessb wrote:
           | bs4 is able to parse some malformed documents that libxml2
           | chokes on.
           | 
           | For these cases it can be useful to do the reverse, and use
           | the BeautifulSoup HTML parser as an alternative parser
           | backend for the lxml package:
           | https://lxml.de/elementsoup.html
        
       | frereubu wrote:
       | I used this recently to scrape a bunch of online restaurant
       | reviews by a guy who I really like, use regex to get the postcode
       | from the markup, do a geocode using postcodes.io, then plot the
       | reviews on a Google map. It took about two / three hours and felt
       | kinda dirty in a good way. Beautiful Soup made the first part
       | really easy.
        
         | cyberlurker wrote:
         | Hmm, cool project idea but if the subject doesn't know you're
         | doing this, it is a bit weird to stalk their online footprint.
         | 
         | Unless you mean the person is a professional restaurant
         | reviewer and you like their opinion.
         | 
         | Oh well, good luck either way.
        
           | tlavoie wrote:
           | "Hmm, cool project idea but if the subject doesn't know
           | you're doing this, it is a bit weird to stalk their online
           | footprint."
           | 
           | Wait, which one is the real cyberlurker? ;)
        
           | frereubu wrote:
           | Yes, they're a professional reviewer and I enjoy their
           | reviews.
        
       | oh_sigh wrote:
       | Boy...15 years ago I was reaching to this and hpricot almost
       | every week to do some cool scraping/parsing of some kind. I
       | always loved the bs API
        
       | dj_gitmo wrote:
       | Beautiful Soup got me my first job.
        
       | stereocodes wrote:
       | whats new about it? why is this post here? Beautiful Soup has
       | been around for a long long time.
        
         | sophacles wrote:
         | See: https://news.ycombinator.com/newsguidelines.html
         | 
         | Particularly: On-Topic: Anything that good hackers would find
         | interesting. That includes more than hacking and startups. If
         | you had to reduce it to a sentence, the answer might be:
         | anything that gratifies one's intellectual curiosity.
        
       | emehex wrote:
       | Beautiful Soup is an incredibly robust, and powerful tool.
       | However, it can be sometimes be an intimidating tool for
       | beginners (which ".find_x_y_z" method should I use again?). To
       | that end you should check out gazpacho (with just a single "find"
       | method): https://github.com/maxhumber/gazpacho
        
       | agentdrtran wrote:
       | I used Beautiful Soup on one of my first successful programming
       | projects, I still remember how easy it was, and it taught me a
       | lot about python.
        
       | durpleDrank wrote:
       | WWW::Mechanize fam where you at?
        
         | gaws wrote:
         | The downside is using Perl.
        
       | dontbenebby wrote:
       | I used Beautiful soup for a project recently (grabbing a series
       | of page titles for export to CSV), it's super useful, thanks for
       | the pointer OP.
        
       | whiskey14 wrote:
       | I find this to be a better version of the docs:
       | https://beautiful-soup-4.readthedocs.io/en/latest/#
       | 
       | Just in case someone wants a comment overview of what this
       | superbly named library is: web scraping (html parsing) in python
        
         | wrycoder wrote:
         | The crummy.com page includes several suggestions to subscribe
         | to Tidelift.
        
       | scanr wrote:
       | The author also wrote a novel called Constellation Games which I
       | enjoyed a lot https://constellation.crummy.com/
        
       | rnx wrote:
       | I fondly remember being introduced to this library as part of the
       | first project I worked on at my first development job. I was
       | lucky there it was the right challenge at the right time.
        
       | aedocw wrote:
       | https://web.archive.org/web/20220608133224/https://www.crumm...
        
       | phone8675309 wrote:
       | How helpful is this when you're dealing with a website that does
       | not degrade gracefully and insists using JavaScript to shove
       | things in where a static webpage would work? (For example,
       | scraping football scores from NFL.com)
        
         | marginalia_nu wrote:
         | Easiest in that case is probably to use something like headless
         | chrome. But that is also significantly more demanding in terms
         | of resources.
        
         | werds wrote:
         | its not useful in those cases, but usually for those js
         | rendered sites you can replicate the ajax requests which happen
         | and get nicely formed json documents to parse through instead.
        
           | squaresmile wrote:
           | Or the data is stored in js objects within script tags in the
           | html and can be extracted programmatically. It's getting
           | common with SSG sites using SPA frameworks.
           | 
           | For example, the new Google Play Store website stores the
           | data in AF_initDataCallback calls and can be extracted with
           | re.findall(r"<script
           | nonce=\"\S+\">AF_initDataCallback\\((.*?)\\);", html_string).
        
           | phone8675309 wrote:
           | I used to do that when I was responsible for a set of web
           | crawlers to extract public records data, but the problem is
           | that changes happen and these sorts of things become out of
           | date fairly quickly.
           | 
           | Getting this working in a headless browser driven by Selenium
           | would probably be easier for maintainability.
        
           | 867-5309 wrote:
           | nowadays you usually have to submit http headers and cookies
           | too, that's always a fun process of elimination
        
         | radus wrote:
         | In those cases you might want to check out SeleniumBase:
         | https://seleniumbase.io/
        
         | [deleted]
        
         | sergiotapia wrote:
         | If you're lucky those sites have the raw data a server side
         | generated JSON payload right in the site source code markup.
         | 
         | For example Target is clientside, but has all the data in a
         | `window.FOOBAR = json` variable you can fetch and parse with
         | some substring magic. Much easier than spinning up chromedriver
         | and some package.
        
       | pmoriarty wrote:
       | Is Beautiful Soup still the best way to scrape the web with
       | python?
       | 
       | IIRC, Beautiful Soup doesn't handle javascript, so at least for
       | JS you're forced to use something else.
       | 
       | I'm also looking forward to seeing how people scrape the web once
       | Web Assembly becomes prevalent.
        
         | move-on-by wrote:
         | I've found Playwright to be a really great tool for scraping
        
           | malshe wrote:
           | This sounds interesting. Any resources for a beginner? I use
           | Selenium regularly.
        
             | simonw wrote:
             | Playwright for Python has really good documentation:
             | https://playwright.dev/python/
             | 
             | I used it for my https://shot-scraper.datasette.io/ tool,
             | and wrote a bit about CLI-driven scraping using that tool
             | here: https://simonwillison.net/2022/Mar/14/scraping-web-
             | pages-sho...
        
           | samwillis wrote:
           | Headless scraping is in the region of 10x slower and more
           | resource intensive, even when carefully blocking requests
           | such as images. It should always be a second choice.
           | 
           | Other than that Playwright is incredible, by far the best
           | browser automation api.
        
             | move-on-by wrote:
             | For sure it's a heavy approach, but if you need a full
             | blown browser with JS, then that's just what you'll have to
             | do. Use the right tool for the job.
        
           | derac wrote:
           | Playwright is easily the best for browser automation. I still
           | use requests + beautiful soup often as well.
        
         | diarrhea wrote:
         | For JS, I've used Selenium with a Chrome driver, then parsed
         | the HTML with BeautifulSoup. I know nothing about web
         | development, so this might be an outdated way, but it worked.
         | BS was nice to deal with.
        
           | wildrhythms wrote:
           | I used Puppeteer for this to great success. Very easy to set
           | up, and you control the whole browser + access to everything
           | you would have access to normally through Chrome dev tools.
        
         | NegativeLatency wrote:
         | It can be useful if it fits your case, the more recent scrapers
         | run a whole browser with automation which can make scraping
         | stuff a lot easier, since js will run etc
        
         | driscoll42 wrote:
         | If you don't need JS, I think it's still the best. Sure, that
         | doesn't always work for you, but once you start needing to
         | render JS the scraping slows down tremendously.
        
       | zasdffaa wrote:
       | For web scraping I used htmltidy
       | (https://en.wikipedia.org/wiki/HTML_Tidy) which cleaned it
       | sufficiently that I could run XSLT over it (gags at the memory)
        
       | iooi wrote:
       | I found lxml.html a lot easier to work with than bs4, in case
       | that helps anyone else.
       | 
       | https://lxml.de/lxmlhtml.html
        
         | mdaniel wrote:
         | On the off chance you were not aware, bs4 also supports[0]
         | getting parse events from html5lib[1] which (as its name
         | implies) is far more likely to parse the text the same way the
         | browser would
         | 
         | 0:
         | https://www.crummy.com/software/BeautifulSoup/bs4/doc/index....
         | 
         | 1: https://pypi.org/project/html5lib/
        
           | westurner wrote:
           | BeautifulSoup is an API for multiple parsers
           | https://beautiful-
           | soup-4.readthedocs.io/en/latest/#installin... :
           | BeautifulSoup(markup, "html.parser")
           | BeautifulSoup(markup, "lxml")       BeautifulSoup(markup,
           | "lxml-xml")       BeautifulSoup(markup, "xml")
           | BeautifulSoup(markup, "html5lib")
           | 
           | Looks like lxml w/ xpath is still the fastest with Python
           | 3.10.4 from "Pyquery, lxml, BeautifulSoup comparison"
           | https://gist.github.com/MercuryRising/4061368 ; which is fine
           | for parsing (X)HTML(5) that validates<
           | 
           | (EDIT: Is xml/html5 a good format for data serialization?
           | defusedxml ... Simdjson, Apache arrow.js)
        
             | driscoll42 wrote:
             | I was curious, so I tried that performance test you linked
             | to on my machine with the various parsers:
             | ==== Total trials: 100000 =====         bs4 lxml total
             | time: 110.9         bs4 html.parser total time: 87.6
             | bs4 lxml-xml total time: 0.5         bs4 xml total time:
             | 0.5         bs4 html5lib total time: 103.6         pq total
             | time: 8.7         lxml (cssselect) total time: 8.8
             | lxml (xpath) total time: 5.6         regex total time: 13.8
             | (doesn't find all p)
             | 
             | bs4 is damn fast with the lxml-xml or xml parsers
        
         | somat wrote:
         | Same here, I am unable to properly quantify it but there was
         | something about the soup api I did not really like.
         | 
         | It may have been because I learned on the python xml.etree
         | library in base(I moved to lxml because it has the same api but
         | is faster and knows about parent nodes) and had a hard time
         | with the soup api.
         | 
         | But I think it was the way it overloaded the selectors. I did
         | not like the way you could magically find elements. I may have
         | to revisit it and try and figure out why and if I still do not
         | like it.
        
       | djtriptych wrote:
       | Great memories with this library, one of my all time favs.
       | 
       | It is fast? no.
       | 
       | But it had a fantastic mission: extracting data from malformed
       | HTML.
       | 
       | Might be less common now but back then (~10+ years ago) it was
       | still rampant. Many if not most parsers would barf on any
       | deviation from the standard, leaving you to hand-roll regex
       | solutions and ugly corner cases.
       | 
       | BS covered a LOT of these cases without forcing you to write
       | terrible code. It mostly just worked, with a reasonable API, and
       | stellar, well-written, example-laden docs.
        
         | Der_Einzige wrote:
         | I remember that I needed to do something involving performance
         | with beautiful soup. Switching the HTML parser backend (as they
         | mention in the docs in BS4) gave me an order of magnitude
         | speedup...
        
         | anigbrowl wrote:
         | I'm not sure why you're using the past tense, the current
         | version is only a year old, and I will never stop using it
         | because panning messy datasets for hidden gold is my thing.
        
         | ssimpson wrote:
         | absolutely. i wrote this mobile allergy data thing and getting
         | the data was mostly scraping from news websites that did
         | terrible things with javascript to keep scrapers and ad
         | blockers out. Beautiful Soup worked past all of that easily.
         | Probably my favorite Python library.
        
         | jerf wrote:
         | "Might be less common now"
         | 
         | It is less necessary now. One of the most important parts of
         | the HTML5 standards, IMHO, is that it specifies how to parse
         | HTML that doesn't conform to the standards in a standard way.
         | In principle, every bag of bytes now has a standard-compliant
         | way to parse it that every HTML5 parser should agree on. I
         | don't use this to know how many edge cases the standard and/or
         | implementations have, but it's a lot better than it used to be,
         | and means that every HTML5 parser has many of the capabilities
         | that Beautiful Soup used to (nearly-)uniquely have for parsing
         | messy HTML.
         | 
         | I suspect Beautiful Soup was a non-trivial aspect of how the
         | decision to implement such a spec was decided upon. It proved
         | the idea to be a very valuable one at a time when most
         | languages lacked such a library. Basically, BS won so hard that
         | while it wasn't necessarily _directly_ adopted as a standard,
         | the essence of it certainly was.
        
           | ThatPlayer wrote:
           | Not just HTML standards, but I've used their detwingle
           | function because one of the sites I'm scraping has a mixture
           | of Windows-1252 and Unicode. It was clearly stored correctly,
           | but encoded differently depending on what page view you were
           | looking at. For example titles in an outline were broken, but
           | on the actual individual pages fine. Their rendering also
           | treated multibyte characters incorrectly during truncation.
        
         | cogman10 wrote:
         | I work in finance and this thing is still indispensable for us.
         | A LOT of financial info is only presented on a dynamically
         | rendered HTML page that was last written in the bad old days.
        
           | gjvc wrote:
           | Absolutely. Until you have had to deal with this kind of
           | problem _first-hand_ , you have no idea how much of a relief
           | it is that it exists.
           | 
           | Sometimes, finding and using the right library can completely
           | turn around a f'd project.
        
             | xbar wrote:
             | True.
             | 
             | And BS has been that library for me on at least 2 such
             | projects.
        
         | toyg wrote:
         | _> Might be less common now_
         | 
         | That's because it's often deeply buried under more fashionable
         | abstractions.
        
         | dmayle wrote:
         | It's actually pretty trivial to speed up, if you have multiple
         | documents to parse, you can use multiprocessing.
         | from multiprocessing import Pool                  def
         | parse(html):           result = []           soup =
         | BeautifulSoup(html, 'html.parser')           for p in
         | soup.select('div > p'):             result.append(p.text)
         | return result                  with Pool(processes=16) as pool:
         | for texts in pool.imap_unordered(parse, my_html_texts):
         | for text in texts:               print(text)
        
         | agumonkey wrote:
         | the jquery of backend
        
         | samwillis wrote:
         | Have literally been building something with BS today. It's very
         | much still a current library, I imagine in some areas people
         | have moved on but I will continue to reach for it.
        
       | cpach wrote:
       | Does anyone know if there as a good equivalent for Go?
        
         | gaws wrote:
         | > Does anyone know if there as a good equivalent for Go
         | 
         | Yes: https://github.com/anaskhan96/soup
         | 
         | It works well.
        
         | mdaniel wrote:
         | I've heard https://github.com/gocolly/colly#readme mentioned
         | fondly, but I've never used it
        
           | kristianp wrote:
           | c.OnHTML("a[href]", func(e *colly.HTMLElement) {
           | e.Request.Visit(e.Attr("href"))      })
           | 
           | Sometimes I wish Go idioms included an iterator abstraction,
           | it's easier to understand and less hideous than that
           | functional callback style.
        
       | tills13 wrote:
       | Why is this kind of post allowed on HN? It's not recent nor
       | relevant and not specific in any meaningful way (literally linked
       | to the homepage). I occasionally see posts just linking to
       | Wikipedia articles as well, same sort of feel as this. At the
       | least, OP should have to offer some sort of discussion point or
       | tidbit from the linked content.
        
         | pauloday wrote:
         | Personally these and the Wikipedia posts are my favorite posts
         | on here. News is cool, but there's a lot of cool things that
         | don't change very often, and I love seeing those things too.
         | 
         | Also, in response to your "low effort posts do not invite
         | meaningful discussion" from a different comment, I don't see
         | how this is any lower effort than every other link only post
         | (i.e. the vast majority)? And there's over 40 comments on this
         | thread now talking about other scrapers, projects you can do
         | with scrapers, better docs, tangential use cases and how to
         | handle them, etc. Seems like a lot of people have a variety of
         | things to say about this, I don't see how that's not
         | "meaningful discussion".
         | 
         | EDIT: I also disagree with requiring a couple of sentences from
         | the submitter. If they have something to say they can say it,
         | otherwise it's fine if they don't try to influence the
         | discussion - it's more interesting to see where the random
         | commenters take something, then trying to chart a course.
        
         | sealeck wrote:
         | Because it's nice to exist outside the news cycle of what's
         | "current" every once in a while :)
        
           | omegalulw wrote:
           | You have other social media for that. Low effort posts do not
           | invite meaningful discussion and add moderation load.
        
             | BenjiWiebe wrote:
             | However, I don't recall hearing the moderators complain
             | about it.
             | 
             | I'm guessing most of the difficult moderation would be on
             | the newsy more-controversial posts anyways.
        
             | sophacles wrote:
             | Did you know that you don't have to read every article
             | posted here, nor read every comment that is added?
             | 
             | If you don't want to participate in this post, it's OK to
             | skip it. I skip dozens of posts a day - the best part is
             | it's more efficient than going to them and putting the
             | effort to whine!
             | 
             | As for not inviting meaningful discussion: there's some
             | good discussion on this post - the very article you claim
             | isn't capable of generating such.
        
       ___________________________________________________________________
       (page generated 2022-06-08 23:01 UTC)