[HN Gopher] Beautiful Soup
___________________________________________________________________
Beautiful Soup
Author : memorable
Score : 177 points
Date : 2022-06-08 13:32 UTC (9 hours ago)
(HTM) web link (www.crummy.com)
(TXT) w3m dump (www.crummy.com)
| js2 wrote:
| I used BS to scrape rogerebert.com and post his reviews to
| letterboxd:
|
| https://letterboxd.com/re2/
|
| I copied over only the first two paragraphs of each of his
| reviews with a link back to the original.
|
| The HTML is a total mess having obviously moved around the web a
| couple times. So it required a bunch of cleanup. But that wasn't
| even the hard part. The hard part was getting the correct TMDB ID
| for the movies because his reviews also have either no useful
| metadata or metadata that's wrong, like incorrect movie years,
| misspelled actor names, etc.
|
| I never was able to get API access to letterboxd, but they have a
| CSV import feature which worked-out well enough.
| ravishi wrote:
| I had my share of gigs where we just decided to scrape the old
| site with BS and extract structured data from there to render a
| new site. It was sometimes cheaper than dealing with their
| ancient ad-hoc cms monstrosities.
| js2 wrote:
| After I managed to wrangle the review text from the HTML it
| still needed this sort of cleanup: def
| clean_text(text): text = re.sub(r"[\x7f-\x9f]",
| "", text) # remove control chars text =
| re.sub(r"[\xa0\r\t]+", " ", text) # replace with spaces
| text = re.sub(r"\n+", "\n", text) # squash runs of newlines
| text = re.sub(r"\s+", " ", text) # squash runs of spaces
| # Remove newlines unless they appear to be at the end of a
| sentence # or if the sentence is shorter than 80
| characters. text = re.sub(r"([^.?!\"\)])\n", r"\1
| ", text) text = re.sub(r"\n([^\n]{,80})\n", r"\1
| ", text) return text.strip()
| benibela wrote:
| I wrote my own HTML parser in Pascal 15 years ago. Pascal is much
| faster than Python
| rawbot wrote:
| Just used this to scrap game guides from GameFAQs. Pretty easy to
| use, would recommend for quick projects.
| latchkey wrote:
| Years ago, I got the privilege of working at the same company
| with the author, Leonard Richardson. Really nice guy, super nerd
| and hilariously funny.
| jng wrote:
| We used this in a project many suns ago and we ended up switching
| to libxml2, less pretty presentation, but more functional. YMMV.
| brodouevencode wrote:
| bs4 introduced some very nice features over bs3, if that's what
| you were using, and includes the ability to use libxml2 as a
| parser. For very simple things though libxml2 would be a better
| fit.
| jamessb wrote:
| bs4 is able to parse some malformed documents that libxml2
| chokes on.
|
| For these cases it can be useful to do the reverse, and use
| the BeautifulSoup HTML parser as an alternative parser
| backend for the lxml package:
| https://lxml.de/elementsoup.html
| frereubu wrote:
| I used this recently to scrape a bunch of online restaurant
| reviews by a guy who I really like, use regex to get the postcode
| from the markup, do a geocode using postcodes.io, then plot the
| reviews on a Google map. It took about two / three hours and felt
| kinda dirty in a good way. Beautiful Soup made the first part
| really easy.
| cyberlurker wrote:
| Hmm, cool project idea but if the subject doesn't know you're
| doing this, it is a bit weird to stalk their online footprint.
|
| Unless you mean the person is a professional restaurant
| reviewer and you like their opinion.
|
| Oh well, good luck either way.
| tlavoie wrote:
| "Hmm, cool project idea but if the subject doesn't know
| you're doing this, it is a bit weird to stalk their online
| footprint."
|
| Wait, which one is the real cyberlurker? ;)
| frereubu wrote:
| Yes, they're a professional reviewer and I enjoy their
| reviews.
| oh_sigh wrote:
| Boy...15 years ago I was reaching to this and hpricot almost
| every week to do some cool scraping/parsing of some kind. I
| always loved the bs API
| dj_gitmo wrote:
| Beautiful Soup got me my first job.
| stereocodes wrote:
| whats new about it? why is this post here? Beautiful Soup has
| been around for a long long time.
| sophacles wrote:
| See: https://news.ycombinator.com/newsguidelines.html
|
| Particularly: On-Topic: Anything that good hackers would find
| interesting. That includes more than hacking and startups. If
| you had to reduce it to a sentence, the answer might be:
| anything that gratifies one's intellectual curiosity.
| emehex wrote:
| Beautiful Soup is an incredibly robust, and powerful tool.
| However, it can be sometimes be an intimidating tool for
| beginners (which ".find_x_y_z" method should I use again?). To
| that end you should check out gazpacho (with just a single "find"
| method): https://github.com/maxhumber/gazpacho
| agentdrtran wrote:
| I used Beautiful Soup on one of my first successful programming
| projects, I still remember how easy it was, and it taught me a
| lot about python.
| durpleDrank wrote:
| WWW::Mechanize fam where you at?
| gaws wrote:
| The downside is using Perl.
| dontbenebby wrote:
| I used Beautiful soup for a project recently (grabbing a series
| of page titles for export to CSV), it's super useful, thanks for
| the pointer OP.
| whiskey14 wrote:
| I find this to be a better version of the docs:
| https://beautiful-soup-4.readthedocs.io/en/latest/#
|
| Just in case someone wants a comment overview of what this
| superbly named library is: web scraping (html parsing) in python
| wrycoder wrote:
| The crummy.com page includes several suggestions to subscribe
| to Tidelift.
| scanr wrote:
| The author also wrote a novel called Constellation Games which I
| enjoyed a lot https://constellation.crummy.com/
| rnx wrote:
| I fondly remember being introduced to this library as part of the
| first project I worked on at my first development job. I was
| lucky there it was the right challenge at the right time.
| aedocw wrote:
| https://web.archive.org/web/20220608133224/https://www.crumm...
| phone8675309 wrote:
| How helpful is this when you're dealing with a website that does
| not degrade gracefully and insists using JavaScript to shove
| things in where a static webpage would work? (For example,
| scraping football scores from NFL.com)
| marginalia_nu wrote:
| Easiest in that case is probably to use something like headless
| chrome. But that is also significantly more demanding in terms
| of resources.
| werds wrote:
| its not useful in those cases, but usually for those js
| rendered sites you can replicate the ajax requests which happen
| and get nicely formed json documents to parse through instead.
| squaresmile wrote:
| Or the data is stored in js objects within script tags in the
| html and can be extracted programmatically. It's getting
| common with SSG sites using SPA frameworks.
|
| For example, the new Google Play Store website stores the
| data in AF_initDataCallback calls and can be extracted with
| re.findall(r"<script
| nonce=\"\S+\">AF_initDataCallback\\((.*?)\\);", html_string).
| phone8675309 wrote:
| I used to do that when I was responsible for a set of web
| crawlers to extract public records data, but the problem is
| that changes happen and these sorts of things become out of
| date fairly quickly.
|
| Getting this working in a headless browser driven by Selenium
| would probably be easier for maintainability.
| 867-5309 wrote:
| nowadays you usually have to submit http headers and cookies
| too, that's always a fun process of elimination
| radus wrote:
| In those cases you might want to check out SeleniumBase:
| https://seleniumbase.io/
| [deleted]
| sergiotapia wrote:
| If you're lucky those sites have the raw data a server side
| generated JSON payload right in the site source code markup.
|
| For example Target is clientside, but has all the data in a
| `window.FOOBAR = json` variable you can fetch and parse with
| some substring magic. Much easier than spinning up chromedriver
| and some package.
| pmoriarty wrote:
| Is Beautiful Soup still the best way to scrape the web with
| python?
|
| IIRC, Beautiful Soup doesn't handle javascript, so at least for
| JS you're forced to use something else.
|
| I'm also looking forward to seeing how people scrape the web once
| Web Assembly becomes prevalent.
| move-on-by wrote:
| I've found Playwright to be a really great tool for scraping
| malshe wrote:
| This sounds interesting. Any resources for a beginner? I use
| Selenium regularly.
| simonw wrote:
| Playwright for Python has really good documentation:
| https://playwright.dev/python/
|
| I used it for my https://shot-scraper.datasette.io/ tool,
| and wrote a bit about CLI-driven scraping using that tool
| here: https://simonwillison.net/2022/Mar/14/scraping-web-
| pages-sho...
| samwillis wrote:
| Headless scraping is in the region of 10x slower and more
| resource intensive, even when carefully blocking requests
| such as images. It should always be a second choice.
|
| Other than that Playwright is incredible, by far the best
| browser automation api.
| move-on-by wrote:
| For sure it's a heavy approach, but if you need a full
| blown browser with JS, then that's just what you'll have to
| do. Use the right tool for the job.
| derac wrote:
| Playwright is easily the best for browser automation. I still
| use requests + beautiful soup often as well.
| diarrhea wrote:
| For JS, I've used Selenium with a Chrome driver, then parsed
| the HTML with BeautifulSoup. I know nothing about web
| development, so this might be an outdated way, but it worked.
| BS was nice to deal with.
| wildrhythms wrote:
| I used Puppeteer for this to great success. Very easy to set
| up, and you control the whole browser + access to everything
| you would have access to normally through Chrome dev tools.
| NegativeLatency wrote:
| It can be useful if it fits your case, the more recent scrapers
| run a whole browser with automation which can make scraping
| stuff a lot easier, since js will run etc
| driscoll42 wrote:
| If you don't need JS, I think it's still the best. Sure, that
| doesn't always work for you, but once you start needing to
| render JS the scraping slows down tremendously.
| zasdffaa wrote:
| For web scraping I used htmltidy
| (https://en.wikipedia.org/wiki/HTML_Tidy) which cleaned it
| sufficiently that I could run XSLT over it (gags at the memory)
| iooi wrote:
| I found lxml.html a lot easier to work with than bs4, in case
| that helps anyone else.
|
| https://lxml.de/lxmlhtml.html
| mdaniel wrote:
| On the off chance you were not aware, bs4 also supports[0]
| getting parse events from html5lib[1] which (as its name
| implies) is far more likely to parse the text the same way the
| browser would
|
| 0:
| https://www.crummy.com/software/BeautifulSoup/bs4/doc/index....
|
| 1: https://pypi.org/project/html5lib/
| westurner wrote:
| BeautifulSoup is an API for multiple parsers
| https://beautiful-
| soup-4.readthedocs.io/en/latest/#installin... :
| BeautifulSoup(markup, "html.parser")
| BeautifulSoup(markup, "lxml") BeautifulSoup(markup,
| "lxml-xml") BeautifulSoup(markup, "xml")
| BeautifulSoup(markup, "html5lib")
|
| Looks like lxml w/ xpath is still the fastest with Python
| 3.10.4 from "Pyquery, lxml, BeautifulSoup comparison"
| https://gist.github.com/MercuryRising/4061368 ; which is fine
| for parsing (X)HTML(5) that validates<
|
| (EDIT: Is xml/html5 a good format for data serialization?
| defusedxml ... Simdjson, Apache arrow.js)
| driscoll42 wrote:
| I was curious, so I tried that performance test you linked
| to on my machine with the various parsers:
| ==== Total trials: 100000 ===== bs4 lxml total
| time: 110.9 bs4 html.parser total time: 87.6
| bs4 lxml-xml total time: 0.5 bs4 xml total time:
| 0.5 bs4 html5lib total time: 103.6 pq total
| time: 8.7 lxml (cssselect) total time: 8.8
| lxml (xpath) total time: 5.6 regex total time: 13.8
| (doesn't find all p)
|
| bs4 is damn fast with the lxml-xml or xml parsers
| somat wrote:
| Same here, I am unable to properly quantify it but there was
| something about the soup api I did not really like.
|
| It may have been because I learned on the python xml.etree
| library in base(I moved to lxml because it has the same api but
| is faster and knows about parent nodes) and had a hard time
| with the soup api.
|
| But I think it was the way it overloaded the selectors. I did
| not like the way you could magically find elements. I may have
| to revisit it and try and figure out why and if I still do not
| like it.
| djtriptych wrote:
| Great memories with this library, one of my all time favs.
|
| It is fast? no.
|
| But it had a fantastic mission: extracting data from malformed
| HTML.
|
| Might be less common now but back then (~10+ years ago) it was
| still rampant. Many if not most parsers would barf on any
| deviation from the standard, leaving you to hand-roll regex
| solutions and ugly corner cases.
|
| BS covered a LOT of these cases without forcing you to write
| terrible code. It mostly just worked, with a reasonable API, and
| stellar, well-written, example-laden docs.
| Der_Einzige wrote:
| I remember that I needed to do something involving performance
| with beautiful soup. Switching the HTML parser backend (as they
| mention in the docs in BS4) gave me an order of magnitude
| speedup...
| anigbrowl wrote:
| I'm not sure why you're using the past tense, the current
| version is only a year old, and I will never stop using it
| because panning messy datasets for hidden gold is my thing.
| ssimpson wrote:
| absolutely. i wrote this mobile allergy data thing and getting
| the data was mostly scraping from news websites that did
| terrible things with javascript to keep scrapers and ad
| blockers out. Beautiful Soup worked past all of that easily.
| Probably my favorite Python library.
| jerf wrote:
| "Might be less common now"
|
| It is less necessary now. One of the most important parts of
| the HTML5 standards, IMHO, is that it specifies how to parse
| HTML that doesn't conform to the standards in a standard way.
| In principle, every bag of bytes now has a standard-compliant
| way to parse it that every HTML5 parser should agree on. I
| don't use this to know how many edge cases the standard and/or
| implementations have, but it's a lot better than it used to be,
| and means that every HTML5 parser has many of the capabilities
| that Beautiful Soup used to (nearly-)uniquely have for parsing
| messy HTML.
|
| I suspect Beautiful Soup was a non-trivial aspect of how the
| decision to implement such a spec was decided upon. It proved
| the idea to be a very valuable one at a time when most
| languages lacked such a library. Basically, BS won so hard that
| while it wasn't necessarily _directly_ adopted as a standard,
| the essence of it certainly was.
| ThatPlayer wrote:
| Not just HTML standards, but I've used their detwingle
| function because one of the sites I'm scraping has a mixture
| of Windows-1252 and Unicode. It was clearly stored correctly,
| but encoded differently depending on what page view you were
| looking at. For example titles in an outline were broken, but
| on the actual individual pages fine. Their rendering also
| treated multibyte characters incorrectly during truncation.
| cogman10 wrote:
| I work in finance and this thing is still indispensable for us.
| A LOT of financial info is only presented on a dynamically
| rendered HTML page that was last written in the bad old days.
| gjvc wrote:
| Absolutely. Until you have had to deal with this kind of
| problem _first-hand_ , you have no idea how much of a relief
| it is that it exists.
|
| Sometimes, finding and using the right library can completely
| turn around a f'd project.
| xbar wrote:
| True.
|
| And BS has been that library for me on at least 2 such
| projects.
| toyg wrote:
| _> Might be less common now_
|
| That's because it's often deeply buried under more fashionable
| abstractions.
| dmayle wrote:
| It's actually pretty trivial to speed up, if you have multiple
| documents to parse, you can use multiprocessing.
| from multiprocessing import Pool def
| parse(html): result = [] soup =
| BeautifulSoup(html, 'html.parser') for p in
| soup.select('div > p'): result.append(p.text)
| return result with Pool(processes=16) as pool:
| for texts in pool.imap_unordered(parse, my_html_texts):
| for text in texts: print(text)
| agumonkey wrote:
| the jquery of backend
| samwillis wrote:
| Have literally been building something with BS today. It's very
| much still a current library, I imagine in some areas people
| have moved on but I will continue to reach for it.
| cpach wrote:
| Does anyone know if there as a good equivalent for Go?
| gaws wrote:
| > Does anyone know if there as a good equivalent for Go
|
| Yes: https://github.com/anaskhan96/soup
|
| It works well.
| mdaniel wrote:
| I've heard https://github.com/gocolly/colly#readme mentioned
| fondly, but I've never used it
| kristianp wrote:
| c.OnHTML("a[href]", func(e *colly.HTMLElement) {
| e.Request.Visit(e.Attr("href")) })
|
| Sometimes I wish Go idioms included an iterator abstraction,
| it's easier to understand and less hideous than that
| functional callback style.
| tills13 wrote:
| Why is this kind of post allowed on HN? It's not recent nor
| relevant and not specific in any meaningful way (literally linked
| to the homepage). I occasionally see posts just linking to
| Wikipedia articles as well, same sort of feel as this. At the
| least, OP should have to offer some sort of discussion point or
| tidbit from the linked content.
| pauloday wrote:
| Personally these and the Wikipedia posts are my favorite posts
| on here. News is cool, but there's a lot of cool things that
| don't change very often, and I love seeing those things too.
|
| Also, in response to your "low effort posts do not invite
| meaningful discussion" from a different comment, I don't see
| how this is any lower effort than every other link only post
| (i.e. the vast majority)? And there's over 40 comments on this
| thread now talking about other scrapers, projects you can do
| with scrapers, better docs, tangential use cases and how to
| handle them, etc. Seems like a lot of people have a variety of
| things to say about this, I don't see how that's not
| "meaningful discussion".
|
| EDIT: I also disagree with requiring a couple of sentences from
| the submitter. If they have something to say they can say it,
| otherwise it's fine if they don't try to influence the
| discussion - it's more interesting to see where the random
| commenters take something, then trying to chart a course.
| sealeck wrote:
| Because it's nice to exist outside the news cycle of what's
| "current" every once in a while :)
| omegalulw wrote:
| You have other social media for that. Low effort posts do not
| invite meaningful discussion and add moderation load.
| BenjiWiebe wrote:
| However, I don't recall hearing the moderators complain
| about it.
|
| I'm guessing most of the difficult moderation would be on
| the newsy more-controversial posts anyways.
| sophacles wrote:
| Did you know that you don't have to read every article
| posted here, nor read every comment that is added?
|
| If you don't want to participate in this post, it's OK to
| skip it. I skip dozens of posts a day - the best part is
| it's more efficient than going to them and putting the
| effort to whine!
|
| As for not inviting meaningful discussion: there's some
| good discussion on this post - the very article you claim
| isn't capable of generating such.
___________________________________________________________________
(page generated 2022-06-08 23:01 UTC)