[HN Gopher] Show HN: Lightpanda, an open-source headless browser...
___________________________________________________________________
Show HN: Lightpanda, an open-source headless browser in Zig
We're Francis and Pierre, and we're excited to share Lightpanda
(https://lightpanda.io), an open-source headless browser we've been
building for the past 2 years from scratch in Zig (not dependent on
Chromium or Firefox). It's a faster and lighter alternative for
headless operations without any graphical rendering. Why start
over? We've worked a lot with Chrome headless at our previous
company, scraping millions of web pages per day. While it's
powerful, it's also heavy on CPU and memory usage. For scraping at
scale, building AI agents, or automating websites, the overheads
are high. So we asked ourselves: what if we built a browser that
only did what's absolutely necessary for headless automation? Our
browser is made of the following main components: - an HTTP loader
- an HTML parser and DOM tree (based on Netsurf libs) - a
Javascript runtime (v8) - partial web APIs support (currently DOM
and XHR/Fetch) - and a CDP (Chrome Debug Protocol) server to allow
plug & play connection with existing scripts (Puppeteer,
Playwright, etc). The main idea is to avoid any graphical
rendering and just work with data manipulation, which in our
experience covers a wide range of headless use cases (excluding
some, like screenshot generation). In our current test case
Lightpanda is roughly 10x faster than Chrome headless while using
10x less memory. It's a work in progress, there are hundreds of
Web APIs, and for now we just support some of them. It's a beta
version, so expect most websites to fail or crash. The plan is to
increase coverage over time. We chose Zig for its seamless
integration with C libs and its _comptime_ feature that allow us to
generate bi-directional Native to JS APIs (see our zig-js-runtime
lib https://github.com/lightpanda-io/zig-js-runtime). And of course
for its performance :) As a company, our business model is based
on a Managed Cloud, browser as a service. Currently, this is
primarily powered by Chrome, but as we integrate more web APIs it
will gradually transition to Lightpanda. We would love to hear
your thoughts and feedback. Where should we focus our efforts next
to support your use cases?
Author : fbouvier
Score : 33 points
Date : 2025-01-24 22:15 UTC (44 minutes ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| fbouvier wrote:
| Author here. The browser is made from scratch (not based on
| Chromium/Webkit), in Zig, using v8 as a JS engine.
|
| Our idea is to build a lightweight browser optimized for AI use
| cases like LLM training and agent workflows. And more generally
| any type of web automation.
|
| It's a work in progress, there are hundreds of Web APIs, and for
| now we just support some of them (DOM, XHR, Fetch). So expect
| most websites to fail or crash. The plan is to increase coverage
| over time.
|
| Happy to answer any questions.
| toobulkeh wrote:
| I'd love to see better optimized web socket support and "save"
| features that cache LLM queries to optimize fallback
| JoelEinbinder wrote:
| When I've talked to people running this kind of ai
| scraping/agent workflow, the costs of the AI parts dwarf that
| of the web browser parts. This causes computational cost of the
| browser to become irrelevant. I'm curious what situation you
| got yourself in where optimizing the browser results in
| meaningful savings. I'd also like to be in that place!
|
| I think your ram usage benchmark is deceptive. I'd expect a
| minimal browser to have much lower peak memory usage than
| chrome on a minimal website. But it should even out or get
| worse as the websites get richer. The nature of web scraping is
| that the worst sites take up the vast majority of your cpu
| cycles. I don't think lowering the ram usage of the browser
| process will have much real world impact.
| refulgentis wrote:
| Generally, for consumer use cases, it's best to A) do it
| locally, preserving some of the original web contract B) run
| JS to get actual content C) post-process to reduce inference
| cost D) get latency as low as possible
|
| Then, as the article points out, the Big Guns making the LLMs
| are a big use case for this because they get a 10x speedup
| and can begin contemplating running JS.
|
| It sounds like the people you've talked to are in a messy
| middle: no incentive to improve efficiency of loading pages,
| simply because there's something else in the system that has
| a fixed cost to it.
|
| I'm not sure why that would rule out improving anything else,
| it doesn't seem they should be stuck doing nothing other than
| flailing around for cheaper LLM inference.
|
| > I think your ram usage benchmark is deceptive. I'd expect a
| minimal browser to have much lower peak memory usage than
| chrome on a minimal website.
|
| I'm a bit lost, the ram usage benchmark says its ~10x less,
| and you feel its deceptive because you'd expect ram usage to
| be less? Steelmanning: 10% of Chrome's usage is _still_ too
| high?
| JoelEinbinder wrote:
| The benchmark shows lower ram usage on a very simple demo
| website. I expect that if the benchmark ran on a random set
| of real websites, ram usage would not be meaningfully lower
| than Chrome. Happy to be impressed and wrong if it remains
| lower.
| fbouvier wrote:
| I believe it will be still significantly lower as we skip
| the graphical rendering.
|
| But to validate that we need to increase our Web APIs
| coverage.
| fbouvier wrote:
| The cost of the browser part is still a problem. In our
| previous startup, we were scraping >20 millions of webpages
| per day, with thousands of instances of Chrome headless in
| parallel.
|
| Regarding the RAM usage, it's still ~10x better than Chrome
| :) It seems to be coming mostly from v8, I guess that we
| could do better with a lightweight JS engine alternative.
| Tostino wrote:
| You may reduce ram, but also performance. A good JIT costs
| ram.
| fbouvier wrote:
| Yes, that's true. It's a balance to find between RAM and
| speed.
|
| I was thinking more on use cases that require to disable
| JIT anyway (WASM, iOS integration, security).
| Tostino wrote:
| Yeah, could be nice to allow the user to select the type
| of ECMAScript engine that fits their use-case /
| performance requirements (balancing the resources
| available).
| cush wrote:
| > there are hundreds of Web APIs, and for now we just
| support some of them (DOM, XHR, Fetch)
|
| > it's still ~10x better than Chrome
|
| Do you expect it to stay that way once you've reached
| parity?
| nwienert wrote:
| Playwright can run webkit very easily and it's dramatically
| less resource-intensive than Chrome.
| dtj1123 wrote:
| Very nice. Does this / will this support the puppeteer-extra
| stealth plugin?
| katiehallett wrote:
| Thanks! Right now no, but since we use the CDP (playwright,
| puppeteer), I guess it would be possible to support it
| sesm wrote:
| Great job! And good luck on your journey!
|
| One question: which JS engines did you consider and why you
| chose V8 in the end?
| fbouvier wrote:
| We have also considered JavaScriptCore (used by Bun) and
| QuickJS. We did choose v8 because it's state of the art,
| quite well documented and easy to embed.
|
| The code is made to support others JS engine in the future.
| We do want to add a lightweight alternative like QuickJS or
| Kiesel https://kiesel.dev/
| bityard wrote:
| Please put a priority on making it hard to abuse the web with
| your tool.
|
| At a _bare_ minimum, that means obeying robot.txt and NOT
| crawling a site that doesn't want to be crawled. And there
| should not be an option to override that. It goes without
| saying that you should not allow users to make hundreds or
| thousands of "blind" parallel requests as these tend to have
| the effect of DoSing sites that are being hosted on modest
| hardware. You should also be measuring response times and
| throttling your requests accordingly. If a website issues a
| response code or other signal that you are hitting it too fast
| or too often, slow down.
|
| I say this because since around the start of the new year, AI
| bots have been ravaging what's left of the open web and causing
| REAL stress and problems for admins of small and mid-sized
| websites and their human visitors:
| https://www.heise.de/en/news/AI-bots-paralyze-Linux-news-sit...
| gkbrk wrote:
| Please don't.
|
| Software I installed on my computer needs to the what I want
| as the user. I don't want every random thing I install to
| come with DRM.
|
| The project looks useful, and if it ends up getting popular I
| imagine someone would make a DRM-free version anyway.
| tossandthrow wrote:
| Where do you read DRM?
|
| Parent commenter merely and humbly asks the author of the
| library to make sure that it has sane defaults and support
| for ethical crawling.
|
| I find it disturbing that you would recommend against that.
| gkbrk wrote:
| Here's what the parent comment wrote.
|
| > And there should not be an option to override that.
|
| This is not just a sane default. This is software telling
| you what you are allowed to do based on what the rights
| owner wants, literally DRM.
|
| This is exactly like Android not allowing screenshots to
| be taken in certain apps because the rights owner didn't
| allow it.
| blacksmith_tb wrote:
| Not sure what "digital rights" that "manages"? I don't
| see it as an unreasonable suggestion that the tool
| shouldn't be set up out of the box to DoS sites it's
| scraping, that doesn't prevent anyone who is technical
| enough to know what they're doing to fork it and remove
| whatever limits are there by default? I can't see it as a
| "my computer should do what I want!" issue, if you don't
| like how this package works, change it or use another?
| benatkin wrote:
| Digital Restrictions Management, then. Have it your way.
| JambalayaJimbo wrote:
| This is software telling you what you are allowed to do
| based on what the software developer wants* (assuming the
| developers cares of course...). Which is how all software
| works. I would not want my users of my software doing
| anything malicious with it, so I would not give them the
| option.
|
| If I create an open-source messaging app I am also not
| going to give users the option of clicking a button to
| spam recipients with dick pics. Even if it was dead-
| simple for a determined user to add code for this dick
| pic button themselves.
| bityard wrote:
| I feel like you may have a misunderstanding of what DRM is.
| Talking about DRM outside the context of media distribution
| doesn't really make any sense.
|
| Yes, someone can fork this and modify it however they want.
| They can already do the same with curl, Firefox, Chromium,
| etc. The point is that this is project is deliberately
| advertising itself as an AI-friendly web scraper. If
| successful, lots of people who don't know any better are
| going to download it and deploy it without a full
| understanding (and possibly caring) of the consequences on
| the open web. And as I already point out, this is not
| hypothetical, it is already happening. Right now. As we
| speak.
|
| Do you want cloudflare everywhere? This is how you get
| cloudflare everywhere.
|
| My plea for the dev is that they choose to take the high
| road and put web-server-friendly SANE DEFAULTS in place to
| curb the bulk of abusive web scraping behavior to lessen
| the number of gray hairs it causes web admins like myself.
| That is all.
| randunel wrote:
| It's exactly DRM, management of legal access to digital
| content. The "media" part has been optional for decades.
|
| The comment they replied to didn't suggest sane defaults,
| but DRM. Here's the quote, no defaults work that way
| (inability to override):
|
| > At a _bare_ minimum, that means obeying robot.txt and
| NOT crawling a site that doesn't want to be crawled. And
| there should not be an option to override that.
| samatman wrote:
| I'll also add something that I expect to be somewhat
| controversial, given earlier conversations on HN[0]: I
| see contexts in which it would be perfectly valid to use
| this and ignore robots.txt.
|
| If I were directing some LLM agent to specifically access
| a site on my behalf, and get a usable digest of that
| information to answer questions, or whever, that use of
| the headless browser is not a spider, it's a user agent.
| Just an unusual one.
|
| The amount of traffic generated is consistent with
| browsing, not scraping. So no, I don't think building in
| a mandatory robots.txt respecter is a reasonable ask.
| Someone who wants to deploy it at scale while ignoring
| robots.txt is just going to disable that, and it causes
| problems for legitimate use cases where the headless
| browser is not a robot in any reasonable or normal
| interpretation of the term.
|
| [0]: I don't entirely understand why this is
| controversial, but it was.
| benatkin wrote:
| > Talking about DRM outside the context of media
| distribution doesn't really make any sense.
|
| It's a cultural thing, and it makes a lot of sense. This
| fits with DRM culture that has walled gardens in iOS and
| Android.
| benatkin wrote:
| Make it faster and furiouser.
| afk1914 wrote:
| I am curious how Lightpanda compares to chrome-headless-shell
| ({headless: 'shell'} in Puppeteer) in benchmarks.
| fbouvier wrote:
| We did not run benchmarks with chrome-headless-shell (aka the
| old headless mode) but I guess that performance wise it's on
| the same scale as the new headless mode.
| xena wrote:
| How do I make sure that people can't use lightpanda to bypass
| bot protection tools?
| danielsht wrote:
| Very impressive! At Airtop.ai we looked into lightweight
| browsers like this one since we run a huge fleet of cloud
| browsers but found that anything other than a non-headless
| Chromium based browser would trigger bot detection pretty
| quickly. Even spoofing user agents triggers bot detection
| because fingerprinting tools like FingerprintJS will use things
| like JS features, canvas fingerprinting, WebGL fingerprinting,
| font enumeration, etc.
|
| Can you share if you've looked into how your browser fares
| against bot detection tools like these?
| fbouvier wrote:
| Thanks! No we haven't worked on bot detection.
| cropcirclbureau wrote:
| Pretty cool. Do you have a list of features you plan to support
| and plan to cut? Also, how much does this differ from the DOM
| impls that test frameworks use? I recall Jest or someone sporting
| such a feature.
| fbouvier wrote:
| The most important "feature" is to increase our Web APIs
| coverage :)
|
| But of course we plan to add others features, including
|
| - tight integration with LLM
|
| - embed mode (as a C library and as a WASM module) so you can
| add a real browser to your project the same way you add libcurl
| andrethegiant wrote:
| Could it potentially fit in a Cloudflare worker? Workers are
| also V8 and can run wasm, but are constrained to 128MB RAM
| and 10MB zipped bundle size
| fbouvier wrote:
| WASM support is not there yet but it's on the roadmap and
| we had it in our mind since the beginning of the project,
| and have made our dev choices accordingly.
|
| So yes it could be used in a serverless platform like
| Cloudflare workers. Our startup time is a huge advantage
| here (20ms vs 600ms for Chrome headless in our local
| tests).
|
| Regarding v8 in Cloudflare workers I think we can not used
| directly, ie. we still need to embed a JS engine in the
| wasm module.
| m3kw9 wrote:
| How does this work because the browser needs to render a page and
| the vision model needs to know where a button is, so it still
| needs to see an image. How does headless make it easier?
| katiehallett wrote:
| Headless mode skips the visual rendering meant for humans, but
| the DOM structure and layout still exist, allowing the model to
| parse elements programmatically (e.g. button locations).
| Instead of 'seeing' an image, the model interacts with the
| page's underlying structure, which is faster and more
| efficient. Our browser removes the rendering engine as well, so
| it won't handle 100% of automation use cases, but it's also
| what allows us to be faster and lighter than Chrome in headless
| mode.
| wiradikusuma wrote:
| But what if the human programmer needs to visually verify
| that their code works by eyeballing which element got
| selected, etc?
| fbouvier wrote:
| You're right, the debugging part is a good use case for
| graphical rendering in a headless environment.
|
| I see it as a build time/runtime question. At build (dev)
| time I want to have a graphical response (debugging,
| computer vision, etc.). And then, when the script is ready,
| I can use Lightpanda at runtime as a lightweight
| alternative.
| chrisweekly wrote:
| If you want a human to eyeball it, you don't use a
| "headless" browser.
| dolmen wrote:
| The human programmer can save the DOM as HTML in a file and
| open it in a headfull browser.
|
| But the use case for Lightpanda is for machine agents, not
| humans.
| 10000truths wrote:
| The issue is that DOM structure does not correspond one-to-
| one with perceived structure. I could render things in the
| DOM that aren't visible to people (e.g. a transparent 5px x
| 5px button), or render things to people that aren't visible
| in the DOM (e.g. Facebook's DOM obfuscation shenanigans to
| evade ad-blocking, or rendering custom text to a WebGL
| canvas). Sure, most websites don't go that far, but most
| websites also aren't valuable targets for automated
| crawling/scraping. These kinds of disparities _will_ be
| exploited to detect and block automated agents if browser
| automation becomes sufficiently popular, and then we 're back
| to needing to render the whole browser and operate on the
| rendered image to keep ahead of the arms race.
| weinzierl wrote:
| If I don't need JavaScript or any interactivity, just modern HTML
| + modern CSS, is there any modern lightweight renderer to png or
| svg?
|
| Something in the spirit of wkhtmltoimage or WeasyPrint that does
| not require a full blown browser but more modern with support of
| recent HTML and CSS?
|
| In a sense this is Lightpanda's complement to a "full panda".
| Just the fully rendered DOM to pixels.
| nicoburns wrote:
| We're working on this here: https://github.com/DioxusLabs/blitz
| See the "screenshot" example for rendering to png. There's no
| SVG backend currently, but one could be added.
|
| (proper announcement of project coming soon)
| Kathc wrote:
| An open-source browser built from scratch is bold. What inspired
| the development of Lightpanda?
| katiehallett wrote:
| Thanks! The three of us worked together at our former company -
| ecomm saas start up where we spent a ton of $ on scraping
| infrastructure spinning up headless Chrome instances.
|
| It started out as more of an R&D thesis - is it possible to
| strip out graphical rendering from Chrome headless? Turns out
| no - so we tried to build it from scratch. And the beta results
| validated the thesis.
|
| I wrote a whole thing about it here if you're interested in
| delving deeper
| https://substack.thewebscraping.club/p/rethinking-the-web-br...
| corford wrote:
| Not sure what category of ecomm sites you were scraping but I
| scrape >10million ecomm URLs daily and, honestly, in my
| experience the compute is not a major issue (8 times out of
| 10 you can either use API endpoints and/or session stuffing
| to avoid needing a browser for every request; and in the 2
| out of 10 sites where you really need a browser for all
| requests it's usually to circumvent aggressive anti-bot which
| means you're very likely going to need full chrome or FF
| anyway - and you can parallelise quite effectively across
| tabs).
|
| One niche where I could definitely see a use for this though
| is scraping terribly coded sites that need some JS execution
| to safely get the data you want (e.g. they do some bonkers
| client side calculations that you don't want to reverse
| engineer). It would be nice to not pay the perf tax of chrome
| in these cases.
|
| Having said all of that, I have to say from a geek
| perspective it's super neat what you guys are hacking on!
| Zig+V8+CDP bindings is very cool.
| dolmen wrote:
| Scrapping modern web pages is hard without full support for JS
| frameworks and dynamic loading. But a full browser, even
| headless, has huge ressource consumption. This has a huge cost
| when scraping at scale.
| monkmartinez wrote:
| This is pretty neat, but I have to ask; Why does everyone want to
| build and/or use a headless browser?
|
| When I use pyautogui and my desktop chrome app I never have
| problems with captchas or trigger bot detectors. When I use a
| "headless" playwright, selenium, or puppeteer, I almost always
| run into problems. My conclusion is that "headless" scrapping
| creates more problems than it solves. Why don't we use the
| chrome, firefox, safari, or edge that we are using on a day to
| day basis?
| fbouvier wrote:
| I guess it depends on the scale of your requests.
|
| When you want to browse a few websites from time to time, a
| local headful browser might be a solution. But when you have
| thousands or millions of webpages, you need a server
| environment and a headless browser.
| fbouvier wrote:
| In the past I've run hundreds of headful instances of Chrome
| in a server environment using Xvfb. It was not a pleasant
| experience :)
| kavalg wrote:
| Why AGPL? I am not blaming you. I am just curious about the
| reasoning behind your choice.
| fbouvier wrote:
| We had some discussions about it. It seems to us that AGPL will
| ensure that a company running our browser in a cloud managed
| offer will have to keep its modifications open for the
| community.
|
| We might be wrong, maybe AGPL will damage the project more than
| eg. Apache2. In that case we will reconsider our choice. It's
| always easier this way :)
|
| Our underlying library https://github.com/lightpanda-io/zig-js-
| runtime is licensed with Apache2.
| cratermoon wrote:
| So is this the scraper we need to block?
| https://news.ycombinator.com/item?id=42750420
| fbouvier wrote:
| I fully understand your concern and agree that scrapers
| shouldn't be hurting web servers.
|
| I don't think they are using our browser :)
|
| But in my opinion, blocking a browser as such is not the right
| solution. In this case, it's the user who should be blocked,
| not the browser.
| jjcoffman wrote:
| If your browser doesn't play nicely and obey robots.txt when
| its headless I don't think it's that crazy to block the
| browser and not the user.
| fbouvier wrote:
| Every tool can be used in a good or bad way, Chrome,
| Firefox, cURL, etc. It's not the browser who doesn't play
| nicely, it's the user.
|
| It's the user's responsibility to behave well, like in life
| :)
| slt2021 wrote:
| it is trivial to spoof user-agent, if you want to stop a
| motivated scraper, you need a different solution that
| exploits the fact that robots use headless browser
| sangnoir wrote:
| > it is trivial to spoof user-agent
|
| It's also trivial to detect spoofed user agents via
| fingerprinting. The best defense against scrapers is done
| in layers, with user-agent name block as the bare
| minimum.
| surfmike wrote:
| Another browser in this space is https://ultralig.ht/, it's
| geared for in-game UI but I wonder how easy it would be to retool
| it for a similar use case.
| gwittel wrote:
| Interesting. Looks really neat! How do you deal with anti bot
| stuff like Fingerprintjs, Cloudflare turnstile, etc? Maybe you're
| new enough to not get flagged but I find this (and CDP) a
| challenge at times with these anti-bot systems.
| frankgrecojr wrote:
| The hello world example does not work. In fact, no website I've
| tried works. It's usually always panics. For the example in the
| readme, the errors are:
|
| ```
|
| ./lightpanda-aarch64-macos --host 127.0.0.1 --port 9222
|
| info(websocket): starting blocking worker to listen on
| 127.0.0.1:9222
|
| info(server): accepting new conn...
|
| info(server): client connected
|
| info(browser): GET https://wikipedia.com/ 200
|
| info(browser): fetch
| https://wikipedia.com/portal/wikipedia.org/assets/js/index-2...:
| http.Status.ok
|
| info(browser): eval script
| portal/wikipedia.org/assets/js/index-24c3e2ca18.js:
| ReferenceError: location is not defined
|
| info(browser): fetch
| https://wikipedia.com/portal/wikipedia.org/assets/js/gt-ie9-...:
| http.Status.ok
|
| error(events): event handler error: error.JSExecCallback
|
| info(events): event handler error try catch: TypeError: Cannot
| read properties of undefined (reading 'length')
|
| info(server): close cmd, closing conn...
|
| info(server): accepting new conn...
|
| thread 5274880 panic: attempt to use null value
|
| zsh: abort ./lightpanda-aarch64-macos --host 127.0.0.1 --port
| 9222
|
| ```
| dang wrote:
| (This was on the frontpage as
| https://news.ycombinator.com/item?id=42812859 but someone pointed
| out to me that it had been a Show HN a few weeks ago:
| https://news.ycombinator.com/item?id=42430629, so I've made a
| fresh copy of that submission and moved the comments hither. I
| hope that's ok with everyone!)
| zlagen wrote:
| what do you think would be the use cases for this project? being
| lightweight is awesome but usually you need a real browser for
| most use cases. Testing sites and scraping for example. It may
| work for some scraping use cases but I think that if the site
| uses any kind of bot blocking this is not going to cut it.
___________________________________________________________________
(page generated 2025-01-24 23:00 UTC)