hngopher.com

       [HN Gopher] Show HN: Finic - Open source platform for building b...
       ___________________________________________________________________
        
       Show HN: Finic - Open source platform for building browser
       automations
        
       Last year we launched a project called Psychic that did moderately
       well on hacker news, but was a commercial failure. We were able to
       find customers, but none with compelling and overlapping use cases.
       Everyone who was interested was too early to be a real customer.
       This was our launch: https://news.ycombinator.com/item?id=36032081
       We recently decided to revive and rebrand the project after seeing
       a sudden spike in interest from people who wanted to connect LLMs
       to data - but specifically through browsers. It's also a problem
       we've experienced firsthand, having built scraping features into
       Psychic and previously working on bot detection at Robinhood.  If
       you haven't built a web scraper or browser automation before, you
       might assume it's very straightforward. People have been building
       scrapers for as long as the internet has existed, so there must be
       many tools for the job.  The truth is that web scraping strategies
       need to constantly adapt as web standard change, and as companies
       that don't want to be scraped adopt new technologies to try and
       block it. The old standards never completely go away, so the longer
       the internet exists, the more edge cases you'll need to account
       for. This adds up to a LOT of infrastructure that needs to be set
       up and a lot of schlep developers have to go through to get up and
       running.  Scraping is no easier today than it was 10 years ago -
       the problems are just different.  Finic is an open source platform
       for building and deploying browser agents. Browser agents are bots
       deployed to the cloud that mimic the behaviour of humans, like web
       scrapers or remote process automation (RPA) jobs. Simple examples
       include scripts that scrape static websites like the SEC's EDGAR
       database. More complex use cases include integrating with legacy
       applications that don't have public APIs, where the best way to
       automate data entry is to just manipulate HTML selectors (EHRs for
       example).  Our goal is to make Finic the easiest way to deploy a
       Playwright-based browser automation. With this launch, you can
       already do so in just 4 steps. Check out our docs for more info:
       https://docs.finic.io/quickstart
        
       Author : jasonwcfan
       Score  : 85 points
       Date   : 2024-09-17 13:26 UTC (9 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | whatnotests2 wrote:
       | With agents like Finic, soon the web will be built for agents,
       | rather than humans.
       | 
       | I can see a few years from now almost all web traffic is agents.
        
         | jasonwcfan wrote:
         | Yep. I used to be the guy responsible for bot detection at
         | Robinhood so I can tell you firsthand it's impossible to
         | reliably differentiate between humans and machines over a
         | network. So either you accept being automated, or you
         | overcorrect and block legitimate users.
         | 
         | I don't think the dead internet theory is true today, but I
         | think it will be true soon. IMO that's actually a good thing,
         | more agents representing us online = more time spent in the
         | real world.
        
           | candiddevmike wrote:
           | That is some bizarre mental gymnastics to justify the work
           | you've done. What about the rest of us who don't want agents
           | representing us?
        
             | ayanb9440 wrote:
             | If you want to use an agent for scraping/automation, you
             | would need to supply it with auth credentials. So
             | permission is required by default.
        
       | Oras wrote:
       | Don't take this as a negative thing, but I'm confused. Is it a
       | playwright? Is it a residential proxy? It's not clear from your
       | video.
        
         | jasonwcfan wrote:
         | Proxies are definitely on our roadmap, but for now it just
         | supports stock Playwright.
         | 
         | Thanks for the feedback! I just updated the repo to make it
         | more clear that it's Playwright based. Once my cofounder wakes
         | up I'll see if he can re-record the video as well.
        
           | ghxst wrote:
           | What kind of proxies are on your road map, do you have any
           | experience with in-house proxy networks?
        
       | j0r0b0 wrote:
       | Thank you for sharing!
       | 
       | Your sign up flow might be broken. I tried creating an account
       | (with my own email), received the confirmation email, but
       | couldn't get my account to be verified. I get "Email not
       | confirmed" when I try to log in.
       | 
       | Also, the verification email was sent from
       | accounts@godealwise.com, which is a bit confusing.
        
         | jasonwcfan wrote:
         | Oops! We tested the Oauth flow but forgot to update the email
         | one. Thanks for the heads up, fixing this now.
        
       | ilrwbwrkhv wrote:
       | Backed by YC = Not open source. Eventually pressure to exit and
       | hyper scale will take over.
        
         | yard2010 wrote:
         | I'm curious, can't do both?
        
         | ayanb9440 wrote:
         | There are quite a few open source YC startups at this point.
         | Our understanding is that:
         | 
         | 1. Developer tooling should be open source by default 2. Open
         | source doesn't meaningfully affect revenue/scaling because
         | developers that would use your self-hosted version would build
         | in-house anyway.
        
           | ilrwbwrkhv wrote:
           | I know there are quite a few open source by default
           | companies. But the ethos of open source is sharing / building
           | something by the community and getting paid in a way which
           | does not scale the way VC funding expectations work.
           | 
           | So to have some respect for the open source way on top of
           | which you are building all this please stop advertising it as
           | "open source infrastructure" in bold and sell it like a
           | normal software product with "source available" on the
           | footer.
           | 
           | If you do plan to go open source and actually follow its
           | ethos, remove the funded by VC label and have self hosting
           | front and center in the docs with the hosted bit somewhere in
           | the footer.
        
             | ilrwbwrkhv wrote:
             | Like again if you are not sure, what open source means,
             | this is open source: https://appimage.org/
             | 
             | Hope it is abundantly clear with this example. Docker tried
             | it's best to do the whole open source but business first
             | and it led to disastrous results.
             | 
             | At best this will make your company suffer and second guess
             | itself and at worst this is moral fraud.
             | 
             | Talk to your group partner about this and explain to them
             | as well.
        
       | computershit wrote:
       | First, nice work. I'm certainly glad to see such a tool in this
       | space right now. Besides a UI, what does this provide that
       | something like Browserless doesn't?
        
         | jasonwcfan wrote:
         | Thanks! Wasn't familiar with Browserless but took a quick look.
         | It seems they're very focused on the scraping use case. We're
         | more focused on the agent use case. One of our first customers
         | turned us on to this - they wanted to build an RPA automation
         | to push data to a cloud EHR. The problem was it ran as a single
         | page application with no URL routing, and had an extremely
         | complex API for their backend that was difficult to reverse
         | engineer. So automating the browser was the best way to
         | integrate.
         | 
         | If you're trying to build an agent for a long-running job like
         | that, you run into different problems: - Failures are magnified
         | as a workflow has multiple upstream dependencies and most
         | scraping jobs don't. - You have to account for different auth
         | schemes (Oauth, password, magic link, etc) - You have to
         | implement token refresh logic for when sessions expire, unless
         | you want to manually login several times per day
         | 
         | We don't have most of these features yet, but it's where we
         | plan to focus.
         | 
         | And finally, we've licensed Finic under Apache 2.0 whereas
         | Browserless is only available under a commercial license.
        
           | sahmeepee wrote:
           | Sounds like a prooblem that can be solved with a Playwright
           | script with a bit of error checking in it.
           | 
           | I think this needs more elaboration on what the Finic wrapper
           | is adding to stock Playwright that can't just be achieved
           | through more effective use of stock Playwright.
        
       | ushakov wrote:
       | I do not understand what this actually is. Any difference between
       | Browserbase and what you're building?
       | 
       | Also, curious why your unstructured idea did not pan out?
        
         | ayanb9440 wrote:
         | Looking at their docs, it seems that with Browserbase you would
         | still have to deploy your Playwright script to a long-running
         | job and manage the infra around that yourself.
         | 
         | Our approach is a bit different. With finic you just write the
         | script. We handle the entire job deployment and scaling on our
         | end.
        
       | ghxst wrote:
       | Cool service but how will you deal / how do you plan to deal with
       | anti scraping and anti bot services like Akamai, Arkose,
       | Cloudflare, DataDome etc.? Automation of the web isn't solved by
       | another playwright or puppeteer abstraction, you need to solve
       | more fundemental problems in order to mitigate the issues you run
       | into at scale.
        
         | jasonwcfan wrote:
         | I mentioned this in another comment, but I know from experience
         | that it's impossible to reliably differentiate bots from humans
         | over a network. And since the right to automate browsers has
         | survived repeated legal challenges, all vendors can do is make
         | it incrementally harder to weed out the low sophistication
         | actors.
         | 
         | This actually creates an evergreen problem that companies need
         | to overcome, and our paid version will probably involve helping
         | companies overcome these barriers.
         | 
         | Also I should clarify that we're explicitly not trying to build
         | a playwright abstraction - we're trying to remain as
         | unopinionated as possible about how developers code the bot,
         | and just help with the network-level infrastructure they'll
         | need to make it reliable and make it scale.
         | 
         | It's good feedback for us, we'll make that point more clear!
        
           | candiddevmike wrote:
           | Don't take this the wrong way, but this is the kind of
           | unethical behavior that our industry should frown upon IMO. I
           | view this kind of thing on the same level as DDoS-as-a-
           | Service companies.
           | 
           | I wish your company the kind of success it deserves.
        
             | jasonwcfan wrote:
             | Why is it unethical when courts have repeatedly affirmed
             | browser automation to be legal and permitted?
             | 
             | If anything, it's unethical for companies to dictate how
             | their customers can access services they've already paid
             | for. If I'm paying hundreds of thousands per year for
             | software, shouldn't I be allowed to build automations over
             | it? Instead, many enterprise products go to great lengths
             | to restrict this kind of usage.
             | 
             | I led the team that dealt with DDoS and other network level
             | attacks at Robinhood so I know how harmful they are. But I
             | also got to see many developers using our services in
             | creative ways that could have been a whole new product
             | (example: https://github.com/sanko/Robinhood).
             | 
             | Instead we had to go after these people and shut them down
             | because it wasn't aligned with the company's long term risk
             | profile. It sucked.
             | 
             | That's why we're focused on authenticated agents for B2B
             | use cases, not the kind of malicious bots you might be
             | thinking of.
        
               | tempest_ wrote:
               | > they've already paid for.
               | 
               | That is the crux, rarely is it a service being scraped
               | that they paid for
        
               | ayanb9440 wrote:
               | Depends on the use case. Lots of hospitals and banks use
               | RPA to automate routine processes on their EHRs and
               | systems of record, because these kinds of software
               | typically don't have APIs available. Or if they do,
               | they're very limited.
               | 
               | Playwright and other browser automation scripts are a
               | much more powerful version of RPA but they do require
               | some knowledge of code. But there are more and more
               | developers every year and code just gets more powerful
               | every year. So I think it's a good bet to make that
               | browser automation in code will replace RPA altogether
               | some day.
        
               | rgrieselhuber wrote:
               | Many times it is scraping aggregators of data that those
               | aggregators also did not pay for.
        
           | ghxst wrote:
           | > but I know from experience that it's impossible to reliably
           | differentiate bots from humans over a network
           | 
           | While this might be true in theory, it doesn't stop them from
           | trying! And believe me, it's getting to a point where the WAF
           | settings on some websites are even annoying the majority of
           | the real users! Some of the issues I am hinting at however
           | are fundemental issues you run into when automating the web
           | using any mainstream browser that hasn't had some source code
           | patches, I'm curious to see if a solution to that will be
           | part of your service if you decide to tackle it.
        
       | slewis wrote:
       | Is it stateful? Like can I do a run, read the results, and then
       | do another run from that point?
        
         | ayanb9440 wrote:
         | We currently don't save the browser state after the run has
         | completed but that's something we can definitely add as a
         | feature. Could you elaborate on your use case? In which
         | scenarios would it be better to split a run into multiple
         | steps?
        
           | mdaniel wrote:
           | Almost any process that involves the word "workflow" (my
           | mental model is one where the user would press alt-tab to
           | look up something else in another window). The very, very
           | common case would be one where they have a stupid SMS-based
           | or "click email link" login flow: one would not wish to do
           | that a ton, versus just leaving the session authenticated for
           | reuse later in the day
           | 
           | Also, if my mental model is correct, the more browsing and
           | mouse-movement telemetry those cloudflare/akamai/etc gizmos
           | encounter, the more likely they are to think the browser is
           | for real, versus encountering a "fresh" one is almost
           | certainly red-alert. Not a panacea, for sure, but I'd guess
           | every little bit helps
        
             | jasonwcfan wrote:
             | The way we plan to handle authenticated sessions is through
             | a secret management service with the ability to ping an
             | endpoint to check if the session is still valid, and if
             | not, run a separate automation that re-authenticates and
             | updates the secret manager with the new token. In that
             | case, it wouldn't need to be stateful, but I can certainly
             | see a case for statefulness being useful as workflows get
             | even more complex.
             | 
             | As for device telemetry, my experience has been that most
             | companies don't rely too much on it. Any heuristic used to
             | identify bots is likely to have a high false positive rate
             | and include many legitimate users, who then complain about
             | it. Captchas are much more common and effective, though if
             | you've seen some of the newer puzzles that vendors like
             | Arkose Labs offers, it's a tossup whether the median human
             | intelligence can even solve it.
        
       | skeptrune wrote:
       | I wonder if there are hidden observality problems with scraping
       | with ideal solutions of a different shape than a dashboard. Feels
       | like sentry connection or other common alert monitoring solutions
       | would combine well with the LLM proposed changes and help trams
       | react more quickly to pipeline problems.
        
         | ayanb9440 wrote:
         | We do support sentry. Finic projects are poetry scripts so you
         | can `poetry add` any observability library you need.
        
       | sebmellen wrote:
       | We use https://windmill.dev which is great for this!
        
       | dataviz1000 wrote:
       | I build browser automation systems with either Playwright or
       | Chrome Extensions. The biggest issue with automating 3rd party
       | websites is knowing when the 3rd party developer pushes changes
       | which break the automation. The way I dealt with that is run a
       | headless browser in the cloud which checks the behavior of the
       | automated site periodically sending emails and sms messages when
       | it breaks.
       | 
       | If you don't already have this feature for your system, I would
       | recommend it.
        
         | ayanb9440 wrote:
         | That's a great suggestion! Essentially a cron job to check for
         | website changes before your automation runs and possibly
         | breaks.
         | 
         | What does this check look like for you? Do you just diff the
         | html to see if there are any changes?
        
           | dataviz1000 wrote:
           | The issue with diffing html is selectors are autogenerated
           | with any update to a website's code. Often website which
           | combat scraping will autogenerate different HTML. First thing
           | is to screen caption a website for comparison. Second, it is
           | possible to determine all the visible elements on a page.
           | With Playwright, inject event listeners to all elements on a
           | page and start automated clicking. If the agent fills out
           | forms, then make sure that all fields are available to
           | populate. There are a lot of heuristics.
        
             | thestepafter wrote:
             | Are you doing screenshot comparison with Playwright? If so,
             | how? Based on my research this looks to be a missing
             | feature but I could be incorrect.
        
               | sahmeepee wrote:
               | Playwright has screenshot comparison built in, including
               | screenshotting a single element, blanking specific
               | elements, and comparing the textual aspects of elements
               | without a visual comparison. You can even apply a
               | specific stylesheet for comparisons.
               | 
               | Everything I can see in this demo can be done with
               | Playwright on its own or with some very basic
               | infrastructure e.g. from Azure to run the tests
               | (automations). I can't see what it is adding. Is it doing
               | some bot-detection countermeasures?
               | 
               | Checking if the page behaviour has changed is pretty easy
               | in Playwright because its primary purpose is testing, so
               | just write some tests to assert the behaviour you expect
               | before you use it.
               | 
               | We use Playwright to both automate and scrape the site of
               | a public organisation we are obliged to use, as another
               | public body. They do have some bot detection because we
               | get an email when we run the scripts, asking us to
               | confirm our account hasn't been compromised, but so far
               | we have not been blocked. If they ever do block us we
               | will need to hire someone to do manual data entry, but
               | the automation has already paid for itself many times
               | over in a couple of years.
        
         | ghxst wrote:
         | IO between humans and websites can be broken down to only a few
         | fundamental pieces (or elements I should say). This is actually
         | where AI has a lot of opportunity to add value as it has the
         | capability of significantly reducing the possibilty of breakage
         | between changes.
        
       | mdaniel wrote:
       | > Finic uses Playwright to interact with DOM elements, and
       | recommends BeautifulSoup for HTML parsing.
       | 
       | I have _never, ever_ understood anyone who goes to the trouble of
       | booting up a browser, and then uses a python library to do
       | _static_ HTML parsing
       | 
       | Anyway, I was surfing around the repo trying to find what,
       | exactly "Safely store and access credentials using Finic's built-
       | in secret manager" means
        
         | ayanb9440 wrote:
         | We're in the middle of putting this together right now but it's
         | going to be a wrapper around Google Secret Manager for those
         | that don't want to set up a secrets manager themselves.
        
         | 0x3444ac53 wrote:
         | Often times websites won't load the HTML without executing the
         | JavaScript. or uses JavaScript running client side to generate
         | the entire page.
        
           | mdaniel wrote:
           | I feel that we are in agreement for the cases where one would
           | use Playwright, and for damn sure would not involve BS4 for
           | anything in that case
        
         | msp26 wrote:
         | What would you recommend for parsing instead?
        
           | mdaniel wrote:
           | In this specific scenario, where the project is using
           | *automated Chrome* to even bother with the connection,
           | redirects, and bazillions of other "browser-y" things to
           | arrive at HTML to be parsed, the very idea that one would
           | `soup = BeautifulSoup(playright.content())` is crazypants to
           | me
           | 
           | I am open to the fact that html5lib strives to parse
           | correctly, and good for them, but that would be the case
           | where one wished to use python for parsing _to avoid_ the
           | pitfalls of dragging a native binary around with you
        
           | ghxst wrote:
           | In python specifically I like lxml (pretty sure that's what
           | BS uses under the hood?), parse5 if you're using node is
           | usually my go to. Ideally though you shouldn't really have to
           | parse anything (or not much at all) when doing browser
           | automation as you have access to the DOM which gives you an
           | interface that accepts query selectors directly (you don't
           | even need the Runtime domain for most of your needs).
        
             | mdaniel wrote:
             | > pretty sure that's what BS uses under the hood?
             | 
             | it's an option[1], and my strong advice is to not use lxml
             | for html since html5lib[2] has the explicitly stated goal
             | of being WHATWG compliant:
             | https://github.com/html5lib/html5lib-python#html5lib
             | 
             | 1: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#i
             | nsta...
             | 
             | 2: https://pypi.org/project/html5lib/
        
               | ghxst wrote:
               | That's good to know, will try it out. I haven't had many
               | cases of "broken" html in projects where I use lxml but
               | when they do happen it can definitely be a pain.
        
       | krick wrote:
       | Does anyone know solid (not SaaS, obviously) solution for
       | scraping these days? It's getting pretty hard to get around some
       | pretty harmless cases (like bulk-downloading MY OWN gpx tracks
       | from some fucking fitness-watch servers), with all these js
       | tricks, countless redirects, cloudflare and so on. Even if you
       | already have the cookies, getting non-403 response to any request
       | is very much not trivial. I feel like it's time to upgrade my
       | usual approach of python requests+libxml, but I don't know if
       | there is a library/tool that solves some of the problems for you.
        
         | whilenot-dev wrote:
         | If requests solves any 403 headaches for you, just pass the
         | session cookies to a playwright instance, and you should be
         | good to go. Just did that for scraping the SAP Software
         | Download Center.
        
         | lambdaba wrote:
         | I've found selenium with undetected-chromedriver to work best.
        
           | unsupp0rted wrote:
           | Doesn't get around Cloudflare's anti-bot
        
             | lambdaba wrote:
             | Ah, ok, I found it worked with YouTube unlike regular
             | chromedriver, didn't encounter Cloudflare when I used it
        
         | bobbylarrybobby wrote:
         | On a Mac, I use keyboard maestro, which can interact with the
         | UI (which is usually stable enough to form an interface of
         | sorts) -- wait for an graphic to appear on screen, then click
         | it, then simulate keystrokes, run JavaScript on the current
         | page and get a result back... looks very human to a website in
         | a browser, and is nearly as easy to write as Python.
        
       | suriya-ganesh wrote:
       | I've been working on browser agent the last week[1]. So this is
       | very exciting. There are also browser agent implementations like
       | Skyvern[2] (Also YC backed) ,or Tarsier[3] Seems like, finic is
       | providing a way to scale/schedule these agents? If that's the
       | case what's the advantage over something like airflow or windmill
       | ?
       | 
       | If I remember correctly, Skyvern also has an implementation of
       | scaling these browser tasks built in.
       | 
       | ps. Is it not called Robotic Process Automation? First time I'm
       | hearing it as Remote process Automation.
       | 
       | [1]https://github.com/ProductLoft/arachne
       | 
       | [2]https://www.skyvern.com/
       | 
       | [3]https://github.com/reworkd/tarsier
        
       ___________________________________________________________________
       (page generated 2024-09-17 23:00 UTC)