hngopher.com

       [HN Gopher] Show HN: Crul - Query Any Webpage or API
       ___________________________________________________________________
        
       Show HN: Crul - Query Any Webpage or API
        
       Hi HN, we're Carl and Nic, the creators of crul
       (https://www.crul.com), and we've been hard at work for the last
       year and a half building our dream of turning the web into a
       dataset. In a nutshell crul is a tool for querying and building web
       and api data feeds from anywhere to anywhere.  With crul you can
       crawl and transform web pages into csv tables, explore and
       dynamically query APIs, filter and organize data, and push data
       sets to third party data lakes and analytics tools. Here's a demo
       video, we've been told Nic sounds like John Mayer (lol)
       (https://www.crul.com/demo-video)  We've personally struggled
       wrangling data from the web using puppeteer/playwright/selenium, jq
       or cobbling together python scripts, client libraries, and
       schedulers to consume APIs. The reality is that shit is hard,
       doesn't scale (classic blocking for-loop or async saturation), and
       comes with thorny maintenance/security issues. The tools we love to
       hate.  Crul's value prop is simple: Query any Webpage or API for
       free.  At its core, crul is based on the foundational linked nature
       of Web/API content. It consists of a purpose built
       map/expand/reduce engine for hierarchical Web/API content (kind of
       like postman but with a membership to Gold's Gym) with a familiar
       parser expression grammar that naturally gets the job done (and
       layered caching to make it quick to fix when it doesn't on the
       first try). There's a boatload of other features like domain
       policies, scheduler, checkpoints, templates, REST API, Web UI,
       vault, OAuth for third parties and 20+ stores to send your data to.
       Our goal is to open source crul as time and resources permit. At
       the end of the day it's just the two of us trying to figure things
       out as we go! We're just getting started.  Crul is one bad
       mother#^@%*& and the web is finally yours!  Download crul for free
       as a Mac OS desktop application or as a Docker image
       (https://www.crul.com) and let us know if you love it or hate it.
       (https://forms.gle/5BXb5bLC1D5QG7i99) And come say hello to us on
       our slack channel - we're a friendly bunch!
       (https://crulinc.slack.com/)  Nic and Carl
       (https://www.crul.com/early-days)
        
       Author : portInit
       Score  : 103 points
       Date   : 2023-02-28 16:12 UTC (6 hours ago)
        
 (HTM) web link (www.crul.com)
 (TXT) w3m dump (www.crul.com)
        
       | 1vuio0pswjnm7 wrote:
       | Please share some data wrangling challenges that support this
       | statement:
       | 
       | "The reality is that shit is hard, doesn't scale (classic
       | blocking for-loop or async saturation), and comes with thorny
       | maintenance/security issues."
       | 
       | Every web user's needs are different. One person might have a
       | task that they struggle to accomplish while another might have
       | one that presents no major challenges. As a web user, I transform
       | web pages to CSV or SQL. I log HTTP and network requests. I do
       | this for free using open source software. No web browser needed.
       | No docker image needed. Works on both Linux and BSD.
       | 
       | For me, the web is a dataset from which I retrieve
       | data/information. "Tech" companies want to the web to be more
       | like a video game, with visuals and constant interactivity.
        
       | [deleted]
        
       | jwx48 wrote:
       | Ah, I see it's pronounced more like "Krull" than "cruel".
        
         | portInit wrote:
         | lol - yeah. https://www.imdb.com/title/tt0085811/?ref_=tt_urv
         | 
         | We've been thinking of ways to make the pronunciation a bit
         | clearer, maybe a mascot or something. Open to ideas!
        
       | asdadsdad wrote:
       | How do you plan on tackling anti-bot blocking?
        
         | is_true wrote:
         | You shouldn't
        
         | portInit wrote:
         | It's a tricky question. Part of it is looking at APIs as the
         | main source of data for scheduled queries/data feeds.
         | 
         | Crul sort of operates as a text only browser when interacting
         | with a single page at a time, but when you expand and open up
         | multiple tabs it becomes a little more challenging. We have the
         | concept of domain policies which allow you to control how
         | quickly/slowly you access something. There are also some
         | puppeteer level options that could be relevant, even a headful
         | toggle.
         | 
         | We have not invested too much time into this yet as we focused
         | on getting the core functionality working. We think there are
         | use cases (particularly with APIs) that don't run into this
         | problem, but if it comes up more often we'll come up with some
         | options.
        
           | wefarrell wrote:
           | In my experience bot detection is moving more towards looking
           | at network activity and IP reputations. Using a proxy will go
           | along way, it's easy to implement, and the cost can easily be
           | passed onto the customer.
        
             | asdadsdad wrote:
             | In my experience it's starting to moving towards "use a
             | headless browser with patched attributes" or nothing else
             | will work.
             | 
             | edit: I have quite a bit of experience with Akamai and
             | other vendors =)
        
             | pocket_cheese wrote:
             | In my experience, the thing that makes me actually have to
             | lug out my ol' headless browser is that a lot of websites
             | are starting to implement obfuscated cryptographic puzzles
             | in their JS, making it really difficult to emulate without
             | just running it in a browser.
        
               | wefarrell wrote:
               | Starting to? That's been going on for at least the last
               | 3-4 years. Akamai tends to rely on that more than
               | Cloudflare and in my recent experience Akamai is winning
               | that game and browser emulation alone, headless or not,
               | isn't going to bypass Akamai. Recently I have seen
               | browser emulation not be effective at all for bypassing
               | bot detection.
               | 
               | The only kind of emulation where I have seen success is
               | mobile and in that case you need to run a device
               | emulator.
        
       | jensneuse wrote:
       | Hey, just watched the video. This looks super useful! I'm the
       | founder of WunderGraph (https://wundergraph.com) and we allow our
       | users to easily integrate multiple data sources into a virtual
       | graph, which they can then access using GraphQL. You can add
       | various data sources, like GraphQL, Federation, OpenAPI,
       | Databases, etc... I was just thinking, wouldn't it be cool if we
       | could find an easy way to add a "Crul" datasource? If you're
       | interested, please DM me in our discord
       | (https://wundergraph.com/discord). I'd love to have a
       | conversation!
        
         | portInit wrote:
         | Will absolutely reach out! Our experience has been that just
         | getting data is often really challenging, so we've really
         | focused on that piece, and being able to easily share with
         | destinations that are purpose built for analytics, viz, etc.
         | 
         | Thanks for checking it out!
        
           | chatmasta wrote:
           | I'll make this same offer for Splitgraph :) If you feel like
           | writing a Postgres FDW then we can add it to the engine on
           | the backend, so that anyone with a Postgres client could
           | connect to postgres://data.splitgraph.com:5432 and SELECT
           | from a table backed by crul (either "mounted" for live
           | querying, and/or ingested once/periodically for subsequent
           | querying). The user just needs to provide parameters for the
           | table; it's up to the FDW how to interpret those parameters.
           | 
           | It would take some thinking and planning, and it's possibly
           | not even a good idea ;) But generally any "data source" is
           | packageable as an FDW as long as you can model it in such a
           | way that you can reasonably implement certain functions for
           | operations like table scans. For most FDWs, this is easy and
           | the tradeoff of a large query is usually limited to excess
           | bandwidth and latency while the query executor reads the
           | result from the FDW. But with a live source pointing to a
           | crawler instance, a table scan could in the worst case mean
           | waiting for the crawler to parse the responses to hundreds of
           | rate-limited network requests. So it's probably better to
           | ingest the data once (and/or periodically) for a particular
           | crul "table" (whatever you decide that means) rather than to
           | query it live.
           | 
           | Fortunately, you can still write an FDW as the adapter layer,
           | because Splitgraph ingests data on a schedule by querying the
           | FDW of the live data source (while tolerating a long-running
           | query).
           | 
           | We've been interested in adding something like this (think
           | Apify + Postgres) for a while. If done well it could be
           | really cool. Let me know if you want to talk about it:
           | miles@splitgraph.com
        
       | jawns wrote:
       | How does crul handle the dynamic nature of the web?
       | 
       | Yes, content changes, but so does structure. If I'm interested in
       | content that shows up in a news feed div, and that div is renamed
       | or moved as part of a site redesign, what happens?
       | 
       | I've worked on a bunch of tools in the past that do similar
       | things, and structural changes were the kryptonite for all of
       | them.
       | 
       | A secondary problem is when you use particular content as a
       | reference point, and that content is later updated. Now your
       | reference point is gone!
        
         | portInit wrote:
         | At this point we're considering it a foundational concept to
         | build around - web content changes, so our best option
         | currently is to make the query as easy as possible to change,
         | and alert when things break.
         | 
         | We have done some preliminary work in some AI or other
         | intelligence for pattern recognition to be able to handle
         | structural changes better, but still have lots of work.
         | 
         | But the expanding and querying concepts also make a lot of
         | sense with APIs, which tend to be a little more stable.
        
           | robbs wrote:
           | IMO, this is the hardest part of maintaining a web scraper.
           | We had ~100 scripts to scrape ~1000 clients' sites and it
           | was, at minimum, 50 hours a week to keep up with changes.
           | 
           | The second hardest part was 30% of our clients all used the
           | same hosting provider, which would start to fail at 10-20
           | req/s. We had to throttle the sites by IP, cluster-wide.
        
             | portInit wrote:
             | This makes sense and I am curious about this. Was there
             | consistency between those 1k client sites or were they all
             | rather different? Mind if I reach out?
        
       | throwaway_e9463 wrote:
       | Do you have an example of how to turn an html table element into
       | a CSV? I saw the open and the scrape commands, but wasn't sure
       | where to go from there.
        
         | portInit wrote:
         | Ah! We didn't quite get an html table command in this release
         | but it will be in the next one.
         | 
         | Here's a query that shows an option, but the table command will
         | be far more straightforward.
         | 
         | open https://www.w3schools.com/html/html_tables.asp --dimension
         | || filter "(nodeName == 'TD')" || groupBy
         | boundingClientRect.top || table _group.0.innerText
         | _group.1.innerText _group.2.innerText
        
       | hermitcrab wrote:
       | Looks interesting. Customers of our data wrangling tool (Easy
       | Data Transform) are asking to be able to pull data out of REST
       | APIs, but we don't currently support this. So it could be an
       | interesting way to bridge the gap.
        
         | portInit wrote:
         | Would love to chat and try a few use cases together! At first
         | glance I see you can drag in csv files, which could be
         | generated by crul and either manually downloaded or scheduled
         | and written to the filesystem.
         | 
         | Crul is a really easy way of populating tools that need data,
         | whether it's just a one time thing for a demo/static data set
         | or a scheduled data feed, so this kind of usage makes sense to
         | us.
        
           | hermitcrab wrote:
           | That might be interesting. I will try to find some time to
           | have a play with Crul. Do you support pulling data from an
           | API on a schedule?
        
             | portInit wrote:
             | Yes that is possible, although we currently have the
             | scheduler set as an enterprise feature. We should look into
             | a free trial, I enjoyed the flow of the EasyDataTransform
             | installation with the free trial option.
        
               | hermitcrab wrote:
               | >I enjoyed the flow of the EasyDataTransform installation
               | with the free trial option
               | 
               | We're taking a different apporach to our enterprisey
               | competitiors. ;0)
               | 
               | >we currently have the scheduler set as an enterprise
               | feature.
               | 
               | Understandable.
        
       | Uptrenda wrote:
       | Really elegant work guys. You've identified the pain points well.
       | I'd be curious how you've structured your infrastructure to scale
       | though. But I imagine that's secret sauce.
       | 
       | What are some of your experiences with async networking, btw? Did
       | you find that it didn't really handle concurrency well in
       | practice? Or were there other issues? I've written a lot of async
       | networking code but always found it horrible to profile.
        
         | portInit wrote:
         | Thank you and your comment really warms our hearts. It's been
         | fun building in the "cave" but comes with self doubt.
         | 
         | We've built using a microservice architecture to allow us to
         | scale out the parts that need to scale, mainly the workers,
         | which interact with a queue, although we'll need to move from
         | the parts that are currently nodejs for some perf gains and a
         | smaller footprint. All those microservices are consolidated for
         | the desktop variants.
         | 
         | Network concurrency is mostly throttled by our domain policy
         | manager (named "gonogo" - lol) at 1 req/per outbound domain a
         | sec. It's a little slow for a default, but also configurable
         | and provides a nice guardrail for api request limits, etc.
         | Overall async networking has been quite tricky, esp with
         | retries, etc., and we're still iterating on it. Agreed on the
         | profiling difficulties.
        
       | dishwishy wrote:
       | I just started playing with CRUL recently to try to map out
       | different media download links on various websites, in an effort
       | to avoid clicking around looking for hidden content. The language
       | is very robust, but just the `filter` command alone is crazy
       | powerful for just exploring things quickly and intuitively.
        
         | portInit wrote:
         | Thanks! The find (https://www.crul.com/docs/commands/find)
         | command works really well if you are trying to construct a
         | filter expression and just want to quickly look for results
         | containing a particular string so you can see the defining
         | attributes/column+row values.
         | 
         | There's a short writeup of this pattern here:
         | https://www.crul.com/docs/examples/how-to-find-filters
        
       | KomoD wrote:
       | > status: complete (28.284 seconds / 14 results)
       | 
       | Why was it so slow? Is there a default delay between requests or
       | something?
        
         | portInit wrote:
         | Yeah the default domain throttle policy is 1 req per second per
         | domain. Configurable through domain policies
         | https://www.crul.com/docs/features/domain-policies - although
         | currently an enterprise feature.
         | 
         | We found that it becomes too easy to break API request limits
         | or spam a website otherwise.
         | 
         | However if you rerun that query it should load pretty instantly
         | due to the caching layers, so the actual querying/filtering of
         | the data part is smoother/faster.
        
           | KomoD wrote:
           | > although currently an enterprise feature.
           | 
           | Wait, so we literally can't go faster than 1 req/s unless we
           | pay?
           | 
           | I have to say I'm pretty disappointed :/
        
           | mdaniel wrote:
           | > due to the caching layers
           | 
           | Every time I see that, the "2 hardest things" springs to
           | mind. Is there a clear-caches option, or I guess the opposite
           | question: does that process honor the HTTP caching semantics?
           | Scrapy actually has a bunch of configurable knobs for that
           | (use RFC2616 Policy (
           | https://docs.scrapy.org/en/2.8/topics/downloader-
           | middleware.... ), write your own policy, or a ton of other
           | stuff: https://docs.scrapy.org/en/2.8/topics/downloader-
           | middleware.... )
        
             | portInit wrote:
             | Agreed, caching does come with its own set of quirks and
             | mind-numbing bugs, crul does have a caching override flag
             | at the command/stage level which alleviates some of this:
             | https://www.crul.com/docs/queryconcepts/common-
             | flags#--cache
             | 
             | Your provided links are interesting and something for us
             | think about some more. Honestly, I would be quite
             | interested in hearing more about your experiences.
        
       | the_giver wrote:
       | That's a pretty slick tool you guys built!
        
         | portInit wrote:
         | Thank you! Really means a lot to us
        
       ___________________________________________________________________
       (page generated 2023-02-28 23:01 UTC)