[HN Gopher] Show HN: Till - Unblock and scale your web scrapers,...
       ___________________________________________________________________
        
       Show HN: Till - Unblock and scale your web scrapers, with minimal
       code changes
        
       Author : paramaw
       Score  : 176 points
       Date   : 2021-08-04 10:16 UTC (12 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | JeremyBanks wrote:
       | > Till was architected to follow best practices that DataHen has
       | accumulated over the years of scraping at a massive scale.
       | 
       | The best practice is not to build a business that's built on
       | evading the security of the people you're depending on.
       | 
       | Just get into crime if you want to do that.
        
       | aftbit wrote:
       | What's the difference between the linked repo and the hosted
       | service at https://till.datahen.com/api/v1 ? Why do I need to
       | sign up for a token if I am hosting my own server? Or otherwise,
       | why do I need to host a server if I'm just calling your APIs?
       | What sort of scraped data is exposed to the till hosted service?
        
         | paramaw wrote:
         | We don't have a hosted service at the moment. It's all run on
         | premise. The API link you mentioned is for a Till instance to
         | validate the token and for usage stats reporting for each user.
         | The Auth token validates each user's free and premium features
         | and limits based on usage stats more specifically total
         | requests counts and cache hits counter. We don't record nor
         | track anything related to the requests made on Till.
        
       | cube00 wrote:
       | > Till helps you circumvent detected as a web scraper by
       | identifying your scraper as a real web browser. It does this by
       | generating random user-agent headers and randomizing proxy IPs
       | (that you supply) on every HTTP request.
       | 
       | How did we start with consent using robots.txt and end up here?
       | 
       | A good neighbor doesn't use circumvention.
        
         | Nextgrid wrote:
         | If a real user can access the content then that user should be
         | able to delegate the work to a machine.
        
           | Doctor_Fegg wrote:
           | But what when there are separate interfaces specifically
           | designed for real users and machines?
           | 
           | Case in point: OpenStreetMap (run by volunteer sysadmins)
           | provides completely open, machine-readable data dumps. You
           | can use these to set up your own geocoding, rendering or
           | similar service. There is copious documentation, several
           | post-processed dumps for particular purposes, etc. etc.
           | 
           | OSM also provides a friendly, human-scale, user-facing
           | interface for map browsing and search. There are clearly
           | documented limitations/TOUs for automated use of this
           | interface.
           | 
           | Does that prevent people from setting up massive scrapers to
           | scrape the human-facing interface, rather than using the
           | machine-facing data dumps? No, it does not; and the volunteer
           | sysadmins have to spend an inordinate amount of time
           | protecting the service against these people.
           | 
           | DataHen's proudly admitted practices ("No need to worry about
           | IP bans, we auto rotate IPs on any requests that are made.";
           | "our massive pool of auto-rotating proxies, user agents, and
           | 'secret-sauce' helps get around them") is directly
           | antithetical to this sort of scenario. I find this incredibly
           | irresponsible and unethical.
        
             | PaulHoule wrote:
             | Almost always the API is nerfed relative to the web site.
             | 
             | Almost all web sites that authenticate use "submit a form
             | with username and password and respect cookies"; often
             | sites that don't authenticate to use the web site require
             | authentication for the API. Every API uses a different
             | authentication scheme and requires custom programming: for
             | web sites you have the URL of the form and the name of the
             | username and password field and you are done.
             | 
             | If you feed most HTML pages through a DOM parser like
             | BeautifulSoup you can extract the links and interpret them
             | through regexes. You might be done right then and there. If
             | you need more usually you can use CSS classes and id(s)
             | and... done!
             | 
             | I wrote a metadata extractor for Wikimedia HTML that I had
             | working provisionally on Flickr HTML immediately and had
             | working at 100% in 20 minutes. No way I could have gotten
             | the authentication working for an API in 20 minutes.
        
             | Nextgrid wrote:
             | Surely you must have a way to throttle human-originated
             | abuse (such as someone spamming F5)? If so then this would
             | work equally well for the machine.
             | 
             | Determine a reasonable rate limit and apply it to the
             | human-facing version, with maybe a link to the machine-
             | readable version in the error message?
        
             | ethbr0 wrote:
             | Technically, this is the same rabbit hole that leads to DRM
             | and device intent superceding user intent.
             | 
             | Because that's the only way to be sure.
             | 
             | It's unfortunate that it also allows havoc and burdens good
             | services. But device-over-user is not a future I want to
             | live in.
        
             | capableweb wrote:
             | Of course if there is another, unlimited way of getting the
             | data, then it's perfectly fine to "redirect" people to
             | another avenue.
             | 
             | What this conversation is about (since we're talking about
             | web scrapers here, not "data downloaders" or whatever you
             | would call it) is when there is no other avenue to get the
             | data, you should be able to access the same data via a
             | machine as you could when you're a person.
        
               | Doctor_Fegg wrote:
               | Sure, but Till/DataHen appears to have a host of measures
               | in place to ignore that "redirect".
        
               | capableweb wrote:
               | Yes, how are they supposed to know if the "redirect" is
               | good or not? Facebook would helpfully redirect you to
               | their Terms and Conditions, something you definitely
               | should be able to ignore and scrape your data til your
               | heart is content anyways.
               | 
               | Similarly to HTTP, the tool does not decide if usage is
               | "nice" or "evil", only the user with their use case
               | decides that.
        
               | tyingq wrote:
               | >you should be able to access the same data via a machine
               | as you could when you're a person
               | 
               | I generally agree with this, but I do see the problem for
               | certain spaces.
               | 
               | Concert tickets, for example. People write scrapers to
               | get the best seats and sit on them so they can scalp them
               | later at inflated prices. Or other first-come/first-serve
               | situations, online auction "sniping", etc.
        
               | osmarks wrote:
               | Perhaps it would be better to not gate by whether you
               | happen to have visited a website fast enough relative to
               | other people. For auctions, at least, there is a simple
               | solution in the form of extending the auction timer
               | whenever someone bids.
        
               | capableweb wrote:
               | Same thing applies here as I said in another comment
               | (https://news.ycombinator.com/item?id=28060567), some
               | people (me included) also write scrapers for getting the
               | best seats in concerts/shows as I'm a big fan of the
               | artist and really want to go there with good seats.
        
           | Kiro wrote:
           | Says who?
        
         | danpalmer wrote:
         | I work on a lot of web scraping and we have business agreements
         | with every site that we scrape explicitly allowing us to do so
         | and with pre-approval for the scraping rate (which we carefully
         | control).
         | 
         | None of this gets around over-eager Cloudflare or Akamai rules
         | set up years ago by some contractor that the businesses have no
         | real ability to change.
        
           | Havoc wrote:
           | Why scrape at all if there are agreements in place. Seems
           | like a API task
        
         | JulianWasTaken wrote:
         | And only a few paragraphs down from a heading that says
         | 
         | > Till was architected to follow best practices that DataHen
         | has accumulated
         | 
         | I know there are plenty of sites out there out to prevent any
         | scraping whatsoever, or to improperly prevent some situations
         | most of us would agree is reasonable behavior, but this appears
         | to be blatantly hostile to web admins out there.
        
           | cube00 wrote:
           | I agree there are cases where entities don't want things
           | scraped that really should be scraped to preserve the history
           | they'd like to rewrite.
           | 
           | My concern is that tools like these that use circumvention by
           | default become the go-to when someone needs scraping and
           | makes life hell for us running sites on hardware that's
           | enough to service our small customer base but not an army of
           | bots.
        
           | paco3346 wrote:
           | Admin of a 5000+ website platform (with customer inventory)
           | here. This the exact kind of thing I've been working so hard
           | to block lately.
           | 
           | Since Feb 2021 we've seen a substantial increase in scraping
           | of our customers' inventory to the point we now have more bot
           | traffic than human traffic. Annoyingly, only 5-10 requests
           | come from a single IP so we've had to resort to always
           | challenging requests that come from certain ASNs (typically
           | those owned by datacenters).
           | 
           | This type of project frustrates me because it knowlingly goes
           | against a site's desired bot action (via robots.txt).
        
         | realusername wrote:
         | We ended up here since some people wrongly decided that only
         | Google has a right to scrape.
        
           | 1vuio0pswjnm7 wrote:
           | Then Google decided that no one has a right to scrape Google.
           | 
           | The web needs a more efficient system for distributing an
           | index of its content. Having web developers design websites
           | in a gazillion different permutations, when they are all
           | bascially doing the same thing, and then having a handful of
           | companies "scrape" them is neither an efficient nor sensibly-
           | designed system.
           | 
           | The web (more generally, the internet) is a giant copy
           | machine. Google, NSA and others have copies. Yet if we were
           | to allow everyone to have copies by faciltating this through
           | technical means (e.g., Wikipedia-style data dumps), many
           | folks would panic. When Google first started indexing, it was
           | not a business, and many folks did panic and there were many
           | complaints. It's 2021; folks are still spooked by others
           | being able to easily copy what they publish to the web.
           | However it's OK to them if it's Google doing the copying. If
           | there were healthy competition and many search engines to
           | choose from, if one search engine did not have the lion's
           | share of web traffic, it's doubtful Google would be given
           | "exclusive" permission in robots.txt.
        
           | PaulHoule wrote:
           | I've run sites that have a lot of pages where 80%+ of the
           | traffic is web crawlers.
           | 
           | Google sends some traffic so I can afford to let them scrape.
           | Bing crawls almost as much as Google but sends 5% as much
           | traffic. Baidu crawls more than Google and never sends a
           | single visitor.
           | 
           | I hate reinforcing the Google monopoly, but a crawler that
           | doesn't send any traffic is expensive to serve.
        
             | maddyboo wrote:
             | Bing powers other search engines like DuckDuckGo, Ecosia,
             | and Yahoo!. But I'm sure that even cumulatively the numbers
             | are still small.
        
             | dredmorbius wrote:
             | You might want to ask yourself, or your readers, what it is
             | people are trying to access on your site that they cannot
             | by other means.
             | 
             | The interfaces for many sites actively and with brutal
             | effectiveness deny ready access to any content not
             | currently featured on the homepage or stream feed. Search
             | features are frequently nonexistent, crippled, or
             | dysfunctional.
             | 
             | Last week I found myself stumbling across a now-archived
             | radio programme on a website which afforded exceedingly
             | poor access to the content. The show ran weekly for over a
             | decade, with 531 episodes. Those are visible ... ten at a
             | time ... through either the Archive or Search features.
             | 
             | Scraping the site gives me the full archive listing, all 11
             | years, in a single webpage, that loads in under a second. I
             | can search that by date, title, or guest to find episodes
             | of interest.
             | 
             | The utility of this (a few hours work on my part) is much
             | higher than that of the site itself.
             | 
             | Often current web sites / apps are little more than
             | wrappers around a JSON delivery-and-parsing engine. Dumping
             | the raw JSON can be much more useful for a technical user.
             | (Reddit, Diaspora, Mastodon, and Ello are several sites in
             | which this is at least somewhat possible.)
             | 
             | Much of the suck is imposed by monetisation schemes. One
             | project of mine, decrufting the Washington Post's website,
             | resulted in article pages with _two percent_ the weight of
             | the originally-delivered payload. The de-cruftified version
             | strips not only extraneous JS and CSS, but nags and teasers
             | which are really nothing but distraction to me. Again, that
             | 's typical.
             | 
             | I'm aware that many scrapers are not benign. More than you
             | might think _are_ , and the fact that casual scraping _is_
             | a problem for your delivery system reflects more poorly on
             | it than them.
        
               | PaulHoule wrote:
               | Adjunk, Sidebarjunk, Javascriptjunk, Popupwindowjunk, and
               | the outlook that the most precious resource in the world
               | are a few seconds when you are distracted are what
               | motivates the Washington Post and most of the commercial
               | web.
               | 
               | What you are doing stripping out the junk threatens those
               | organizations at the core.
        
               | dredmorbius wrote:
               | Good.
               | 
               | https://news.ycombinator.com/item?id=26893033
               | 
               | https://news.ycombinator.com/item?id=27803591
        
               | PaulHoule wrote:
               | On mobile ads, trackers and all that crap cost the
               | consumer more than the the ads make.
               | 
               | If mobile phone companies kicked back a fraction of the
               | revenue they get to content creators they'd be better
               | paid than they are now and Verizon would get the love
               | that it has sought in vain. (e.g. who would say a bad
               | word about the phone company?)
        
               | dredmorbius wrote:
               | That's my argument.
               | 
               | Gobal ad spend, which mostly accrues to the wealthiest 1
               | billion or so, is about $600 billion. Some complex maths
               | tells us that's $600 per person in the industrialised
               | countries (G-20 / OECD, close enough). Global content
               | spend is somewhere around $100 -- 200/year per capita.
               | That's roughly the annual online ad spend.
               | 
               | Bundled into network provisioning, that's about $30--40
               | per household per month, all-you-can-eat. Information as
               | a public good.
               | 
               | (My preference is for higher rates in more affluent
               | areas, ideally by income.)
               | 
               | Trying to figure out WCPGW.
        
               | PaulHoule wrote:
               | My personal model (emerging, there is a manifesto but I
               | am rewriting it as we speak) is to rigorously control
               | costs, focus on quality, stay small.
               | 
               | Think of the old phone company slogan "reach out and
               | touch someone." If I can accomplish that and spend less
               | than I do on food or clothes or my car then I win.
        
               | dredmorbius wrote:
               | I'd be interested in seeing what you're developing.
               | 
               | The challenge, as I see it is that information is a
               | public good (in the economic sense: nonrivalrous,
               | nonexcludable, zero marginal cost, high fixed costs), and
               | provision _at scale_ requires either a complementary
               | rents service (advertising, patronage, propaganda, fancy
               | professional-services  "shingle") or a tax. Busking or
               | its public-broadcasting is another option, though that's
               | highly lossy.
               | 
               | Any truthful publishing also requires a strong self-
               | defence mechanism (protection against lawsuits, coercion,
               | intimidation, protection rackets, etc.), a frequently
               | underappreciated role played by publishers.
               | 
               | Charles Perrow's descriptions of the music industry
               | (recorded and broadcast) circa 1945 -- 1985 is
               | informative here (see his _Complex Organizations_
               | https://www.worldcat.org/title/complex-organizations-a-
               | criti...), notably the roles of publishers vs. front-line
               | and studio musicians.
        
               | Kiro wrote:
               | Not sure why you are attacking a poster specifically
               | talking about Google, Bing and Baidu doing massive
               | scraping. What you are talking about is something
               | entirely different.
        
               | PaulHoule wrote:
               | I don't feel attacked. I also don't blame him for being
               | inflamed about the problem he's been inflamed at because
               | I am inflamed about it too!
        
               | dredmorbius wrote:
               | Fortunately I think we both managed to realise that
               | before too many rounds of this ;-)
        
         | CiaTransMatters wrote:
         | In order for the CIA to continue its legendary fight against
         | transphobia, it is important for them to use these technologies
         | to gather OSINT on regions that house millions of backwards
         | non-progressives and all of their content.
         | 
         | I can't believe HN is such a hotbed for transphobics like this
         | to make comments without moderation or punishment.
        
         | ok2938 wrote:
         | We ended up here, because a) the promise of easy to use
         | structured data - that is inherent in data processing systems -
         | has not been fulfilled: you have to scrape HTML to reconstruct
         | some relational table, and b) information, despite being almost
         | free to transmit in large amounts today, is treated as if it
         | were a physical good, and each copy costs money. There is just
         | lots of money in there, because our economic system looks more
         | backwards than forwards.
        
       | ChrisArchitect wrote:
       | what kinds of scraping is being talked about here? As a
       | maintainer of many a website I hate seeing so so much bot traffic
       | these days. So many sketchy foreign things. And then throw in the
       | spam/vulnerability testing bots/random URLs etc. sigh. Filling up
       | the logs..Filling up the logs.
       | 
       | Almost totally dependent on Cloudflare to mitigate these days/as
       | an in-between.
        
         | DaveExeter wrote:
         | I have simple PHP code that adds misbehaving IP address to
         | iptables. Keeps the logs clean, because they can't even connect
         | over port 80.
        
       | dannyw wrote:
       | Can this bypass cloudflare?
        
         | new_guy wrote:
         | These guys bypass cloudflare https://microlink.io it's open
         | source but it's just more convenient to pay and forget about
         | it.
        
         | jonatron wrote:
         | Cloudflare have a few configuration options for the level of
         | bot protection, so it's probably more a question of what
         | percentage would be blocked for a particular site.
        
           | yourad_io wrote:
           | This is entirely correct.
           | 
           | It will also depend on the proxy's ASN, the rate of requests,
           | the ability to solve captchas and the presence of an actual
           | js environment in the browser (as opposed to, say, curl -
           | Cloudflare uses a javascript environment check script to
           | detect puppeteer/selenium etc.)
        
             | yourad_io wrote:
             | > repository contains research from CloudFlare's AntiDDoS,
             | JS Challenge, Captcha Challenges, and CloudFlare WAF.
             | 
             | https://github.com/scaredos/cfresearch
        
         | tintt wrote:
         | Sometimes I, a human being, can't bypass Cloudflare
        
       | awestroke wrote:
       | Looks extremely useful. Starred for if I ever need to do scraping
        
       | kordlessagain wrote:
       | If anyone needs a screenshot bot:
       | https://github.com/kordless/grub-2.0
       | 
       | Still need to do some work on it, but it does do the job!
        
       | captainmuon wrote:
       | This looks really interesting, I'm going to bookmark it!
       | 
       | > Proxy IP address rotation
       | 
       | I wonder where you would get a decent proxy list nowadays. In the
       | late 90s, you could find list of accidental open proxies that
       | "hackers" collected. But nowadays I've only seen really shady
       | "residential proxies" that are basically private PCs with
       | malware. Is there a decent source for proxies that are not widely
       | blocked and not criminal and not too expensive?
       | 
       | And by the way, while many people here question the morality of
       | scraping, I have at least two legitimate use cases:
       | 
       | 1) I built a little "vertical" search engine for a niche, using a
       | combination of scraping and RSS. It doesn't consume much traffic
       | and does really help people discover content in this niche.
       | 
       | 2) A friend does research on urban development and rent prices
       | and they scrape rent ads to get their data (or let students type
       | it off manually, I'm not sure...)
        
         | showerst wrote:
         | I run a number of proxies for (legally) scraping government
         | data sites with overly restrictive views/hour policies.
         | 
         | They're just squid boxes spun up on various cheap VPS
         | providers. A $5 digitalocean instance will power a lot of squid
         | traffic, and as long as you stay away from AWS, the IP ranges
         | tend not to be banned.
         | 
         | Just make sure to lock down your allowed source list, so you
         | don't become one of those hacked open proxies!
        
       | k1rcher wrote:
       | Does this have any sort of JavaScript emulation/support? Or is it
       | purely HTTP requests?
       | 
       | The GID signature piece is especially interesting, I have ran
       | into blockers in scaling scrapers in the past and this sort of
       | organizational/tracking platform sounds awesome.
        
         | paramaw wrote:
         | You can connect your local browser to Till. But, you need to
         | have your browser to accept a custom CA(Certificate Authority).
         | Instructions here:
         | https://till.datahen.com/docs/installation#ca
        
       | mellosouls wrote:
       | This sounds technically interesting but ethically dubious; also
       | requires sign-up - even though free - which (I think?) is
       | discouraged for Show HN.
       | 
       | If you are able to show how this service is administered
       | responsibly with due regard for those running websites and
       | blocking for good reason, best of luck with it.
        
         | Kiro wrote:
         | > also requires sign-up - even though free - which (I think?)
         | is discouraged for Show HN.
         | 
         | Look at past Show HNs and tell me how many respect this. It's
         | such an obsolete rule that it's irrelevant and almost OT to
         | even comment on.
        
         | OJFord wrote:
         | > also requires sign-up - even though free - which (I think?)
         | is discouraged for Show HN.
         | 
         | I don't doubt it's 'community discouraged' because people don't
         | want to have to, but I think the only rule is that there must
         | be something to try out - no announcement-only, sign up for
         | wait-list, hype-building type thing.
        
         | capableweb wrote:
         | > This sounds technically interesting but ethically dubious
         | 
         | > If you are able to show how this service is administered
         | responsibly
         | 
         | Imagine if this was the comments about HTTP when it was first
         | launched (honestly, surely there was, but not as much as
         | praise).
         | 
         | Protocols and tools are not ethically dubious or responsible
         | for anything. The users who use those protocols and tools are.
         | Anything that can be used for good/bad can also be used for
         | bad/good.
        
       | axiosgunnar wrote:
       | Why is this on Github if I still need an API and cannot run it
       | locally since the actual code is not public?
       | 
       | Or is this some sort of ,,source available" content marketing
       | nonsense?
        
       ___________________________________________________________________
       (page generated 2021-08-04 23:01 UTC)