[HN Gopher] Show HN: Till - Unblock and scale your web scrapers,...
___________________________________________________________________
Show HN: Till - Unblock and scale your web scrapers, with minimal
code changes
Author : paramaw
Score : 176 points
Date : 2021-08-04 10:16 UTC (12 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| JeremyBanks wrote:
| > Till was architected to follow best practices that DataHen has
| accumulated over the years of scraping at a massive scale.
|
| The best practice is not to build a business that's built on
| evading the security of the people you're depending on.
|
| Just get into crime if you want to do that.
| aftbit wrote:
| What's the difference between the linked repo and the hosted
| service at https://till.datahen.com/api/v1 ? Why do I need to
| sign up for a token if I am hosting my own server? Or otherwise,
| why do I need to host a server if I'm just calling your APIs?
| What sort of scraped data is exposed to the till hosted service?
| paramaw wrote:
| We don't have a hosted service at the moment. It's all run on
| premise. The API link you mentioned is for a Till instance to
| validate the token and for usage stats reporting for each user.
| The Auth token validates each user's free and premium features
| and limits based on usage stats more specifically total
| requests counts and cache hits counter. We don't record nor
| track anything related to the requests made on Till.
| cube00 wrote:
| > Till helps you circumvent detected as a web scraper by
| identifying your scraper as a real web browser. It does this by
| generating random user-agent headers and randomizing proxy IPs
| (that you supply) on every HTTP request.
|
| How did we start with consent using robots.txt and end up here?
|
| A good neighbor doesn't use circumvention.
| Nextgrid wrote:
| If a real user can access the content then that user should be
| able to delegate the work to a machine.
| Doctor_Fegg wrote:
| But what when there are separate interfaces specifically
| designed for real users and machines?
|
| Case in point: OpenStreetMap (run by volunteer sysadmins)
| provides completely open, machine-readable data dumps. You
| can use these to set up your own geocoding, rendering or
| similar service. There is copious documentation, several
| post-processed dumps for particular purposes, etc. etc.
|
| OSM also provides a friendly, human-scale, user-facing
| interface for map browsing and search. There are clearly
| documented limitations/TOUs for automated use of this
| interface.
|
| Does that prevent people from setting up massive scrapers to
| scrape the human-facing interface, rather than using the
| machine-facing data dumps? No, it does not; and the volunteer
| sysadmins have to spend an inordinate amount of time
| protecting the service against these people.
|
| DataHen's proudly admitted practices ("No need to worry about
| IP bans, we auto rotate IPs on any requests that are made.";
| "our massive pool of auto-rotating proxies, user agents, and
| 'secret-sauce' helps get around them") is directly
| antithetical to this sort of scenario. I find this incredibly
| irresponsible and unethical.
| PaulHoule wrote:
| Almost always the API is nerfed relative to the web site.
|
| Almost all web sites that authenticate use "submit a form
| with username and password and respect cookies"; often
| sites that don't authenticate to use the web site require
| authentication for the API. Every API uses a different
| authentication scheme and requires custom programming: for
| web sites you have the URL of the form and the name of the
| username and password field and you are done.
|
| If you feed most HTML pages through a DOM parser like
| BeautifulSoup you can extract the links and interpret them
| through regexes. You might be done right then and there. If
| you need more usually you can use CSS classes and id(s)
| and... done!
|
| I wrote a metadata extractor for Wikimedia HTML that I had
| working provisionally on Flickr HTML immediately and had
| working at 100% in 20 minutes. No way I could have gotten
| the authentication working for an API in 20 minutes.
| Nextgrid wrote:
| Surely you must have a way to throttle human-originated
| abuse (such as someone spamming F5)? If so then this would
| work equally well for the machine.
|
| Determine a reasonable rate limit and apply it to the
| human-facing version, with maybe a link to the machine-
| readable version in the error message?
| ethbr0 wrote:
| Technically, this is the same rabbit hole that leads to DRM
| and device intent superceding user intent.
|
| Because that's the only way to be sure.
|
| It's unfortunate that it also allows havoc and burdens good
| services. But device-over-user is not a future I want to
| live in.
| capableweb wrote:
| Of course if there is another, unlimited way of getting the
| data, then it's perfectly fine to "redirect" people to
| another avenue.
|
| What this conversation is about (since we're talking about
| web scrapers here, not "data downloaders" or whatever you
| would call it) is when there is no other avenue to get the
| data, you should be able to access the same data via a
| machine as you could when you're a person.
| Doctor_Fegg wrote:
| Sure, but Till/DataHen appears to have a host of measures
| in place to ignore that "redirect".
| capableweb wrote:
| Yes, how are they supposed to know if the "redirect" is
| good or not? Facebook would helpfully redirect you to
| their Terms and Conditions, something you definitely
| should be able to ignore and scrape your data til your
| heart is content anyways.
|
| Similarly to HTTP, the tool does not decide if usage is
| "nice" or "evil", only the user with their use case
| decides that.
| tyingq wrote:
| >you should be able to access the same data via a machine
| as you could when you're a person
|
| I generally agree with this, but I do see the problem for
| certain spaces.
|
| Concert tickets, for example. People write scrapers to
| get the best seats and sit on them so they can scalp them
| later at inflated prices. Or other first-come/first-serve
| situations, online auction "sniping", etc.
| osmarks wrote:
| Perhaps it would be better to not gate by whether you
| happen to have visited a website fast enough relative to
| other people. For auctions, at least, there is a simple
| solution in the form of extending the auction timer
| whenever someone bids.
| capableweb wrote:
| Same thing applies here as I said in another comment
| (https://news.ycombinator.com/item?id=28060567), some
| people (me included) also write scrapers for getting the
| best seats in concerts/shows as I'm a big fan of the
| artist and really want to go there with good seats.
| Kiro wrote:
| Says who?
| danpalmer wrote:
| I work on a lot of web scraping and we have business agreements
| with every site that we scrape explicitly allowing us to do so
| and with pre-approval for the scraping rate (which we carefully
| control).
|
| None of this gets around over-eager Cloudflare or Akamai rules
| set up years ago by some contractor that the businesses have no
| real ability to change.
| Havoc wrote:
| Why scrape at all if there are agreements in place. Seems
| like a API task
| JulianWasTaken wrote:
| And only a few paragraphs down from a heading that says
|
| > Till was architected to follow best practices that DataHen
| has accumulated
|
| I know there are plenty of sites out there out to prevent any
| scraping whatsoever, or to improperly prevent some situations
| most of us would agree is reasonable behavior, but this appears
| to be blatantly hostile to web admins out there.
| cube00 wrote:
| I agree there are cases where entities don't want things
| scraped that really should be scraped to preserve the history
| they'd like to rewrite.
|
| My concern is that tools like these that use circumvention by
| default become the go-to when someone needs scraping and
| makes life hell for us running sites on hardware that's
| enough to service our small customer base but not an army of
| bots.
| paco3346 wrote:
| Admin of a 5000+ website platform (with customer inventory)
| here. This the exact kind of thing I've been working so hard
| to block lately.
|
| Since Feb 2021 we've seen a substantial increase in scraping
| of our customers' inventory to the point we now have more bot
| traffic than human traffic. Annoyingly, only 5-10 requests
| come from a single IP so we've had to resort to always
| challenging requests that come from certain ASNs (typically
| those owned by datacenters).
|
| This type of project frustrates me because it knowlingly goes
| against a site's desired bot action (via robots.txt).
| realusername wrote:
| We ended up here since some people wrongly decided that only
| Google has a right to scrape.
| 1vuio0pswjnm7 wrote:
| Then Google decided that no one has a right to scrape Google.
|
| The web needs a more efficient system for distributing an
| index of its content. Having web developers design websites
| in a gazillion different permutations, when they are all
| bascially doing the same thing, and then having a handful of
| companies "scrape" them is neither an efficient nor sensibly-
| designed system.
|
| The web (more generally, the internet) is a giant copy
| machine. Google, NSA and others have copies. Yet if we were
| to allow everyone to have copies by faciltating this through
| technical means (e.g., Wikipedia-style data dumps), many
| folks would panic. When Google first started indexing, it was
| not a business, and many folks did panic and there were many
| complaints. It's 2021; folks are still spooked by others
| being able to easily copy what they publish to the web.
| However it's OK to them if it's Google doing the copying. If
| there were healthy competition and many search engines to
| choose from, if one search engine did not have the lion's
| share of web traffic, it's doubtful Google would be given
| "exclusive" permission in robots.txt.
| PaulHoule wrote:
| I've run sites that have a lot of pages where 80%+ of the
| traffic is web crawlers.
|
| Google sends some traffic so I can afford to let them scrape.
| Bing crawls almost as much as Google but sends 5% as much
| traffic. Baidu crawls more than Google and never sends a
| single visitor.
|
| I hate reinforcing the Google monopoly, but a crawler that
| doesn't send any traffic is expensive to serve.
| maddyboo wrote:
| Bing powers other search engines like DuckDuckGo, Ecosia,
| and Yahoo!. But I'm sure that even cumulatively the numbers
| are still small.
| dredmorbius wrote:
| You might want to ask yourself, or your readers, what it is
| people are trying to access on your site that they cannot
| by other means.
|
| The interfaces for many sites actively and with brutal
| effectiveness deny ready access to any content not
| currently featured on the homepage or stream feed. Search
| features are frequently nonexistent, crippled, or
| dysfunctional.
|
| Last week I found myself stumbling across a now-archived
| radio programme on a website which afforded exceedingly
| poor access to the content. The show ran weekly for over a
| decade, with 531 episodes. Those are visible ... ten at a
| time ... through either the Archive or Search features.
|
| Scraping the site gives me the full archive listing, all 11
| years, in a single webpage, that loads in under a second. I
| can search that by date, title, or guest to find episodes
| of interest.
|
| The utility of this (a few hours work on my part) is much
| higher than that of the site itself.
|
| Often current web sites / apps are little more than
| wrappers around a JSON delivery-and-parsing engine. Dumping
| the raw JSON can be much more useful for a technical user.
| (Reddit, Diaspora, Mastodon, and Ello are several sites in
| which this is at least somewhat possible.)
|
| Much of the suck is imposed by monetisation schemes. One
| project of mine, decrufting the Washington Post's website,
| resulted in article pages with _two percent_ the weight of
| the originally-delivered payload. The de-cruftified version
| strips not only extraneous JS and CSS, but nags and teasers
| which are really nothing but distraction to me. Again, that
| 's typical.
|
| I'm aware that many scrapers are not benign. More than you
| might think _are_ , and the fact that casual scraping _is_
| a problem for your delivery system reflects more poorly on
| it than them.
| PaulHoule wrote:
| Adjunk, Sidebarjunk, Javascriptjunk, Popupwindowjunk, and
| the outlook that the most precious resource in the world
| are a few seconds when you are distracted are what
| motivates the Washington Post and most of the commercial
| web.
|
| What you are doing stripping out the junk threatens those
| organizations at the core.
| dredmorbius wrote:
| Good.
|
| https://news.ycombinator.com/item?id=26893033
|
| https://news.ycombinator.com/item?id=27803591
| PaulHoule wrote:
| On mobile ads, trackers and all that crap cost the
| consumer more than the the ads make.
|
| If mobile phone companies kicked back a fraction of the
| revenue they get to content creators they'd be better
| paid than they are now and Verizon would get the love
| that it has sought in vain. (e.g. who would say a bad
| word about the phone company?)
| dredmorbius wrote:
| That's my argument.
|
| Gobal ad spend, which mostly accrues to the wealthiest 1
| billion or so, is about $600 billion. Some complex maths
| tells us that's $600 per person in the industrialised
| countries (G-20 / OECD, close enough). Global content
| spend is somewhere around $100 -- 200/year per capita.
| That's roughly the annual online ad spend.
|
| Bundled into network provisioning, that's about $30--40
| per household per month, all-you-can-eat. Information as
| a public good.
|
| (My preference is for higher rates in more affluent
| areas, ideally by income.)
|
| Trying to figure out WCPGW.
| PaulHoule wrote:
| My personal model (emerging, there is a manifesto but I
| am rewriting it as we speak) is to rigorously control
| costs, focus on quality, stay small.
|
| Think of the old phone company slogan "reach out and
| touch someone." If I can accomplish that and spend less
| than I do on food or clothes or my car then I win.
| dredmorbius wrote:
| I'd be interested in seeing what you're developing.
|
| The challenge, as I see it is that information is a
| public good (in the economic sense: nonrivalrous,
| nonexcludable, zero marginal cost, high fixed costs), and
| provision _at scale_ requires either a complementary
| rents service (advertising, patronage, propaganda, fancy
| professional-services "shingle") or a tax. Busking or
| its public-broadcasting is another option, though that's
| highly lossy.
|
| Any truthful publishing also requires a strong self-
| defence mechanism (protection against lawsuits, coercion,
| intimidation, protection rackets, etc.), a frequently
| underappreciated role played by publishers.
|
| Charles Perrow's descriptions of the music industry
| (recorded and broadcast) circa 1945 -- 1985 is
| informative here (see his _Complex Organizations_
| https://www.worldcat.org/title/complex-organizations-a-
| criti...), notably the roles of publishers vs. front-line
| and studio musicians.
| Kiro wrote:
| Not sure why you are attacking a poster specifically
| talking about Google, Bing and Baidu doing massive
| scraping. What you are talking about is something
| entirely different.
| PaulHoule wrote:
| I don't feel attacked. I also don't blame him for being
| inflamed about the problem he's been inflamed at because
| I am inflamed about it too!
| dredmorbius wrote:
| Fortunately I think we both managed to realise that
| before too many rounds of this ;-)
| CiaTransMatters wrote:
| In order for the CIA to continue its legendary fight against
| transphobia, it is important for them to use these technologies
| to gather OSINT on regions that house millions of backwards
| non-progressives and all of their content.
|
| I can't believe HN is such a hotbed for transphobics like this
| to make comments without moderation or punishment.
| ok2938 wrote:
| We ended up here, because a) the promise of easy to use
| structured data - that is inherent in data processing systems -
| has not been fulfilled: you have to scrape HTML to reconstruct
| some relational table, and b) information, despite being almost
| free to transmit in large amounts today, is treated as if it
| were a physical good, and each copy costs money. There is just
| lots of money in there, because our economic system looks more
| backwards than forwards.
| ChrisArchitect wrote:
| what kinds of scraping is being talked about here? As a
| maintainer of many a website I hate seeing so so much bot traffic
| these days. So many sketchy foreign things. And then throw in the
| spam/vulnerability testing bots/random URLs etc. sigh. Filling up
| the logs..Filling up the logs.
|
| Almost totally dependent on Cloudflare to mitigate these days/as
| an in-between.
| DaveExeter wrote:
| I have simple PHP code that adds misbehaving IP address to
| iptables. Keeps the logs clean, because they can't even connect
| over port 80.
| dannyw wrote:
| Can this bypass cloudflare?
| new_guy wrote:
| These guys bypass cloudflare https://microlink.io it's open
| source but it's just more convenient to pay and forget about
| it.
| jonatron wrote:
| Cloudflare have a few configuration options for the level of
| bot protection, so it's probably more a question of what
| percentage would be blocked for a particular site.
| yourad_io wrote:
| This is entirely correct.
|
| It will also depend on the proxy's ASN, the rate of requests,
| the ability to solve captchas and the presence of an actual
| js environment in the browser (as opposed to, say, curl -
| Cloudflare uses a javascript environment check script to
| detect puppeteer/selenium etc.)
| yourad_io wrote:
| > repository contains research from CloudFlare's AntiDDoS,
| JS Challenge, Captcha Challenges, and CloudFlare WAF.
|
| https://github.com/scaredos/cfresearch
| tintt wrote:
| Sometimes I, a human being, can't bypass Cloudflare
| awestroke wrote:
| Looks extremely useful. Starred for if I ever need to do scraping
| kordlessagain wrote:
| If anyone needs a screenshot bot:
| https://github.com/kordless/grub-2.0
|
| Still need to do some work on it, but it does do the job!
| captainmuon wrote:
| This looks really interesting, I'm going to bookmark it!
|
| > Proxy IP address rotation
|
| I wonder where you would get a decent proxy list nowadays. In the
| late 90s, you could find list of accidental open proxies that
| "hackers" collected. But nowadays I've only seen really shady
| "residential proxies" that are basically private PCs with
| malware. Is there a decent source for proxies that are not widely
| blocked and not criminal and not too expensive?
|
| And by the way, while many people here question the morality of
| scraping, I have at least two legitimate use cases:
|
| 1) I built a little "vertical" search engine for a niche, using a
| combination of scraping and RSS. It doesn't consume much traffic
| and does really help people discover content in this niche.
|
| 2) A friend does research on urban development and rent prices
| and they scrape rent ads to get their data (or let students type
| it off manually, I'm not sure...)
| showerst wrote:
| I run a number of proxies for (legally) scraping government
| data sites with overly restrictive views/hour policies.
|
| They're just squid boxes spun up on various cheap VPS
| providers. A $5 digitalocean instance will power a lot of squid
| traffic, and as long as you stay away from AWS, the IP ranges
| tend not to be banned.
|
| Just make sure to lock down your allowed source list, so you
| don't become one of those hacked open proxies!
| k1rcher wrote:
| Does this have any sort of JavaScript emulation/support? Or is it
| purely HTTP requests?
|
| The GID signature piece is especially interesting, I have ran
| into blockers in scaling scrapers in the past and this sort of
| organizational/tracking platform sounds awesome.
| paramaw wrote:
| You can connect your local browser to Till. But, you need to
| have your browser to accept a custom CA(Certificate Authority).
| Instructions here:
| https://till.datahen.com/docs/installation#ca
| mellosouls wrote:
| This sounds technically interesting but ethically dubious; also
| requires sign-up - even though free - which (I think?) is
| discouraged for Show HN.
|
| If you are able to show how this service is administered
| responsibly with due regard for those running websites and
| blocking for good reason, best of luck with it.
| Kiro wrote:
| > also requires sign-up - even though free - which (I think?)
| is discouraged for Show HN.
|
| Look at past Show HNs and tell me how many respect this. It's
| such an obsolete rule that it's irrelevant and almost OT to
| even comment on.
| OJFord wrote:
| > also requires sign-up - even though free - which (I think?)
| is discouraged for Show HN.
|
| I don't doubt it's 'community discouraged' because people don't
| want to have to, but I think the only rule is that there must
| be something to try out - no announcement-only, sign up for
| wait-list, hype-building type thing.
| capableweb wrote:
| > This sounds technically interesting but ethically dubious
|
| > If you are able to show how this service is administered
| responsibly
|
| Imagine if this was the comments about HTTP when it was first
| launched (honestly, surely there was, but not as much as
| praise).
|
| Protocols and tools are not ethically dubious or responsible
| for anything. The users who use those protocols and tools are.
| Anything that can be used for good/bad can also be used for
| bad/good.
| axiosgunnar wrote:
| Why is this on Github if I still need an API and cannot run it
| locally since the actual code is not public?
|
| Or is this some sort of ,,source available" content marketing
| nonsense?
___________________________________________________________________
(page generated 2021-08-04 23:01 UTC)