hngopher.com

       [HN Gopher] Curl-Impersonate
       ___________________________________________________________________
        
       Curl-Impersonate
        
       Author : jakeogh
       Score  : 340 points
       Date   : 2024-12-30 09:18 UTC (13 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | londons_explore wrote:
       | > The resulting curl looks, from a network perspective, identical
       | to a real browser.
       | 
       | How close is it? If I ran wireshark, would the bytes be exactly
       | the same in the exact same packets?
        
         | dchest wrote:
         | What else could "identical" mean?
        
           | londons_explore wrote:
           | It could be that the TCP streams are the same, but
           | packetiation is different.
           | 
           | It could mean that the packets are the same, but timing is
           | off by a few milliseconds.
           | 
           | It could mean a single HTTP request exactly matches, but when
           | doing two requests the real browser uses a connection pool
           | but curl doesn't. Or uses HTTP/3's fast-open abilities, etc.
           | 
           | etc.
        
             | zlagen wrote:
             | It replicates the browser at the HTTP/SSL level, not TCP.
             | From what I know this is good enough to bypass cloudflare's
             | bot detection.
        
             | Retr0id wrote:
             | Two TLS streams are never byte-identical, due to randomness
             | inherent to the protocol.
             | 
             | Identical here means having the same fingerprint - i.e. you
             | could not write a function to reliably distinguish traffic
             | from one or the other implementation (and if you can then
             | that's a bug).
        
         | jsnell wrote:
         | The packets from Chrome wouldn't be exactly the same as packets
         | sent by Chrome at a different time either. "The exact same
         | packets" is not a viable benchmark, since both the client and
         | the server randomize the payloads in various ways. (E.g. key
         | exchange, GREASE).
        
         | peetistaken wrote:
         | You can check your fingerprint on https://tls.peet.ws
        
       | peetistaken wrote:
       | https://github.com/bogdanfinn/tls-client is the go-to package for
       | the go world, it does the same thing
        
       | zlagen wrote:
       | In case anyone is interested, I created something similar but for
       | python(using chromium's network stack)
       | https://github.com/lagenar/python-cronet I'm looking for help to
       | create the build for windows.
        
         | hk__2 wrote:
         | Any reason you didn't use
         | https://github.com/lexiforest/curl_cffi?
        
           | zlagen wrote:
           | I wanted to try a diffent approach which is to use chromium's
           | network stack directly instead of patching curl to
           | impersonate it. In this case you're using the real thing so
           | it's a bit easier to maintain when there are changes in the
           | fingerprint.
        
         | Klonoar wrote:
         | Similar projects exist for C#
         | (https://github.com/sleeyax/CronetSharp), Go
         | (https://github.com/sleeyax/cronet-go) and Rust
         | (https://github.com/sleeyax/cronet-rs).
         | 
         | These _can_ work well in some cases but it 's always a
         | tradeoff.
        
         | thrdbndndn wrote:
         | Any plan to offer a sync API?
        
       | Retr0id wrote:
       | I recently used ja3proxy, which uses utls for the impersonation.
       | It exposes an HTTP proxy that you can use with any regular HTTP
       | client (unmodified curl, python, etc.) and wraps it in a TLS
       | client fingerprint of your choice. Although I don't think it does
       | anything special for http/2, which curl-impersonate does
       | advertise support for.
       | 
       | https://github.com/LyleMi/ja3proxy
       | 
       | https://github.com/refraction-networking/utls
        
       | TekMol wrote:
       | What is the use case? If you have to read data from one specific
       | website which uses handshake info to avoid being read by
       | software?
       | 
       | When I have to do HTTP requests these days, I default to a
       | headless browser right away, because that seems to be the best
       | bet. Even then, some website are not readable because they use
       | captchas and whatnot.
        
         | mschuster91 wrote:
         | > What is the use case? If you have to read data from one
         | specific website which uses handshake info to avoid being read
         | by software?
         | 
         | Evade captchas. curl user agent / heuristics are blocked by
         | many sites these days - I'd guess many popular CDNs have pre-
         | defined "block bots" stuff that blocks everything automated
         | that is not a well-known search engine indexer.
        
         | adastral wrote:
         | > I default to a headless browser
         | 
         | Headless browsers consume orders of magnitude more resources,
         | and execute far more requests (e.g. fetching images) than a
         | common webscraping job would require. Having run webscraping at
         | scale myself, the cost of operating headless browsers made us
         | only use them as a last resort.
        
           | TekMol wrote:
           | So you maintain a table of domains and how to access them?
           | 
           | How do you build that table and keep it up to date? Manually?
        
           | at0mic22 wrote:
           | Blocking all image/video/CSS requests is the rule of thumb
           | when working with headless browsers via CDP
        
             | sangnoir wrote:
             | Speaking as a person who has played on both offense and
             | defense: this is a heuristic that's not used frequently
             | enough by defenders. Clients that load a single HTML/JSON
             | endpoint without loading css or image resources associated
             | with the endpoints are likely bots (or user agents with a
             | fully loaded cache, but defenders control what gets cached
             | by legit clients and how). Bot data thriftiness is a huge
             | signal.
        
               | at0mic22 wrote:
               | As a high load system engineer you'd want to offload
               | asset serving to CDN which makes detection slightly more
               | complicated. The easy way is to attach an image onload
               | handler with client js, but that would give a high yield
               | of false positives. I personally have never seen such
               | approach and doubt its useful for many concerns.
        
               | sangnoir wrote:
               | Unless organization policy forces you to, you do not have
               | to put _all_ resources behind a CDN. As a matter of fact,
               | getting this heuristic to work requires a non-optimal
               | caching strategy of one or more real or decoy resources -
               | CDN or not.  "Easy" is not an option for the bot/anti-bot
               | arms race, all the low hanging fruit is now gone when
               | fighting a determined adversary on either end.
               | 
               | > I personally have never seen such approach and doubt
               | its useful for many concerns.
               | 
               | It's an arms race and defenders are not keen on sharing
               | their secret sauce, though I can't be the only one who
               | thought of this rather basic bot characteristic, multiple
               | abuse trams probably realized this decades ago. It works
               | pretty well against the low-resource scrapers with fakes
               | UA strings and all the right TLS handshakes. It won't
               | work against the headless browsers that costs scrapers
               | more in resources and bandwidth, and there are specific
               | countermeasures for headless browsers [1], and counter-
               | countermeasures. It's a cat and mouse game.
               | 
               | 1. e.g. Mouse movement, as made famous as ine signal
               | evaluated by Google's reCAPTCHA v2, monitor resolution &
               | window size and position, and Canvas rendering, all if
               | which have been gradually degraded by browser anti-
               | fingerprinting efforts. The bot war is fought on the long
               | tail.
        
               | zzo38computer wrote:
               | Even legitimate users might want to disable CSS and
               | pictures and whatever, and I often do when I just want to
               | read the document.
               | 
               | Blind users also might have no use for the pictures, and
               | another possibility is if the document is longer than the
               | screen so the picture is out of view then the user might
               | program the client software to use lazy loading, etc.
        
       | jollyllama wrote:
       | >The Client Hello message that most HTTP clients and libraries
       | produce differs drastically from that of a real browser.
       | 
       | Why is this?
        
         | throwaway99210 wrote:
         | Based on what I've seen, most command-line clients and basic
         | HTTP libraries typically ship with leaner, more static
         | configurations (e.g., no GREASE extensions in the Client Hello,
         | limited protocols in the ALPN extension header, smaller number
         | of Signature Algorithms). Mirroring real browser TLS
         | fingerprints is also more difficult due to the randomization of
         | the Client Hello parameters (e.g., current versions of Chrome)
        
         | Retr0id wrote:
         | The protocols are flexible and most browsers bring their own
         | HTTP+TLS clients
        
         | zlagen wrote:
         | They use different SSL libraries/configuration. Chrome uses
         | BoringSSL and other libraries may use OpenSSL or some other
         | library. Besides that the SSL library may be configured with
         | different cipher suites and extensions. The solution these
         | impersonators provide is to use the same SSL library and
         | configuration as a real browser.
        
       | cle wrote:
       | The same author also makes a Python binding of this which exposes
       | a requests-like API in Python, very helpful for making HTTP reqs
       | without the overhead of running an entire browser stack:
       | https://github.com/lexiforest/curl_cffi
       | 
       | I can't help but feel like these are the dying breaths of the
       | open Internet though. All the megacorps (Google, Microsoft,
       | Apple, CloudFlare, et al) are doing their damndest to make sure
       | everyone is only using software approved by them, and to ensure
       | that they can identify you. From multiple angles too (security,
       | bots, DDoS, etc.), and it's not just limited to browsers either.
       | 
       | End goal seems to be: prove your identity to the megacorps so
       | they can track everything you do and also ensure you are only
       | doing things they approve of. I think the security arguments are
       | just convenient rationalizations in service of this goal.
        
         | throwaway99210 wrote:
         | > I can't help but feel like these are the dying breaths of the
         | open Internet though
         | 
         | I agree with the over zealous tracking by the megacorps but
         | this is also due to bad actors, I work for a financial company
         | and the amount of API abuse, ATO, DDoS, nefarious bot traffic,
         | etc. we see on a daily basis is absolutely insane
        
           | berkes wrote:
           | But how much of this "bad actor" interaction is countered
           | with tracking? And how many of these attempts are even close
           | to successfull with even the simplest out of the box security
           | practices set up?
           | 
           | And when it does get more dangerous, is over zealous tracking
           | the best counter for this?
           | 
           | I've dealt with a lot of these threats as well, and a lot are
           | countered with rather common tools, from simple fail2ban
           | rules to application firewalls and private subnets and
           | whatnot. E.g. a large fai2ban rule to just ban anything that
           | attempts to HTTP GET /admin.php or /phpmyadmin etc, even just
           | once, gets rid of almost all nefarious bot traffic.
           | 
           | So, I think the amount of attacks indeed can be insane. But
           | the amount that need over zealous tracking is to be
           | countered, is, AFAICS, rather small.
        
             | throwaway99210 wrote:
             | > E.g. a large fai2ban rule to just ban anything that
             | attempts to HTTP GET /admin.php or /phpmyadmin etc, even
             | just once, gets rid of almost all nefarious bot traffic.
             | 
             | unfortunately fail2ban wouldn't even make a dent in the
             | attack traffic hitting the endpoints in my day-to-day work,
             | these are attackers utilizing residential proxy
             | infrastructure that are increasingly capable of solving
             | JS/client-puzzle challenges.. the arms race is always
             | escalating
        
               | JohnMakin wrote:
               | we see the same thing, also with a financial company, the
               | most successful strategies we've seen is making stuff
               | like this extremely expensive for whoever it is if we see
               | it, and they stop or slow down to a point it becomes not
               | worth it and they move on. sometimes that's really all
               | you can do without harming legit traffic.
        
               | josephcsible wrote:
               | Such a rule is a great way to let malicious users lock
               | out a bunch of your legitimate customers. Imagine if
               | someone makes a forum post and includes this in it:
               | [img]https://example.com/phpmyadmin/whatever.png[/img]
        
             | tialaramex wrote:
             | A big problem is that where we have a good solution you'll
             | lose if you insist on that solution but other people get
             | away with doing something that's crap but customers like
             | better. We often have to _mandate_ a poor solution that
             | will be tolerated because if we mandate the better solution
             | it will be rejected, and if we don 't mandate anything the
             | outcomes are far worse.
             | 
             | Today for example I changed energy company+. I made a
             | telephone call, from a number the company has never seen
             | before. I told them my name (truthfully but I could have
             | lied) and address (likewise). I agreed to about five
             | minutes of parameters, conditions, etc. and I made one
             | actual meaningful choice (a specific tariff, they offer
             | two). I then provided 12 digits identifying a bank account
             | (they will eventually check this account exists and ask it
             | to pay them money, which by default will just work) and I'm
             | done.
             | 
             | Notice that _anybody_ could call from a burner and that
             | would work too. They could move Aunt Sarah 's energy to
             | some random outfit, assign payments to Jim's bank account,
             | and cause maybe an hour of stress and confusion for both
             | Sarah and Jim when months or years later they realise the
             | problem.
             | 
             | We know how to do this properly, but it would be high
             | friction and that's not in the interests of either the
             | "energy companies" or the politicians who created this
             | needlessly complicated "Free Market" for energy. We could
             | abolish that Free Market, but again that's not in their
             | interests. So, we're stuck with this waste of our time and
             | money, indefinitely.
             | 
             | There have been _simpler_ versions of this system, which
             | had even worse outcomes. They 're clumsier to use, they
             | cause more people to get scammed AND they result in higher
             | cost to consumers, so that's not great. And there are
             | _better_ systems we can 't deploy because in practice too
             | few consumers will use them, so you'd have 0% failure but
             | lower total engagement and that's what matters.
             | 
             | + They don't actually supply either gas or electricity,
             | that's a last mile problem solved by a regulated monopoly,
             | nor do they make electricity or drill for gas - but they do
             | bill me for the gas and electricity I use - they're an
             | artefact of Capitalism.
        
             | Szpadel wrote:
             | I can tell you about my experience with blocking traffic
             | from scalpers bots that were very active during pandemic.
             | 
             | All requests produced by those bots were valid ones,
             | nothing that could be flagged by tools like fail2ban etc
             | (my assumption is that it would be the same for financial
             | systems).
             | 
             | Any blocking or rate limiting by IP is useless, we saw
             | about 2-3 requests per minute per IP, and those actors had
             | access to ridiculous number of large CIDRs, blocking any IP
             | caused it instantly replace it with another.
             | 
             | blocking by AS number was also mixed bag, as this list
             | growed up really quickly, most of that were registered to
             | suspicious looking Gmail addresses. (I feel that such
             | activity might own significant percentage of total ipv4
             | space)
             | 
             | This was basically cat and mouse game of finding some
             | specific characteristic in requests that matches all that
             | traffic and filtering it, but the other side would adapt
             | next day or on Sunday.
             | 
             | aggregated amount of traffic was in range of 2-20k r/s to
             | basically heaviest endpoint in the shop, with was the main
             | reason we needed to block that traffic (it generated 20-40x
             | load of organic traffic)
             | 
             | cloudflare was also not really successful with default
             | configuration, we had to basically challenge everyone by
             | default with whitelist of most common regions from where we
             | expected customers.
             | 
             | So best solution is to track everyone and calculate long
             | term reputation.
        
               | stareatgoats wrote:
               | Blocking scalper bot traffic by any means, be it by
               | source or certified identification seems a lost cause,
               | i.e. not possible because it can always be circumvented.
               | Why did you not have that filter at point of sale
               | instead? I'm sure there are reasons, but to have a
               | battery of captchas and a limit on purchases per credit
               | card seems on the surface much more sturdy. And it
               | doesn't require that everyone browsing the internet
               | announce their full name and residential address in order
               | to satisfy the requirements of a social score ...
        
               | Szpadel wrote:
               | The product they tried to buy what not in stock anyways,
               | but their strategy was to constantly try anyways, so in
               | case it would become in stock they would be the first to
               | get it. It was all for guest checkout, so no address yet
               | to validate nor credit card. Because they used API
               | endpoints used by the frontend we could not use any
               | captcha at this place because of technical requirements.
               | 
               | As stated before the main reason we needed to block it
               | was volume of the traffic, you migh imagine identical
               | scenario for dealing with DDoS attack.
        
               | bornfreddy wrote:
               | > Because they used API endpoints used by the frontend we
               | could not use any captcha at this place because of
               | technical requirements.
               | 
               | That doesn't compute... Captcha is almost always used in
               | such setups.
               | 
               | It also looks like you could just offer an API endpoint
               | which would return if the article is in stock or not, or
               | even provide a webhook. Why fight them? Just make the
               | resource usage lighter.
               | 
               | I'm curious now though what the articles were, if you are
               | at liberty to share?
        
               | Szpadel wrote:
               | We had captcha, but it was at later stage of the checkout
               | process. This API endpoint needed to work from cached
               | pages, so it could not contain any dynamic state in
               | request.
               | 
               | Some bots checked product page where we had info if
               | product is in stock (although they tried heavenly to
               | bypass any caches by putting garbage in URL). This kind
               | of bots also scaled instantly to thousands checkout
               | requests when product become available with gave no time
               | for auto scaling to react (this was another challenge
               | here)
               | 
               | This was easy to mitigate so it didn't generate almost
               | any load on the system.
               | 
               | I believe we had email notification available, but it
               | could be too high latency way for them.
               | 
               | I'm not sure how much I can share about articles here,
               | but I can say that those were fairly expensive (and
               | limited series) wardrobe products.
        
               | shaky-carrousel wrote:
               | Hm, is probably too late, but you could have implemented
               | in your API calls some kind of proof of work. Something
               | that's not too onerous for a casual user but it is hard
               | for someone trying multiple requests.
        
               | miki123211 wrote:
               | > Why fight them? Just make the resource usage lighter.
               | 
               | Because you presumably want real, returning customers,
               | and that means those customers need to get a chance at
               | buying those products, instead of them being scooped up
               | by a scalper the millisecond they appear on the website.
        
               | geysersam wrote:
               | Sounds like a dream having customers scooping up your
               | products the millisecond the appear on the website. They
               | should increase their prices.
        
               | sesm wrote:
               | I remember people doing this with PS5 when they were in
               | short supply after release.
        
               | cute_boi wrote:
               | why not charge people? This is the only solution I can
               | think of.
        
               | shwouchk wrote:
               | Require a verified account to buy high demand items.
        
               | codingminds wrote:
               | I've learned that Akamai has a service that deals with
               | this specific problem, maybe this might interest you as
               | well: https://www.akamai.com/products/content-protector
        
             | mattpallissard wrote:
             | That's not the same type of bot net. Fail 2 ban simply is
             | not going to work when you have a popular unauthenticated
             | endpoint. You have hundreds of thousands of rps spread
             | across thousands of legitimate networks that. The requests
             | are always modified to look legitimate in a never ending
             | game of whack-a-mole.
             | 
             | You wind up having to use things like tls fingerprinting
             | with other heuristics to identify what to traffic to
             | reject. These all take engineering hours and require
             | infrastructure. It is SO MUCH SIMPLER to require auth and
             | reject everything else outright.
             | 
             | I know that the BigCo's want to track us and you originally
             | mentioned tracking not auth. But my point is yeah, they
             | have malicious reasons for locking things down, but there
             | are legitimate reasons too.
        
               | sangnoir wrote:
               | > You wind up having to use things like tls
               | fingerprinting
               | 
               | ...and we've circled back to the post's subject - a
               | version of curl that impersonates browsers TLS handshake
               | behavior to bypass such fingerprinting.
        
               | fijiaarone wrote:
               | Easy solution to rate limit. Require initial request to
               | get 1 time token with a 1 second delay And then require
               | valid requests to include the token. The token returned
               | has a salt with something like timestamp and ip. That way
               | they can only bombard the token generator.
               | 
               | get /token
               | 
               | Returns token with timestamp in salted hash
               | 
               | get /resource?token=abc123xyz
               | 
               | Check for valid token and drop or deny.
        
             | jsnell wrote:
             | The question is a bit of a non sequitur, since this is not
             | tracking. The TLS fingerprint is not a useful tracking
             | vector, by itself nor as part of some composite
             | fingerprint.
        
               | fijiaarone wrote:
               | The point is that you have to use an approved client (eg
               | browser, os) with an approved cert authority that goes
               | through approved gatekeepers (eg Cloudflare, Akamai)
        
             | miki123211 wrote:
             | This depends on what you're fighting.
             | 
             | If you're fighting adversaries that go for scale, AKA
             | trying to hack as many targets as possible, mostly low-
             | sophistication, using techniques requiring 0 human work and
             | seeing what sticks, yes, blocking those simple techniques
             | works.
             | 
             | Those attackers don't ever expect to hack Facebook or your
             | bank, that's just not the business they're in. They're fine
             | with posting unsavory ads on your local church's website,
             | blackmailing a school principal with the explicit pictures
             | he stores on the school server, or encrypting all the data
             | on that server and demanding a ransom.
             | 
             | If your company does something that is specifically
             | valuable to someone, and there are people _whose literal
             | job it is to attack your company 's specific systems_, no,
             | those simple techniques won't be enough.
             | 
             | If you're protecting a Church with 150 members, the simple
             | techniques are probably fine, if you're working for a major
             | bank or a retailer that sells gaming consoles or concert
             | tickets, they're laughably inadequate.
        
           | cle wrote:
           | Yep totally agree these are problems. I don't have a good
           | alternative proposal either, I'm just disappointed with what
           | we're converging on.
        
           | code51 wrote:
           | Much of this "bad actor" activity is actually customer needs
           | left hanging - for either the customer to automate herself or
           | other companies to fill the gap to create value that's not
           | envisioned by the original company.
           | 
           | I'm guessing investors actually like a healthy dose of open
           | access and a healthy dose of defence. We see them (YC, as an
           | example) betting on multiple teams addressing the same
           | problem. The difference is their execution, the angle they
           | attack.
           | 
           | If, say, the financial company you work for is capable in
           | both product and technical aspect, I assume it leaves no gap.
           | It's the main place to access the service and all the side
           | benefits.
        
             | miki123211 wrote:
             | > Much of this "bad actor" activity is actually customer
             | needs left hanging - for either the customer to automate
             | herself or other companies to fill the gap to create value
             | 
             | Sometimes the customer you have isn't the customer you
             | want.
             | 
             | As a bank, you don't want the customers that will try to
             | log in to 1000 accounts, and then immediately transfer any
             | money they find to the Seychelles. As a ticketing platform,
             | you don't want the customers that buy tickets and then
             | immediately sell them on for 4x the price. As a messaging
             | app, you don't want the customers who have 2000 bot
             | accounts and use AI to send hundreds of thousands of spam
             | messages a day. As a social network, you don't want the
             | customers who want to use your platform to spread pro-
             | russian misinformation.
             | 
             | In a sense, those are "customer needs left changing", but
             | neither you nor otherr customers want those needs to be
             | automatible.
        
         | schnable wrote:
         | A lot of the motivation comes from government regulations too.
         | Right now this is mostly in banking, but social media and porn
         | regs are coming too.
        
           | lelandfe wrote:
           | PornHub and all of its affiliate sites now block all
           | residents of Alabama, Arkansas, Idaho, Indiana, Kansas,
           | Kentucky, Mississippi, Montana, Nebraska, North Carolina,
           | Texas, Utah, and Virginia (and Florida on Jan 1):
           | https://www.pcmag.com/news/pornhub-blocked-florida-
           | alabama-t...
           | 
           | Child safety, as always, was the sugar that made the medicine
           | go down in freedom-loving USA. I imagine these states'
           | approaches will try to move to the federal level after
           | Section 230 dies an ignominious death.
           | 
           | Keep an eye out for _Free Speech Coalition v. Paxton_ to hit
           | SCOTUS in January: https://www.oyez.org/cases/2024/23-1122
        
         | octocop wrote:
         | "I have nothing to hide" will eventually spread to everyone.
         | Very unfortunate.
        
           | cle wrote:
           | I'm in a similar boat but it's more like "I have nothing I
           | can hide".
           | 
           | These days I just tell friends & family to assume that
           | nothing they do is private.
        
           | Habgdnv wrote:
           | The answer is simple: I have something to hide. I have many
           | things to hide actually. Nothing of these things is illegal
           | currently but I still have many things to hide. And if I have
           | something to hide - I can be worried about many things.
        
         | deadbabe wrote:
         | Even if the internet was wide open it's of little use these
         | days.
         | 
         | AI will replace any search you would want to do to find
         | information, the only reason to scour the internet now is for
         | social purposes: finding comments and forums or content from
         | other users, and you don't really need to be untracked to do
         | all that.
         | 
         | A megacorp's main motivation for tracking your identity is to
         | sell you shit or sell your data to other people who want to
         | sell you things. But if you're using AI the amount of ads and
         | SEO spam that you have to sift through will dramatically
         | reduce, rendering most of those efforts pointless.
         | 
         | And most people aren't using the internet like in the old days:
         | stumbling across quaint cozy boutique websites made by
         | hobbyists about some favorite topic. People just jump on social
         | platforms and consume content until satisfied.
         | 
         | There is no money to be made anymore in mass web scraping at
         | scale with impersonated clients, it's all been consumed.
        
       | oefrha wrote:
       | What are some example sites where this is both necessary and
       | sufficient? In my experience sites with serious anti-bot
       | protection basically always have JavaScript-based browser
       | detection, and some are capable of defeating puppeteer-extra-
       | plugin-stealth even in headful mode. I doubt sites without
       | serious anti-bot detection will do TLS fingerprinting. I guess it
       | is useful for the narrower use case of getting a short-lived
       | token/cookie with a headless browser on a heavily defended site,
       | then performing requests using said tokens with this lightweight
       | client for a while?
        
         | jonatron wrote:
         | There are sites that will block curl and python-requests
         | completely, but will allow curl-impersonate. IIRC, Amazon is an
         | example that has some bot protection but it isn't "serious".
        
           | ekimekim wrote:
           | In most cases this is just based on user agent. It's
           | widespread enough that I just habitually tell requests not to
           | set a User Agent at all (these aren't blocked, but if the UA
           | contains "python" it is).
        
         | Retr0id wrote:
         | A lot of WAFs make it a simple thing to set up. Since it
         | doesn't require any application-level changes, it's an easy
         | "first move" in the anti-bot arms race.
         | 
         | At the time I wrote this up, r1-api.rabbit.tech required TLS
         | client fingerprints to match an expected value, and not much
         | else:
         | https://gist.github.com/DavidBuchanan314/aafce6ba7fc49b19206...
         | 
         | (I haven't paid attention to what they've done since so it
         | might no longer be the case)
        
           | oefrha wrote:
           | Makes sense, thanks.
        
         | Avamander wrote:
         | CloudFlare offers it. Even if it's not used for blocking it
         | might be used for analytics or threat calculations, so you
         | might get hit later.
        
         | thrdbndndn wrote:
         | Lots of sites, actually.
         | 
         | > I doubt sites without serious anti-bot detection will do TLS
         | fingerprinting
         | 
         | They don't set it up themselves. CloudFlare offer such thing by
         | default (?).
        
           | oefrha wrote:
           | Pretty sure it's not default, and Cloudflare browser check
           | and/or captcha is a way bigger problem than TLS
           | fingerprinting, at least was the case the last time I scraped
           | a site behind Cloudflare.
        
         | remram wrote:
         | Those JavaScript scripts often get data from some API, and it's
         | that API that will usually be behind some fingerprinting wall.
        
       | ape4 wrote:
       | I like this project!
       | 
       | Is there a way to request impersonization of the current version
       | of Chrome (or whatever)?
        
       | aninteger wrote:
       | I think we should list the sites where this fingerprinting is
       | done. I have a suspicion that Microsoft does it for conditional
       | access policies but I am not sure of other services.
        
         | Galanwe wrote:
         | We cannot really list them, as 90% of the time, it's not the
         | websites themselves, it's their WAF. And there is a trend
         | toward most company websites to be behind a WAF nowadays to
         | avoid 1) annoying regulations (US companies putting geoloc on
         | their websites to avoid EU cookie regulations) and 2) DDoS.
         | 
         | It's now pretty common to have cloudflare, AWS, etc WAFs as
         | main endpoints, and these do anti bots (TLS fingerprinting,
         | header fingerprinting, Javascript checks, capt has, etc).
        
         | pixelesque wrote:
         | Cloudflare (which seems to be fronting half the web these days
         | based off the number of cf-ray cookies that I see being sent
         | back) does this with bot protection on, and Akamai has
         | something similar I think.
        
       | Sytten wrote:
       | Thankfully only a small fraction of website does JA3/JA4
       | fingerprinting. Some do more advanced stuff like correlating
       | headers to the fingerprint. We have been able to get away without
       | doing much in Caido for a long time but I am working on an OSS
       | rust based equivalent. Neat trick, you can use the fingerprint of
       | our competitor (Burp Suite) since it is whitelisted for the
       | security folks to do their job. Only time you will not hear me
       | complain about checkbox security.
        
       | jandrese wrote:
       | The build scripts on this repo seem a bit cursed. It uses
       | autotools but has you build them in a subdirectory. The default
       | built target is a help text instead of just building the project.
       | When you do use the listed build target it doesn't have the
       | dependencies set up correctly so you have to run it like 6 times
       | to get to the point where it is building the application.
       | 
       | Ultimately I was not able to get it to build because the
       | BoringSSL disto it downloaded failed to build even though I made
       | sure all of the dependencies the INSTALL.md listed are installed.
       | This might be because the machine I was trying to build it on is
       | an older Ubuntu 20 release.
       | 
       | Edit: Tried it on Ubuntu 22, but BoringSSL again failed to build.
       | The make script did work better however, only requiring a single
       | invocation of make chrome-build before blowing up.
       | 
       | Looks like a classic case of "don't ship -Werror because compiler
       | warnings are unpredictable".
       | 
       | Died on:
       | 
       | /extensions.cc:3416:16: error: 'ext_index' may be used
       | uninitialized in this function [-Werror=maybe-uninitialized]
       | 
       | The good news is that removing -Werror from the CMakeLists.txt in
       | BoringSSL got around that issue. Bad news is that the dependency
       | list is incomplete. You will also need libc++-XX-dev and
       | libc++abi-XX-dev where the XX is the major version number of GCC
       | on your machine. Once you fix that it will successfully build,
       | but the install process is slightly incomplete. It doesn't run
       | ldconfig for you, you have to do it yourself.
       | 
       | On a final note, despite the name BoringSSL is huge library that
       | takes a surprisingly long time to build. I thought it would be
       | like LibreSSL where they trim it down to the core to keep the
       | attack surface samll, but apparently Google went in the opposite
       | direction.
        
         | at0mic22 wrote:
         | Played this game and switched to prebuilt libraries. Think
         | builder docker images have also been broken for a while.
        
         | 38 wrote:
         | that's exactly why I stopped using C/C++. building is many
         | times a nightmare, and the language teams seems to have no
         | interest in improving the situation
        
       | kerblang wrote:
       | Interesting in light of another much-discussed story about AI
       | scraper farms swamping/DDOSing sites
       | https://news.ycombinator.com/item?id=42549624
        
       ___________________________________________________________________
       (page generated 2024-12-30 23:00 UTC)