[HN Gopher] How Bear does analytics with CSS
___________________________________________________________________
How Bear does analytics with CSS
Author : todsacerdoti
Score : 292 points
Date : 2023-11-01 08:08 UTC (14 hours ago)
(HTM) web link (herman.bearblog.dev)
(TXT) w3m dump (herman.bearblog.dev)
| user20231101 wrote:
| Smart approach!
| nannal wrote:
| > And not just the bad ones, like Google Analytics. Even Fathom
| and Plausible analytics struggle with logging activity on
| adblocked browsers.
|
| I believe that's as they're trying to live in what amounts to a
| toxic wasteland. Users like us are done with the whole concept
| and as such I assume if CSS analytics becomes popular, then
| attempts will be made to bypass that too.
| account-5 wrote:
| Makes me reminiscent of uMatrix which could block the loading
| of CSS too.
| momentary wrote:
| Is uMatrix not in vogue any more? It's still my go to tool!
| account-5 wrote:
| It's not actively developed anymore so I've been using
| ublocks advanced options which are good but not as good as
| uMatrix was.
| its-summertime wrote:
| ||somesite.example^$css
|
| would work in ublock
| account-5 wrote:
| I didn't know this. But with uMatrix you could default to
| all websites and then whitelist those you wanted it for. At
| least that's the way I used it and uBlock advanced user
| features.
| berkes wrote:
| Why?
|
| I manually unblocked Piwik/Matomo, Plausible and and Fathom
| from ublock. I don't see any harm in what and how these track.
| And they do give the people behind the site valuable
| information "to improve the service".
|
| e.g. Plausible collects less information on me than the common
| nginx or Apache logs do. For me, as blogger, it's important to
| see when a post gets on HN, is linked from somewhere and what
| kinds of content are valued and which are ignored. So that I
| can blog about stuff you actually want to read and spread it
| through channels so that you are actually aware of it.
| morelisp wrote:
| You're just saying a smaller-scale version of "as a publisher
| it's important for me to collect data on my audience to
| optimize my advertising revenue." The adtech companies take
| the shit for being the visible 10% but publishers are
| consistently the ones pressuring for more collection.
| ordersofmag wrote:
| I'm a website 'publisher' for a non-profit that has zero
| advertising on our site. Our entire purpose for collecting
| analytics is to make the site work better for our users.
| Really. Folks like us may not be in the majority but it's
| worth keeping in mind that "analytics = ad revenue
| optimization" is over-generalizing.
| morelisp wrote:
| I'm sure your stated 13 years of data is absolutely
| critical to optimize your page load times.
| majewsky wrote:
| Can you give some examples of changes that you made
| specifically to make the site work better for users, and
| how those were guided by analytics? I usually just do
| user interviews because building analytics feels like
| summoning a compliance nightmare for little actual
| impact.
| arp242 wrote:
| I've decided to either stop working or keep working on
| some things based on the fact that I did or didn't get
| any traffic for it. I've become aware some pages were
| linked on Hacker News, Lobsters, or other sites, and
| reading the discussion I've been able to improve some
| things in the article.
|
| And also just knowing some people read what you write is
| nice. There is nothing wrong with having some validation
| (as long as you don't obsess over it) and it's a basic
| human need.
|
| This is just for a blog; for a product knowing "how many
| people actually use this?" is useful. I suspect that for
| some things the number is literally 0, but it can be hard
| to know for sure.
|
| User interviews are great, but it's time-consuming to do
| well and especially for small teams this is not always
| doable. It's also hard to catch things that are useful
| for just a small fraction of your users. i.e. "it's
| useful for 5%" means you need to do a lot of user
| interviews ( _and_ hope they don 't forget to mention
| it!)
| HuwFulcher wrote:
| How horrifying that someone who does writing potentially as
| their income would seek to protect that revenue stream.
|
| Services like Plausible give you the bare minimum to
| understand what is viewed most. If you have a website that
| you want people to visit then it's a pretty basic
| requirement that you'll want to see what people are
| interested in.
|
| When you start "personalising" the experience based on some
| tracking that's when it becomes a problem.
| peoplefromibiza wrote:
| > a pretty basic requirement that you'll want to see what
| people are interested in.
|
| not really
|
| it should be what you are competent and proficient at
|
| people will come because they like what you do, not
| because you do the things they like (sounds like the same
| thing, but it isn't)
|
| there are many proxies to know what they like if you want
| to plan what to publish and when and for how long,
| website visits are one of the less interesting.
|
| a lot of websites such as this one get a lot of visits
| that drive no revenue at all.
|
| OTOH there are websites who receive a small amount of
| visits, but make revenues based on the amount of people
| subscribing to the content (the textbook example is OF,
| people there can get from a handful of subscriber what
| others earn from hundreds of thousands of views on YT or
| the like)
|
| so basically monitoring your revenues works better than
| constantly optimizing for views, in the latter case you
| are optimizing for the wrong thing
|
| I know a lot of people who sell online that do not use
| analytics at all, except for coarse grained ones like
| number of subscriptions/number of items sold/how many
| email they receive about something they published or
| messages from social platforms etc.
|
| that's been true in my experience through almost 30 years
| of interacting and helping publishing creative content
| online and offline (books, records, etc)
| HuwFulcher wrote:
| > people will come because they like what you do, not
| because you do the things they like (sounds like the same
| thing, but it isn't)
|
| This isn't true for all channels. The current state of
| search requires you to adapt your content to what people
| are looking for. Social channels are as you've said.
|
| It doesn't matter how you want to slice it. Understanding
| how many people are coming to your website, from where
| and what they're looking at is valuable.
|
| I agree the "end metric" is whatever actually drives the
| revenue. But number of people coming to a website can
| help tune that.
| cpill wrote:
| emails revived or messages on social media are just
| another analytic and filling that same need as knowing
| pages hits. and somehow these people are vega analytics
| junkies instead of mainlining page hits. your
| unconvincing in the argument for "analytics are not
| needed"
| marban wrote:
| Plausible still works if you reverse-proxy the script and the
| event url through your own /randompath.
| chrismorgan wrote:
| This approach is no harder to block than the JavaScript
| approaches: you're just blocking requests to certain URL
| patterns.
| nannal wrote:
| That approach would work until analytics gets mixed in with
| actual styles and then you're trying to use a website without
| CSS.
| chrismorgan wrote:
| You're blocking the _image_ , not the CSS. Here's a rule to
| catch it at present: ||bearblog.dev/hit/
|
| This is the shortest it can be written with certainty of no
| false positives, but you can do things like making the URL
| pattern more specific (e.g. _/ hit/*/_) or adding the
| _image_ option (append _$image_ ) or just removing the
| ||bearblog.dev domain filter if it spread to other domains
| as well (there probably aren't enough false positives to
| worry about).
|
| I find it also worth noting that _all_ of these techniques
| are pretty easily circumventable by technical means, by
| blending content and tracking /ads/whatever. In case of
| all-out war, content blockers _will_ lose. It's just that
| no one has seen fit to escalate that far (and in some cases
| there are legal limitations, potentially on both sides of
| the fight).
| macNchz wrote:
| > In case of all-out war, content blockers will lose.
| It's just that no one has seen fit to escalate that far
| (and in some cases there are legal limitations,
| potentially on both sides of the fight).
|
| The Chrome Manifest v3 and Web Environment Integrity
| proposals are arguably some of the clearest steps in that
| direction, a long term strategy being slow-played to
| limit pushback.
| ben_w wrote:
| The bit of the web that feels to me like a toxic wasteland is
| all the adverts; the tracking is a much more subtle issue,
| where the damage is the long-term potential of having a digital
| twin that can be experimented on to find how best to manipulate
| me.
|
| I'm not sure how many people actually fear that. Might get
| responses from "yes, and it's creepy" to "don't be daft that's
| just SciFi".
| input_sh wrote:
| Nothing's gonna block your webserver's access.log fed into an
| analytics service.
|
| If anything, you're gonna get numbers that are inflated because
| it's a bit impossible to dismiss all of the bot traffic just by
| looking at user agents.
| victorbjorklund wrote:
| This does make sense! Might try it for my own analytics solution.
| Anyone can think of a downside of this vs js?
| berkes wrote:
| I can think of many "downsides" but whether those matter or are
| actually upsides really depends on your use-case and
| perspective.
|
| * You cannot (easily) track interaction events (esp. relevant
| for SPAs, but also things like "user highlighted x" or "user
| typed Y, then backspaced then typed Z)"
|
| * You cannot track timings between events (e.g. how long a user
| is on the page)
|
| * You cannot track data such as screen-sizes, agents, etc.
|
| * You cannot track errors and exceptions.
| Wouter33 wrote:
| Nice implementation! Just a heads-up, hashing the ip like that is
| still considered tracking under GDPR and requires a privacy
| banner in the EU.
| thih9 wrote:
| Can you explain why or link a source? I'd like to learn the
| details.
| fizzbuzz-rs wrote:
| Likely because the hash of an IP can easily be reversed as
| there are only ~2^32 IPv4 addresses.
| openplatypus wrote:
| It is not just that. Having user IP and such a hashing
| approach you can re-identify past sessions.
| thih9 wrote:
| What if my hashing function has high likelihood of
| collisions?
| firtoz wrote:
| Then you cannot trust the analytics
| thih9 wrote:
| Do you trust analytics that doesn't use JS? Or relies on
| mobile users to scroll the page before counting a hit?
|
| It's all a heuristic and even with high collision
| hashing, analytics would provide some additional insight.
| rjmunro wrote:
| You can estimate the actual numbers based on the
| collision rate.
|
| Analytics is not about absolute accuracy, it's about
| measuring differences; things like which pages are most
| popular, did traffic grow when you ran a PR campaign etc.
| dsies wrote:
| https://gdpr-info.eu/art-4-gdpr/ paragraph 1:
|
| > 'personal data' means any information relating to an
| identified or identifiable natural person ('data subject');
| an identifiable natural person is one who can be identified,
| directly or indirectly, in particular by reference to an
| identifier such as a name, an identification number, location
| data, an online identifier or to one or more factors specific
| to the physical, physiological, genetic, mental, economic,
| cultural or social identity of that natural person;
| thih9 wrote:
| This does not reference hashing, which can be an
| irreversible and destructive operation. As such, it can
| remove the "relating" part - i.e. you'll no longer be able
| to use the information to relate it to an identifiable
| natural person.
|
| In this context, if I define a hashing function that e.g.
| sums all ip address octets, what then?
| jvdvegt wrote:
| A hash (whether MD5 or some SHA) on IP4-address is easily
| reversed.
|
| Summing octets is non-reversable, so it seems like a good
| 'hash' to me (but note: you'll get a lot of collisions).
| And of course, IANAL.
| dsies wrote:
| I was answering your request for a source.
|
| The linked article talks about _identification numbers_
| that can be used to link a person. I am not a lawyer but
| the article specifically refers to one person.
|
| By that logic, if the hash you generate cannot be linked
| to exactly one, specific person/request - you're in the
| clear. I think ;)
| openplatypus wrote:
| Correct. This is a flawed hashing implementation as it allows
| for re-identification.
|
| Having that IP and user timezone you can generate the same hash
| and trace back the user. This is hardly anonymous hashing.
|
| Wide Angle Analytics adds daily, transient salt to each IP hash
| which is never logged thus generating a truly anonymous hash
| that prevents reidentification.
| thih9 wrote:
| What if my hashing function is really destructive and has
| high likelihood of collisions?
| hk__2 wrote:
| > What if my hashing function is really destructive and has
| high likelihood of collisions?
|
| If it's so destructive that it's impossible to track users,
| it's useless for you. If not, you need a privacy banner.
| thih9 wrote:
| A high collision hash would be useful for me on my low
| traffic page and I'd enjoy not having to display a cookie
| banner.
|
| Also: https://news.ycombinator.com/item?id=38096235
| victorbjorklund wrote:
| Probably should be "salted hashes might be considered PII". It
| has not be tried by the EU court and the law is not 100% clear.
| It might be. It might not be.
| e38383 wrote:
| If the data gets stored in this way (hash of IP[0]) for a long
| time I'm with you. But if you only store the data for 24 hours
| it might still count as temporary storage and should be
| "anonymized" enough.
|
| IMO (and I'm not a lawyer): if you store ip+site for 24 hours
| and after that only store "region" (maybe country or state) and
| site this should be GDPR compliant.
|
| [0] it should use sha256 or similar and not md5
| donohoe wrote:
| Actually no. It's very likely this is fine. Context is
| important.
|
| Not a layer but discussed this previously with lawyers when
| building a GDPR framework awhile back.
| sleepyhead wrote:
| Context is irrelevant. What is relevant is whether a value,
| for example a hash, can be identified to a specific person in
| some way.
| donohoe wrote:
| I'm really not going to argue here.
|
| I've been told this directly by lawyers who specialize in
| GDPR and CCPA etc. I will take their word over yours.
|
| If you are a lawyer with direct expertise in this area then
| I'm willing to listen.
| mcny wrote:
| On the topic of analytics, how do you store them?
|
| Let's say I have an e-commerce website, with products I want to
| sell. In addition to analytics, I decide to log a select few
| actions myself such as visits to product detail page while logged
| in. So I want to store things like user id, product id,
| timestamp, etc.
|
| How do I actually store this? My naive approach is to stick it in
| a table. The DBA yelled at me and asked how long I need data. I
| said at least a month. They said ok and I think they moved all
| older data to a different table (set up a job for it?)
|
| How do real people store these logs? How long do you keep them?
| ludwigvan wrote:
| ClickHouse
| jon-wood wrote:
| Unless you're at huge volume you can totally do this in a
| Postgres table. Even if you are you can partition that table by
| date (or whatever other attributes make sense) so that you
| don't have to deal with massive indexes.
|
| I once did this, and we didn't need to even think about
| partitioning until we hit a billion rows or so. (But partition
| sooner than that, it wasn't a pleasant experience)
| n_e wrote:
| An analytics database is better (clickhouse, bigquery...).
|
| They can do aggregations much faster and can deal with
| sparse/many columns (the "paid" event has an "amount"
| attribute, the "page_view" event has an "url" attribute...)
| ordersofmag wrote:
| We've got 13 years worth of data stored in mysql (5 million
| visitor/year). It's a pain to query there so we keep a copy in
| clickhouse as well (which is a joy to query).
| mcny wrote:
| I only track visits to a product detail page so far.
| Basically, some basic metadata about the user (logged in
| only), some metadata about the product, and basic "auditing"
| columns -- created by, created date, modified by, modified
| date (although why I have modified by and modified date makes
| no sense to me, I don't anticipate to ever edit these,
| they're only there for "standardization". I don't like it but
| I can only fight so many battles at a time).
|
| I am approaching 1.5 million rows in under two months.
| Thankfully, my DBA is kind, generous, and infinitely patient.
|
| Clickhouse looks like a good approach. I'll have to look into
| that.
|
| > select count(*) from trackproductview;
|
| > 1498745
|
| > select top 1 createddate from TrackProductView order by
| createddate asc;
|
| > 2023-08-18 11:31:04.000
|
| what is the maximum number of rows in clickhouse table? Is
| there such a limit?
| victorbjorklund wrote:
| I use Postgres with timescale db. Works unless your e-commerce
| is amazon.com. Great thing with timescale db is that they take
| care of creating materialized views with the aggregates you
| care about (like product views per hour etc) and you can even
| choose to "throw away" the events themselves and just keep the
| aggregations (to avoid getting a huge db if you have a lot of
| events).
| p4bl0 wrote:
| The :hover pseudo-class could be applied and unapplied multiple
| times for a single page load. This can certainly be mitigated
| using cache related http headers but then if the same page is
| visited by the same person a second time coming from the same
| referrer, the analytics endpoint won't be loaded.
|
| But maybe I'm not aware that browsers guarantee that "images"
| loaded using url() in CSS will be (re)loaded exactly once per
| page?
| kevincox wrote:
| I'm not sure about `url()` in CSS but `<img>` tags are
| guaranteed to only be loaded once per URL per page. I would
| assume that `url()` works the same.
|
| This bit me when I tried to make a page that reload an image as
| a form of monitoring. However URL interestingly includes the
| fragment (after the #) even though it isn't set to the server.
| So I managed to work around this by appending #1, #2, #3... to
| the image URL.
|
| https://gitlab.com/kevincox/image-monitor/-/blob/e916fcf2f9a...
| alabhyajindal wrote:
| Wow, I didn't know you could trigger a URL endpoint with CSS!
| dontlaugh wrote:
| Why not just get this info from the HTTP server?
| victorbjorklund wrote:
| Hard if you run serverless
| dontlaugh wrote:
| There's still a server somewhere and it can log URLs and IPs.
| tmikaeld wrote:
| Not if it's static generated html/css.
|
| And the real benefit of this trick is separating users from
| bots.
| berkes wrote:
| And even if there are many servers (a CDN or distributed
| caching) you can collect and merge these.
| victorbjorklund wrote:
| Tell me how to collect the logs for static sites on
| Cloudflare Pages (not functions. The Pages sites)
| berkes wrote:
| Cloudflare Pages are running on servers. These servers
| (can, quite certainly will) have logs.
|
| That you cannot access the logs because you don't own the
| servers doesn't mean there aren't any servers that have
| logs.
| victorbjorklund wrote:
| Yes, no one has argued that Cloudflare Pages arent using
| servers. But it is "hard" to track using logs if you are
| a cloudflare customers. Guess only way would be to hack
| into cloudflare itself and access my logs that way. But
| that is "hard" (because yes theoretically it is possible
| i know). And not a realistic alternative.
| victorbjorklund wrote:
| Of course. But you can't access it. You can't get logs for
| static sites on Cloudflare Pages.
| Spivak wrote:
| Huh? You can get logs just fine from your ALB's and API
| Gateways.
| hk__2 wrote:
| > Why not just get this info from the HTTP server?
|
| This is explained in the blog post:
|
| > There's always the option of just parsing server logs, which
| gives a rough indication of the kinds of traffic accessing the
| server. Unfortunately all server traffic is generally seen as
| equal. Technically bots "should" have a user-agent that
| identifies them as a bot, but few identify that since they're
| trying to scrape information as a "person" using a browser. In
| essence, just using server logs for analytics gives a skewed
| perspective to traffic since a lot of it are search-engine
| crawlers and scrapers (and now GPT-based parsers).
| dontlaugh wrote:
| Don't bots now load an entire browser including simulated
| user interaction, to the point where there's no difference?
| janosdebugs wrote:
| Not for the most part, it's still very expensive. Even if,
| they don't simulate mouse movement.
| spiderfarmer wrote:
| All bots
| jackjeff wrote:
| The whole anonymization of IP addresses by just hashing the date
| and IP is just security theater.
|
| Cryptographic hashes are designed to be fast. You can do 6
| billion md5 hashes in a second on an MacBook (m1 pro) via hashcat
| and there's only 4 billion ipv4 addresses. So you can brute force
| the entire range and find the IP address. Basically reverse the
| hash.
|
| And that's true even if they used something secure like SHA-256
| instead of broken MD5
| berkes wrote:
| Maybe they use a secret salt or rotating salt? The example code
| doesn't, so I'm afraid you are right. But one addition and it
| can be made reasonable secure.
|
| I am afraid, however, that this security theater is enough to
| pass many laws, regulations and such on PII.
| ktta wrote:
| Not if they use a password hash like Argon2 or scrypt
| ale42 wrote:
| But that's very heavy to compute at scale...
| isodev wrote:
| True, but also it's a blogging platform - does it really
| have that kind of scale to be concerned with?
| ale42 wrote:
| Probably not, I was mainly thinking if that kind of
| solution was to be adopted at a scale like Google
| Analytics.
| __alexs wrote:
| Even then it is theatre because if you know the IP address
| you want to check it's trivial to see if there's a match.
| chrismorgan wrote:
| And _this_ is why such a hash will still be considered
| personal data under legislation like GDPR.
| TekMol wrote:
| That is easy to fix though. Just use a temporary salt.
|
| Pseudo code: if salt.day < today():
| salt = {day: today(), salt: random()} ip_hash =
| sha256(ip + salt.salt)
| __alexs wrote:
| Assuming you don't store the salts, this produces a value
| that is useless for anything but counting something like DAU.
| Which you could equally just do by counting them all and
| deleting all the data at the end of the day, or using a
| cardinality estimator like HLL.
| TekMol wrote:
| DAU in regards to a given page.
|
| Have you read the article? That is what the author's goal
| seems to be.
|
| He wants to prevent multiple requests to the same page by
| the same IP counted multiple times.
| tatersolid wrote:
| Is that more efficiently done with an appropriate caching
| header on the page as it is served?
|
| Cache-Control: private, max-age=86400
|
| This prevents repeat requests for normal browsers from
| hitting the server.
| dvdkon wrote:
| That same uselessness for long-term identification of users
| is what makes this approach compliant with laws regulating
| use of PII, since what you have after a small time window
| isn't actually PII (unless correlated with another dataset,
| but that's always the case).
| SamBam wrote:
| That's precisely all that OP is storing in the original
| article.
|
| They're just getting a list of hashes per day, and
| associated client info. They have no idea if the same user
| visit them on multiple days, because the hashes will be
| different.
| kevincox wrote:
| Of course if you have multiple severs or may reboot you need
| to store the salt somewhere. If you are going to bother
| storing the salt and cleaning it up after the day is over it
| may be just as easy to clean the hashes at the end of the day
| (and keep the total count) which is equivalent. This should
| work unless you want to keep individual counts around for
| something like seeing distribution of requests per IP or
| similar. But in that case you could just replace the hashes
| with random values at the end of the day to fully anonymize
| them since you no longer need to increment then.
| Etheryte wrote:
| For context, this problem also came up in a discussion about
| Storybook doing something similar in their telemetry [0] and
| with zero optimization it takes around two hours to calculate
| the salted hashes for every IPv4 on my home laptop.
|
| [0] https://news.ycombinator.com/item?id=37596757
| hnreport wrote:
| This the type of comment that reinforces not even trying to
| learn or outsource security.
|
| You'll never know enough.
| petesergeant wrote:
| I think the opposite? I'm a dev with a bit of an interest in
| security, and this immediately jumped out at me from the
| story; knowing enough security to discard bad ideas is
| useful.
| WhyNotHugo wrote:
| Aside from it being technically trivial to get an IP back from
| its hash, the EU data protection agency made it very clear that
| "hashing PII does not count as anonymising PII".
|
| Even if you hash somebody's full name, you can later answer the
| question "does this hash match the this specific full name".
| Being able to answer this question implies that the
| anonymisation process is reversible.
| kevincox wrote:
| I think the word "reversible" here is being stretched a bit.
| There is a significant difference between being able to list
| every name that has used your service and being able to check
| if a particular name has used your service. (Of course these
| can be effectively the same in cases where you can list all
| possible inputs such as hashed IPv4 addresses.)
|
| That doesn't mean that hashing is enough for pure anonymity,
| but used properly hashes are definitely a step above
| something fully reversible (like encryption with a common
| key).
| SamBam wrote:
| I'm not sure the distinction is meaningful. If the police
| demand your logs to find out whether a certain IP address
| visited in the past year, they'd be able to find that out
| pretty quickly given what's stored. So how is privacy being
| respected?
| pluto_modadic wrote:
| if it fulfills the same function, does it matter?
|
| if you have an ad ID for a person, say example@example.com,
| and you want to deduplicate it,
|
| if you provide them with the names, the company that buys
| the data can still "blend" it with data they know, if they
| know how the hash was generated... and effectively get back
| that person's email, or IP, or phone number, or at least
| get a good hunch that the closest match is such and such
| person with uncanny certainty
|
| de-anonymization of big data is trivial in basically every
| case that was written by an advertising company, instead of
| written by a truly privacy focused business.
|
| if it were really a non-reversible hash, it would be evenly
| distributed, not predictable, and basically useless for
| advertising, because it wouldn't preserve locality. It
| needs to allow for finding duplicates... so the person you
| give the hash to, can abuse that fact.
| bayindirh wrote:
| We're members of some EU projects, and they share a common
| help desk. To serve as a knowledge base, the tickets are
| kept, but all PII is anonymized after 2 years AFAIK.
|
| What they do is pretty simple. They overwrite the data fields
| with the text "<Anonymized>". No hashes, no identifiers,
| nothing. Everything is gone. Plain and simple.
| spookie wrote:
| KISS That's the best way to go about it
| jefftk wrote:
| It depends. For example, if each day you generate a random
| nonce and use it to salt that day's PII (and don't store the
| nonce) then you cannot later determine (a) did person A visit
| on day N or (b) is visitor X on day N the same as visitor Y
| on day N+1. But you can still determine how many distinct
| visitors you had on day N, and answer questions about within-
| day usage patterns.
| ilrwbwrkhv wrote:
| Yes but if the business is not in the EU they don't need to
| care one bit about GDPR or EU.
| troupo wrote:
| If they target residents of the EU, they must care.
|
| _Edit:_
|
| This is a different bear:
|
| Also, Bear claims to be GDPR compliant:
| https://bear.app/faq/bear-is-gdpr-compliant/
| TylerE wrote:
| Is an ipv4 address really classes as PII? Sounds a bit
| insane.
| beardog wrote:
| It can be used to track you across the web, get a general
| geographic area, and if you have the right connections one
| can get the ISP subscriber address. Given that PII is
| anything that can be used to identify a person, I think it
| qualifies despite it being difficult for a rando to tie an
| IP to a person.
|
| Additionally in the case of ipv6 it can be tied to a
| specific device more often. One cannot rely on ipv6 privacy
| extensions to sufficiently help there.
| rtsil wrote:
| That's compounded by the increasing use of static IPs, or
| at least extremely long-lasting dynamic IPs in some ISPs.
| alkonaut wrote:
| Hashes should be salted. If you salt, you are fine, if you
| don't you aren't.
|
| Whether the salt can be kept indefinitely, or is rotated
| regularly etc is just an implementation detail, but the key
| with salting hashes for analytics is that the salt never leaves
| the client.
|
| As explained in the article there seems to be no salt (or
| rather, the current date seems to be used as a salt, but that's
| not a random salt and can easily be guessed for anyone who
| wants to say "did IP x.y.z.w visit on date yy-mm-dd?".
|
| It's pretty easy to reason about these things if you look from
| the perspective of an attacker. How would you do to figure out
| anything about a specific person given the data? If you can't,
| then the data is probably OK to store.
| piaste wrote:
| > Hashes should be salted. If you salt, you are fine, if you
| don't you aren't.
|
| > Whether the salt can be kept indefinitely, or is rotated
| regularly etc is just an implementation detail, but the key
| with salting hashes for analytics is that the salt never
| leaves the client.
|
| I think I'm missing something.
|
| If the salt is known to the server, then it's useless for
| this scenario. Because given a known salt, you can generate
| the hashes for every IP address + that salt very quickly.
| (Salting passwords works because the space for passwords is
| big, so rainbow tables are expensive to generate.)
|
| If the salt is unknown to the server, i.e. generated by the
| client and 'never leaves the client'... then why bother with
| hashes? Just have the client generate a UUID directly instead
| of a salt.
| rkangel wrote:
| Without a salt, you can generate the hash for every IP
| address _once_ , and then permanantly have a hash->IP
| lookup (effectively a Rainbow table). If you have a salt,
| then you need to do it for each database entry, which does
| make it computationally more expensive.
| tptacek wrote:
| People are obsessed with this attack from the 1970s, but
| in practice password cracking rigs just brute force the
| hashes, and that has been the practice since my career
| started in the 1990s and people used `crack`, into the
| 2000s and `jtr`, and today with `hashcat` or whatever it
| is the cool kids use now. "Rainbow tables" don't matter.
| If you're discussing the expense of attacking your scheme
| with or without rainbow tables, you've already lost.
| jonahx wrote:
| > If you're discussing the expense of attacking your
| scheme with or without rainbow tables, you've already
| lost.
|
| Can you elaborate on this or link to some info
| elaborating what you mean? I'd like to learn about it.
| alkonaut wrote:
| > > _the salt never leaves the client_
|
| > I think I'm missing something.
|
| ...
|
| > If the salt is known to the server,
|
| That's what you were missing yes
| SamBam wrote:
| Did you miss the second half where GP asked why the
| client doesn't just send up a UUID, instead of generating
| their own salt and hash?
| arp242 wrote:
| > why bother with hashes? Just have the client generate a
| UUID directly instead of a salt.
|
| The reason for all this bonanza is that the ePrivacy
| directive requires a cookie banner, _" making exceptions
| only for data that is _"strictly necessary in order to
| provide a [..] service explicitly requested by the
| subscriber or user"*.
|
| In the end, you only have "pinky promise" that someone
| isn't doing more processing on the server end, so in
| reality it doesn't matter much especially if the cookie
| lifetime is short (hours or even minutes). Actually, a
| cookie or other (short-lived!) client-side ID is probably
| better for everyone if it wasn't for the cookie banners.
| TylerE wrote:
| ALL of the faff around cookies is the biggest security
| theater of the past 40 years. I remember hearing the
| fear-mongering in the very early 2000's about cookies in
| the mainstream media - it was self-evidentally a farce
| then, and a farce now.
| throwaway290 wrote:
| Isn't in this case data is part of "strictly necessary"
| data (IP address)? That's all that gets collected by that
| magic CSS + server, no?
| arp242 wrote:
| ePrivacy directive only applies to information stored on
| the client side (such as cookies).
| darken wrote:
| Salts are generally stored with the hash, and are only really
| intended to prevent "rainbow table" attacks. (I.e. use of
| precomputed hash tables.) Though a predictable and matching
| salt per entry does mean you can attack all the hashes for a
| timestamp per hash attempt.
|
| That being said, the previous responder's point still stands
| that you can brute force the salted IPs at about a second per
| IP with the colocated salt. Using multiple hash iterations
| (e.g. 1000x; i.e. "stretching") is how you'd meaningfully
| increase computational complexity, but still not in a way
| that makes use of the general "can't be practically reversed"
| hash guarantees.
| alkonaut wrote:
| As I said the key for hashing PII for telemetry is that the
| client does the hashing on the client side and the client
| never transmits the salt. This isn't a login system or
| similar. There is no "validation" of the hash. All the hash
| is is a unique marker for a user that doesn't contain any
| PII.
| SamBam wrote:
| How does the client generate the same salt every time
| they visit the page, without using cookies?
| donkeyd wrote:
| Use localstorage!
|
| Kidding, of course. I don't think there's a way to track
| users across sessions, without storing something and
| requiring a 'cookie notification'. Which is kind of the
| point of all these laws.
| alkonaut wrote:
| Storing a salt with 24h expiry would be the same thing as
| the solution in the article. It would be better from a
| privacy perspective because the IP would then not be
| transmitted in a reversible way.
|
| If I hadn't asked for permission to send hash(ip + date)
| then I'd sure not ask permission if I instead stored a
| random salt for each new 24h and sent the hash(ip +
| todays_salt).
|
| This is effectively a cookie and it's not strictly
| necessary if it's stats only. So I think on the server
| side I'd just invent some reason why it's necessary for
| the product itself too, and make the telemetry just an
| added bonus.
| alkonaut wrote:
| If you can use JS it's easy. For example
| localStorage.setItem("salt", Math.random()). Without JS
| it's hard I think. I don't know why this author wants to
| use JS, perhaps out of respect for his visitors, but then
| I think it's worse to send PII over the wire (And an IP
| hashed in the way he describes is PII).
| SamBam wrote:
| EU's consent requirements don't distinguish between
| cookies and localStorage, as far as I understand. And a
| salt that is only used for analytics would not count as
| "strictly necessary" data, so I think you'd have to put
| up a consent popup. Which is precisely the kind of thing
| a solution of that is trying to avoid.
| alkonaut wrote:
| Indeed, but as I wrote in another reply: it doesn't
| matter. It's even worse to send PII over the wire. Using
| the date as the salt (as he does) just means it's
| reversible PII - a.k.a. PII!.
|
| Presumable these are stored on the server side to
| identify returning visitors - so instead of storing a
| random number for 24 hours on the client, you now have
| PII stored on the server. So basically there is no way to
| do this that doesn't require consent.
|
| The only way to do it is to make the information required
| for some necessary function, and then let the analytics
| piggyback on it
| SamBam wrote:
| I think I agree with you there. But again, the idea of a
| "salt" is then overcomplicating things. It's exactly the
| same to have the client generate a GUUID and just send
| that up, no salting or hashing required.
| alkonaut wrote:
| Yup for only identifying a system that's easier. If this
| is all the telemetry is ever planned to do then that's
| all you need. The benefit of having a local hash function
| is when you want to transmit multiple ids for data. E.g
| in a word processor you might transmit
| hash(salt+username) on start and hash(salt+filename) when
| opening a document and so on. That way you can send
| identifiers for things that are sensitive or private like
| file names in a standardized way and you don't need to
| keep track of N generated guids for N use cases.
|
| On the telemetry server you get e.g
|
| Function "print" used by user 123 document 345. Using
| that you can do things like answering how many times an
| average document is printed or how many times per year an
| average user uses the print function.
| robertlagrant wrote:
| IP address is "non-sensitive PII"[0]. It's pretty hard to
| identify someone from an IP address. Hashing and then
| deleting every day is very reasonable.
|
| [0] https://www.ibm.com/topics/pii
| sysop073 wrote:
| What's the point in hashing the IP + salt then, just let
| each client generate a random nonce and use that as the
| key
| tptacek wrote:
| Salting a standard cryptographic hash (like SHA2) doesn't do
| anything meaningful to slow a brute force attack. This
| problem is the reason we have password KDFs like scrypt.
|
| (I don't care about this Bear analytics thing at all, and
| just clicked the comment thread to see if it was the Bear I
| thought it was; I do care about people's misconceptions about
| hashing.)
| alkonaut wrote:
| What do you mean by "brute force" in the context of
| reversing PII that has been obscured by a one way hash? My
| IP number passed through SHA1 with a salt (a salt I
| generated and stored safely on my end) is
| 6FF6BA399B75F5698CEEDB2B1716C46D12C28DF5 Since this is all
| that would be sent over the wire for analytics, this is the
| only information an attacker will have available.
|
| The only thing you can brute force from that is _some_ IP
| and _some salt_ such that SHA1(IP+Salt) =
| 6FF6BA399B75F5698CEEDB2B1716C46D12C28DF5 But you 'll find
| millions of such IPs. Perhaps all possible IP's will work
| with _some_ salt, and give that hash. It 's not revealing
| my IP even if you manage to find a match?
| infinityio wrote:
| If you also explicitly mentioned the salt used (as bear
| appear to have done?), this just becomes a matter of
| testing 4 billion options and seeing which matches
| alkonaut wrote:
| I think it's just unsalted in the example code. Or you
| could argue that the date is kind of used as a salt. But
| the point was that salting + hashing is fine for PII in
| telemetry if and only if the salt stays on the client. It
| might be difficult to do without JS though.
| michaelmior wrote:
| > Salting a standard cryptographic hash (like SHA2) doesn't
| do anything meaningful to slow a brute force attack.
|
| Sure, but it does at least prevent the use of rainbow
| tables. Arguably not relevant in this scenario, but it
| doesn't mean that salting does nothing. Rainbow tables can
| speed up attacks by many orders of magnitude. Salting may
| not prevent each individual password from being brute
| forced, but for most attackers, it probably will prevent
| your entire database from being compromised due to the
| amount of computation required.
| tptacek wrote:
| Rainbow tables don't matter. If you're discussing the
| strength of your scheme with or without rainbow tables,
| you have already lost.
|
| https://news.ycombinator.com/item?id=38098188
| robertlagrant wrote:
| That's just a link where you claim the same thing. What's
| your actual rationale? Do you think salting is pointless?
| dspillett wrote:
| _> Cryptographic hashes are designed to be fast._
|
| Not _really_. They are designed to be fast _enough_ and even
| then only as a secondary priority.
|
| _> You can do 6 billion ... hashes /second on [commodity
| hardware] ... there's only 4 billion ipv4 addresses. So you can
| brute force the entire range_
|
| This is harder if you use a salt not known to the attacker.
| Per-entry salts can help even more, though that isn't relevant
| to IPv4 addresses in a web/app analytics context because after
| the attempt at anonymisation you want to still be able to tell
| that two addresses were the same.
|
| _> And that's true even if they used something secure like
| SHA-256 instead of broken MD5_
|
| Relying purely on the computation complexity of one hash
| operation, even one not yet broken, is not safe given how easy
| temporary access to mass CPU/GPU power is these days. This can
| be mitigated somewhat by running many rounds of the hash with a
| non-global salt - which is what good key derivation processes
| do for instance. Of course you need to increase the number of
| rounds over time to keep up with the rate of growth in
| processing availability, to keep undoing your hash more hassle
| than it is worth.
|
| But yeah, a single unsalted hash (or a hash with a salt the
| attacker knows) on IP address is not going to stop anyone who
| wants to work out what that address is.
| krsdcbl wrote:
| Don't forget that md5 is comparatively slow & there are way
| options for hashing nowadays:
|
| https://jolynch.github.io/posts/use_fast_data_algorithms/
| SAI_Peregrinus wrote:
| A "salt not known to the attacker" is a "key" to a keyed hash
| function or message authentication code. A salt isn't a
| secret, though it's not usually published openly.
| marcosdumay wrote:
| > only as a secondary priority
|
| That's not a reasonable way to say it. It's literally the
| second priority, and heavily evaluated when deciding what
| algorithms to take.
|
| > This is harder if you use a salt not known to the attacker.
|
| The "attacker" here is the sever owner. So if you use a
| random salt and throw it away, you are good, anything
| resembling the way people use salt on practice is not fine.
| HermanMartinus wrote:
| Author here. I commented down below, but it's probably more
| relevant in this thread.
|
| For a bit of clarity around IP addresses hashes. The only use
| they have in this context is preventing duplicate hits in a day
| (making each page view unique by default). At the end of each
| day there is a worker job that scrubs the ip hash which is now
| irrelevant.
| myfonj wrote:
| Have you considered serving actual small transparent image
| with caching headers set to expire at midnight?
| freitasm wrote:
| "The only downside to this method is if there are multiple reads
| from the same IP address but on separate devices, it will still
| only be seen as one read. And I'm okay with that since it
| constitutes such a minor fragment of traffic."
|
| Many ISPs are now using CG-NAT so this approach would miscount
| thousands of visitors seemingly coming from a single IP address.
| tmikaeld wrote:
| Only if all of them use the exact same user agent
| platform/browser.
|
| (It would be better if he used a hash of the raw user agent
| string)
| freitasm wrote:
| UA aren't unique these days.
| colesantiago wrote:
| How would one block this from tracking you?
|
| I think we would either need to send fake data to these analytics
| tools deliberately like https://adnauseam.io/
|
| Or now include CSS as a spy tracker that needs to be blocked.
| Kiro wrote:
| I don't see how this is more intrusive for privacy than what
| you can already get from access logs.
| colesantiago wrote:
| It is still tracking you so it needs to be blocked.
| Kiro wrote:
| So are access logs. How are you going to block those?
| colesantiago wrote:
| I never said anything about access logs, I specifically
| mentioned this CSS trick that will become popular for ad
| companies to track people.
|
| For this, this would need to block the endpoint or send
| obfuscated data deliberately in protest of this.
|
| Should you _want_ to cover access logs also, then forms
| of tracking then sending excessive, random obfuscation
| data with adnauseam would also help here.
|
| https://adnauseam.io/
| jokethrowaway wrote:
| I sure hope you're being sarcastic here and illustrating
| the ridiculousness of privacy extremists (who, btw, ruined
| the web, thanks to a few idiot politicians in the EU).
|
| If not, what's wrong with a service knowing you're
| accessing it? How can they serve a page without knowing
| you're getting a page?
| callalex wrote:
| Ruined the web? It sure seems like the web still works
| from my perspective. What has been ruined for you?
| matrss wrote:
| If it is not then it must be unnecessary, since you could get
| the same information from the access logs already.
| its-summertime wrote:
| ||/hit/*$image
|
| In your favorite ad blocker
| meiraleal wrote:
| Interesting approach but what about mobile users?
| welpo wrote:
| From the article:
|
| > Now, when a person hovers their cursor over the page (or
| scrolls on mobile) it triggers body:hover which calls the URL
| for the post hit
| cantSpellSober wrote:
| It _doesn 't_ do that though.
|
| > The :hover pseudo-class is problematic on touchscreens.
| Depending on the browser, the :hover pseudo-class might never
| match
|
| https://developer.mozilla.org/en-US/docs/Web/CSS/:hover
|
| Don't take my word for it. Trying it in mobile emulators will
| have the same result.
| rzmmm wrote:
| > Now, when a person hovers their cursor over the page (or
| scrolls on mobile)...
|
| I can imagine many cases where real human user doesn't scroll the
| page on mobile platform. I like the CSS approach but I'm not sure
| it's better than doing some bot filtering with the server logs.
| freitzzz wrote:
| I attempted to do this back at the start of this year, but lost
| motivation building the web ui. My trick is not CSS but simply
| loading fake images with <img> tags:
|
| https://github.com/nolytics
| openplatypus wrote:
| The CSS tracker is as useful as server log-based analytics. If
| that is the information you need, cool.
|
| But JS trackers are so much more. Time spent on the website,
| scroll depth, screen sizes, some limited and compliant and yet
| useful unique sessions, those things cannot be achieved without
| some (simple) JS.
|
| Server side, JS, CSS... No one size fits all.
|
| Wide Angle Analytics has strong privacy, DNT support, an opt-out
| mechanism, EU cloud, compliance documentation, and full process
| adherence. Employs non-reversible short-lived sessions that still
| give you good tracking. Combine it with custom domain or first-
| party API calls and you get near 100% data accuracy.
| croes wrote:
| If it's an US company then EU cloud doesn't matter regarding
| data protection for EU citizens.
|
| The Cloud Act rendered that worthless.
| openplatypus wrote:
| Wide Angle Analytics is German company operating everything
| on EU cloud (EU owners, EU location).
| EspressoGPT wrote:
| You probably even could analyze screen sizes by doing the same
| thing but with CSS media queries.
| TekMol wrote:
| The CSS tracker is as useful as server log-based analytics.
|
| It is not. Have you read the article?
|
| The whole point of the CSS approach is to weed out user agents
| which are not doing mouse hover on the body events. You can't
| see that from server logs.
| jokethrowaway wrote:
| Lovely technique and probably more than adequate for most uses.
|
| My scraping bots use an instance of chrome and therefore trigger
| hover as well, but you'll cut out the less sophisticated bots.
|
| This is because of protection systems, if I try to scrape my
| target website with just code I just get insta banned /
| "captched".
| jerbear4328 wrote:
| Are you sure? Even if you run a headless browser, you might not
| be triggering the hover event, unless you specifically tell it
| to or your framework simulates a virtual mouse that triggers
| mouse events and CSS.
|
| You totally could be triggering it, but not every bot will,
| even the fancy ones.
| fatih-erikli wrote:
| This is known as "pixel tracker" for decades.
| cantSpellSober wrote:
| Used in emails as well. Loading a 1x1 transparent <img> is a
| more sure thing than triggering a hover event, but ad-blockers
| often block those
| t0astbread wrote:
| Occasionally I've seen people fail and add the pixel as an
| attachment instead.
| blacksmith_tb wrote:
| True, though doing it in CSS does have a couple of interesting
| aspects, using :hover would filter out bots that didn't use a
| full-on webdriver (most bots, that is). I would think that
| using an @import with 'supports' for an empty-ish .css file
| would be better in some ways (since adblockers are awfully good
| at spotting 1px transparent tracking pixels, but less likely to
| block .css files to avoid breaking layouts), but that wouldn't
| have the clever :hover benefits.
| chrismorgan wrote:
| I'd like to see a comparison of the server log information with
| the hit endpoint information: my feeling is that the reasons for
| separating it don't really hold water, and that the initial
| request server logs could fairly easily be filtered to acceptable
| quality levels, obviating the subsequent request.
|
| The basic server logs include declared bots, undeclared bots
| pretending to use browsers, undeclared bots actually using
| browsers, and humans.
|
| The hit endpoint logs will exclude almost all declared bots,
| almost all undeclared bots pretending to use browsers, and some
| humans, but will retain a few undeclared bots that search for and
| load subresources, and almost all humans. About undeclared bots
| that actually use browsers, I'm uncertain as I haven't inspected
| how they are typically driven and what their initial mouse cursor
| state is: if it's placed within the document it'll trigger, but
| if it's not controlled it'll probably be outside the document. (
| _Edit:_ actually, I hadn't considered that bearblog caps the body
| element's width and uses margin, so if the mouse cursor is not in
| the main column it won't trigger. My feeling is that this will
| get rid of almost all undeclared bots using browsers, but
| _significantly_ undercount users with large screens.)
|
| But my experience is that reasonably simple heuristics do a
| pretty good job of filtering out the bots the hit endpoint also
| excludes.
|
| * Declared bots: the filtration technique can be ported as-is.
|
| * Undeclared bots pretending to use browsers: that's a
| behavioural matter, but when I did a _little_ probing of this
| some years ago, I found that a great many of them were using
| unrealistic user-agent strings, either visibly wonky or
| impossible or just corresponding to browsers more than a year old
| (which almost no real users are using). I suspect you could get
| rid of the vast majority of them reasonably easily, though it
| might require occasional maintenance (you could do things like
| estimate the browser's age based on their current version number
| and release cadence, with the caveat that it may slowly drift and
| should be checked every few years) and will certainly exclude a
| very few humans.
|
| * Undeclared bots actually using browsers: this depends on the
| unknown I declared, whether they position their mice in the
| document area. But my suspicion is that these simply aren't worth
| worrying about because they're not enough to notably skew things.
| Actually using browsers is _expensive_ , people avoid it where
| possible.
|
| And on the matter of humans, it's worth clarifying that the hit
| endpoint is _worse_ in some ways, and honestly quite risky:
|
| * Some humans will use environments that _can't_ trigger the
| extra hit request (e.g. text-mode browsers, or using some service
| that fetches and presents content in a different way);
|
| * Some humans will behave in ways that _don't_ trigger the extra
| hit request (e.g. keyboard-only with no mouse movement, or
| loading then going offline);
|
| * Some humans will block the extra hit request; and if you upset
| the wrong people or potentially even become too popular, it'll
| make its way into a popular content blocker list and significant
| fractions of your human base will block it. _This_ , in my
| opinion, is the biggest risk.
|
| * There's also the risk that at some point browsers might
| prefetch such resources to minimise the privacy leak. (Some email
| clients have done this at times, and browsers have wrestled from
| time to time with related privacy leaks, which have led to the
| hobbling of what properties :visited can affect, and other
| mitigations of clickjacking. I think it _conceivable_ that such a
| thing could be changed, though I doubt it will happen and there
| would be plenty of notice if it ever did.)
|
| But there's a deeper question to it: _if_ you don't exclude some
| bots; or _if_ the URL pattern gets on a popular content filter
| list: does it matter? Does it skew the ratios of your results
| significantly? (Absolute numbers have never been particularly
| meaningful or comparable between services or sources: you can
| only meaningfully compare numbers from within a source.) My
| feeling is that after filtering out most of the bots in fairly
| straightforward ways, the data that remains is likely to be of
| similar enough quality to the hit endpoint technique: both will
| be overcounting in some areas and undercounting in others, but I
| expect both to be Good Enough, at which point I prefer the
| simplicity of not having a separate endpoint.
|
| (I think I've presented a fairly balanced view of the facts and
| the risks of both approaches, and invite correction in any point.
| Understand also that I've never tried doing this kind of analysis
| in any _detail_ , and what examination and such I have done was
| almost all 5-8 years ago, so there's a distinct possibility that
| my feelings are just way off base.)
| p4bl0 wrote:
| I have a genuine question that I fear might be interpreted as a
| dismissive opinion but I'm actually interested in the answer:
| what's the goal of collecting analytics data in the case of
| personal blogs in a non-commercial context such as what Bearblog
| seems to be?
| Veen wrote:
| Curiosity? I like to know if anyone is reading what I write.
| It's also useful to know what people are interested in. Even
| personal bloggers may want to tailor content to their audience.
| It's good to know that 500 people have read an article about
| one topic, but only 3 people read one about a different topic.
| mrweasel wrote:
| For the curiosity, one solution I've been pondering, but
| never gotten around to implementing is just logging the
| country of origin for a request, rather than the entire IP.
|
| IPs are useful in case of attack, but you could limit
| yourself to simply logging subnets. It's a little more
| aggressive block a subnet, or an entire ISP, but it seems
| like a good tradeoff.
| taurusnoises wrote:
| I can speak to this from the writer's perspective as someone
| who has been actively blogging since c. 2000 and has been
| consistently (very) interested in my "stats" the entire time.
|
| The primary reason I care about analytics is to see if posts
| are getting read, which on the surface (and in some ways) is
| for reasons of vanity, but is actually about writer-reader
| engagement. I'm genuinely interested in what my readers
| resonate with, because I want to give them more of that. The
| "that" could be topical, tonal, length, who knows. It helps me
| hone my material specifically for my readers. Ultimately, I
| could write about a dozen different things in two dozen
| different ways. Obviously, I do what I like, but I refine it to
| resonate with my audience.
|
| In this sense, analytics are kind of a way for me to get to
| know my audience. With blogs that had high engagement,
| analytics gave me a sort of fuzzy character description of who
| my readers were. As with above, I got to see what they liked,
| but also when they liked it. Were they reading first thing in
| the morning? Were they lunch time readers? Were they late at
| night readers. This helped me choose (or feel better about)
| posting at certain times. Of course, all of this was fuzzy
| intel, but I found it really helped me engage with my
| readership more actively.
| hennell wrote:
| Feedback loops. Contrary to what a lot of people seem to think,
| analytics is not just about advertising or selling data, it's
| about analysing site and content performance. Sure that can be
| used (and abused) for advertising, but it's also essential if
| you want any feedback about what you're doing.
|
| You might get no monetary value from having 12 people read the
| site or 12,000 but from a personal perspective it's nice to
| know what people want to read about from you, and so you can
| feel like the time you spent writing it was well spent, and
| adjust if you wish to things that are more popular.
| colesantiago wrote:
| If you want to send obfuscated data on purpose to prevent this
| dark pattern behaviour from spreading I recommend Adnauseam.
|
| (not the creator, just a regular user of this great tool)
|
| We need more tools that send random, fake data to analytics
| providers which renders the analytics useless to them in protest
| of tracking.
|
| If there are any more like Adnauseam, I would love to know.
|
| https://adnauseam.io/
| myfonj wrote:
| Seems clever and all, but `body:hover` will most probably
| completely miss all "keyboard-only" users and users with user
| agents (assistive technologies) that do not use pointer devices.
|
| Yes, these are marginal groups perhaps, but it is always super
| bad sign seeing them excluded in any way.
|
| I am not sure (I doubt) there is a 100 % reliable way to detect
| that "real user is reading this article (and issue HTTP request)"
| from baseline CSS in every single user agent out there (some of
| them might not support CSS at all, after all, or have loading of
| any kind of decorative images from CSS disabled).
|
| There are modern selectors that could help, like :root:focus-
| within (requiring that user would actually focus something
| interactive there, what again is not guaranteed for al agents to
| trigger such selector), and/or bleeding edge scroll-linked
| animations (`@scroll-timeline`). But again, braille readers will
| probably remain left out.
| demondemidi wrote:
| Keyboard only users? All 10 of them? ;)
| bayindirh wrote:
| Well with me, it's probably 11.
|
| Joking aside, I love to read websites with keybaords, esp. if
| I'm reading blogs. So, it's possible that sometimes my
| pointer is out there somewhere to prevent distraction.
| myfonj wrote:
| I think there might be more than ten [1] blind folks using
| computers out there, most of them not using pointing devices
| at all or not in a way that would produce "hover".
|
| [1] was it base ten, right?
| zichy wrote:
| Think about screen readers.
| vivekd wrote:
| I'm a keyboard user when on my computer, qutebrowser but I
| think your sentiments are correct, the numbers of keyboard
| only users are probably much much smaller than the number of
| people using Adblock. So OPs method is likely to produce a
| more accurate analytics than a JavaScript only design.
|
| OP just thought of a creative, effective and probably faster
| more code efficient way to do analytics. I love it, thanks OP
| for sharing it
| paulddraper wrote:
| https://www.youtube.com/watch?v=lKie-vgUGdI
| qingcharles wrote:
| Marginal? Surely this affects 50%+ of user-agents, i.e. phones
| and tablets which don't support :hover? (without a mouse being
| plugged in)
| myfonj wrote:
| I think most mobile browsers emit "hover" state whenever you
| tap / drag / swipe over something in the page. "active" state
| is even more reliable IMO. But yes, you are right that it is
| problematic. Quoting MDN page about ":hover" [1]:
|
| > Note: The :hover pseudo-class is problematic on
| touchscreens. Depending on the browser, the :hover pseudo-
| class might never match, match only for a moment after
| touching an element, or continue to match even after the user
| has stopped touching and until the user touches another
| element. Web developers should make sure that content is
| accessible on devices with limited or non-existent hovering
| capabilities.
|
| [1] https://developer.mozilla.org/en-US/docs/Web/CSS/:hover
| callalex wrote:
| I really wish modern touchscreens spent the extra few cents
| to support hover. Samsung devices from the ~2012 era all
| supported detection of fingers hovering near the screen. I
| suspect it's terrible patent laws holding back this
| technology, like most technologies that aren't headline
| features.
| gizmo wrote:
| > Even Fathom and Plausible analytics struggle with logging
| activity on adblocked browsers.
|
| The simple solution is to respect the basic wishes of those who
| do not want to be tracked. This is a "struggle" only because
| website operators don't want to hear no.
| reustle wrote:
| As much I agree with respecting folks wishes to not be tracked,
| most of these cases are not about "tracking".
|
| It's usually website hosts just wanting to know how many folks
| are passing through. If a visitor doesn't even want to
| contribute to incrementing a private visit counter by +1, then
| maybe don't bother visiting.
| gizmo wrote:
| If it was just about a simple count the host could just `wc
| -l access.log`. Clearly website hosts are not satisfied with
| that, and so they ignore DO_NOT_TRACK and disrespectfully try
| to circumvent privacy extensions.
| jakelazaroff wrote:
| Is there a meaningful difference between recording "this IP
| address made a request on this date" and "this IP address
| made a request on this date after hovering their cursor
| over the page body"? How is your suggestion more acceptable
| than what the blog describes?
| gizmo wrote:
| Going out of your way to specifically track people who
| indicate they don't want to be tracked is worse.
| vivekd wrote:
| Google cloud and AWS VPS and many hosting services
| collect and provide this info by default. Chances are
| most websites do this including this one you are using
| now. HN does up bans meaning they must access visitor IP.
|
| Why aren't people starting their protest against the
| website they're currently using instead of at OP.
| gizmo wrote:
| We all know that tracking is ubiquitous on the web. This
| blogpost however discusses technology that specifically
| helps with tracking people who don't want to be tracked.
| I responded that an alternative approach is to just not.
| That's not hypocritical.
| joshmanders wrote:
| Again, you don't answer the question of what's the
| difference between a image pixel or javascript logging
| that you visited the site vs nginx/apache logging you
| visited the site?
|
| You're upset that OP used an image or javascript instead
| of grepping `access.log` makes absolute no sense. The
| same data is shown there.
| gizmo wrote:
| It's rude to tell people how they feel and it's rude to
| assert a post makes "absolutely no sense" while at the
| same time demanding a response.
|
| One difference is intent. When you build an analytics
| system you have an obligation to let people opt out.
| Access logs serve many legitimate purposes, and yes, they
| can also be used to track people, but that is not why
| access logs exist. This difference is also reflected in
| law. Using access logs for security purposes is always
| allowed but using that same data for marketing purposes
| may require an opt-in or disclosure.
| jakelazaroff wrote:
| My point is that your `wc -l access.log` solution will
| _also_ track people who send the Do Not Track header
| unless you specifically prevent it from doing so. In
| fact, you could implement the _exact same system_
| described in the blog post by replacing the Python code
| run on every request with an aggregation of the access
| log. So what is the pragmatic difference between the two?
| gizmo wrote:
| Even the GDPR makes this distinction. Access logs (with
| IP addresses) are fine if you use them for technical
| purposes (monitor for 404, 500 errors) but if you use
| access logs for marketing purposes you need users to opt-
| in, because IP addresses are considered PII by law. And
| if you don't log IPs then you can't track uniques.
| Tracking and logging are not the same thing.
| arp242 wrote:
| > If it was just about a simple count the host could just
| `wc -l access.log`
|
| That doesn't really work because huge amount of traffic is
| from 1) bots, 2) prefetches and other things that shouldn't
| be counted, 3) the same person loading the page 5 times,
| visiting every page on the site, etc. In short, these
| numbers will be wildly wrong (and in my experience "how
| wrong" can also differ quite a bit per site and over time,
| depending on factors that are not very obvious).
|
| What people want is a simple break-down of useful things
| like which entry pages are used, where people came from (as
| in: "did my post get on the frontpage of HN?")
|
| I don't see how anyone's privacy or anything is violated
| with that. You can object to that of course. You can also
| object to people wearing a red shirt or a baseball cap. At
| some point objections become unreasonable.
| anonymouse008 wrote:
| I don't know how I feel about this overall. I think we took
| some rules from the physical world that we liked and discarded
| others that we've ended up with a cognitively dissonant space.
|
| For example, if you walked into my coffee shop, I would be able
| to lay eyes on you and count your visits for the week. I could
| also observe were you sit and how long you stay. If I were to
| better serve you with these data points, by reserving your
| table before you arrive with your order ready, you'd probably
| welcome my attention to detail. However, if I were to see you
| pulled about x number of watts a month from my outlets, then
| locked up the outlets for a fee suddenly - then you'd
| rightfully wish to never be observed again.
|
| So what I'm getting at is, the issues with tracking appear to
| be with the perverse assholes vs. the benevolent shopkeeps of
| the tracking.
|
| To wrap up this thought: what's happening now though is a
| stalker is following us into every store, watching our every
| move. In physical space, we'd have this person arrested and
| assigned a restraining order with severe consequences. However,
| instead of holding those creeps accountable, we've punished the
| small businesses that just want to serve us.
|
| --
|
| I don't know how I feel about this or really what to do.
| croniev wrote:
| The coffe shop reserving my place and having my order ready
| before I arrive sounds nice - but is it not an innecessary
| luxury, that I would not miss had I never even thought of its
| possibility? I never asked for it, I was ready to stand in
| line for my order, and the tracking of my behavior resulted
| in a pleasant surprise, not a feature I was hoping for. If I
| really wanted my order to be ready when I arrive, then I
| would provide the information to you, not expect that you
| observe me to figute it out.
|
| My point is that I don't get why the small businesses should
| have the right to track me to offer me better services that I
| never even asked for. Sure, its nice, but its not worth
| deregulating tracking and allowing all the evil corps to
| track me too.
| joshmanders wrote:
| Here's a better analogy using the coffee shop:
|
| You walk into your favorite coffee shop, order your
| favorite coffee, every day. But because of privacy reasons
| the coffeeshop owner is unaware of anything. Doesn't even
| track inventory, just orders whatever whenever.
|
| One day you walk in and now you can't get your favorite
| coffee... Because the owner decided to remove that item
| from the menu. You get mad, "Where's my favorite coffee?"
| the barista says "owner removed it from menu" and you get
| even more upset "Why? Don't you know I come in here every
| day and order the same thing?!"
|
| Nope, because you don't want any amount of tracking
| whatesoever, knowing any type of habits from visitors is
| wrong!
|
| But in this scenario you deem the owner knowing that you
| order that coffee every day ensures that it never leaves
| the menu, so you actually do like tracking.
| majewsky wrote:
| The coffee shop analogy falls apart after a few seconds
| because tracking in real life does not scale the same way
| that tracking in the digital space scales. If you wanted to
| track movements in a coffee shop as detailed as you can on
| websites or applications with heavy tracking, you would need
| to have a dozen people with clipboards strewn about the
| place, at which point it would feel justifiably dystopian.
| The only difference on a website is that the clipboard-
| bearing surveillers are not as readily apparent.
| lcnPylGDnU4H9OF wrote:
| > you would need to have a dozen people with clipboards
| strewn about the place
|
| Assuming you live in the US, next time you're in a grocery
| store, count how many cameras you can spot. Then consider:
| these cameras could possibly have facial recognition
| software; these cameras could possibly have object
| recognition software; these cameras could possibly have
| software that tracks eye movements to see where people are
| looking.
|
| Then wonder: do they have cameras in the parking lot? Maybe
| those cameras can read license plates to know which
| vehicles are coming and going. Any time I see any sort of
| news about information that can be retrieved from a photo,
| I assume that it will be done by many cameras at >1 Hertz
| in a handful of years.
| arp242 wrote:
| I don't think it's very important that people "can" do
| this; the only thing that matters if they actually "are"
| doing it.
| al_borland wrote:
| I think that's the point. It's the level of detail of
| tracking online that's the problem. If a website just wants
| to know someone showed up, that's one thing. If a site
| wants to know that I specifically showed up, and dig in to
| find out who I specifically am, and what I'm into so they
| can target me... that's too much.
|
| Current website tracking is like the coffee shop owner
| hiring a private investigate to dig into the personal lives
| of everyone who walks in the door so they can suggest the
| right coffee and custom cup without having to ask. They
| could not do that and just let someone pick their own
| cup... or give them a generic one. I'd like that better. If
| clipboards in coffee shops are dystopian, so is current web
| tracking, and we should feel the same about it.
|
| I think Bear strikes a good balance. It lets authors know
| someone is reading, but it's not keeping profiles on users
| to target them with manipulative advertising or some kind
| of curated reading list.
| OhMeadhbh wrote:
| I have, unfortunately, become cynical in my old age. Don't take
| this the wrong way, but...
|
| <cynical_statement> The purpose of the web is to distribute
| ads. The "struggle" is with people who think we made this
| infrastructure so you could share recipes with your grand-
| mother. </cynical_statement>
| gizmo wrote:
| No matter how bad the web gets, it can still get worse.
| Things can always get worse. That's why I'm not a cynic. Even
| when the fight is hopeless --and I don't believe it is--
| delaying the inevitable is still worthwhile.
| al_borland wrote:
| The infrastructure was put in place for people to freely
| share information. The ads only came once people started
| spending their time online and that's where the eyeballs
| were.
|
| The interstate highway system in the US wasn't build with the
| intent of advertising to people, it was to move people and
| goods around (and maybe provide a means to move around the
| military on the ground when needed). Once there were a lot of
| eyes flying down the interstate, the billboard was used to
| advertise to those people.
|
| The same thing happened with the newspaper, magazines, radio,
| TV, and YouTube. The technology comes first and the ads come
| with popularity and as a means to keep it low cost. We're
| seeing that now with Netflix as well. I'm actually a little
| surprised that books don't come with ads throughout them...
| maybe the longevity of the medium makes ads impractical.
| digging wrote:
| Hm, I don't think that's the _purpose_ of the web. It 's just
| the most common use case.
| fdaslkjlkjklj wrote:
| Looks like a clever way to do analytics. Would be neat to see how
| it compares with just munging the server logs since you're only
| looking at page views basically.
|
| re the hashing issue, it looks interesting but adding more
| entropy with other client headers and using a stronger hash algo
| should be fine.
| HermanMartinus wrote:
| Hey, author here. For a bit of clarity around IP addresses
| hashes. The only use they have in this context is preventing
| duplicate hits in a day (making each page view unique by
| default). At the end of each day there is a worker job that
| empties them out while retaining the hit info.
|
| I've added an edit to the essay for clarity.
| dantiberian wrote:
| You should add this as a reply to the top comment as well.
| bosch_mind wrote:
| If 10 users share an IP on a shared VPN around the globe and
| hit your site, you only count that as 1? What about corporate
| networks, etc? IP is a bad indicator
| Culonavirus wrote:
| It's a bad indicator especially since people who would
| otherwise not use VPN apparently started using this:
| https://support.apple.com/en-us/102602
| Galanwe wrote:
| Not even mentioning CGNAT.
| fishtoaster wrote:
| The idea of using CSS-triggered requests for analytics was really
| cool to me when I first encountered it.
|
| One guy on twitter (no longer available) used it for mouse
| tracking: overlay an invisible grid of squares on the page, each
| with a unique background image triggered on hover. Each
| background image sends a specific request to the server, which
| interprets it!
|
| For fun one summer, I extended that idea to create a JS-free "css
| only async web chat": https://github.com/kkuchta/css-only-chat
| joewils wrote:
| This is really clever and fun. I'm curious how the results
| compare to something "old school" like AWStats?
|
| https://github.com/eldy/awstats
| stmblast wrote:
| Bearblog is AWESOME!!
|
| I use it for my personal stuff and it's great. No hassle, just
| paste your markdown in and you're good to go.
___________________________________________________________________
(page generated 2023-11-01 23:00 UTC)