[HN Gopher] Big data may not know your name, but it knows everyt...
___________________________________________________________________
Big data may not know your name, but it knows everything else
Author : nemoniac
Score : 16 points
Date : 2021-12-30 09:40 UTC (13 hours ago)
(HTM) web link (www.wired.com)
(TXT) w3m dump (www.wired.com)
| aledalgrande wrote:
| To me it is crazy selling that data is even legal.
| aledalgrande wrote:
| throwaway_465 wrote:
| Enjoy:
|
| FinanceIQ by AnalyticsIQ - Consumer Finance Data USA - 241M
| Individuals https://datarade.ai/data-products/financeiq
|
| Individual Consumer Data
| https://datarade.ai/search?utf8=%E2%9C%93&category=individua...
| agsnu wrote:
| If you're interested in this topic, I recommend the chapter on
| Inference Control in Ross Anderson's excellent book "Security
| Engineering". It's one of the ones which is freely available on
| his web site
| https://www.cl.cam.ac.uk/~rja14/Papers/SEv3-ch11-7sep.pdf
| specialist wrote:
| Two tangential "yes and" points:
|
| 1)
|
| I'm not smart enough to understand differential privacy.
|
| So my noob mental model is: Fuzz the data to create hash
| collisions. Differential privacy's heuristics guide the effort.
| Like how much source data and how much fuzz you need to get X%
| certainty of "privacy". Meaning the likelihood someone could
| reverse the hash to recover the source identity.
|
| BUT: This is entirely moot if original (now fuzzed) data set can
| be correlated with another data set.
|
| 2)
|
| All PII should be encrypted at rest, at the field level.
|
| I really wish Wayner's Translucent Databases was more well known.
| TLDR: Wayner shows clever ways of using salt+hash to protect
| identity. Just like how properly protected password files should
| be salt+hash protected.
|
| Again, entirely moot if protected data is correlated with another
| data set.
|
| http://wayner.org/node/46
|
| https://www.amazon.com/Translucent-Databases-Peter-Wayner/dp...
|
| Bonus point 3)
|
| The privacy "fix" is to extend property rights to all personal
| data.
|
| My data is me. I own it. If someone's using my data, for any
| reason, I want my cut.
|
| Pay me.
| lrem wrote:
| In Google we have a bunch of researchers on anonymity and that
| whole thing is _hard_. I vaguely remember supporting, a couple
| years ago, a pipeline where logs stripped out of "all PII" came
| at one end, aggregated data out of the middle... Into an
| anonymity verifier, that then redirected much of it into
| /dev/null, because a technique is known to de-anonymise it. And
| the research on differential anonymity advanced quite a bit in
| the meantime.
| geoduck14 wrote:
| Did that stuff work?
| dang wrote:
| We've merged https://news.ycombinator.com/item?id=29734713 into
| this thread since the previous submission wasn't to the original
| source. That's why some comment timestamps are older.
| MrDresden wrote:
| Anecdotally, the only time I've seen a truly anonymized database
| was in a european genetics research company, due mainly to the
| rightly high amount of regulation required in the medical field.
|
| There was a whole separate legal entity, with its own board, that
| did the phenotype measurement gathering and stored the data in a
| big database on premise. The link between those measurements and
| the individual's personal identifiable record was then stored in
| a separate airgapped database which had cryptographic locks
| implemented (on the data and physical access to the server) so
| accessing the data took the physical presence of the privacy
| officers of each of the two companies (the measurement lab and
| the research lab) and finally what I found at the time to be the
| unique move; a representative from the state run privacy
| watchdog.
|
| To be able to backtrack the data to a person, there was always
| going to be a need to go through the watchdog. Required, not just
| legally mandated.
|
| All of the measurement data that was stored in the database came
| from very restricted input fields in the custom software that was
| made on prem (no long form text input fields for instance, where
| identifying data could be put in accidentally), and there was a
| lot of thought put into the design of the UI to limit the
| possibility that anyone could put identifiable data into the
| record.
|
| For instance numerical ranges for a specific phenotype where all
| prefilled in a dropdown, so as to keep user key input to a
| minimum. Much of the data also came from direct connections to
| the medical equipment (I wrote a serial connector for a Humphrey
| medical eye scanner that parsed the results straight into the
| software, skipping the human element all together).
|
| This didn't make for the nicest looking software (endless
| dropdowns or scales), but it fulfilled its goal of functionality
| and privacy concerns perfectly.
|
| The measurement data would then go through many automatic filters
| and further anonymizing processes before being delivered through
| a dedicated network pipeline (configured through the local isp to
| be unidirectional) to the research lab.
|
| Is this guaranteed to never leak any private information? No,
| nothing is 100%. This comes damn near close to it, but ofcourse
| would not work in most other normal business situations.
| Hendrikto wrote:
| > To be able to backtrack the data to a person, there was
| always going to be a need to go through the watchdog.
|
| The assumption in the deanonymization literature is that this
| data is unavailable. So no, you don't need to go through any
| watchdog.
| MrDresden wrote:
| Yes they had to, in case the person giving the data had opted
| for being notified about some severe medical condition or
| other revelations that would show up during the analysis
| process. For those cases, these mappings were kept around,
| and did require going through the watchdog.
| Hendrikto wrote:
| I think you misunderstood my point. There are ways of
| deanonymizing data, just by looking at the data alone. In
| fact, this is the standard assumption.
|
| This watchdog stuff is nice for the "good" actors, but
| irrelevant for adversaries.
| MrDresden wrote:
| I did understand the point perfectly. The mechanism was
| there simply for the _good_ actors to backtrace the data
| to the matching person, it 's purpose was never to play a
| part in making the data more anonymous.
|
| If what you meant to say was the more clear statement
| that adversaries wouldn't need to do that then I'd have
| agreed with you.
|
| Everything else that was mentioned; the strict processes
| in determining what data could be stored, what it exposed
| about the user if anything, eliminating as much of the
| human input as possible and the post processing of the
| data before it left the measurement lab. These are the
| steps that achieved anonymity (as far as everyone
| believed had been achieved).
| benreesman wrote:
| One click removed from an original source that is soaked in
| second-rate adtech crap.
|
| To my dying day I will regret being one of the architects of this
| insidious mechanism.
|
| The problem with trying to evade, or defeat, or even sidestep
| this stuff is that latent-rep embeddings break human intuition in
| their effectiveness.
|
| There was a time when the uniqueness of one's signature could
| move money.
|
| Those pen twitches are still there to see in what order you click
| on links.
| kurthr wrote:
| Just wait until they have your unconscious eye-twitches.
|
| Foveal rendering is required for adequate resolution/refresh of
| VR/AR due to the bandwidths and GPU calcs involved (e.g.
| 6kx6kxRGBx2eyesx120Hz=~200Gb/s). Updating only the 20-30deg
| around the eye's focus allows reducing this by >10x which
| reduces power/weight and GPU cost dramatically.
| jaytaylor wrote:
| Having a conscience is good, but don't be too hard on yourself
| old friend.
| sanxiyn wrote:
| Remember that 33 bits of entropy are enough to identify everyone.
| It may not be legally so, but any data with 33 bits of entropy is
| technically PII, and you should treat it as such.
| unixhero wrote:
| What do you mean here? I am asking because this is potentially
| useful.
| jodrellblank wrote:
| There are ~8,000,000,000 people in the world; that's a ten-
| digit number so that's the smallest size of number which
| could count out a unique number for everyone in the world, 9
| digits doesn't have enough possible values. If the digit
| values are based on details about you, e.g. being in USA sets
| the second digit to 0/1/2, being in Canada and male sets it
| to 3, being in Canada and female sets it to 4, the last two
| digits are your height in inches, etc. etc. then you don't
| have to count out the numbers and give one to everyone, the
| ten digits become a kind of identifier of their own.
| 1,445,234,170 narrows down to (a woman in Canada 70 inches
| tall ... ) until it only matches one person. There are lots
| of people of the same height so perhaps it won't quite
| identify a single person, but it will be close. Maybe one or
| two more digits is enough to tiebreak and reduce it to one
| person.
|
| Almost anything will do as a tie-break between two people -
| married, wears glasses, keeps snakes, once visited
| whitehouse.gov, walked past a Bluetooth advertising beacon on
| Main Street San Francisco. Starting from 8 billion people and
| making some yes/no tiebreaks that split people into two
| groups, a divide and conquer approach, split the group in
| two, split in two again, cheerful/miserable, speaks Spanish
| yes/no, once owned a dog yes/no, once had a Google account
| yes/no, once took a photo at a wedding yes/no, ever had a
| tooth filling yes/no, moved house more than once in a year
| yes/no, ever broke a bone yes/no, has a Steam account yes/no,
| anything which divides people you will "eventually" winnow
| down from 8 billion to 1 person and have a set of tiebreaks
| with enough information in them to uniquely identify
| individual people.
|
| I say "eventually", if you can find tiebreaks that split the
| groups perfectly in half each time then you only need 33 of
| them to reduce 8 billion down to 1. This is all another way
| of saying counting in binary, 1010010110101001011010100101101
| is a 33 bit binary number and it can be seen as 33 yes/no
| tiebreaks and it's long enough to count up past 8 billion.
| It's 2^33, two possible values in each position, 33 times.
|
| That means any collection of data about people which gets to
| 33bits of information about each person is getting close to
| being enough data to have a risk of uniquely identifying
| people. If you end up gathering how quickly someone clicks a
| cookie banner, that has some information hiding in it about
| how familiar they are with cookie banners and how physically
| able they are, that starts to divide people into groups. If
| you gather data about their web browser, that tells you what
| OS they run, what version, how up to date it is, those divide
| people into buckets. What time they opened your email with a
| marketing advert in it gives a clue to their timezone and
| work hours. Not very precise, but it only needs 33 bits
| before it approaches enough to start identifying individual
| people. Gather megabytes of this stuff about each person, and
| identities fall out - the person who searched X is the same
| person who stood for longer by the advert beacon and supports
| X political candidate and lives in this area and probably has
| an income in X range and ... can only be Joe Bloggs.
| unixhero wrote:
| Jawdropping. So this is the 33 degrees I've heard people
| throw around. Thank you so much for elaborating in such a
| detailed and insightful way.
| konschubert wrote:
| That makes no sense, sorry.
|
| Ok, 2^33 > world population, but that doesn't mean that the
| string "Hello world" is PII.
| halhen wrote:
| That depends on the encoding, does it not? The binary
| sequence equal to ASCII "Hello world" might well be PII with
| many different encodings. By accident, of course, but
| nevertheless 33 bits of information would be enough.
| stillicidious wrote:
| 33 bits of entropy, not just 33 bits
| [deleted]
| AstralStorm wrote:
| Unless someone is actually called Hello World. Or perhaps
| Bobby Tables. ;)
| lb1lf wrote:
| -Anecdotally, at a former employer we had an annual questionnaire
| used to estimate how content we were.
|
| The results, we were assured, would only be used in aggregate
| after having been anonymized.
|
| I laughed quite hard when the results were back - the 'Engineer,
| female, 40-49yrs, $SITE' in the office next door wasn't as
| amused. All her responses had been printed in the report. Sample
| size: 1.
| sqrt17 wrote:
| At our (fairly large) company, you can query by team (and maybe
| job role) but it will hide responses where the sample size is
| smaller than a set number (I think 8 or 10).
|
| So yes it can be done but people have to actually care about
| it.
|
| The cautionary tale about k-anonymity (from Aaron Schwartz's
| book I think) is when the behavior of aggregates is also
| something that should be kept privates - the example was that
| the morning run of an army base in a foreign country was
| revealed because enough people did this with their smartwatches
| on that it formed a neat cluster.
| pfraze wrote:
| Isn't location data particularly easy to de-anonymize? I
| remember reading some research that because people tend to be
| so consistent with their location, you could deanonymize most
| people in a dataset with 3 random location samples through
| the day
| teraku wrote:
| In Germany (I think all of the EU), a dataset can only be
| published if the sample size is at least >=7.
| williamtrask wrote:
| Just fwiw sample size isn't a robust defence against this
| kind of attack. Check out Differential Privacy.
| ocdtrekkie wrote:
| More details on the Strava Run incident:
| https://www.bbc.com/news/technology-42853072
| AlexTWithBeard wrote:
| Can confirm.
|
| I used to have a team and at some point they all had to submit
| their feedback on my performance. The answers were then fed
| back to me unattributed, but it was pretty obvious who wrote
| what.
| rotten wrote:
| Never answer those honestly. More than half the time they
| aren't there to "help management understand how to do better",
| rather to purge people who aren't happy. "We don't need
| employees who don't love it here."
| lb1lf wrote:
| -Oh, we were already deemed beyond redemption - we'd been a
| small company, quite successful in our narrow niche, only to
| be bought up by $MEGACORP.
|
| That was a culture clash. Big time.
|
| Nevertheless, I thought the same thing you did and filled in
| my questionnaire so that my answers created a nice,
| symmetrical pattern - it looked almost like a pattern for an
| arts&crafts project...
| vorpalhex wrote:
| I always say I am unhappy and would take another offer in a
| heatbeat, and so far that strategy has worked well for me - I
| usually get offered a good salary bump amd bonus every year.
| Obviously YMMV.
| AstralStorm wrote:
| Even without sample size having these aggregated results makes
| it very easy to predict who picked what with a modicum of extra
| information. (Even silly binned personality type.)
| float4 wrote:
| Two things I'd like to say here
|
| 1. All anonymisation algorithms (k-anonymity, l-divergence,
| t-closeness, e-differential privacy, (e,d)-differential privacy,
| etc.) have, as you can see, at least one parameter that states
| _to what degree_ the data has been anonymised. This parameter
| should not be kept secret, as it tells entities that are part of
| a dataset how well, and in what way, their privacy is being
| preserved. Take something like k-anonymity: the k tells you that
| every equivalence class in the dataset has a size >= k, i.e. for
| every entity in the dataset, there are at least k-1 other
| identical entities in the dataset. There are a lot of things
| wrong with k-anonymity, but at least it's transparent. Tech
| companies however just state in their Privacy Policies that
| "[they] care a lot about your privacy and will therefore
| anonymise your data", without specifying _how_ they do that.
|
| 2. Sharing anonymised data with other organisations (this is
| called Privacy Preserving Data Publishing, or PPDP) is virtually
| always a bad idea if you care about privacy, because there is
| something called the privacy-utility tradeoff: you either have
| data with sufficient utility, or you have data with sufficient
| privacy preservation, but you can't have both. You either
| publish/share useless data, or you publish/share data that does
| not preserve privacy well. You can decide for yourself whether
| companies care more about privacy or utility.
|
| Luckily, there's an alternative to PPDP: Privacy Preserving Data
| Mining (PPDM). With PPDM, data analysts can submit statistical
| queries (queries that only return aggregate information) to the
| owner of the original, non-anonymised dataset. The owner will run
| the queries, and return the result to the data analyst.
| Obviously, one can still infer the full dataset as long as they
| submit a sufficient number of specific queries (this is called a
| Reconstruction Attack). That's why a privacy mechanism is
| introduced, e.g. epsilon-differential privacy. With
| e-differential privacy, you essentially _guarantee_ that no query
| result depends significantly on one specific entity. This makes
| reconstruction attacks impossible.
|
| The problem with PPDM is that you can't sell your high-utility
| "anonymised" datasets, which sucks if you're a big boi data
| broker.
| motohagiography wrote:
| Important concepts. Key thing that has changed in privacy in
| last couple years is that de-identified data has recently been
| made into a legal concept instead of a technical one, whereby
| you do a re-identification risk assessment (not a very mature
| methodology in place yet), figure out who is accountable in the
| event of a breach, label the data as de-identified, and include
| the obligation of the recipients to protect it in the data
| sharing agreement.
|
| The effect on data sharing has been notable because nobody
| wants to hold risk, where previously "de-identification"
| schemes (and even encryption) made their risk and obligation
| evaporate as it magically transformed the data from sensitive
| to less sensitive using encryption or data masking. Privacy
| Preserving Data Publishing is sympathetic magic from a
| technical perspective, as it just obfuscates the data
| ownership/custodianship and accountability.
|
| FHE is the only candidate technology I am aware of that meets
| this need, and DBAs, whose jobs are to manage these issues, are
| notoriously insufficiently skilled to produce even a
| synthesized test data set from a data model, let alone
| implement privacy preserving query schemes like differential
| privacy. What I learned from working on the issue with
| institutions was nobody really cared about the data subjects,
| they cared about avoiding accountability, which seems natural,
| but only if you remove altruism and social responsibility
| altogether. You can't rely on managers to respect privacy as an
| abstract value or principle.
|
| Whether you have a technical or policy control is really at the
| crux of security vs. privacy, where as technologists we mostly
| have a cryptographic/information theoretic understanding of
| data and identification, but the privacy side is really about
| responsibilities around collection, use, disclosure, and
| retention. Privacy really is a legal concept, and you can kick
| the can down the road with security tools, but the reason
| someone wants to pay you for your privacy tool is that you are
| telling them you are taking on breach risk on their behalf by
| providing a tool. The people using privacy tools aren't using
| them because they preserve privacy, they use them because it's
| a magic feather that absolves them of responsibility. It's a
| different understanding of tools.
|
| However, it does imply a market opportunity for a crappy
| snakeoil freemium privacy product that says it implements the
| aformentioned techniques but barely does anything at all, and
| just allows organizations to say they are using it. Their
| problem isn't cryptographic, it just has to be sophisticated
| enough that non-technical managers can't be held accountable
| for reasoning about it, and they're using a tool so they are
| compliant. I wonder what the "whitebox cryptography" people are
| doing these days...
| [deleted]
| amelius wrote:
| Can advertisers be legally forced to use these mathematical
| techniques?
___________________________________________________________________
(page generated 2021-12-30 23:00 UTC)