[HN Gopher] A Face Is Exposed for AOL Searcher No. 4417749 (2006)
       ___________________________________________________________________
        
       A Face Is Exposed for AOL Searcher No. 4417749 (2006)
        
       Author : acqbu
       Score  : 108 points
       Date   : 2024-02-25 09:22 UTC (13 hours ago)
        
 (HTM) web link (www.nytimes.com)
 (TXT) w3m dump (www.nytimes.com)
        
       | acqbu wrote:
       | https://archive.is/sfAMv
        
         | kleiba wrote:
         | Couldn't this functionality be automized somehow? Every time
         | there's a link on HN to a paywalled article, I have to the same
         | dance:
         | 
         | 1. Click on the link
         | 
         | 2. Find out it's behind a paywall
         | 
         | 3. Go back in the browser
         | 
         | 4. Click on the "comments" link.
         | 
         | 5. Look for the post that has the archive.is version of it.
         | 
         | 6. Click on that.
         | 
         | Surely that could somehow be collapsed into just a single
         | click?
        
           | hoherd wrote:
           | Open the current page in archive.li:                   javasc
           | ript:(function()%7B%0A%20window.location%20%3D%20%22https%3A%
           | 2F%2Farchive.li%2F%22%20%2B%20window.location%3B%0A%7D)()
        
           | rsaarelm wrote:
           | You can go in the browser URL field and type "archive.is/" in
           | front of the URL and press enter after step 2. It'll either
           | redirect you to an existing archive page or lets you create
           | one if one doesn't exist.
        
             | Zambyte wrote:
             | For some reason it isn't loading for me, but if you use a
             | search engine that supports bangs in your URL bar (DDG or
             | Kagi) you can prefix the url with !ais and just search
             | that. Same with !wbm or !ia for Wayback Machine
        
             | extraduder_ire wrote:
             | I get good use out of this browser extension:
             | https://github.com/arantius/resurrect-pages
             | 
             | Sites are usually archived already.
        
           | danjc wrote:
           | It cant be a first party feature.
        
           | rsaarelm wrote:
           | Here's how this could be done as a HN-side feature with zero
           | interaction with archive.is servers:
           | 
           | * Compile a list of domains like nytimes.com that have soft
           | paywalls.
           | 
           | * When a link like https://example.com/ is submitted and its
           | domain is on the paywall list, insert
           | [archive](https://archive.is/timegate/https://example.com/)
           | after it in the title area. Just prefix the timegate part and
           | it's a working link.
        
           | c22 wrote:
           | I usually just start with the comments. If I see an archive
           | link I'll use that (assuming I've determined that the source
           | article is worth reading at all).
        
       | DicIfTEx wrote:
       | There was also a theatre production produced around AOL User 927:
       | https://arstechnica.com/uncategorized/2008/05/uare-what-u-se...
       | 
       | And a documentary series about User 711391:
       | https://www.imdb.com/title/tt1455044/
        
       | samwillis wrote:
       | This is important to look back on in the context of what's
       | happing now with AI tools. This story is obviously about the leak
       | of the data publicly, but what it shows is the profiling that is
       | available to corporations.
       | 
       | Search has exposed so much data about ourselves to the services
       | we use with very little regulation on what they are permitted to
       | do with it inside their own walls.
       | 
       | My fear with AI is that we are moving toward sending even more
       | data to party services. Tools such a co-pilot (which I enjoy
       | using) are a gold mine for behavioural analysis. The profiling
       | that will be possible with these tools is extraordinary and we
       | don't yet fully understand the implication.
       | 
       | It's because of this that I'm a massive proponent of "Local AI".
       | We need to be pushing for the industry to adopt a local inference
       | architecture asap. It needs to become the standard pattern as
       | early as possible to reduce the risk of the AI revolution being a
       | repeat of the invasive internet search and advertising industry.
        
         | Spooky23 wrote:
         | The issue with AI is that people are going to be creating work
         | product with it, and it doesn't require the extensive
         | infrastructure that search has.
         | 
         | Google knows alot about your behavior - they can and have been
         | able to correlate online behavior with health and meatspace
         | actions to identify budding extremists or people at risk of
         | addiction, etc. AI will bring that capability to business
         | processes.
         | 
         | With the number of little companies that are springing up, it
         | will become much easier for outside parties to figure out how
         | instituions work. This capability exists, but it's gated by
         | Google and Microsoft and they have drawn lines to protect the
         | overall business. Some jackass will install a creepy AI tool to
         | scrape outlook and salesguys will be able to get a profile of
         | who makes what decision in a company, for example.
        
           | dartos wrote:
           | AI does take a ton of infrastructure. You need data
           | collection and curation. Massive amounts of training
           | hardware.
           | 
           | And a large infrastructure to ensure you can scale. No easy
           | feat with the current GPU stack.
        
             | hunter2_ wrote:
             | Are there not relatively tiny workloads that fall under the
             | umbrella of AI? Or does the term itself inherently refer to
             | intense workloads, like how the phrase "big data" refers to
             | sets that can't be processed by typical means?
        
               | tomoyoirl wrote:
               | Training AIs is almost always done with massive data, as
               | small data usually doesn't have sufficient information
               | content, statistically speaking, to build a good model
               | from it. Certainly this is the case for the generative
               | models.
        
               | dartos wrote:
               | I think AI refers to large models.
               | 
               | Smaller probabilistic models line linear regression, I'd
               | call machine learning.
        
             | Spooky23 wrote:
             | It takes a ton of infrastructure to maintain a sustained
             | misinformation campaign.
             | 
             | Yet cloud providers happily sell resources and APIs to
             | unethical companies. They rightfully don't insert
             | themselves into most legal business matters, with
             | exceptions.
        
           | paulmd wrote:
           | > Google knows alot about your behavior - they can and have
           | been able to correlate online behavior with health and
           | meatspace actions to identify budding extremists or people at
           | risk of addiction, etc. AI will bring that capability to
           | business processes.
           | 
           | not in the EU it won't.
           | 
           | if you can de-anonymize people from the data it's not
           | anonymous, and collecting this data at all would be illegal
           | in the EU without user consent, unless it's being used solely
           | for the purpose of delivering the service.
        
         | whoisthemachine wrote:
         | I fully agree. If local AI takes traction, then we actually
         | have a unique opportunity to take away some of that massive
         | profiling, as some of what you use search engines for today can
         | be done by AI. This may be why there's some fear-mongering
         | around truly open-source AI and such by the big ones.
        
         | akira2501 wrote:
         | Noise generation is the only answer the common person has
         | against the giant corporate machine.
         | 
         | You need a personal "AI" that just does random searches
         | unconnected with your life, constantly, in the background, and
         | then injects this data into all the portals that are watching
         | you.
         | 
         | Ultimately, their data will become dominated by noise, and
         | ultimately useless to the point of severely destroying the
         | value of the entire enterprise and data collection mechanisms
         | in the first place.
         | 
         | No matter how many tools you make "local only" you're only a
         | forgotten "send telemetry back to the mothership" checkbox away
         | from being right where you started.
        
           | cj wrote:
           | This might work at an individual level, but it isn't scalable
           | to the general population.
           | 
           | It's hard to imagine any population-scale solution that
           | doesn't involve regulation.
           | 
           | The biggest problem with regulation (in my eyes) is that it
           | thwarts competition between countries. E.g. if the US imposes
           | restrictions on technology, innovation is incentivized to
           | happen elsewhere. The EU has been bold on the privacy
           | regulatory front with GDPR and the like, and has probably
           | lost out on immeasurable monetary gains as a result. There's
           | a huge cost to regulation, but it works.
        
       | jll29 wrote:
       | That episode (releasing the AOL search query log file for
       | research purposes and subsequent aftermath) led to some firings
       | at the company, but some information retrieval searchers used
       | this log to conduct important experiments.
       | 
       | The "60s lady with the dog that kept peeing her sofa" got her
       | hour of fame, and the whole thing became a case study in de-
       | anonymization.
       | 
       | A few pointers:
       | 
       | https://en.wikipedia.org/wiki/AOL_search_log_release
       | 
       | https://www.researchgate.net/publication/233390862_Privacy_P...
       | 
       | https://github.com/wasiahmad/aol_query_log_analysis
       | 
       | https://www.technologyreview.com/2006/08/15/100592/who-benef...
       | 
       | https://www.sciencedirect.com/science/article/abs/pii/S00200...
       | 
       | https://isquared.wordpress.com/2014/04/24/mining-search-logs...
        
       | elzbardico wrote:
       | 2006... If they only knew....
        
         | rvnx wrote:
         | Now people create accounts to get their searches directly tied
         | to their profile
        
           | HeatrayEnjoyer wrote:
           | Like that new search engine that's better than googly?
        
             | alwa wrote:
             | You mean Kagi? They're pretty transparent about how they
             | approach query data, and it's as privacy-friendly as any
             | I've seen.
             | 
             | https://kagi.com/privacy
        
               | warkdarrior wrote:
               | > Kagi [...] [is] as privacy-friendly as any I've seen
               | 
               | For now
        
               | rvnx wrote:
               | No, I wasn't referring to Kagi (which is done by a hard-
               | working guy btw), just in general to the trend that the
               | internet is now completely different if seen from a
               | perspective of 2006.
               | 
               | Storage was expensive, and data wasn't seen as a goldmine
               | as now, so most long-term logs went to /dev/null.
               | 
               | That the normality now is to ask users to create an
               | account, have data-scientists (whose goal is precisely to
               | find needles in haystacks), etc.
        
             | piperswe wrote:
             | Kagi makes it very clear in their privacy policy and in
             | their settings that search queries are not saved. If they
             | save them regardless, that's a clear cut violation of the
             | law.
             | 
             | From their settings page:                 > Save My Search
             | History       > Currently this option can not be turned on.
             | Kagi does not save any       > searches by default. In the
             | future we may add features that will       > utilize your
             | search history and then we will allow you to enable this.
             | 
             | It sure seems like it will always be opt-in, even if they
             | add query saving in the future.
        
             | rvnx wrote:
             | On tons of search engines and AI services, you are nudged
             | or required to have an account (which forces you to de-
             | anonymize yourself and/or link your activity to a specific
             | account).
             | 
             | 20 years ago, when this leak happened, the situation wasn't
             | like that.
             | 
             | Gmail was barely born, so Google accounts didn't make
             | sense.
             | 
             | The article was like "wow we managed to deanonymize a
             | search query", but that's actually the norm now.
             | 
             | Essentially, this scandalous AOL-leak, became a legitimized
             | every-day routine (sadly).
             | 
             | Today when you type anything inside a ChatGPT-like AI app
             | (this is the case also for many search engines), you get
             | tons of contractors, workers and partners who have access
             | to such dataset:
             | 
             | researchers, engineers, support, advertising platforms,
             | technical intermediaries, legal, etc.
             | 
             | Though the future isn't gloomy; in the short-term, with the
             | advent of LLMs, we may actually see a really good solution
             | private-wise: fully local answers.
             | 
             | Which means that for the first time, queries and questions
             | may not leave the device or sent to whoever you need to
             | trust.
        
       | karaterobot wrote:
       | Re-identification of supposedly anonymous data was a problem
       | twenty years ago, and is a bigger problem today. Soon it may
       | become a crisis, as the tools needed to do it become more and
       | more turnkey, effective, and commodified. Now we dox as in a
       | glass, darkly, etc.
        
       | neonate wrote:
       | https://web.archive.org/web/20170715075814/https://www.nytim...
        
       ___________________________________________________________________
       (page generated 2024-02-25 23:01 UTC)