[HN Gopher] Clio: A system for privacy-preserving insights into ...
       ___________________________________________________________________
        
       Clio: A system for privacy-preserving insights into real-world AI
       use
        
       Author : simonw
       Score  : 119 points
       Date   : 2024-12-12 23:22 UTC (23 hours ago)
        
 (HTM) web link (www.anthropic.com)
 (TXT) w3m dump (www.anthropic.com)
        
       | refulgentis wrote:
       | I find this sort of thing cloying because all it does is show me
       | they keep copies of my chats and access them at will.
       | 
       | I hate playing that card. I worked at Google, and for the first
       | couple years, I was very earnest. Someone smart here pointed out
       | to me, sure, maybe everything is behind 3 locks and keys and
       | encrypted and audit logged, but what about the next guys?
       | 
       | Sort of stuck with me. I can't find a reason I'd ever build
       | anything that did this, if only to make the world marginally
       | easier to live in.
        
         | eddyzh wrote:
         | They say something about retention after analysis by Clio but
         | it's not very specific.
        
         | epoch_100 wrote:
         | Anthropic's privacy policy is extremely strict -- for example,
         | conversations are retained for only 30 days and there's no
         | training on user data by default.
         | https://privacy.anthropic.com/en/articles/10023548-how-long-...
        
           | refulgentis wrote:
           | I thought this was true, honestly, up until I read it just
           | now. User data is explicitly one of the 3 training
           | sources[^1], with forced opt-ins like "feedback"[^2] lets
           | them store & train on it for 10 years[^3], or tripping the
           | safety classifier"[^2], lets them store & train on it for 7
           | years.[^3]
           | 
           | [^1] https://www.anthropic.com/legal/privacy:
           | 
           | "Specifically, we train our models using data from three
           | sources:...[3.] Data that our users or crowd workers
           | provide"..."
           | 
           | [^2] For all products, we retain inputs and outputs for up to
           | 2 years and trust and safety classification scores for up to
           | 7 years if you submit a prompt that is flagged by our trust
           | and safety classifiers as violating our UP.
           | 
           | Where you have opted in or provided some affirmative consent
           | (e.g., submitting feedback or bug reports), we retain data
           | associated with that submission for 10 years.
           | 
           | [^3] "We will not use your Inputs or Outputs to train our
           | models, unless: (1) your conversations are flagged for Trust
           | & Safety review (in which case we may use or analyze them to
           | improve our ability to detect and enforce our Usage Policy,
           | including training models for use by our Trust and Safety
           | team, consistent with Anthropic's safety mission), or (2)
           | you've explicitly reported the materials to us (for example
           | via our feedback mechanisms), or (3) by otherwise explicitly
           | opting in to training."
        
             | pixelsort wrote:
             | All of the major AI providers are trying to pretend they
             | care about your privacy by being weasles with their
             | retention and anonymization terms.
             | 
             | Partly why I'm building a zero-trust product that keeps all
             | your AI artifacts encrypted at rest.
        
               | Jerrrry wrote:
               | You're work is vital for opposing the nonchalant march of
               | privacy erasing norms we are continuing to parade
               | towards.
        
           | voltaireodactyl wrote:
           | Given the apparent technical difficulties involved in getting
           | insight into a model's underlying data, how would anyone ever
           | hold them to account if they violated this policy? Real
           | question, not a gotcha, it just seems like if corporate-
           | backed IP holders are unable to prosecute claims against AI,
           | it seems even more unlikely that individual paying customers
           | would have greater success.
        
           | saagarjha wrote:
           | That's the point, though. What's there that would stop it
           | from changing later?
        
           | anon373839 wrote:
           | Even if this were true (and not hollowed out by various
           | exceptions in Anthropic's T&C), I would not call it
           | "extremely strict". How about zero retention?
        
           | binarymax wrote:
           | This is a non starter for every company I work with as a B2B
           | SaaS dealing with sensitive documents. This policy doesn't
           | make any sense. OpenAI is guilty of the same. Just freaking
           | turn this off for business customers. They're leaving money
           | on the table by effectively removing themselves from a huge
           | chunk of the market that can't agree to this single clause.
        
             | phillipcarter wrote:
             | I haven't personally verified this, but I'm fairly positive
             | all the enterprise versions of these tools (ChatGPT,
             | Gemini, Claude) not only are oblivious to document contents
             | but also respect things like RBAC on documents for any
             | integration.
        
           | lazycog512 wrote:
           | who guards the guards? [they plan] ahead and begin with them.
        
         | pixelsort wrote:
         | They have to, the major AI companies are ads companies. Their
         | profits demand that we accept their attempts to normalize the
         | Spyware that networked AI represents.
        
           | botanical76 wrote:
           | Yep. More generally, I have a lot of distaste that big tech
           | are the ones driving the privacy conversation. Why would you
           | put the guys with such blatant ulterior motives behind the
           | wheel? But, this seems to be the US way. Customer choice via
           | market share above everything, always, even if that choice
           | gradually erodes the customer's autonomy.
           | 
           | Not that anywhere else is brave enough to try otherwise, for
           | fear of falling too far behind US markets.
           | 
           | Disclaimer: I could be much more informed on the relevant
           | policies which enable this, but I can see the direction we're
           | heading in... and I don't like it.
        
       | simonw wrote:
       | I wrote up some notes (including interesting snippets of the
       | video transcript) here:
       | https://simonwillison.net/2024/Dec/12/clio/
        
       | eddyzh wrote:
       | While the highest catergoies are vague (web development vs cloud
       | development) the specific clusters shown in the language specific
       | examples expose a nation specific collectiev activity. While
       | anonimized its stil exposing a lot of this collection of privat
       | chats.
       | 
       | Good that the tell, but they did it before telling. I really hope
       | they delete the detailed chats afterwards. They should and
       | probably wont delete the first layer of aggregation.
        
       | isoprophlex wrote:
       | "Xenon Deficiency Syndrome", mentioned in one of the examples on
       | talking about medical conditions.
       | 
       | I knew it just had to be a made up thing for demonstration
       | purposes but still I had to google that, haha... on the off
       | chance there _is_ some weird biological use for trace amounts of
       | xenon.
        
         | compressedgas wrote:
         | Source of phrase:
         | https://www.reddit.com/r/AskReddit/comments/vzlupm/your_user...
        
           | Jerrrry wrote:
           | Depending on the prominence of the user, that could have been
           | further coerced as a fact or reference, and/or inferred as
           | factual if enough people agreed, in the next training
           | session.
           | 
           | Upvotes as truth-confidence scores, what could go wrong?
           | 
           | Ironically some iotas of truth can be outputted directly, but
           | if the most explicitly articulated reference is by a (non
           | power-) user on a forum in a comment, it will be sourced as
           | anonymous/eponymous wisdom (unless suffiently coerced)
        
       | Imnimo wrote:
       | I don't think I would describe a system in which a human ends up
       | looking at your conversation if the algorithm thinks you're
       | suspicious as "privacy-preserving". What is the non-privacy-
       | preserving version of this system? A human browsing through every
       | conversation?
        
         | IanCal wrote:
         | That's a different thing. This system doesn't do that, but
         | that's one use case they have for it.
        
         | wseqyrku wrote:
         | Yeah, this is basically a kind of surveillance system for
         | governments seeking "insights" into communications of any
         | modality.
        
       | saagarjha wrote:
       | In case you were wondering, Anthropic does in fact have a reply
       | to the inevitable Hacker News comments:
       | 
       | > Risk: Despite our privacy mitigations, the existence of a
       | system like Clio might be perceived as invasive by some users.
       | This perception could lead to an erosion of trust in AI
       | assistants.
       | 
       | > Mitigation Strategy: First, we plan to be radically transparent
       | about Clio's purpose, capabilities, and limitations to the public
       | through this report, rather than building and not disclosing the
       | system. For example, Clio is a tool that can be used to make
       | systems safer, as well as a tool that can be used to gain
       | insights that can be used to gain a better understanding of and
       | improve the product. We are also transparent about how we
       | designed Clio with important privacy protection features that
       | safeguard user data and privacy. Second, beyond these use cases,
       | we are committed to turning Clio's insights into a public good--
       | for example, we released information about our most common use
       | cases in Figure 6 because we believe it is in the best interest
       | of society to know how AI systems are being used in the world,
       | despite the fact that this information could be commercially
       | harmful for Anthropic to publish from a competitive intelligence
       | standpoint. We plan to share further insights from Clio in the
       | future, and hope these disclosures contribute to an emerging
       | culture of empirical transparency in the field that can inform
       | broader AI safety and governance efforts. Finally, we plan to
       | actively engage with user communities, addressing concerns and
       | incorporating feedback into our development process--for example,
       | during our work on Clio we met with a number of civil society
       | organizations to gather feedback on our approach and made
       | adjustments in response to their comments.
       | 
       | Unfortunately the feedback they gathered is not included in the
       | paper.
        
         | bcherny wrote:
         | The blog post also talks about how privacy is preserved in more
         | concrete terms:
         | 
         | > These four steps are powered entirely by Claude, not by human
         | analysts. This is part of our privacy-first design of Clio,
         | with multiple layers to create "defense in depth." For example,
         | Claude is instructed to extract relevant information from
         | conversations while omitting private details. We also have a
         | minimum threshold for the number of unique users or
         | conversations, so that low-frequency topics (which might be
         | specific to individuals) aren't inadvertently exposed. As a
         | final check, Claude verifies that cluster summaries don't
         | contain any overly specific or identifying information before
         | they're displayed to the human user.
        
       | gwd wrote:
       | I was surprised to see "Business Strategy and Operations" as #6,
       | at 5.7%. Part of me finds it somewhat concerning; but then again,
       | I'm using Claude for that purpose myself, and found it pretty
       | helpful, so...
        
         | dr_dshiv wrote:
         | 10% dev, 7% education, 6% business. What's the bulk of AI
         | usage, then?
         | 
         | Well, other analyses have found that sexual content is the #2
         | most common use [1]. So maybe they didn't want to talk about
         | that... (creative composition was #1)
         | 
         | [1] https://arxiv.org/pdf/2407.14933
        
           | terhechte wrote:
           | I don't think you can do that with Claude. That's what people
           | use uncensored local models for.
        
             | levzzz wrote:
             | Claude 3 Opus is by far the best model for sexual content,
             | believe it or not. Nearly uncensored and surpasses anything
             | local.
        
               | stavros wrote:
               | > Anthropic: Claude 3 Opus (self-moderated)
               | 
               | > I apologize, but I don't feel comfortable writing
               | explicit adult stories. While I'm happy to discuss
               | relationships and sex on a general level, I try to keep
               | things tasteful and avoid extremely graphic content or
               | erotica. There are other resources out there better
               | suited for that kind of material. Let me know if there
               | are other writing topics I can assist with though!
        
               | dr_dshiv wrote:
               | Well, just like a human, you don't immediately say "get
               | sexual with me." You need to create a mood! Someone I
               | know does this by uploading on-topic scientific research
               | and then asking for synthetic data. Haha, different from
               | humans?
               | 
               | One can also make headway when asking it to reflect on
               | the importance of sexual wellbeing, etc. Good luck. It's
               | good practice for learning "prompt vibing"
        
               | stavros wrote:
               | You and I have very different definitions of "nearly
               | uncensored".
        
       | orbital-decay wrote:
       | _> For example, our systems sometimes failed to flag violating
       | content when the user asked Claude to translate from one language
       | to another. Clio, however, spotted these conversations._
       | 
       | Why do they even consider translation of existing content
       | "harmful", policy-wise? The content already exists. No machine
       | translator I know would refuse translating something based on the
       | content. That makes their _language models_ unpredictable in one
       | of their major use cases.
        
         | noworriesnate wrote:
         | I'm adjacent to the world of sign language translators in the
         | US. They are legally obligated to translate EVERYTHING,
         | regardless of whether it's legal or not, and they also have to
         | maintain client secrecy. I personally know some who have
         | facilitated drug deals and another who has facilitated an
         | illegal discussion about Trump.
         | 
         | We decided as a society that we're not going to use translation
         | services to catch citizens in crime. This AI situation is so
         | much milder--we're talking about censoring stuff that is
         | "harmful", not illegal. The content is not being published by
         | Anthropic--it's up to the users to publish it or not.
         | 
         | We seriously need regulations around AI "safety" because of the
         | enormous influence they bear on all human discourse.
        
       | tonygiorgio wrote:
       | There's absolutely nothing privacy preserving about their system
       | and adding additional ways to extract and process user data
       | doesn't call for any additional privacy, it weakens it further.
       | 
       | Until they start using nvidia confidential compute and doing end
       | to end encryption from the client to the GPU like we are, it's
       | just a larp. Sorry, a few words in a privacy policy don't cut it.
        
         | musha68k wrote:
         | They are in bed with NSA & co the same as OpenAI.
         | 
         | Palantir announced this even officially; partnership with
         | Anthropic and AWS:
         | 
         | https://www.businesswire.com/news/home/20241107699415/en/Ant...
        
       | staticautomatic wrote:
       | Seems like this might infringe the trademark belonging to the
       | legal tech company Clio.
        
       | pieix wrote:
       | A lot of negativity in these comments. I find this analysis of
       | claude.ai use cases helpful -- many people, myself included, are
       | trying to figure out what real people find LLMs useful for, and
       | now we know a little more about that.
       | 
       | Coding use cases making up 23.8% of usage indicates that we're
       | still quite early on the adoption curve. I wonder if ChatGPT's
       | numbers also skew this heavily towards devs, who make up only
       | ~2.5% of the [American] workforce.
        
       | wseqyrku wrote:
       | Of course this doesn't need to be used on "AI use" as they frame
       | it. So far, your activity was a line in the logs somewhere, now
       | someone is actually looking at you with three eyes, at all times.
        
       | bijoo wrote:
       | Footnote on website is quite confusing > For safety
       | investigations, we also run Clio on a subset of first-party API
       | traffic, keeping results restricted to authorized staff. Certain
       | accounts are excluded from analysis, including trusted
       | organizations with zero retention agreements. For more
       | information about our policies, see Appendix F in the research
       | paper.
       | 
       | They clarify API's excluded > "Because we focus on studying
       | patterns in individual usage, the results shared in this paper
       | exclude activity from business customers (i.e. Team, Enterprise,
       | and all API customers)."
        
       ___________________________________________________________________
       (page generated 2024-12-13 23:01 UTC)