[HN Gopher] Show HN: Read Wikipedia privately using homomorphic ...
       ___________________________________________________________________
        
       Show HN: Read Wikipedia privately using homomorphic encryption
        
       Author : blintz
       Score  : 193 points
       Date   : 2022-06-08 15:29 UTC (7 hours ago)
        
 (HTM) web link (spiralwiki.com)
 (TXT) w3m dump (spiralwiki.com)
        
       | raxxorraxor wrote:
       | Does homophobic in this case mean that I can edit the content of
       | an article and the diff is directly applied to the crypt?
        
         | teraflop wrote:
         | This has nothing to do with editing Wikipedia. The problem this
         | demo is solving is "private information retrieval" -- that is,
         | you send a query for a particular article, and the server sends
         | back a response containing that data, but the server does not
         | learn which article you asked for. Homomorphic encryption is
         | just one of the building blocks used to achieve this.
         | 
         | A trivial solution to this problem would be for the client to
         | just download a copy of the entire Wikipedia database and pick
         | out the particular article it's interested in. With a proper
         | implementation of PIR, the server still needs to scan through
         | the entire encrypted dataset (this is unavoidable, otherwise
         | its I/O patterns would leak information) but the amount of
         | information that needs to be sent to the client is much
         | smaller.
        
           | calvinmorrison wrote:
           | Unfortunately, it's not that trivial. I tried a few offline
           | wikis with varying success, but haven't had a ton of success
           | using it day to day.
        
             | teraflop wrote:
             | Oh, certainly, I didn't mean to imply that there would be
             | no implementation difficulties -- just that if you're
             | willing to download the entire dataset, the _privacy_
             | aspect is trivially solved, without requiring any fancy
             | cryptography.
        
         | ReaLNero wrote:
         | Quite the curious typo you have there.
        
           | jesushax wrote:
           | I wouldn't be surprised if they originally wrote it correctly
           | and a phone's autocorrect "fixed" it for them
        
       | iFire wrote:
       | I wonder if this can be done on sqlite?
       | 
       | http://static.wiki/
       | 
       | See the previous news article.
       | https://news.ycombinator.com/item?id=28012829
        
         | lucb1e wrote:
         | You mean distributing 43GB of data to everyone rather than
         | uploading a few megabytes of data one time and getting 250KB
         | answer chunks? I'm not sure it compares.
        
         | rakoo wrote:
         | If you're going the sqlite route, the much easier path is to
         | distribute the whole database and do everything on the client
        
       | 0cVlTeIATBs wrote:
       | Could this be used for DNS?
        
         | freemint wrote:
         | Yes but it would be quiet expensive infrastructure wise. Also i
         | think numbers of website grows faster then processor io meaning
         | one would need more and more processors with time.
        
         | blintz wrote:
         | This is a great idea, and we think it would be relatively
         | practical assuming some aggressive caching. However, I couldn't
         | think of a threat model where this is useful, since presumably
         | your ISP can in the end always see which sites you visit by
         | simply reversing the IPs you connect to.
         | 
         | Do you think that people would want private DNS? I suppose it
         | would still be an improvement over the what we have today, but
         | I'm not sure that it will make it meaningfully harder for ISPs
         | to collect and sell data to advertisers.
        
           | gojomo wrote:
           | There are ways to hide IP endpoints from relaying nodes, as
           | well.
        
           | 0cVlTeIATBs wrote:
           | On threat models, a malicious DNS server might also be one
           | compromised by a party demanding wiretap access.
           | 
           | Regardless, a person today has a choice of which DNS server
           | to use but they all could track the requests made. Tracking
           | site visits via IP is a different link in that chain.
           | 
           | Would people pay? I don't know, but I could see it being a
           | feature used to different a VPN service from its competitors.
        
             | blintz wrote:
             | That's a good point, I could see this being a
             | differentiating feature for a VPN provider. The only way to
             | know if people would pay is for someone to offer it, I
             | guess...
        
           | captn3m0 wrote:
           | OCSP would be a good target in the similar space: https://en.
           | m.wikipedia.org/wiki/Online_Certificate_Status_Pr...
        
             | blintz wrote:
             | I do think OCSP and certificate transparency stuff is a
             | clear application for PIR that makes a lot of sense. There,
             | you currently basically have to volunteer your browsing
             | patterns to a 3rd party (beyond your ISP).
        
               | captn3m0 wrote:
               | For SCT audits on the browser, see https://www.lightest.e
               | u/static/papers/10-Revisiting%20User%2...
               | 
               | Google's CT design docs also state PIR Audits as the long
               | term goal, so this would be a good area to focus on. http
               | s://docs.google.com/document/d/1G1Jy8LJgSqJ-B673GnTYIG4b.
               | ..
        
       | barbazoo wrote:
       | Can anyone recommend an explanation of this concept geared
       | towards people with only a superficial knowledge of encryption?
       | 
       | This seems to be some kind of search applied on an encrypted
       | dataset, is that right?
        
         | SilasX wrote:
         | For homomorphic encryption in general I wrote this blog post to
         | make the idea accessible:
         | 
         | http://blog.tyrannyofthemouse.com/2013/05/i-added-your-numbe...
        
           | Engineering-MD wrote:
           | I find it deeply ironic that the page is http not https!
        
         | extr wrote:
         | It's like, I send the server an encrypted math problem. The
         | server has no idea what the math problem is, but homomorphic
         | encryption allows it to compute an (encrypted) result and send
         | that back to me. I get the result and decrypt it for the
         | correct answer. It's novel because you don't have to trust the
         | server with your math problems.
        
       | gojomo wrote:
       | Interesting! But, it'd be helpful to clarify further the strength
       | of the following claim:
       | 
       |  _> This demo allows private access to 6 GB (~30%) of English
       | Wikipedia. In theory, even if the server is malicious, it will be
       | unable to learn which articles you request. All article title
       | searches are performed locally, and no images are available._
       | 
       | In this demo, the number of article-titles is relatively small -
       | a few million - & enumerable.
       | 
       | If the server is truly malicious, and it issues _itself_ requests
       | for every known title, does it remain true that this  "Private
       | Information Retrieval" (PIR) scheme still gives it _no_ hints
       | that subsequent requests from others for individual articles
       | retrieve particular data?
       | 
       | (Presumably: _every_ request touches every byte of the same full
       | 6GB of data, and involves every such byte in constant-run-time
       | calculations that vary per request, and thus have the effect of
       | returning only what each request wanted - but not at all in any
       | way correlatable with other requests for the exact same article,
       | from the same or different clients?)
        
         | blintz wrote:
         | Indeed! The encryptions of the client queries are fully
         | semantically secure - under relatively solid lattice-based
         | cryptographic assumptions, a server would need to do more than
         | 2^128 work to recover the plaintext of someone's query. One
         | query for one item in the database is indistinguishable
         | (without the client's key) from another query for the same item
         | later; in other words, it's similar to something like the
         | guarantee of CBC or GCM modes, where as long as you use it
         | correctly, it is secure even if the attacker can see many
         | encryptions of its choosing.
        
           | danuker wrote:
           | Would this be vulnerable to a side-channel attack as follows?
           | 
           | 1. Record what item was retrieved from disk for a query
           | 
           | 2. Run a dictionary through the query system, and see which
           | item matches the record
        
             | [deleted]
        
             | [deleted]
        
             | blintz wrote:
             | The server processing has to scan the entire database to
             | answer a query- so the entire database stays in memory, and
             | access patterns tell you nothing about the query.
        
               | gojomo wrote:
               | Notably to some who may misinterpret the "stays in
               | memory" aspect: even the copy in memory is fully scanned
               | by every query. So it's not just _disk_ access patterns
               | that don 't give anything away, but _RAM_ access
               | patterns.
        
           | benlivengood wrote:
           | How indistinguishable is a Not Found result by the server? It
           | seems like user behavior would be to re-request the article a
           | second time, so the client should probably protect the user
           | against this kind of server (which could bisect on article
           | popularity to find the requested article in ~20 tries) by
           | throwing up a warning about an article in the index not being
           | retrievable.
        
             | blintz wrote:
             | Indeed, client behavior when an article fails to be
             | retrieved is very important. A malicious server could guess
             | what article is being retrieved, remove it from the
             | database, and see if the client 'reacts' in some observable
             | way.
             | 
             | The mitigations for this are somewhat heuristic; the client
             | should perhaps 'pace'/'debounce' subsequent requests when a
             | retrieval fails. For example, enforce that the re-retrieval
             | will happen at a uniformly random time between 1-5 seconds
             | later.
             | 
             | It's good to note that this kind of attack is somewhat
             | unavoidable, since denial-of-service is (generally) always
             | possible. An interesting mitigation might be for the server
             | to produce a ZKP that the full dot product was computed
             | correctly, and have the client always check the ZKP before
             | displaying the article. This is mostly a theoretical fix
             | for now I believe, since ZKP's over homomorphic computation
             | are in the realm of theory more than practice.
        
               | benlivengood wrote:
               | If the protocol is that the server returns a specific
               | "index N deleted" result for deleted N then the client
               | can at least trust a valid response from the server as
               | opposed to a DDoS or unmasking attempt. Any server that
               | returns something other than a valid document or "N
               | deleted" should no longer be trusted or communicated
               | with, but retries for communication failures or timeouts
               | should still be safe.
               | 
               | Edit: this assumes that the client gets a trusted index
               | from a set of trusted servers who are at least as up to
               | date as the latest index that is made available, which
               | timestamped signatures or similar should suffice for. Or
               | the client can supply the timestamped index signature and
               | the server can reply with a different message if it's too
               | old.
        
           | throwaway2016a wrote:
           | > without the client's key
           | 
           | Thank you, these 4 words really helped with my understanding
           | so I'm calling it out incase it helps others. So I was
           | thinking, what prevents you from replaying the query and
           | getting the same page back? But it seems the answer is: that
           | would only produce a gibberish response because you don't
           | have the key.
        
       | mihaitodor wrote:
       | Last year, there was a detailed presentation with several
       | speakers on state of the art Secure Multi-Party Computation for
       | practical applications in healthcare, fighting financial crime
       | and machine learning from CWI (Centrum Wiskunde & Informatica)
       | Netherlands. The recording is here (2,5h):
       | https://www.youtube.com/watch?v=gE7-S1sEf6Q
        
       | f38zf5vdt wrote:
       | Extremely cool. Now we can serve content without any ability to
       | observe what people are being served exactly. I was hoping that
       | someday soon such technology could be used to serve search
       | results and give us a _truly_ private search engine experience.
        
       | blintz wrote:
       | Hi, creator here.
       | 
       | This is a demo of our recent work presented at Oakland (IEEE
       | S&P): https://eprint.iacr.org/2022/368. The server and client
       | code are written in Rust and available here:
       | https://github.com/menonsamir/spiral-rs. The general aim of our
       | work is to show that homomorphic encryption is practical today
       | for real-world applications. The server we use to serve this
       | costs $35/month!
       | 
       | A quick overview: the client uses homomorphic encryption to
       | encrypt the article number that they would like to retrieve. The
       | server processes the query and produces an encrypted result
       | containing the desired article, and sends this back to the
       | client, who can decrypt and obtain the article. A malicious
       | server is unable to determine which article the client retrieved.
       | All search and autocomplete is down locally. The technical
       | details are in the paper, but the high level summary is that the
       | client creates a large one-hot vector of encrypted bits (0's
       | except for the index of the desired article, where they place a
       | 1) and then the server computes something like a 'homomorphic dot
       | product' between the query and the plaintext articles.
       | 
       | I'd like to caveat that this is an in-browser demo to show it is
       | practical to use homomorphic encryption at this scale. As a real
       | product, you'd probably want to distribute a signed client
       | executable (or Electron app) since otherwise, a malicious server
       | could simply deliver bad client JS on the fly.
       | 
       | Happy to answer any questions!
        
         | Labo333 wrote:
         | I understand that you do some kind of dot product (with two
         | steps, Regev and GSW). However, it looks to me that those steps
         | involve fixed dimension vectors.
         | 
         | - How do you handle variable length data? Do you need to pad
         | it?
         | 
         | - What is the memory overhead of the storage of encrypted data?
         | 
         | I think that at least for video data, the streaming scheme
         | "leaks" the size of the encrypted data with the number of
         | streaming packets.
        
           | blintz wrote:
           | Yeah, every record needs to be the same size. For the demo,
           | we batch the articles into 100KB chunks. We do a good job
           | packing, where we put many small articles into a 100KB chunk,
           | and we split large articles into many 100KB chunks. This
           | packing is pretty efficient, roughly 90% of space is not
           | wasted.
           | 
           | The memory overhead is significant but not prohibitive... we
           | have to keep something like a 5-8x larger encoded database in
           | memory, but this overhead is from encoding the plaintext in a
           | special format to allow a fast dot product, not from
           | inefficiency in the packing.
        
             | lucb1e wrote:
             | Is it not possible to determine which article(s) the user
             | downloaded based on the memory locations read? Of course,
             | multiple small articles from within the same 100KB cannot
             | be said, but for any medium to large article, you'd be able
             | to make a good guess (if there are a handful of articles
             | there) or an exact match (if there is <=1 article in that
             | chunk) no?
             | 
             | Or does the server go through a large chunk of its memory
             | (say, at least a quarter of all of Wikipedia) and perform
             | some oblivious computation on _all of that data_ (applying
             | the result modulo this 100KB return buffer)? That sounds
             | very resource-intensive, at least for something large like
             | Wikipedia (a doctor 's office with some information pages
             | of a few KB each could more easily do such a thing).
             | 
             | In the latter case, is each request unique (does it involve
             | some sort of IV that the client can xor out of the data
             | again) or could an index be built similar to a list of
             | hashed PIN codes mapped back to plain text numbers?
             | 
             | Edit: I had already read some comments but just two
             | comments further would have been my answer... :)
             | https://news.ycombinator.com/item?id=31669630
             | 
             | > > With a proper implementation of PIR, the server still
             | needs to scan through the entire encrypted dataset (this is
             | unavoidable, otherwise its I/O patterns would leak
             | information)
             | 
             | Now let's see if I can also find the answer regarding the
             | ability to build a lookup table...
             | 
             | Edit #2: found that also!
             | https://news.ycombinator.com/item?id=31669924
             | 
             | > One query for one item in the database is
             | indistinguishable (without the client's key) from another
             | query for the same item later; in other words, it's similar
             | to something like the guarantee of CBC or GCM modes, where
             | as long as you use it correctly, it is secure even if the
             | attacker can see many encryptions of its choosing.
             | 
             | That is some cool stuff indeed. I'm going to have to up my
             | game when building or reviewing privacy-aware applications.
             | Sure, a file sharing service is not going to practically
             | allow this, but I'm sure that with this knowledge, I will
             | come across places where it makes sense from both a
             | usefulness (e.g. medical info) and practicality (data set
             | size) perspective.
        
               | MauranKilom wrote:
               | > Sure, a file sharing service is not going to
               | practically allow this
               | 
               | Well, as the author points out here [0], it doesn't
               | actually translate to exorbitant cost increases when
               | almost all the cost is in bandwidth rather than compute.
               | A file sharing service rather seems like an ideal example
               | (assuming you can derive additional revenue from the
               | privacy promises).
               | 
               | [0]: https://news.ycombinator.com/item?id=31673122
        
         | j2kun wrote:
         | Do you have a blog or Twitter? I'd like to keep up with any
         | other cool projects you're working on!
        
           | blintz wrote:
           | Not at the moment, but will probably make a blog post or
           | something to explain how it all works at a slightly higher
           | level than the paper.
           | 
           | When it's done it'll be at https://samirmenon.com/.
        
         | jl6 wrote:
         | In another comment you've said:
         | 
         | > With a proper implementation of PIR, the server still needs
         | to scan through the entire encrypted dataset (this is
         | unavoidable, otherwise its I/O patterns would leak information)
         | 
         | Is this technique therefore practical only when the server side
         | dataset is relatively small (or full scans for every query are
         | tolerable)?
         | 
         | (edit: sorry, misattributed the quote)
        
           | blintz wrote:
           | Wasn't me, but it was accurate!
           | 
           | Indeed, except in some (exciting!) new theoretical
           | constructions, server work is always linear in the database
           | size.
           | 
           | However, I'd emphasize that our work shows that the constabt
           | factor on this linear operation is really high. We process at
           | 300MB/s to 1.9GB/s on a single core, which is fast enough for
           | relatively large databases. Remember that the computation is
           | embarrassingly parallel, so you can always throw more compute
           | at larger databases. To summarize, we think the speed is now
           | fast enough that it really can be feasible to scan the whole
           | database to answer a single query.
        
             | teraflop wrote:
             | > except in some (exciting!) new theoretical constructions
             | 
             | Sorry if this is too much of a tangent, but I would love to
             | know what these are!
        
               | sangel wrote:
               | They are called PIR with sublinear online computation or
               | offline/online PIR.
               | 
               | Key idea is the client issues a query that is independent
               | of what they really want. This is the "offline" query.
               | This query is linear (unavoidable) and the client gets
               | some hints as a result
               | 
               | Then later, the client can issue one or more queries for
               | elements they want and those queries can be sublinear in
               | terms of computation. The client uses hints to make that
               | happen.
               | 
               | Some papers on this are:
               | 
               | https://eprint.iacr.org/2019/1075.pdf
               | 
               | https://eprint.iacr.org/2021/1438
               | 
               | https://eprint.iacr.org/2022/081.pdf
        
               | blintz wrote:
               | Sure! This paper has people really excited from a
               | theoretical standpoint:
               | https://eprint.iacr.org/2022/081.pdf. It's a bit far from
               | practical for single-server, but I'm sure people are
               | working on it.
        
           | wbeckler wrote:
           | Maybe the I/O pattern could be hideable using confidential
           | computing, like with a Nitro Enclave.
        
             | teraflop wrote:
             | Maybe, but if you have a secure enclave that can be trusted
             | not to leak data, then you don't really need PIR. You can
             | just have clients encrypt their queries with a key that can
             | only be decrypted by the code running in the enclave.
        
           | teraflop wrote:
           | Clarification: that was my comment, not OP's. I'm not a
           | cryptography expert, just an interested amateur. But my
           | understanding is that O(n) query times are inevitable if you
           | want information-theoretic security. Maybe it's possible to
           | do better with a weaker security property.
           | 
           | And there are clever ways you can make a system like this
           | "scale" even if the overall dataset size is limited. For
           | instance, the authors cite another interesting paper [1] that
           | uses a similar technique to build a fully-private voice chat
           | system. The basic idea seems to be that you build a
           | "database" consisting of the most recent snippet of audio
           | from every ongoing conversation, and let each client
           | privately query for the segment that's addressed to it. And
           | every fraction of a second when new audio data arrives, you
           | just throw away the old database and build a new one, so the
           | amount of data doesn't depend on the length of the
           | conversation.
           | 
           | Even if this doesn't scale to arbitrary numbers of users, it
           | could still be used to run a "cell" of a few thousand people,
           | in which it's not possible for an adversary to reconstruct
           | communication patterns within the cell.
           | 
           | [1]:
           | https://www.usenix.org/conference/osdi21/presentation/ahmad
        
         | Canada wrote:
         | Can this be applied usefully to non-public datasets?
         | 
         | Would it be feasible to add some other zero knowledge proof to
         | this that would confirm a user has paid a subscription for
         | access? For example, if this were a news site, the user would
         | have to prove a valid subscription to read articles, but the
         | site would not be able to know which articles any subscriber
         | decided to read?
         | 
         | If that is possible, what could the site to to prevent a paying
         | subscriber from sharing their access to an unreasonable number
         | of others? Would it be possible to impose a rate limit per
         | subscriber?
        
           | phh wrote:
           | I'm no cryptographer, but it seems to me you could implement
           | this using the same algorithm as cloudfare for tor, which
           | generates anonymous tokens from an adhoc webpage
        
           | blintz wrote:
           | The simplest approach would be to just charge people per-
           | query (or charge in levels, depending on the number of
           | queries). This could be done in the standard way (you have to
           | log in, pay the site, and then it gives you an API key or
           | just session token, and logs how many queries you make). I
           | think you can avoid having to use a ZKP this way, since that
           | will make things much more complicated and possibly costly.
        
         | JanisErdmanis wrote:
         | > A malicious server is unable to determine which article the
         | client retrieved.
         | 
         | This sounds like magic :O. How does it behave when new articles
         | (elements) are added, does it need to rebuild the whole
         | database and distribute new parameters?
         | 
         | I wonder how practical it would be for clients to synchronize
         | content without server not being able to deduce the
         | synchronization state at which the client is.
        
           | blintz wrote:
           | It does sound like magic! This is what got me into this
           | field; it seems like something that should intuitively be
           | impossible... but it's not!
           | 
           | Parameters only need to be changed based on the number of
           | items in the database (not the content of the items). Also,
           | they don't really need to be modified as long as we are
           | within the same power of two number of items. So, I think
           | client and server agreement on parameters seems feasible.
           | Right now, it's just hard coded :)
        
         | syrrim wrote:
         | What is the maximum throughput the server can maintain? Or, in
         | other words, how much does it cost per query?
        
           | blintz wrote:
           | The $35/month server uses 6 cores to answer a single query in
           | ~2.5-3 seconds. So it's 0.33 QPS :-)
           | 
           | Not high, which is why it might not be working well for folks
           | right now...
           | 
           | Time scales almost perfectly linearly with cores, since the
           | computation is embarrassingly parallel.
           | 
           | In terms of cost, we're still talking only 18 CPU*s and
           | ~300KB of outgoing bandwidth, which is not a ton at todays
           | prices.
        
             | IshKebab wrote:
             | It's embarrassing parallel... but you also do N times more
             | work than a non-homomorphic system so that's not saying
             | much!
             | 
             | This doesn't seem like a particularly compelling
             | application - can you give some practical problems that
             | homomorphic encryption solves. I've always heard vote
             | counting as the example.
        
               | blintz wrote:
               | Yeah, the computational overhead over no-privacy
               | retrieval will always be high. Something we point out in
               | the paper is that this does not necessarily translate to
               | server monetary costs; for example, on AWS, where
               | outgoing bandwidth is expensive, using our system to
               | stream a movie is less than 2x the monetary cost of
               | direct, no-privacy streaming. This is because the
               | computational costs are just very small in comparison to
               | the bandwidth costs.
               | 
               | As far as more compelling applications, several folks
               | have suggested DNS, OCSP, and certificate transparency as
               | potential examples. Using PIR as a building block, it is
               | also possible to build more complex systems, like
               | metadata-private messaging (systems like 'Pung') and even
               | private voice calling (systems like 'Addra').
        
         | rkagerer wrote:
         | Not able to read the full paper at the moment, and confused
         | about something:
         | 
         | If the server needs to go pull the article from Wikipedia, how
         | is it blind to which one is being requested?
         | 
         | If you've pre-seeded the server with an encrypted 30% of
         | Wikipedia, how can I trust you haven't retained information
         | that would enable you to derive what I requested?
         | 
         | The only way I understand this works is if the client itself
         | seeded the encrypted data in the first place (or at least an
         | encrypted index if all the server pushes back is article
         | numbers).
         | 
         | Maybe I'm ignorant of something; if so thanks for ELI5.
        
           | gojomo wrote:
           | The server already has a full copy of (its subset of)
           | Wikipedia.
           | 
           |  _Every_ query touches every byte of that snapshot, in the
           | same way... but the homomorphic-encryption math distills that
           | down to just the fragment the client requested.
           | 
           | And it does so _without_ giving even the CPU executing that
           | math any hint of what bytes /ranges/topics survive the math.
           | The much-smaller - but still, at every server step, encrypted
           | - result is then sent to the client, which performs the
           | decryption.
           | 
           | It's moon-math magic.
        
           | blintz wrote:
           | The server has a snapshot of Wikipedia, sitting in memory.
           | The server is blind to which article is requested because it
           | computes a homomorphic dot product between an encrypted one-
           | hot vector (encrypted under a key that _only_ the client
           | knows) and the total set of articles in plaintext. The
           | encrypted query does not reveal anything about the requested
           | index (ie it is semantically secure).
           | 
           | The 'magic' of homomorphic encryption is that the server is
           | able to take an encrypted one-hot vector of bits, and a
           | vector of plaintext articles, compute a homomorphic dot
           | product, and produce a single ciphertext encoding the single
           | desired plaintext article. This single output ciphertext is
           | crucially also encrypted under the _client 's_ secret key, so
           | it reveals nothing to the server.
        
             | philipkglass wrote:
             | Don't you have to pad the output ciphertext size to match
             | the largest article you could possibly request from the set
             | of articles? Or is a fixed-size output an inherent property
             | of homomorphic encryption schemes? Otherwise it seems to
             | reveal something to the server just by the size of the
             | ciphertext (since WP articles vary in size).
        
               | blintz wrote:
               | Yes, every item needs to be the same size. We batch the
               | articles into 100KB chunks, and we split articles larger
               | than this into multiple chunks.
        
               | aildours wrote:
               | Wouldn't you then have to send out multiple ciphertexts
               | (for articles >100 KB)? Which would leak something about
               | the size of the article...
        
               | blintz wrote:
               | You would. It's important for the client to pace it's
               | requests in a way that does not reveal too much (for
               | example, the client should not just request both chunks
               | of the article at the same time). The best thing to do
               | would probably be to have a 'load more' button at the
               | bottom of a very long article that makes a separate
               | request.
               | 
               | If you think about it, the pacing of user queries could
               | always reveal something (if there's a long gap between
               | quarries, perhaps it's a dense mathematics article?). So
               | the best we can hope for is pacing requests either
               | randomly, or for even further protection, perhaps making
               | dummy requests on a fixed cadence.
        
               | Dylan16807 wrote:
               | The output is always going to be fixed size.
               | 
               | The nicer way to look at homomorphic encryption is like
               | math, but the more literal way is like a bunch of logic
               | gates. There are N output wires. You control the bits on
               | the input wires. Nothing you do will change the number of
               | wires.
        
           | Dylan16807 wrote:
           | > If you've pre-seeded the server with an encrypted 30% of
           | Wikipedia, how can I trust you haven't retained information
           | that would enable you to derive what I requested?
           | 
           | With homomorphic encryption, the client sends a series of
           | encrypted numbers. Nobody can decrypt them except the client.
           | The server can do arithmetic with them, making new secret
           | numbers that nobody can decrypt except the client.
           | 
           | There is no usable information _to_ retain.
           | 
           | So the question becomes: what can you calculate using
           | arithmetic on secret numbers?
           | 
           | Well, for this demo, treat every article as a number. Then
           | multiply all the articles you don't want by 0, and the
           | article you want by 1, and add them all together.
           | 
           | The server just sees that it's multiplying every article by a
           | secret number. It can't tell what the number is. It can't
           | tell if the output is "encrypted article" or "encrypted
           | 000000..."
           | 
           | Then the server adds them all up. If the client asked for no
           | articles, the result will be "encrypted 000000..." If the
           | client asked for one article, the result will be that
           | article, encrypted. If the client asked for multiple, the
           | result will be a garbled mush of overlapping articles,
           | encrypted. The server can't tell the difference. It just
           | knows it has an encrypted number.
        
             | barkingcat wrote:
             | does revealing that you accessed something from the index
             | vs nothing a leak of information?
        
             | dheera wrote:
             | Thank you. I _really_ wish the paper started with exactly
             | what you wrote as the abstract.
        
         | jerf wrote:
         | This is the first thing out of homomorphic encryption _I
         | personally have seen_ that seems to be in the ballpark of
         | useful for some practical use, which is impressive. Have I
         | missed out on any other such things of interest?
         | 
         | (And this is not a criticism; this is a compliment. You start
         | so far behind the eight-ball with homomorphic encryption with
         | regard to the resources it consumes I wasn't convinced it was
         | ever going to be even remotely useful for much of anything.
         | Precisely because I was so skeptical, I am that impressed to
         | see something work this well. It's not the fastest Wikipedia
         | mirror, but... honestly... I've been on slower websites!
         | Websites with _far_ less excuse.)
        
           | blintz wrote:
           | There has recently been a lot of great work on making
           | homomorphic encryption more practical for particular
           | applications. We definitely stand on the shoulders of all
           | that work!
           | 
           | One reason homomorphic encryption has a reputation as
           | absurdly slow is that people typically talk about "fully"
           | homomorphic encryption, which essentially means that you can
           | compute an arbitrarily large function on the encrypted data.
           | This involves a very expensive process called bootstrapping,
           | which incurs a roughly ~15 ms cost per binary operation
           | evaluated. As you can imagine, that adds up.
           | 
           | This work uses "leveled" homomorphic encryption, where we
           | only perform a function of 'bounded' size (that is, the
           | homomorphic dot product). So, we do not have to perform
           | bootstrapping, and thus avoid some of the more extreme costs.
           | 
           | The other reason this work is practical is that we 'tailored'
           | our construction to the particular problem of private
           | information retrieval. When people try to apply homomorphic
           | encryption _generically_ to their problem, they typically end
           | up with disappointingly slow and expensive results. Cool work
           | on better 'FHE compilers' so in progress, so hopefully that
           | will also help!
        
             | feanaro wrote:
             | > can compute an arbitrarily large function on the
             | encrypted data
             | 
             | What is the meaning of the word "large" here? How are we
             | defining the size of a function?
        
               | Confiks wrote:
               | Either the amount of Boolean logic operations or the
               | amount of integer additions and multiplications, as these
               | are the operations FHE schemes support.
        
           | halJordan wrote:
           | Apple's Differential Privacy uses homomorphic encryption for
           | parts of its functioning.
        
             | j2kun wrote:
             | Citation?
        
           | kccqzy wrote:
           | You missed the widely panned Apple iCloud child sexual abuse
           | imagery detection feature. The private set intersection is
           | basically doing homomorphic encryption. In raising some very
           | valid policy critiques, people forget that it's actually a
           | nifty piece of engineering. (This is not an endorsement of
           | that feature.) https://www.apple.com/child-
           | safety/pdf/Apple_PSI_System_Secu...
           | 
           | I'm also closely working together with a team at $WORK who's
           | using a protocol very similar to Apple's but not doing CSAM
           | detection. We are seeing some severe pushback on this
           | technology. I wouldn't be surprised if there are multiple
           | homomorohic encryption based products at Big Tech that have
           | yet to see the light of day.
        
             | [deleted]
        
             | lifthrasiir wrote:
             | I think it is very counterintuitive---even to professional
             | programmers---that you can compute things without letting
             | the computer know them. I literally had to go through the
             | entire paper to understand that this can actually work as
             | long as human in the loop doesn't screw up (see my summary
             | [1], which was revised multiple times at that time).
             | 
             | [1] https://news.ycombinator.com/item?id=28223141
        
         | ddjsn111 wrote:
         | How does the server select the article in a way that we can be
         | sure they don't record the article sent back? Are the articles
         | encrypted on the server too?
        
           | sp332 wrote:
           | Yes, in fact the article number is not even decrypted by the
           | server, so the server doesn't know which article you asked
           | for!
        
             | kaoD wrote:
             | How is this not vulnerable to side-channel attacks like
             | disk-access patterns?
             | 
             | Could I, as a malicious server, request myself a target
             | article and correlate that with legitimate user requests?
        
               | sodality2 wrote:
               | > With a proper implementation of PIR, the server still
               | needs to scan through the entire encrypted dataset (this
               | is unavoidable, otherwise its I/O patterns would leak
               | information)
               | 
               | https://news.ycombinator.com/item?id=31669370 (not
               | original poster)
        
       | dontbenebby wrote:
       | This is very cool OP! I interviewed to be a privacy engineer with
       | Wikimedia a while back.
       | 
       | I suggested that my goal would be to add a v3 onion service. They
       | actually had listed years of "homomorphic encryption" as a
       | requirement. I phoned up the recruiter and basically said it's ok
       | if there is a personality conflict, but the role as written was
       | impossible to fill, and it scared me that very good suggestions
       | for privacy as well as the health of the Tor network were
       | discarded.
       | 
       | (If you set up a dot onion, that frees up traffic on exit nodes,
       | whose capacity are limited.)
       | 
       | Big thanks to the OP for being willing to share this work, it's
       | very cool and I'm about to read your eprint.
       | 
       | I'm excited about the potential of homomorphic encryption, though
       | I worry about things like CPU cost -- I recall when folks had to
       | really be nudged not to encrypt huge blocks of data with PGP, but
       | instead use it to encrypt the passphrase to a Truecrypt volume
       | using a symmetric cipher like AES.
       | 
       | (I'd love how to know we got to a point Twitter added an onion
       | service then banned me, but Wikipedia continues to not even
       | support MFA for logins -- I recently registered an account
       | intending to eventually upload some art to the commons, but the
       | perpetual refusal to allow folks to make healthy choices disturbs
       | me.
       | 
       | In fact, after reading articles like these ones[1][2], it makes
       | me question the integrity of the folks I interacted with during
       | the interview process.
       | 
       | On my end, it was especially disturbing since prior to enrolling
       | in my PhD, the alternative path I discussed was becoming an FBI
       | agent focused on counter intelligence in the "cyber" realm.
       | 
       | The agent I spoke with told me I'd serve "at the needs of the
       | bureau", so that would mean probably not using my computer
       | skills, which would then languish, then after a couple years I
       | might still not get my desired position, and gave me a card,
       | which I eventually lost.
       | 
       | Years later, prior to the insurrection, I had to walk down to
       | Carnegie Mellon and ask if anyone had his contact information,
       | and was shocked that folks refused to even point me at a link to
       | the lecture, which had been listed as open to the public.
       | 
       | I'm someone who _reads_ Wikipedia, not really edits, but the vast
       | majority of users are readers not editors, and this perpetual
       | pattern of refusing to enable privacy enhancing technologies,
       | paired with using privileges access to make hiring decisions
       | against folks who lack the physical ability to make good privacy
       | decisions offended me on a deep, personal level, and is why I
       | often post in brash, erratic manner.
       | 
       | Because I see zero incentive to stay silent -- if I'm quiet,
       | people will slowly drain my bank account.
       | 
       | If I post, there is a chance someone will see what I say, notice
       | my skills, and offer full time employment. So I have to continue
       | risking offending folks until I find a full time job, which I
       | have not had since I left the Center for Democracy and Technology
       | under duress following a series of electronic and physical
       | attacks, paired with threats and harassment by staffers in the
       | organization.
       | 
       | TL;DR: Great research, but I hope they also add an onion service
       | rather than jump straight to using this :-)
       | 
       | [1]
       | https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@list...
       | 
       | [2] https://slate.com/technology/2021/10/wikipedia-mainland-
       | chin...
        
         | MauranKilom wrote:
         | I tried, but I simply can't follow your train of thought. You
         | keep going back and forth between criticizing Wikimedia hiring
         | and technology choices, advertising yourself, and deliberating
         | over onion services. And it all seems extremely tangential to
         | the article (which is really not about the future of Wikipedia,
         | or Tor, or your career).
        
       | badrabbit wrote:
       | Very nice! Great against snoopers that lack authority but for
       | when they do have some authority (bosses, government) without
       | plausible deniability it can do more harm than good.
        
       | ArkaneSkye wrote:
        
       | ajconway wrote:
       | Theoretically, can this scheme be turned into a generic O(N) key-
       | value retrieval for non-static content (in this example --
       | supporting adding, removing and replacing articles without re-
       | encrypting the whole database and re-sending the client setup
       | data)?
        
         | blintz wrote:
         | We never encrypt the database. Only the query is encrypted. The
         | client setup data is only dependent on the client key and the
         | size of the database (not the content). Adding and replacing
         | articles can happen whenever the server wants, and clients do
         | not need to "re-sync" or something like that.
         | 
         | For arbitrary key-value retrieval, a hashing scheme would work
         | pretty well, modulo some imbalances that will occur.
        
       ___________________________________________________________________
       (page generated 2022-06-08 23:00 UTC)