[HN Gopher] Pg-Emoji
       ___________________________________________________________________
        
       Pg-Emoji
        
       Author : JoelJacobson
       Score  : 41 points
       Date   : 2021-01-21 15:16 UTC (7 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | jszymborski wrote:
       | So, am I right in my understanding that this is meant to add
       | support for emojis not present in UTF8?
        
         | Ancapistani wrote:
         | I don't _think_ so.
         | 
         | Without digging into the source - and I intend to do that, if
         | someone more familiar with this doesn't chime in - it appears
         | that it's targeted at reducing resource consumption.
         | 
         | UTF-8 can encode emoji fine. Consider (" grinning face with
         | smiling eyes"), which is `\xF0\x9F\x98\x81` in bytes. That's
         | four bytes. From the pg-emoji Readme:
         | 
         | > A lookup-table is constructed from the first 1024 emojis from
         | [https://unicode.org/Public/emoji/13.1/emoji-test.txt], where
         | each emoji maps to a unique 10 bit sequence.
         | 
         | > The input data is split into 10 bit fragments, mapped to the
         | corresponding emojis.
         | 
         | If my understanding is correct thus far, then instead of
         | storing four bytes for each emoji, you'd only need 10 bits.
         | 
         | I don't know where this would be worthwhile.
         | 
         | I'm further confused by the purpose of `to_text()` and
         | `from_text()`. Their example shows a string composed of mostly
         | Latin characters being encoded into a string of emoji and back.
        
           | JoelJacobson wrote:
           | > I'm further confused by the purpose of `to_text()` and
           | `from_text()`. Their example shows a string composed of
           | mostly Latin characters being encoded into a string of emoji
           | and back.
           | 
           | This is meant to be used if you want to pass some text
           | containing escape characters or perhaps JSON. Note also that
           | the first emoji is a checksum, which might be useful if you
           | want to make sure a user correctly copy/pasted a string, as
           | opposed to sending a raw text string (without checksum).
        
             | Ancapistani wrote:
             | > This is meant to be used if you want to pass some text
             | containing escape characters or perhaps JSON.
             | 
             | I guess I don't understand how this is an improvement.
             | Perhaps it's because I typically interact with the DB
             | through a language-specific library/protocol like Python's
             | DB API, which handles escaping strings and parameterization
             | without my really having to think about it.
             | 
             | Could you provide a specific example of when this might
             | solve a real-world problem?
        
               | JoelJacobson wrote:
               | One intended use-case for the from_text()/to_text()
               | functions is when information is manually copied from
               | some place and pasted somewhere else, where you are
               | worried the user might make a mistake and select the
               | wrong piece of text.
               | 
               | For instance, if you instruct the user to copy "this text
               | string" and paste it somewhere, some users might copy the
               | text string with the double-quotes and some without them.
               | By instead emoji encode the string, the receiver of the
               | copied emoji string can detect if not all emojis were
               | copied.
        
               | Ancapistani wrote:
               | This strikes me as data validation, which should reside
               | in the application layer - I don't see how pg-emoji helps
               | in any way.
               | 
               | Further... if the receiver can validate the encoded
               | string itself, they implicitly already have the string.
               | Why require the user to copy/paste at all? If you meant
               | "Ensure that the user hasn't copied quotation marks as
               | well", then we're back to it being application logic.
               | 
               | If I'm understanding correctly that the primary benefit
               | is that there is a checksum, then there are already many
               | solutions for this in common use - base58checksum, as
               | used to ensure the validity of Bitcoin addresses, comes
               | immediately to mind. I wrote an implementation of that
               | quite a while ago: https://github.com/lyndsysimon/cryptoc
               | oin/blob/primary/crypt...
               | 
               | Please don't misunderstand, I'm in no way intending to be
               | argumentative. I don't understand the practical use of
               | this project, which leads me to believe that there is a
               | problem being solved that I lack the context to identify.
        
               | JoelJacobson wrote:
               | In my case, PostgreSQL is the application layer, the
               | application is written in database functions, and I'm
               | using PostgREST to expose it to my front-end.
        
           | MrStonedOne wrote:
           | this is not an emoji support system or a system for storing
           | emojis.
           | 
           | Its a system for encoding data _as_ emojis.  "this is a
           | string" => some emojis. ie baseemoji or base1024
        
         | sfeng wrote:
         | I believe it is an encoding for data into base 1024, using
         | emoji as the symbol set. Similar in concept to Base64, meant to
         | encode data into a format which can be sent anywhere ASCII is
         | acceptable. This, one could think, would allow you to do the
         | same thing more efficiently but with systems which accept
         | emoji.
        
           | jfk13 wrote:
           | I'd expect that most places where emoji are accepted and
           | reliably preserved, you could also use things like Han
           | ideographs, which would give you a much larger symbol set to
           | work with.
        
             | WorldMaker wrote:
             | One reason to pick emoji is visual distinctiveness and user
             | familiarity. While admittedly there are large populations
             | familiar with CJK ideographs and their
             | construction/deconstruction, there are many more people
             | familiar with emoji at this point. In the case of an
             | encoding error or trying to visually "diff" two encodings,
             | many audiences will spot emoji differences and/or problems
             | with badly encoded emoji (much easier than they might spot
             | differences in CJK ideographs).
             | 
             | (Admittedly there are still issues within the emoji space
             | such as some of the "faces" are quite similar in appearance
             | in many fonts and still easily confused. Plus in the larger
             | emoji space the subtle differences of skin color/gender can
             | be easily confused if you have to rely on them for
             | distinction. Restricting to only 1024 emoji and fewer ZWJ
             | sequence variations presumably takes care of most of those
             | issues.)
        
             | roywiggins wrote:
             | There's always base65536
             | 
             | https://github.com/qntm/base65536
        
         | cmeacham98 wrote:
         | It's a novelty fun project, it serves no practical purpose.
         | It's an encoding scheme similar to base64 or URL encoding, but
         | one that does nothing useful.
        
           | jeltz wrote:
           | A similar technique but with a different set of emojis is
           | used by the Element chat client to verify signatures for end
           | to end encryption.
           | 
           | https://matrix.org/docs/spec/client_server/latest#sas-
           | method...
        
         | alex_duf wrote:
         | I think it's a tongue in cheek project that isn't meant to be
         | used in any production system.
         | 
         | I'm pretty sure you can already put emojis in the text fields
         | of postgres. (or at least I'd be surprised if you couldn't)
        
           | Ancapistani wrote:
           | > I think it's a tongue in cheek project
           | 
           | That would certainly make more sense than anything I've been
           | able to glean from it.
           | 
           | I've definitely used Postgres text columns to store user-
           | provided text values that included emoji in the past. They're
           | part of my standard test case for any user input.
        
           | JoelJacobson wrote:
           | The idea is to encode binary strings in a visually shorter
           | form than e.g. hex, and also make it easier to visually
           | detect differences. It's also possibly easier to remember a
           | bunch of emojis than a hex string.
        
             | jfk13 wrote:
             | Not really. Do you really notice whether someone uses
             | "Grinning face with smiling eyes" or just plain "Grinning
             | face"? Or was it "Grinning face with big eyes", or maybe
             | "Beaming face with smiling eyes". Or were they "squinting"
             | eyes? Maybe the face was just "smiling", not "grinning".
             | Sheesh.
        
               | jeltz wrote:
               | If you select a good set of clearly ambiguous emojis I
               | could see the use for it. See for example the Matrix spec
               | which recommends using emojis for verifying E2EE
               | signatures.
               | 
               | https://matrix.org/docs/spec/client_server/latest#sas-
               | method...
        
               | jfk13 wrote:
               | Yes, a carefully selected set of 64 symbols would be much
               | more sensible from that point of view. This project,
               | though, apparently uses "the first 1024 emojis from
               | [https://unicode.org/Public/emoji/13.1/emoji-test.txt]",
               | which is an entirely different matter.
        
       | jasperry wrote:
       | I was disappointed when I saw it wasn't emojis drawn to look like
       | Paul Graham.
        
         | sillysaurusx wrote:
         | I've been waiting for this moment for literally years.
         | 
         | Long ago, I made http://github.com/strayptr/memes
         | 
         | If you scroll down to Kappa, you'll see a Lambda, which is pg
         | in the style of twitch.tv's Kappa emote.
         | 
         | https://cloud.githubusercontent.com/assets/12214175/7581578/...
        
       | mathiasrw wrote:
       | This is genius.
       | 
       | Data encoded in base1024 (here using the 1024 safe chars
       | represented by emojis) gives much more efficient storage usage.
       | 
       | 16 kB raw data encoded in base64
       | ceil(16*1024/3)*4 = 21848 bytes long ~= 21.8kB.
       | 
       | 16 kB raw data encoded in base1024
       | ceil(16*1024/9)*10 = 18210 bytes long ~= 18.21kB.
       | 
       | So base64 needs about 19.7% more data storage than base1024 and
       | both can be used anywhere utf8 is supported.
       | 
       | Let the baseEmoji revolution begin...
        
       ___________________________________________________________________
       (page generated 2021-01-21 23:02 UTC)