[HN Gopher] Pg-Emoji
___________________________________________________________________
Pg-Emoji
Author : JoelJacobson
Score : 41 points
Date : 2021-01-21 15:16 UTC (7 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| jszymborski wrote:
| So, am I right in my understanding that this is meant to add
| support for emojis not present in UTF8?
| Ancapistani wrote:
| I don't _think_ so.
|
| Without digging into the source - and I intend to do that, if
| someone more familiar with this doesn't chime in - it appears
| that it's targeted at reducing resource consumption.
|
| UTF-8 can encode emoji fine. Consider (" grinning face with
| smiling eyes"), which is `\xF0\x9F\x98\x81` in bytes. That's
| four bytes. From the pg-emoji Readme:
|
| > A lookup-table is constructed from the first 1024 emojis from
| [https://unicode.org/Public/emoji/13.1/emoji-test.txt], where
| each emoji maps to a unique 10 bit sequence.
|
| > The input data is split into 10 bit fragments, mapped to the
| corresponding emojis.
|
| If my understanding is correct thus far, then instead of
| storing four bytes for each emoji, you'd only need 10 bits.
|
| I don't know where this would be worthwhile.
|
| I'm further confused by the purpose of `to_text()` and
| `from_text()`. Their example shows a string composed of mostly
| Latin characters being encoded into a string of emoji and back.
| JoelJacobson wrote:
| > I'm further confused by the purpose of `to_text()` and
| `from_text()`. Their example shows a string composed of
| mostly Latin characters being encoded into a string of emoji
| and back.
|
| This is meant to be used if you want to pass some text
| containing escape characters or perhaps JSON. Note also that
| the first emoji is a checksum, which might be useful if you
| want to make sure a user correctly copy/pasted a string, as
| opposed to sending a raw text string (without checksum).
| Ancapistani wrote:
| > This is meant to be used if you want to pass some text
| containing escape characters or perhaps JSON.
|
| I guess I don't understand how this is an improvement.
| Perhaps it's because I typically interact with the DB
| through a language-specific library/protocol like Python's
| DB API, which handles escaping strings and parameterization
| without my really having to think about it.
|
| Could you provide a specific example of when this might
| solve a real-world problem?
| JoelJacobson wrote:
| One intended use-case for the from_text()/to_text()
| functions is when information is manually copied from
| some place and pasted somewhere else, where you are
| worried the user might make a mistake and select the
| wrong piece of text.
|
| For instance, if you instruct the user to copy "this text
| string" and paste it somewhere, some users might copy the
| text string with the double-quotes and some without them.
| By instead emoji encode the string, the receiver of the
| copied emoji string can detect if not all emojis were
| copied.
| Ancapistani wrote:
| This strikes me as data validation, which should reside
| in the application layer - I don't see how pg-emoji helps
| in any way.
|
| Further... if the receiver can validate the encoded
| string itself, they implicitly already have the string.
| Why require the user to copy/paste at all? If you meant
| "Ensure that the user hasn't copied quotation marks as
| well", then we're back to it being application logic.
|
| If I'm understanding correctly that the primary benefit
| is that there is a checksum, then there are already many
| solutions for this in common use - base58checksum, as
| used to ensure the validity of Bitcoin addresses, comes
| immediately to mind. I wrote an implementation of that
| quite a while ago: https://github.com/lyndsysimon/cryptoc
| oin/blob/primary/crypt...
|
| Please don't misunderstand, I'm in no way intending to be
| argumentative. I don't understand the practical use of
| this project, which leads me to believe that there is a
| problem being solved that I lack the context to identify.
| JoelJacobson wrote:
| In my case, PostgreSQL is the application layer, the
| application is written in database functions, and I'm
| using PostgREST to expose it to my front-end.
| MrStonedOne wrote:
| this is not an emoji support system or a system for storing
| emojis.
|
| Its a system for encoding data _as_ emojis. "this is a
| string" => some emojis. ie baseemoji or base1024
| sfeng wrote:
| I believe it is an encoding for data into base 1024, using
| emoji as the symbol set. Similar in concept to Base64, meant to
| encode data into a format which can be sent anywhere ASCII is
| acceptable. This, one could think, would allow you to do the
| same thing more efficiently but with systems which accept
| emoji.
| jfk13 wrote:
| I'd expect that most places where emoji are accepted and
| reliably preserved, you could also use things like Han
| ideographs, which would give you a much larger symbol set to
| work with.
| WorldMaker wrote:
| One reason to pick emoji is visual distinctiveness and user
| familiarity. While admittedly there are large populations
| familiar with CJK ideographs and their
| construction/deconstruction, there are many more people
| familiar with emoji at this point. In the case of an
| encoding error or trying to visually "diff" two encodings,
| many audiences will spot emoji differences and/or problems
| with badly encoded emoji (much easier than they might spot
| differences in CJK ideographs).
|
| (Admittedly there are still issues within the emoji space
| such as some of the "faces" are quite similar in appearance
| in many fonts and still easily confused. Plus in the larger
| emoji space the subtle differences of skin color/gender can
| be easily confused if you have to rely on them for
| distinction. Restricting to only 1024 emoji and fewer ZWJ
| sequence variations presumably takes care of most of those
| issues.)
| roywiggins wrote:
| There's always base65536
|
| https://github.com/qntm/base65536
| cmeacham98 wrote:
| It's a novelty fun project, it serves no practical purpose.
| It's an encoding scheme similar to base64 or URL encoding, but
| one that does nothing useful.
| jeltz wrote:
| A similar technique but with a different set of emojis is
| used by the Element chat client to verify signatures for end
| to end encryption.
|
| https://matrix.org/docs/spec/client_server/latest#sas-
| method...
| alex_duf wrote:
| I think it's a tongue in cheek project that isn't meant to be
| used in any production system.
|
| I'm pretty sure you can already put emojis in the text fields
| of postgres. (or at least I'd be surprised if you couldn't)
| Ancapistani wrote:
| > I think it's a tongue in cheek project
|
| That would certainly make more sense than anything I've been
| able to glean from it.
|
| I've definitely used Postgres text columns to store user-
| provided text values that included emoji in the past. They're
| part of my standard test case for any user input.
| JoelJacobson wrote:
| The idea is to encode binary strings in a visually shorter
| form than e.g. hex, and also make it easier to visually
| detect differences. It's also possibly easier to remember a
| bunch of emojis than a hex string.
| jfk13 wrote:
| Not really. Do you really notice whether someone uses
| "Grinning face with smiling eyes" or just plain "Grinning
| face"? Or was it "Grinning face with big eyes", or maybe
| "Beaming face with smiling eyes". Or were they "squinting"
| eyes? Maybe the face was just "smiling", not "grinning".
| Sheesh.
| jeltz wrote:
| If you select a good set of clearly ambiguous emojis I
| could see the use for it. See for example the Matrix spec
| which recommends using emojis for verifying E2EE
| signatures.
|
| https://matrix.org/docs/spec/client_server/latest#sas-
| method...
| jfk13 wrote:
| Yes, a carefully selected set of 64 symbols would be much
| more sensible from that point of view. This project,
| though, apparently uses "the first 1024 emojis from
| [https://unicode.org/Public/emoji/13.1/emoji-test.txt]",
| which is an entirely different matter.
| jasperry wrote:
| I was disappointed when I saw it wasn't emojis drawn to look like
| Paul Graham.
| sillysaurusx wrote:
| I've been waiting for this moment for literally years.
|
| Long ago, I made http://github.com/strayptr/memes
|
| If you scroll down to Kappa, you'll see a Lambda, which is pg
| in the style of twitch.tv's Kappa emote.
|
| https://cloud.githubusercontent.com/assets/12214175/7581578/...
| mathiasrw wrote:
| This is genius.
|
| Data encoded in base1024 (here using the 1024 safe chars
| represented by emojis) gives much more efficient storage usage.
|
| 16 kB raw data encoded in base64
| ceil(16*1024/3)*4 = 21848 bytes long ~= 21.8kB.
|
| 16 kB raw data encoded in base1024
| ceil(16*1024/9)*10 = 18210 bytes long ~= 18.21kB.
|
| So base64 needs about 19.7% more data storage than base1024 and
| both can be used anywhere utf8 is supported.
|
| Let the baseEmoji revolution begin...
___________________________________________________________________
(page generated 2021-01-21 23:02 UTC)