[HN Gopher] Sqids - Generate Short Unique IDs from Numbers
___________________________________________________________________
Sqids - Generate Short Unique IDs from Numbers
Author : vyrotek
Score : 285 points
Date : 2023-11-25 17:30 UTC (5 hours ago)
(HTM) web link (sqids.org)
(TXT) w3m dump (sqids.org)
| dfc wrote:
| It's weird under "Get Started" they have links to 40 different
| languages. You can only get started with 15 of the 40 languages
| listed, the other 25 are skeleton repos asking for people to
| start the repo to indicate interest.
| hooverd wrote:
| Maybe a slam dunk first FOSS contribution?
| ctoth wrote:
| This seems like a perfect use case for an LLM :)
| LeFever wrote:
| It's kinda clever. The people most likely to look at this
| project are also likely ideal candidates for implementing the
| library in a new language (Developer, FOSS enthusiast,
| interested in the project, need the library in a language
| they're familiar with that isn't implemented yet).
|
| Also, the language pills differentiate between those that have
| been implemented (color logo, dark text, bold) and those that
| aren't (grayscale).
| 4kimov wrote:
| Good points. Those pages also contain links to old
| implementations (Hashids), because a lot of projects still
| use those and want to be able to find them.
| alas44 wrote:
| Also can help track which languages people click on, probably
| a good proxy of where there would be the need to develop a
| lib
| vyrotek wrote:
| The approach definitely works. Some time ago I saw .NET listed
| but discovered it wasn't complete. I was eager to replace an
| existing Hashids implementation so I made some comments, shared
| a starter-snippet, and then someone was excited enough to
| complete in just a few days. It was great to see how quick the
| community stepped in. Maybe there was a bit of Cunningham's Law
| in effect with my contribution, ha.
|
| https://github.com/sqids/sqids-dotnet/issues/2#issuecomment-...
| c2xlZXB5Cg1 wrote:
| Reminds me of proquints https://github.com/dsw/proquint
|
| But 127.0.0.1 looks more "readable" to me than lusab-babad
| whalesalad wrote:
| This used to have a totally different name iirc, they used to be
| called hashids
| resoluteteeth wrote:
| Yeah, it says that both in the page title and the logo at the
| upper left
| no_wizard wrote:
| I like the idea, though I use nanoid with the safe letter
| dictionary (it excludes letters used for profanity[0])
|
| They should use a similar dictionary approach IMO because I
| looked at the implementation and it's hardcoded to look for "bad"
| words
|
| Otherwise looks real straightforward! I'd love to see some
| performance test suites for it
|
| [0]: https://github.com/sqids/sqids-
| javascript/blob/ebca95e114932...
|
| [1]: though with UUID v4 so common to generate and well optimized
| in most languages I wonder if these userland solutions are really
| better. You can always generate a UUID and re-encode with base32
| or base64 with also is well optimized in most languages
| lxgr wrote:
| > it excludes letters used for profanity
|
| That doesn't seem possible. How would that work?
|
| > I looked at the implementation and it's hardcoded to look for
| "bad" words.
|
| If you mean https://github.com/y-gagar1n/nanoid-good, that
| seems to be doing the same thing.
|
| In general, I'm a bit weary of solutions that "guarantee no bad
| words" - this is usually highly language-specific: One
| language's perfectly acceptable name is another language's
| swear word.
| no_wizard wrote:
| This is the implementation:
| https://github.com/CyberAP/nanoid-dictionary
|
| We use it in a highly internationalized product spanning
| multiple languages and haven't yet ran into a complaint or
| value on audit that would constitute something offense in any
| language per our intl content teams anyway.
|
| That isn't to say it's 100% (and simply enough we don't audit
| every single URL) but I suspect we would have gotten at least
| a user heads up by now
|
| Never the less we are moving our approach to uuids that get
| base32 encoded for some of our use case for this. They're
| easier to work for us in many scenarios
| Sharlin wrote:
| Omit vowels and you're 90% of the way there; omit the vowel-
| looking digits 0,1,3,4 and you're probably >99% of the way
| there.
| gberger wrote:
| fxck
| Sharlin wrote:
| Which is, evidently, why nanoids also excludes x and X,
| as well as v and V (fvck).
| Silasdev wrote:
| It's particularly funny because their example docs for .NET
| outputs "B4aajs", which to any Swedish l33t speaking
| individual, would read "Bajs", which means "shit"
| livrem wrote:
| Looks like the dictionaries used are from this file?
|
| https://registry.npmjs.org/naughty-words/-/naughty-
| words-1.2...
|
| From a quick look, the lists are pretty short, except for the
| one with English words that at least have some 404 words, but
| I can imagine there are far more bad words that you want to
| avoid than just those?
| ape4 wrote:
| Here's the C++ of the sqid blocked words
| https://github.com/sqids/sqids-
| cpp/blob/main/include/sqids/b...
| njharman wrote:
| > That doesn't seem possible. How would that work?
|
| agree; b00b, DlCK, cntfcker
|
| But I suppose, if user doesn't get to craft input, the
| collision space of converted numerical ids and words like
| above is sufficiently small to be ignorable.
| Sharlin wrote:
| Besides vowels, nanoid excludes 0, 1, 3, 4, 5, I, l, x, X,
| v, V, and other lookalikes, so the chances of generating
| something naughty in _any_ language are close to zero.
| tttp wrote:
| I tried something similar with a fixed alphabet that guarantees
| no profanity and a checksum (luhn)
|
| https://github.com/tttp/dxid
| dumbo-octopus wrote:
| Odd design decision in that if you provide your own blocklist, it
| overwrites their (extensive) default list instead of adding to
| it.
|
| And in general the algorithm is surprisingly complicated for
| something that could be replaced with simply base64 encoding, the
| given example (1,2,3) base64 encodes to a string with just one
| more letter than this algorithm.
|
| That said I do appreciate the semicolon-free-style. I don't
| typically see that in libs besides my own.
|
| https://github.com/sqids/sqids-javascript/blob/main/src/sqid...
| 8organicbits wrote:
| The problem is their block list will change over time. If you
| don't override it, then your IDs won't decode right when you
| update. This is a huge risk.
|
| > You have to account for scenarios where a new word might be
| introduced to the default blocklist
|
| https://sqids.org/faq#future-blocklist
|
| Honestly, I think they need to rethink this. Otherwise you've
| got different library versions for different languages each
| using different default blocklists, none of which are
| compatible.
| jsf01 wrote:
| What's the use case for passing in an array of numbers? Typically
| when generating an ID my input is either a single random number,
| a string that's being hashed, or nothing at all.
| 4kimov wrote:
| [shard_number, primary_id_number, timestamp]
| dumbo-octopus wrote:
| But then why not just arbitrary text?
| James_K wrote:
| I guess they haven't heard of base-64.
| xjia wrote:
| Or base58, e.g.
| https://api.rubyonrails.org/classes/SecureRandom.html#method...
| dymk wrote:
| That doesn't solve the same set of problems as TFA. Randomized
| output order for sequential input, skips IDs that include
| profanity.
| majkinetor wrote:
| One of the points is also to use custom alphabet.
| canU4 wrote:
| Sad that it is not for user ids
| 8organicbits wrote:
| I think that's only if you don't want to leak user count when
| your ID is an autoincrement. Elsewhere people mention
| cryptographicly remapping integers, which could work (by
| itself, or before passing the ID to sqids).
| packetlost wrote:
| The name (but not function) seems really close to squuids from
| Datomic/Clojure.
| 3cats-in-a-coat wrote:
| I don't get it, that's like two lines of code, why does it have a
| library and even a domain
| k2xl wrote:
| Also wondering this
| its-summertime wrote:
| For a similar thing, (X bytes to X bytes, no collisions)
| https://en.wikipedia.org/wiki/Format-preserving_encryption is a
| good page
| jchook wrote:
| Also see Knuth Hash and k-dimensional equidistribution.
| habitue wrote:
| Skipping profanity seems like a liability in this design. It
| means in order to preserve the encoding you need to make the
| banned word list immutable, otherwise old sqids will decode to
| the wrong thing when you get them back.
| Etheryte wrote:
| I don't think this holds, you can enforce filtering in the
| encoding step, i.e. be strict about what you output, but always
| decode, even if the input is profanity. This means you can also
| be backwards compatible if you update the list etc. So in
| short, the old maxim of be strict about your outputs and
| lenient about your inputs.
| fimdomeio wrote:
| From their FAQ: "The best way to ensure your IDs stay
| consistent throughout future updates is to provide a custom
| blocklist, even if it is identical to the current default
| blocklist."
| Etheryte wrote:
| In that case it sounds like a shortcoming on their part.
| There is no fundamental reason to have that limitation. I
| understand it can make the implementation easier to not
| have it, but in my opinion being blocklist change agnostic
| would be a much better value offering.
| lights0123 wrote:
| The *encoding* changes. The decoding stays consistent:
|
| > Decoding IDs will usually produce some kind of numeric
| output, but that doesn't necessarily mean that the ID is
| canonical. To check that the ID is valid, you can re-encode
| decoded numbers and check that the ID matches.
|
| The reason this is not done automatically is that if the
| default blocklist changes in the future, we don't want to
| automatically invalidate the ID that has been generated in
| the past and might now be matching a new blocklist word.
| runlevel1 wrote:
| The stupid simple way I did this ages ago was:
|
| 1. Start with a-z.
|
| 2. Drop all vowels, numbers, most homoglyphs, and the letter
| 'x'.
|
| 3. Map digits 0-9 to one of the remaining letters.
|
| 4. Stringify the integer and replace the digit in each decimal
| place with its corresponding character.
|
| For my use-case, all the numbers were >7 digits long, so the
| odds of you getting an offensive acronym were reasonably low
| unless you started combining them.
|
| But there's no perfect solution. As this dataset shows, you can
| find offense in almost anything if you look hard enough:
|
| California Personalized License Plate Requests Flagged for
| Review 2015-2016:
| https://docs.google.com/spreadsheets/d/18IUVU9Q4uN_lxqNd5AsN...
| arp242 wrote:
| Many of those reviewer comments are utterly moronic. And that
| is my _polite_ opinion.
|
| How does this work? Is there a review board? Is it put to
| public review? A few of them like "dick out" and "shtlord"
| are reasonable, but many of them seem so bonkers it looks
| like the work of trolls.
|
| Anyway, TIL that 1970s Intel was a MS-13 gang outfit and that
| Octocat really means "eight vaginas".
| air7 wrote:
| > California Personalized License Plate Requests Flagged for
| Review 2015-2016: https://docs.google.com/spreadsheets/d/18IU
| VU9Q4uN_lxqNd5AsN...
|
| Wow this is a funny peek into a weird perdicment where people
| need to justify that they have a good reason to have a
| specific license plate.
|
| Some seems obviously ok such as:
|
| INT13H
|
| 314 PI
|
| And some are obviously not:
|
| DRY(hand emoji)JOB
|
| DICK OUT
|
| Come to think of it: Can license plates have emojis now?!
| 8organicbits wrote:
| Agreed, this is a big risk made worse that the default word
| list can change over time.
|
| https://sqids.org/faq#future-blocklist
| kaetemi wrote:
| It's a base62 encoder that takes multiple integers as input.
| Probably a bit-length prefixed encoding. I am assuming it just
| pads an extra junk integer to re-roll the encoded number.
| 8organicbits wrote:
| The mention of one-time passcodes seems odd. Those need to be
| unguessable, but don't need to be unique. If you supply a
| suitable random source, then I suppose it works, but the "padded
| with junk" feature makes these look more complex than they really
| are.
|
| The standard choice of 4 to 8 random digits works well and it's
| clear what level of security they provide. Digits are easier to
| understand than case sensitive latin characters, especially when
| your native language uses a different character set.
| progne wrote:
| In a Ruby app we just convert to a high base, like
| > 1234567890.to_s(36) => "kf12oi"
|
| That gets us most of the way there, but Sqid has a Ruby library
| and lets you set a much higher base, including upper case
| characters, and I suppose, emoji. We're going to need much bigger
| numbers before that space savings makes much difference. I like
| it, but it's hard to know when something like that is worth
| adding a dependency.
| vyrotek wrote:
| I believe a big part of the idea is for the hash to be
| unpredictable as well.
|
| If I figure out you're using (36) then I know the next number
| 1234567891 is "kf12oj".
|
| Not the case with Sqids.
| hot_gril wrote:
| You can easily brute-force this. Sqids also says it's not
| good for sensitive data.
| 8organicbits wrote:
| It looks like an easy brute force too, there's no compute-
| hard operations here. I guess you could scramble your
| alphabet? Otherwise Uk always comes after bM, etc.
| echelon wrote:
| I'd prefer to use crockford-encoded entropy with Stripe-style
| token prefixes to create unique ID namespaces. Run in through
| a bad words filter, and it's perfect.
|
| user_1hrpt0xpax7ps
|
| file_xpax7psaz0tv6az0tv6
|
| Etc.
|
| In distributed systems you can use the trailing bytes to
| encode things like author cluster, in case you're active-
| active and need to route subsequent writes before create
| event replication.
|
| Easy to copy, debug, run ops/incall against. If you have an
| API, they're user-friendly.
|
| Of course you still want to instruct people the prefixes are
| opaque.
| wombatpm wrote:
| Yeah don't forget the bad words filter. I worked on an IKEA
| mailing where the list processing house was adding an
| autogenerated discount code to the address label. The
| customers received codes with BOOB, DICK, TWAT, and CUNT
| embedded within. People were not happy.
| otteromkram wrote:
| Did they never make an IKEA purchase after that or did
| they get over it like a normal adult?
|
| I don't work retail, but something tells me people will
| make a stink out of just about anything if it meant
| potentially free products or other compensation.
|
| Plus, are you filtering just English curse words or all
| curse words for countries that use Latin characters?
| pelagicAustral wrote:
| Correct me if I'm wrong, but, It cannot be unpredictable,
| which makes the library redundant for security concerns,
| which would be the one business case to seek for anything
| other than an UUID (which is already built into Ruby).
| paulddraper wrote:
| No, squids are predictable
| candiddevmike wrote:
| BaseEmoji is a thing: https://github.com/amoallim15/base-emoji
| exxos wrote:
| I didn't think of that, but this is a nice trick!
| 1-6 wrote:
| Sqids vs Squids. Missing the 'U' for unique but nevertheless a
| unique shortened version of the regular spelling.
| urza wrote:
| I wanted to say that I use similar project called HashIDs, but I
| see that HashIDs rebranded to Sqids :)
| ComputerGuru wrote:
| I haven't been able to find a case for this because ids either
| need to be unique or they're not going to be large. If they're
| unique, I'm using uuid or ulid (uuidv7 of tomorrow) as the
| sortable primary key type to avoid conflicts without using the db
| to generate and maintain sequences.
|
| Where do you have unique ids that aren't the primary key? I would
| be more interested in a retrospectively unique truncated encoding
| for extant ulid/uuid; ie given that we've passed timestamp foo,
| we know that (where no external data is merged) we only need a
| bucketed time granularity of x for the random component of the id
| to remain unique (for when sortability is no longer needed).
|
| Or just more generally a way to convert a ulid/uuidv7 to a
| shorter sequence if we are using it for external hash table
| lookups only and can do without the timestamp component.
| Bytewave81 wrote:
| The idea is that you encode and decode database IDs with this.
| You wouldn't save them separately unless you were using it for
| a purpose other than shareable "identifiers" which don't leak
| significant amounts of database state. Imagine something like a
| link shortener where you want to provide a short link to users,
| but don't want it to just be a number.
| swyx wrote:
| why is ulid the uuidv7 of tomorrow?
| waffle_ss wrote:
| I wrote a Ruby gem to address this problem of hiding sequential
| primary keys that uses a Feistel network to effectively shuffle
| int64 IDs: https://github.com/abevoelker/gfc64
|
| So instead of /customers/1
| /customers/2
|
| You'll get something like
| /customers/4552956331295818987
| /customers/3833777695217202560
|
| Kinda similar idea to this library but you're encoding from an
| integer to another integer (i.e. it's format-preserving
| encryption). I like keeping the IDs as integers without having to
| reach for e.g. UUIDs
| chupapimunyenyo wrote:
| Hashids seems way better than their new implementation
| Use wrote:
| Why should you hide your user count?
| sneak wrote:
| The rate of change over time can be used against you; many
| people consider their businesses' month-over-month growth (or
| lack thereof) to be private information.
|
| "$WEBSITE did 50,000 signups a month during the beginning of
| the pandemic, but now struggles to sign up a thousand a week"
| is a story.
| Schnitz wrote:
| It would be great to have a quick primer on why this is better
| than what people typically homebrew, like base62 encoding a
| random number.
| sneak wrote:
| Database PKs usually aren't random, which AFAIK is what is
| usually used as the number in this case.
| 8organicbits wrote:
| If you use a random number then you need to store it somewhere
| to map back to the original. Sqids is an encoding, you can
| decode the sqid back to the original without storage overhead.
|
| Features like the profanity filter avoid creating URL routes
| like /user/cuntFh.
|
| Cross language support allows interop between the encoder and
| decoder across microservices written in different languages.
| parhamn wrote:
| Side note: there are some business insights you can get from a
| company using serial ids.
|
| i.e if you sign up and get user id 32588 and make another account
| a few days later, you can tell the growth rate of the company.
|
| And this is possible with every resource type in the application.
|
| I do wonder how much the url bar junk thing matters these days. I
| tend to use uulids (waiting on uuid v7 wide adoption), and
| they're a bit ugly, but most browsers hide most of the urls now
| anyway. The fact that there is a builtin time component comes in
| clutch sometimes (e.g. object merging rules).
| pacificmint wrote:
| > you can tell the growth rate of the company.
|
| You can even do this when you don't know the exact interval by
| using probabilities. The Allies used this method to estimate
| German tank production in World War II by analyzing the serial
| numbers of captured or destroyed tanks.
|
| This is know as the German Tank Problem [1]
|
| [1] https://en.wikipedia.org/wiki/German_tank_problem
| lhamil64 wrote:
| It also makes it slightly easier to perform certain attacks
| since it's trivial to figure out other IDs.
| paulddraper wrote:
| > most browsers
|
| Not chrome...
|
| Also, links are a think in chat, etc
| parhamn wrote:
| Heres what a recent youtube (which squid documents as a
| sample use case) link I shared looked like:
|
| > https://www.youtube.com/watch?v=fFMzQ3tYTFU&pp=ygURY2hQImVz
| Z...
|
| Or Twitter:
|
| > https://x.com/elonmusk/status/172853302828286055507?s=20
|
| Or TikTok:
|
| > https://www.tiktok.com/@<userId>/video/73029257425923205478
| 5...
|
| While I tend to strip the tracking params and there are
| extensions that do this, I don't think most people do. These
| URLs are pretty 'ugly'.
|
| So if the links that are being shared most on the internet
| (YT, TikTok, Twitter) don't care, you probably shouldn't
| either. I think the onus is on the UI layers (Chat apps, etc)
| to show urls how they look best on their respective
| platforms.
|
| Edit: to this point, it looks like HN truncates these to make
| them less ugly too.
| swyx wrote:
| saving it to my list of uid implementations
| https://github.com/swyxio/brain/blob/master/R%20-%20Dev%20No...
| bufferoverflow wrote:
| Do we really need a library for that? Shouldn't it be a simple
| function?
| dustingetz wrote:
| anyone have a copy pasta for the widest possible alphabet (i.e.
| extended unicode safe chars)
| filleokus wrote:
| > Not Good For:
|
| > User IDs - Can be decoded, revealing user count
|
| Suppose you don't want to leak the count, what's a resonable way
| of implementing that?
|
| You can of course have a uuid v7 / uulids or something as the
| primary key. Or have it as a public facing primary key, mapping
| back to a sequential ID PK (there might be some performance hits
| with larger PK's in e.g postgres? or is that just fud?)
|
| But you could also generate a public ID with something like
| encrypt(seq_id, secret) and then encode it with whatever alphabet
| and or profanity filter you'd like - right? The issue then is
| that all public ID's would be long (and of course dealing with a
| decrypt operation on all incoming requests).
|
| Don't know what's best really.
| 8n4vidtmkvmk wrote:
| Add an offset, multiply by a large prime number, and modulo. I
| don't think you can recover the original number without
| figuring out the prime.
| kryptogeist wrote:
| Damn, those squids are getting smart
| orf wrote:
| How do you adjust or evolve the blocklist with this, without
| making previously generated IDs incorrect?
|
| The ID is simply incremented if it is blacklisted [1]. So the ID
| is fixed to the blacklist content, and adjusting it in any way
| invalidates certain segments of previously generated IDs?
|
| 1. https://github.com/sqids/sqids-
| rust/blob/9f987886bc06875d782...
| revenga99 wrote:
| is there anyway to generate short unique id's from UUID's?
| snowflake is incredibly slow when joining UUID => UUID columns.
| andrewstuart wrote:
| What is the decimal range of these values?
| sandstrom wrote:
| Neat library!
|
| We're using randomly generated strings for many things. IDs,
| password recovery tokens, etc. We've generated millions of them
| in our system, for various use-cases. Hundreds of thousands of
| people see them every day.
|
| I've never heard any complaints about a random content-id being
| "lR8vDick4r" (dick) or whatever.
|
| But nowadays our society is so afraid of offending anyone, that
| profanity filters has extended all the way to database IDs and
| password recovery tokens.
|
| (there are some legit cases, like randomly generated IDs for user
| profiles shared in public URLs, that users have to live with, but
| even there just make the min length 8 and you're unlikely to have
| any full-word profanity as the complete ID; put differently, I
| don't understand why they made the block list an opt-out thing)
___________________________________________________________________
(page generated 2023-11-25 23:00 UTC)