[HN Gopher] Representing Enums in PostgreSQL
       ___________________________________________________________________
        
       Representing Enums in PostgreSQL
        
       Author : wojcikstefan
       Score  : 103 points
       Date   : 2023-06-20 12:31 UTC (10 hours ago)
        
 (HTM) web link (making.close.com)
 (TXT) w3m dump (making.close.com)
        
       | jpalomaki wrote:
       | Native enums can be used in queries like strings, but with type
       | checking:                 select * from cust where type =
       | 'company' -- ok       select * from cust where type = 'c0mpany'
       | -- error
       | 
       | As mentioned, they take less space. Important if you use these
       | columns in index and have millions of rows.
        
       | cryptonector wrote:
       | TFA is all about how to make changes where you drop elements of
       | an enum, and how hard that is. The obvious thing though is not
       | covered: don't do that! Instead you should:                 a.
       | Add CHECK() constraints on columns of          the relevant ENUM
       | type checking that          the value is one of the "live"
       | values.       b. RENAME "dead" ENUM values to indicate
       | that they're dead.       c. UPDATE ... SET
       | column_of_that_enum_type = ...          WHERE
       | column_of_that_enum_type IN (dead_values_here) ...
       | 
       | (c) can be done slowly, say, with a LIMIT clause to keep load and
       | disruptions down. Yeah, you might have "broken" data for a while,
       | but you can make your queries report better values for columns of
       | that ENUM type in the same way that the UPDATE would fix them.
        
         | BulgarianIdiot wrote:
         | [dead]
        
         | dikei wrote:
         | Agree, PostgreSQL enum is really just for mapping with an enum
         | in application code. And just like you should not remove an
         | enum or change their ordinal in application code, don't do that
         | to a PostgreSQL enum
        
         | hn_throwaway_99 wrote:
         | Glad to see this as the top comment, completely agree. After
         | reading the article, was thinking that the only real downside
         | of using enum types (but there is one more issue not mentioned,
         | more on that below) is when you need to remove values. In
         | reality, I've found removing enum values to be a very rare
         | occurrence in prod. Removing an enum value fundamentally breaks
         | backwards compatibility, so usually a better option is to
         | essentially deprecate it, which is more like what you
         | recommend.
         | 
         | Note one other thing I have found is that postgres enums are
         | less "portable" than plain text columns. For example, if you're
         | using BigQuery to query postgres tables as an external
         | connection, BQ doesn't support enums, so your external
         | connection query has to cast any enum return values (e.g.
         | some_enum_column::text) before returning them to BQ, which can
         | be a pain if you want to do "SELECT * FROM ..." some table that
         | returns enum columns.
        
           | faangsticle wrote:
           | > BQ doesn't support enums
           | 
           | Sounds like a reason to use views to work around the
           | limitations of BigQuery
        
           | cryptonector wrote:
           | > Note one other thing I have found is that postgres enums
           | are less "portable" than plain text columns. For example, if
           | you're using BigQuery to query postgres tables as an external
           | connection, BQ doesn't support enums, so your external
           | connection query has to cast any enum return values (e.g.
           | some_enum_column::text) before returning them to BQ, which
           | can be a pain if you want to do "SELECT * FROM ..." some
           | table that returns enum columns.
           | 
           | Yeah, one has to write more adapter queries to add those
           | casts to text (or from text, for DMLs), but it's OK, and
           | ENUMs are just really nice. And as u/faangsticle says, use
           | VIEWs for that.
        
           | thom wrote:
           | You can create an implicit cast to and from text with CREATE
           | CAST and it becomes transparent.
        
             | aidos wrote:
             | Oh nice. I didn't know about that feature.
        
         | cropcirclbureau wrote:
         | Good approach. And with Postgres `DOMAIN` types features to DRY
         | up the `CHECK` the live constraint, this should be very
         | palatable. I usually opt for enum tables and regularly find
         | myself needing to declare `archived_at` columns for said
         | tables. Did not know of the `RENAME` command either and what
         | you describe renders native enums less warty.
        
       | rollulus wrote:
       | When would one use this approach versus a domain type [1]? How
       | does it differ?
       | 
       | [1]: https://www.postgresql.org/docs/current/domains.html
        
         | jpmelos wrote:
         | That would be an improvement on the `CHECK` constraint approach
         | if the same `CHECK` constraint is used in multiple columns, to
         | DRY it up. The `DOMAIN` types offer the same commands for
         | `CHECK` constraints, namely `NOT VALID` when updating the
         | constraint, and `VALIDATE CONSTRAINT` to make sure it is valid
         | after having updated it.
        
       | cmcconomy wrote:
       | I'm happy to have a FK to a reftable with int, shortname,
       | description
        
       | Icathian wrote:
       | Little known bug, but enums also choke dump/restore if you use
       | them as a hash partition key.
       | 
       | That said, they can be really nice for all the same reasons
       | static types are nice. Good article!
        
         | ComputerGuru wrote:
         | It sure sounds like a bug to me, too! Is there an upstream bug
         | id? I'd imagine dump/restore should "just work" with anything
         | that can be done in postgres oob.
        
           | Icathian wrote:
           | Donno about a bug Id, but there were a couple mailing list
           | discussions. Here: https://www.postgresql.org/message-
           | id/flat/1376149.167526827...
           | 
           | Disclosure: I reported this particular bug. It was a fun one
           | to run down.
        
         | wojcikstefan wrote:
         | Whew, good to know! Another point towards the `CHECK`
         | constraint approach. Thanks! :)
        
       | fabian2k wrote:
       | Native enums in Postgres just seem to have many more drawbacks
       | than advantages. I never really used them, it didn't seem worth
       | it.
       | 
       | I usually use C# Enums translated by EF Core now, which works
       | perfectly fine on the C# side. The only missing part would be to
       | give the DB the information about the enum names, so it could
       | show those to me instead of just the raw numbers. But I assume
       | there is no way to do that. Adding entire tables just doesn't
       | seem worth it for this use case alone.
        
         | ComputerGuru wrote:
         | > Adding entire tables just doesn't seem worth it for this use
         | case alone.
         | 
         | If you peek at a particular table often enough and it has a
         | crazy enough enum to maybe justify making your life a little
         | easier, what you can do is add a table just for debug purposes
         | and use a view that maps the int to it, but otherwise only use
         | the direct int:enum cast in your app/orm/etc. thereby not
         | paying any runtime cost for foreign keys or integrity checking.
        
         | bob1029 wrote:
         | > Adding entire tables just doesn't seem worth it for this use
         | case alone.
         | 
         | I am going down this path. There are a lot of downstream use
         | cases (e.g. support, reporting, etc) that can benefit from
         | having the canonical names & human-friendly labels in the
         | database.
         | 
         | Keeping database in sync with code like this requires some
         | discipline and manual scripting, but I think it's worth it.
        
         | wojcikstefan wrote:
         | > I usually use C# Enums translated by EF Core now (...)
         | 
         | Does this approach ensure data consistency? Or could you
         | accidentally insert a number value into your table which is not
         | represented by any of the enum values?
        
           | fabian2k wrote:
           | With raw SQL you could write anything in it. Within EF Core
           | it won't allow bad values, you could trick it I think but
           | that is not something you'd do accidentally. You can always
           | add a CHECK constraint in addition, if you have more
           | uncontrolled places modifying those values.
        
           | ComputerGuru wrote:
           | Well C# is strongly typed so that's punted to the CLR type
           | system (assuming you're not using raw sql to insert values
           | via other means/frontends) - but it isn't too hard to
           | forcibly to cast a raw int to an enum in C# which doesn't
           | actually contain a definition with that integral value
           | (unlike in even stronger typed languages eg in rust where
           | it's a lot more work).
           | 
           | I "manually" add a check constraint (via EF, so it's not so
           | much manual as it is "remember to copy and paste this in your
           | db.cs) to assert the value is greater than or equal to zero
           | and less than the number of inhabitants in the enum, but this
           | fails if you manually code the enum values (eg for flags,
           | legacy interop, etc).
        
       | akersten wrote:
       | Is there a reason ALTER TYPE name ADD VALUE new_enum_value
       | wouldn't Just Work for the first example for 99% of use cases?
       | Seems like the only drawback highlighted for native enums is that
       | you have to lock the whole table if you completely swap out the
       | type of one of the columns, which... Yes, that's true, but also
       | very rare? I guess TFA mentions removing a value from an enum,
       | but you shouldn't do that imo - leave the data in place and
       | ignore it at the app layer, like how you'd treat a deleted flag
       | or similar. Or migrate it and leave the old value hanging around
       | in the type.
        
         | jpmelos wrote:
         | > Is there a reason ALTER TYPE name ADD VALUE new_enum_value
         | wouldn't Just Work for the first example for 99% of use cases?
         | 
         | I'm sorry, we don't have any examples in the article about
         | adding values to enums, so I don't know to what you are
         | referring here.
         | 
         | > Yes, that's true, but also very rare?
         | 
         | It is rare, indeed, but we just have a strong inclination
         | towards having a clean schema, so we do prefer to clean up
         | things that would otherwise be unused. But I understand that
         | many people would prefer to leave them hanging there, even if
         | they are not going to be used in any way.
        
       | jpgvm wrote:
       | PostgreSQL enums feel like a bit of a hack in general. I end up
       | using an "enum table" approach in many cases as joining against a
       | very small table has negligible performance impact in all but the
       | most performance sensitive databases and foreign key constraints
       | are a well understood quantity.
        
         | marcosdumay wrote:
         | PostgreSQL enums are there for you to use on your code API.
         | They can go into tables because sometimes it's important to put
         | code data into tables.
         | 
         | What they are absolutely not for is to replace the one standard
         | way to define data enums that everybody use since relational
         | algebra was created. It's such an outrageously naive idea that
         | I'm sure most people here were attracted to the title thinking
         | it's about algebraic types or some other similar
         | misunderstanding.
        
           | wojcikstefan wrote:
           | Can you elaborate on what it means to use PostgreSQL enums
           | "on your code API"?
           | 
           | > It's such an outrageously naive idea that I'm sure most
           | people here were attracted to the title thinking it's about
           | algebraic types or some other similar misunderstanding.
           | 
           | Just to share how we at Close got into this discussion (which
           | I personally don't find as "outrageous" as you), SQLAlchemy -
           | which we use in our Python code - uses `native_enum=True` as
           | the default value for its `Enum` type [0], which then
           | translates to using types in PostgreSQL:
           | 
           | > native_enum - Use the database's native ENUM type when
           | available. Defaults to True. When False, uses VARCHAR + check
           | constraint for all backends.
           | 
           | [0]: https://docs.sqlalchemy.org/en/20/core/type_basics.html#
           | sqla...
        
             | marcosdumay wrote:
             | > Can you elaborate on what it means to use PostgreSQL
             | enums "on your code API"?
             | 
             | If you export a procedure for creating a socket into
             | pgPlSQL, you shouldn't use magical numbers for setting the
             | socket flags. You should use enums.
             | 
             | As for sqlalchmey, well that design is not good. It should
             | support more mappings than just to string. But well,
             | personally, I would ignore the feature and go without enum
             | types (notice that it's a recent addition). You can always
             | declare your own enum and convert the column if you want.
        
               | faangsticle wrote:
               | > But well, personally, I would ignore the feature and go
               | without enum types (notice that it's a recent addition).
               | 
               | Postgres has had enums since 2008
        
               | mike_hock wrote:
               | So, recent addition.
        
             | aidos wrote:
             | Heyya! Close customer here.
             | 
             | > When False, uses VARCHAR + check constraint for all
             | backends
             | 
             | Is this behaviour new? Or is it that alembic still doesn't
             | pick up changes to the enum to recreate the constraint? We
             | implemented our own alembic hooks to automate the
             | migrations for us. We've found the check constraint model
             | to be fairly effective.
        
               | wojcikstefan wrote:
               | Hiya Aidan! :D
               | 
               | > Is this behaviour new?
               | 
               | Don't think so. Even the docs for SQLA v1.3 ([0]) mention
               | it:
               | 
               | > native_enum - Use the database's native ENUM type when
               | available. Defaults to True. When False, uses VARCHAR +
               | check constraint for all backends. The VARCHAR length can
               | be controlled with Enum.length
               | 
               | [0]: https://docs.sqlalchemy.org/en/13/core/type_basics.h
               | tml#sqla...
               | 
               | --
               | 
               | > Or is it that alembic still doesn't pick up changes to
               | the enum to recreate the constraint?
               | 
               | Off the top of my head, I believe Alembic doesn't pick it
               | up automatically. We make a lot of manual tweaks to our
               | auto-generated Alembic migrations and I believe adding /
               | modifying the CHECK constrains is part of it.
        
               | aidos wrote:
               | Happy to ping you guys over the code if you want to look.
               | It's probably fairly tweak-able to handle your specific
               | cases.
        
         | cryptonector wrote:
         | PG enums _are_ "enum tables" under the hood. With native enum
         | support the JOINs with the enum tables happen at query
         | compilation time, which is a performance benefit you should not
         | ignore.
        
       | rpcope1 wrote:
       | After having suffered through the consequences of "type" enums on
       | MySQL, and see some things go through a long life that used
       | "enums" in the database (in multiple different databases, include
       | Postgres), I'm not convinced that either of these are the right
       | choice for representing enumerations. The string with check
       | constraint seems dumb if for no other reason than if the table
       | that uses it winds up having many rows, you're basically burning
       | up lots of extra space for arguably no reason, and if you ever
       | have to alter the name of an enum (or something similar), that
       | update is going to be really expensive.
       | 
       | I think the "right" choice for enums probably looks a little more
       | like:                  CREATE TABLE myEnum(          enumID
       | SERIAL NOT NULL PRIMARY KEY,          enumName TEXT UNIQUE NOT
       | NULL,          -- enum description or other metadata columns
       | here.        );            CREATE TABLE foo(         ...
       | associatedEnumID INTEGER NOT NULL REFERENCES myEnum(enumID),
       | ...       );
       | 
       | I think this has the benefit of being space efficient like the
       | native typed enum, while being relatively flexible (easy to
       | change names, add new enum values, add data about the enum
       | itself, etc.)
        
         | SigmundA wrote:
         | Yep thats what I typically do except a string "code" and string
         | "description" that way the raw data is more readable without
         | joining the coded value and many times you want a short code vs
         | long description, like with US states and their abbreviations.
         | Most codes are 1 or 2 characters up to maybe 4 so you end up
         | with less or the same space used than a 32 bit int with UTF-8.
         | 
         | Metadata like sorting and even what date ranges the code is
         | valid for and even security as in who's allowed to use the code
         | in the app.
        
         | netcraft wrote:
         | In my experience, you need those descriptions and other
         | metadata on these enum values more often than not.
        
         | SPBS wrote:
         | I prefer directly using strings as enums, and using the foreign
         | key constraint only to validate enum values.
         | CREATE TABLE my_enum ( name TEXT PRIMARY KEY );
         | CREATE TABLE foo ( my_enum TEXT REFERENCES my_enum (name) );
         | 
         | The reason is because a SELECT * FROM foo showing cryptic enum
         | ordinals is a headache, and having to join the enum table every
         | time is potentially slower than just reading from the column
         | directly. An ASCII character only takes 1 byte, so an INT enum
         | is just as space efficient as using 4 characters, which affords
         | way more descriptiveness than a meaningless ordinal number.
        
           | faangsticle wrote:
           | The point of enums is so the names can be descriptive, which
           | typically will be longer than 4 characters.
        
           | ledgerdev wrote:
           | If you use longer named enums(eg. my_enum_xyz) in my_enum,
           | does this store a full copy of the enum text bytes of
           | 'my_enum_xyz' into table foo?
        
             | SPBS wrote:
             | Yeah the full text of the enum is stored in the table
             | ('my_enum_xyz' is 11 ASCII characters so it takes up 11
             | bytes, plus 1 byte needed to store the length of the
             | string).
        
         | jfb wrote:
         | This is exactly what I do, save with natural keys.
        
         | cryptonector wrote:
         | This is... exactly what PG does under the covers for ENUM
         | types. And also of course this is historically the canonical
         | way to do ENUMs in SQL.
        
       | out_of_protocol wrote:
       | smallint looks like a good alternative, with dictionary in the
       | app or separate table. So far i've only seen storing dictionary
       | in app source code approach
        
         | wojcikstefan wrote:
         | Just to make sure I understand your suggestion fully, you're
         | saying: 1. Use `smallint` instead of `text` for the column
         | type. 2. Otherwise follow our "`CHECK` constraint" approach
         | (without it we're jeopardizing data consistency because we can
         | store anything between -32768 and 32767 in that column). 3.
         | Translate the int to the enum in your application.
         | 
         | Right?
        
           | out_of_protocol wrote:
           | Yep!
        
           | ComputerGuru wrote:
           | Most ORMs or just DALs can automate step 3 for you. If you're
           | using a strongly typed language with any type of ffi support,
           | you'll probably already have native cast from int to enum
           | available (unless you're using rust where that's considered
           | unsafe by default).
        
         | jpmelos wrote:
         | I don't like this approach much because getting cryptic
         | integers when you do a `SELECT` in the database is really
         | cumbersome.
        
           | arp242 wrote:
           | Agreed, but on the other hand it also saves a lot of
           | complexity and possible headaches down the line. It's kind of
           | a matter of choosing which headache you want. Especially if
           | your table gets larger all those extra bytes in text columns
           | can cost you dozens of GB of disk space, makes indexing
           | slower, etc.
           | 
           | I MySQL/MariaDB enums are just "aliases" for ints, and that
           | works much nicer IMHO, and adding a new value is cheap
           | because it doesn't recheck all the rows (removing values is
           | still expensive, as it needs to check it's not actually used
           | by any row).
        
       | jensenbox wrote:
       | What is the difference in size on disk between the two options?
        
         | jpmelos wrote:
         | An enum stores a small reference to the enum value in the
         | tuples, and the constraint approach stores the string itself.
         | The constraint approach takes more space with (a) larger enum
         | values, (b) more columns using them, and (c) more rows in the
         | tables that use them. The difference depends on the
         | characteristics of the database we're talking about.
        
       | Canada wrote:
       | I agree with this, I don't use enums they are always more trouble
       | than they are worth.
       | 
       | They break FDW unless they are pre-created on the importing side.
       | Super inconvenient.
        
         | efxhoy wrote:
         | I had to implement a workaround for that for our DW which
         | imports from two different app databases. IMO this should be
         | part of                 import foreign schema
         | 
         | but then you would have to qualify the names somehow if you
         | were importing from more than one db. I wrote this as a
         | workaround, it runs every day as part of our data import job.
         | DROP SCHEMA IF EXISTS fdw_enum CASCADE;       CREATE SCHEMA
         | fdw_enum;              -- Get names of the enums using
         | ::regtype casting and label and sort_order from app_a and
         | app_b.       CREATE VIEW fdw_enum.app_a_enums AS SELECT * FROM
         | dblink('fdw_app_a',       $QUERY$         SELECT
         | enumtypid::regtype AS name,           enumsortorder sort_order,
         | enumlabel label         FROM pg_enum;       $QUERY$       ) AS
         | t (name text, sort_order integer, label text);
         | CREATE VIEW fdw_enum.app_b_enums AS SELECT * FROM
         | dblink('fdw_app_b',       $QUERY$         SELECT
         | enumtypid::regtype AS name,           enumsortorder sort_order,
         | enumlabel label         FROM pg_enum;       $QUERY$       ) AS
         | t (name text, sort_order integer, label text);              --
         | Ensure enums with the same names aren't defined in both app_a
         | and app_b.       DO       $DO$       DECLARE
         | matching_count integer;       BEGIN         SELECT COUNT(*)
         | into matching_count         FROM fdw_enum.app_a_enums INNER
         | JOIN fdw_enum.app_b_enums USING (name);         ASSERT
         | matching_count = 0,           'app_a and app_b-NG have
         | identically named enums. Implement a check that they are
         | identically defined.';       END       $DO$;
         | CREATE VIEW fdw_enum.upstream_enums AS       SELECT * FROM
         | fdw_enum.app_a_enums       UNION ALL       SELECT * FROM
         | fdw_enum.app_b_enums;                     CREATE PROCEDURE
         | fdw_enum.create_type_if_not_exists(name text)       LANGUAGE
         | plpgsql       AS $PROC$         BEGIN           EXECUTE
         | format('CREATE TYPE %s AS ENUM ()', name);           EXCEPTION
         | WHEN duplicate_object THEN RAISE NOTICE 'Enum type % already
         | exists, skipping', name;         END;       $PROC$       ;
         | -- To make this idempotent we create the enums empty then alter
         | them by adding values.       -- So instead of `CREATE TYPE foo
         | AS ENUM ('bar', 'baz');` we do       -- `CREATE TYPE IF NOT
         | EXISTS foo AS ENUM ();       -- `ALTER TYPE foo ADD VALUE IF
         | NOT EXISTS ('bar');       -- `ALTER TYPE foo ADD VALUE IF NOT
         | EXISTS ('baz');       -- This ensure the procedure is re-
         | runnable and can add new values to the enum as they are added
         | upstream.       -- Order is ensured by the ORDER BY in the loop
         | query.       CREATE PROCEDURE fdw_enum.refresh_upsteram_enums()
         | LANGUAGE plpgsql       AS $PROC$       DECLARE         rec
         | record;         ddl text;       BEGIN         FOR rec IN SELECT
         | DISTINCT(name) AS name FROM fdw_enum.upstream_enums
         | LOOP           CALL
         | fdw_enum.create_type_if_not_exists(rec.name);         END LOOP;
         | FOR rec in           SELECT *           FROM
         | fdw_enum.upstream_enums           ORDER BY name, sort_order
         | LOOP           ddl := FORMAT('ALTER TYPE %s ADD VALUE IF NOT
         | EXISTS %s', rec.name, quote_literal(rec.label));
         | EXECUTE ddl;         END LOOP;       END       $PROC$       ;
         | CALL fdw_enum.refresh_upsteram_enums();
        
         | wojcikstefan wrote:
         | Apologies for my ignorance, but what does "FDW" stand for in
         | your comment? :)
        
           | Quekid5 wrote:
           | Foreign Data Wrapper
        
             | adgjlsfhk1 wrote:
             | specifically, they are one of the coolest features in
             | postgres. Have you ever wanted to pretend that a CSV (or
             | your filesystem or google) or basically anything else was a
             | SQL database? With a couple hundred lines of code, you can
             | do this. The performance won't be great, but it's
             | ridiculously flexible.
        
               | ComputerGuru wrote:
               | I have a two-liner shell script that turns a csv into a
               | virtual SQLite db (./to_sql.sh foo.csv) and lets me use
               | the full power of SQLite. Usually that's enough for my
               | purposes, but I do use pgsql's copy from csv to generate
               | tables representing foreign data imported from a csv (say
               | census info or NAICS occupation codes).
        
           | [deleted]
        
       ___________________________________________________________________
       (page generated 2023-06-20 23:02 UTC)