[HN Gopher] Representing Enums in PostgreSQL
___________________________________________________________________
Representing Enums in PostgreSQL
Author : wojcikstefan
Score : 103 points
Date : 2023-06-20 12:31 UTC (10 hours ago)
(HTM) web link (making.close.com)
(TXT) w3m dump (making.close.com)
| jpalomaki wrote:
| Native enums can be used in queries like strings, but with type
| checking: select * from cust where type =
| 'company' -- ok select * from cust where type = 'c0mpany'
| -- error
|
| As mentioned, they take less space. Important if you use these
| columns in index and have millions of rows.
| cryptonector wrote:
| TFA is all about how to make changes where you drop elements of
| an enum, and how hard that is. The obvious thing though is not
| covered: don't do that! Instead you should: a.
| Add CHECK() constraints on columns of the relevant ENUM
| type checking that the value is one of the "live"
| values. b. RENAME "dead" ENUM values to indicate
| that they're dead. c. UPDATE ... SET
| column_of_that_enum_type = ... WHERE
| column_of_that_enum_type IN (dead_values_here) ...
|
| (c) can be done slowly, say, with a LIMIT clause to keep load and
| disruptions down. Yeah, you might have "broken" data for a while,
| but you can make your queries report better values for columns of
| that ENUM type in the same way that the UPDATE would fix them.
| BulgarianIdiot wrote:
| [dead]
| dikei wrote:
| Agree, PostgreSQL enum is really just for mapping with an enum
| in application code. And just like you should not remove an
| enum or change their ordinal in application code, don't do that
| to a PostgreSQL enum
| hn_throwaway_99 wrote:
| Glad to see this as the top comment, completely agree. After
| reading the article, was thinking that the only real downside
| of using enum types (but there is one more issue not mentioned,
| more on that below) is when you need to remove values. In
| reality, I've found removing enum values to be a very rare
| occurrence in prod. Removing an enum value fundamentally breaks
| backwards compatibility, so usually a better option is to
| essentially deprecate it, which is more like what you
| recommend.
|
| Note one other thing I have found is that postgres enums are
| less "portable" than plain text columns. For example, if you're
| using BigQuery to query postgres tables as an external
| connection, BQ doesn't support enums, so your external
| connection query has to cast any enum return values (e.g.
| some_enum_column::text) before returning them to BQ, which can
| be a pain if you want to do "SELECT * FROM ..." some table that
| returns enum columns.
| faangsticle wrote:
| > BQ doesn't support enums
|
| Sounds like a reason to use views to work around the
| limitations of BigQuery
| cryptonector wrote:
| > Note one other thing I have found is that postgres enums
| are less "portable" than plain text columns. For example, if
| you're using BigQuery to query postgres tables as an external
| connection, BQ doesn't support enums, so your external
| connection query has to cast any enum return values (e.g.
| some_enum_column::text) before returning them to BQ, which
| can be a pain if you want to do "SELECT * FROM ..." some
| table that returns enum columns.
|
| Yeah, one has to write more adapter queries to add those
| casts to text (or from text, for DMLs), but it's OK, and
| ENUMs are just really nice. And as u/faangsticle says, use
| VIEWs for that.
| thom wrote:
| You can create an implicit cast to and from text with CREATE
| CAST and it becomes transparent.
| aidos wrote:
| Oh nice. I didn't know about that feature.
| cropcirclbureau wrote:
| Good approach. And with Postgres `DOMAIN` types features to DRY
| up the `CHECK` the live constraint, this should be very
| palatable. I usually opt for enum tables and regularly find
| myself needing to declare `archived_at` columns for said
| tables. Did not know of the `RENAME` command either and what
| you describe renders native enums less warty.
| rollulus wrote:
| When would one use this approach versus a domain type [1]? How
| does it differ?
|
| [1]: https://www.postgresql.org/docs/current/domains.html
| jpmelos wrote:
| That would be an improvement on the `CHECK` constraint approach
| if the same `CHECK` constraint is used in multiple columns, to
| DRY it up. The `DOMAIN` types offer the same commands for
| `CHECK` constraints, namely `NOT VALID` when updating the
| constraint, and `VALIDATE CONSTRAINT` to make sure it is valid
| after having updated it.
| cmcconomy wrote:
| I'm happy to have a FK to a reftable with int, shortname,
| description
| Icathian wrote:
| Little known bug, but enums also choke dump/restore if you use
| them as a hash partition key.
|
| That said, they can be really nice for all the same reasons
| static types are nice. Good article!
| ComputerGuru wrote:
| It sure sounds like a bug to me, too! Is there an upstream bug
| id? I'd imagine dump/restore should "just work" with anything
| that can be done in postgres oob.
| Icathian wrote:
| Donno about a bug Id, but there were a couple mailing list
| discussions. Here: https://www.postgresql.org/message-
| id/flat/1376149.167526827...
|
| Disclosure: I reported this particular bug. It was a fun one
| to run down.
| wojcikstefan wrote:
| Whew, good to know! Another point towards the `CHECK`
| constraint approach. Thanks! :)
| fabian2k wrote:
| Native enums in Postgres just seem to have many more drawbacks
| than advantages. I never really used them, it didn't seem worth
| it.
|
| I usually use C# Enums translated by EF Core now, which works
| perfectly fine on the C# side. The only missing part would be to
| give the DB the information about the enum names, so it could
| show those to me instead of just the raw numbers. But I assume
| there is no way to do that. Adding entire tables just doesn't
| seem worth it for this use case alone.
| ComputerGuru wrote:
| > Adding entire tables just doesn't seem worth it for this use
| case alone.
|
| If you peek at a particular table often enough and it has a
| crazy enough enum to maybe justify making your life a little
| easier, what you can do is add a table just for debug purposes
| and use a view that maps the int to it, but otherwise only use
| the direct int:enum cast in your app/orm/etc. thereby not
| paying any runtime cost for foreign keys or integrity checking.
| bob1029 wrote:
| > Adding entire tables just doesn't seem worth it for this use
| case alone.
|
| I am going down this path. There are a lot of downstream use
| cases (e.g. support, reporting, etc) that can benefit from
| having the canonical names & human-friendly labels in the
| database.
|
| Keeping database in sync with code like this requires some
| discipline and manual scripting, but I think it's worth it.
| wojcikstefan wrote:
| > I usually use C# Enums translated by EF Core now (...)
|
| Does this approach ensure data consistency? Or could you
| accidentally insert a number value into your table which is not
| represented by any of the enum values?
| fabian2k wrote:
| With raw SQL you could write anything in it. Within EF Core
| it won't allow bad values, you could trick it I think but
| that is not something you'd do accidentally. You can always
| add a CHECK constraint in addition, if you have more
| uncontrolled places modifying those values.
| ComputerGuru wrote:
| Well C# is strongly typed so that's punted to the CLR type
| system (assuming you're not using raw sql to insert values
| via other means/frontends) - but it isn't too hard to
| forcibly to cast a raw int to an enum in C# which doesn't
| actually contain a definition with that integral value
| (unlike in even stronger typed languages eg in rust where
| it's a lot more work).
|
| I "manually" add a check constraint (via EF, so it's not so
| much manual as it is "remember to copy and paste this in your
| db.cs) to assert the value is greater than or equal to zero
| and less than the number of inhabitants in the enum, but this
| fails if you manually code the enum values (eg for flags,
| legacy interop, etc).
| akersten wrote:
| Is there a reason ALTER TYPE name ADD VALUE new_enum_value
| wouldn't Just Work for the first example for 99% of use cases?
| Seems like the only drawback highlighted for native enums is that
| you have to lock the whole table if you completely swap out the
| type of one of the columns, which... Yes, that's true, but also
| very rare? I guess TFA mentions removing a value from an enum,
| but you shouldn't do that imo - leave the data in place and
| ignore it at the app layer, like how you'd treat a deleted flag
| or similar. Or migrate it and leave the old value hanging around
| in the type.
| jpmelos wrote:
| > Is there a reason ALTER TYPE name ADD VALUE new_enum_value
| wouldn't Just Work for the first example for 99% of use cases?
|
| I'm sorry, we don't have any examples in the article about
| adding values to enums, so I don't know to what you are
| referring here.
|
| > Yes, that's true, but also very rare?
|
| It is rare, indeed, but we just have a strong inclination
| towards having a clean schema, so we do prefer to clean up
| things that would otherwise be unused. But I understand that
| many people would prefer to leave them hanging there, even if
| they are not going to be used in any way.
| jpgvm wrote:
| PostgreSQL enums feel like a bit of a hack in general. I end up
| using an "enum table" approach in many cases as joining against a
| very small table has negligible performance impact in all but the
| most performance sensitive databases and foreign key constraints
| are a well understood quantity.
| marcosdumay wrote:
| PostgreSQL enums are there for you to use on your code API.
| They can go into tables because sometimes it's important to put
| code data into tables.
|
| What they are absolutely not for is to replace the one standard
| way to define data enums that everybody use since relational
| algebra was created. It's such an outrageously naive idea that
| I'm sure most people here were attracted to the title thinking
| it's about algebraic types or some other similar
| misunderstanding.
| wojcikstefan wrote:
| Can you elaborate on what it means to use PostgreSQL enums
| "on your code API"?
|
| > It's such an outrageously naive idea that I'm sure most
| people here were attracted to the title thinking it's about
| algebraic types or some other similar misunderstanding.
|
| Just to share how we at Close got into this discussion (which
| I personally don't find as "outrageous" as you), SQLAlchemy -
| which we use in our Python code - uses `native_enum=True` as
| the default value for its `Enum` type [0], which then
| translates to using types in PostgreSQL:
|
| > native_enum - Use the database's native ENUM type when
| available. Defaults to True. When False, uses VARCHAR + check
| constraint for all backends.
|
| [0]: https://docs.sqlalchemy.org/en/20/core/type_basics.html#
| sqla...
| marcosdumay wrote:
| > Can you elaborate on what it means to use PostgreSQL
| enums "on your code API"?
|
| If you export a procedure for creating a socket into
| pgPlSQL, you shouldn't use magical numbers for setting the
| socket flags. You should use enums.
|
| As for sqlalchmey, well that design is not good. It should
| support more mappings than just to string. But well,
| personally, I would ignore the feature and go without enum
| types (notice that it's a recent addition). You can always
| declare your own enum and convert the column if you want.
| faangsticle wrote:
| > But well, personally, I would ignore the feature and go
| without enum types (notice that it's a recent addition).
|
| Postgres has had enums since 2008
| mike_hock wrote:
| So, recent addition.
| aidos wrote:
| Heyya! Close customer here.
|
| > When False, uses VARCHAR + check constraint for all
| backends
|
| Is this behaviour new? Or is it that alembic still doesn't
| pick up changes to the enum to recreate the constraint? We
| implemented our own alembic hooks to automate the
| migrations for us. We've found the check constraint model
| to be fairly effective.
| wojcikstefan wrote:
| Hiya Aidan! :D
|
| > Is this behaviour new?
|
| Don't think so. Even the docs for SQLA v1.3 ([0]) mention
| it:
|
| > native_enum - Use the database's native ENUM type when
| available. Defaults to True. When False, uses VARCHAR +
| check constraint for all backends. The VARCHAR length can
| be controlled with Enum.length
|
| [0]: https://docs.sqlalchemy.org/en/13/core/type_basics.h
| tml#sqla...
|
| --
|
| > Or is it that alembic still doesn't pick up changes to
| the enum to recreate the constraint?
|
| Off the top of my head, I believe Alembic doesn't pick it
| up automatically. We make a lot of manual tweaks to our
| auto-generated Alembic migrations and I believe adding /
| modifying the CHECK constrains is part of it.
| aidos wrote:
| Happy to ping you guys over the code if you want to look.
| It's probably fairly tweak-able to handle your specific
| cases.
| cryptonector wrote:
| PG enums _are_ "enum tables" under the hood. With native enum
| support the JOINs with the enum tables happen at query
| compilation time, which is a performance benefit you should not
| ignore.
| rpcope1 wrote:
| After having suffered through the consequences of "type" enums on
| MySQL, and see some things go through a long life that used
| "enums" in the database (in multiple different databases, include
| Postgres), I'm not convinced that either of these are the right
| choice for representing enumerations. The string with check
| constraint seems dumb if for no other reason than if the table
| that uses it winds up having many rows, you're basically burning
| up lots of extra space for arguably no reason, and if you ever
| have to alter the name of an enum (or something similar), that
| update is going to be really expensive.
|
| I think the "right" choice for enums probably looks a little more
| like: CREATE TABLE myEnum( enumID
| SERIAL NOT NULL PRIMARY KEY, enumName TEXT UNIQUE NOT
| NULL, -- enum description or other metadata columns
| here. ); CREATE TABLE foo( ...
| associatedEnumID INTEGER NOT NULL REFERENCES myEnum(enumID),
| ... );
|
| I think this has the benefit of being space efficient like the
| native typed enum, while being relatively flexible (easy to
| change names, add new enum values, add data about the enum
| itself, etc.)
| SigmundA wrote:
| Yep thats what I typically do except a string "code" and string
| "description" that way the raw data is more readable without
| joining the coded value and many times you want a short code vs
| long description, like with US states and their abbreviations.
| Most codes are 1 or 2 characters up to maybe 4 so you end up
| with less or the same space used than a 32 bit int with UTF-8.
|
| Metadata like sorting and even what date ranges the code is
| valid for and even security as in who's allowed to use the code
| in the app.
| netcraft wrote:
| In my experience, you need those descriptions and other
| metadata on these enum values more often than not.
| SPBS wrote:
| I prefer directly using strings as enums, and using the foreign
| key constraint only to validate enum values.
| CREATE TABLE my_enum ( name TEXT PRIMARY KEY );
| CREATE TABLE foo ( my_enum TEXT REFERENCES my_enum (name) );
|
| The reason is because a SELECT * FROM foo showing cryptic enum
| ordinals is a headache, and having to join the enum table every
| time is potentially slower than just reading from the column
| directly. An ASCII character only takes 1 byte, so an INT enum
| is just as space efficient as using 4 characters, which affords
| way more descriptiveness than a meaningless ordinal number.
| faangsticle wrote:
| The point of enums is so the names can be descriptive, which
| typically will be longer than 4 characters.
| ledgerdev wrote:
| If you use longer named enums(eg. my_enum_xyz) in my_enum,
| does this store a full copy of the enum text bytes of
| 'my_enum_xyz' into table foo?
| SPBS wrote:
| Yeah the full text of the enum is stored in the table
| ('my_enum_xyz' is 11 ASCII characters so it takes up 11
| bytes, plus 1 byte needed to store the length of the
| string).
| jfb wrote:
| This is exactly what I do, save with natural keys.
| cryptonector wrote:
| This is... exactly what PG does under the covers for ENUM
| types. And also of course this is historically the canonical
| way to do ENUMs in SQL.
| out_of_protocol wrote:
| smallint looks like a good alternative, with dictionary in the
| app or separate table. So far i've only seen storing dictionary
| in app source code approach
| wojcikstefan wrote:
| Just to make sure I understand your suggestion fully, you're
| saying: 1. Use `smallint` instead of `text` for the column
| type. 2. Otherwise follow our "`CHECK` constraint" approach
| (without it we're jeopardizing data consistency because we can
| store anything between -32768 and 32767 in that column). 3.
| Translate the int to the enum in your application.
|
| Right?
| out_of_protocol wrote:
| Yep!
| ComputerGuru wrote:
| Most ORMs or just DALs can automate step 3 for you. If you're
| using a strongly typed language with any type of ffi support,
| you'll probably already have native cast from int to enum
| available (unless you're using rust where that's considered
| unsafe by default).
| jpmelos wrote:
| I don't like this approach much because getting cryptic
| integers when you do a `SELECT` in the database is really
| cumbersome.
| arp242 wrote:
| Agreed, but on the other hand it also saves a lot of
| complexity and possible headaches down the line. It's kind of
| a matter of choosing which headache you want. Especially if
| your table gets larger all those extra bytes in text columns
| can cost you dozens of GB of disk space, makes indexing
| slower, etc.
|
| I MySQL/MariaDB enums are just "aliases" for ints, and that
| works much nicer IMHO, and adding a new value is cheap
| because it doesn't recheck all the rows (removing values is
| still expensive, as it needs to check it's not actually used
| by any row).
| jensenbox wrote:
| What is the difference in size on disk between the two options?
| jpmelos wrote:
| An enum stores a small reference to the enum value in the
| tuples, and the constraint approach stores the string itself.
| The constraint approach takes more space with (a) larger enum
| values, (b) more columns using them, and (c) more rows in the
| tables that use them. The difference depends on the
| characteristics of the database we're talking about.
| Canada wrote:
| I agree with this, I don't use enums they are always more trouble
| than they are worth.
|
| They break FDW unless they are pre-created on the importing side.
| Super inconvenient.
| efxhoy wrote:
| I had to implement a workaround for that for our DW which
| imports from two different app databases. IMO this should be
| part of import foreign schema
|
| but then you would have to qualify the names somehow if you
| were importing from more than one db. I wrote this as a
| workaround, it runs every day as part of our data import job.
| DROP SCHEMA IF EXISTS fdw_enum CASCADE; CREATE SCHEMA
| fdw_enum; -- Get names of the enums using
| ::regtype casting and label and sort_order from app_a and
| app_b. CREATE VIEW fdw_enum.app_a_enums AS SELECT * FROM
| dblink('fdw_app_a', $QUERY$ SELECT
| enumtypid::regtype AS name, enumsortorder sort_order,
| enumlabel label FROM pg_enum; $QUERY$ ) AS
| t (name text, sort_order integer, label text);
| CREATE VIEW fdw_enum.app_b_enums AS SELECT * FROM
| dblink('fdw_app_b', $QUERY$ SELECT
| enumtypid::regtype AS name, enumsortorder sort_order,
| enumlabel label FROM pg_enum; $QUERY$ ) AS
| t (name text, sort_order integer, label text); --
| Ensure enums with the same names aren't defined in both app_a
| and app_b. DO $DO$ DECLARE
| matching_count integer; BEGIN SELECT COUNT(*)
| into matching_count FROM fdw_enum.app_a_enums INNER
| JOIN fdw_enum.app_b_enums USING (name); ASSERT
| matching_count = 0, 'app_a and app_b-NG have
| identically named enums. Implement a check that they are
| identically defined.'; END $DO$;
| CREATE VIEW fdw_enum.upstream_enums AS SELECT * FROM
| fdw_enum.app_a_enums UNION ALL SELECT * FROM
| fdw_enum.app_b_enums; CREATE PROCEDURE
| fdw_enum.create_type_if_not_exists(name text) LANGUAGE
| plpgsql AS $PROC$ BEGIN EXECUTE
| format('CREATE TYPE %s AS ENUM ()', name); EXCEPTION
| WHEN duplicate_object THEN RAISE NOTICE 'Enum type % already
| exists, skipping', name; END; $PROC$ ;
| -- To make this idempotent we create the enums empty then alter
| them by adding values. -- So instead of `CREATE TYPE foo
| AS ENUM ('bar', 'baz');` we do -- `CREATE TYPE IF NOT
| EXISTS foo AS ENUM (); -- `ALTER TYPE foo ADD VALUE IF
| NOT EXISTS ('bar'); -- `ALTER TYPE foo ADD VALUE IF NOT
| EXISTS ('baz'); -- This ensure the procedure is re-
| runnable and can add new values to the enum as they are added
| upstream. -- Order is ensured by the ORDER BY in the loop
| query. CREATE PROCEDURE fdw_enum.refresh_upsteram_enums()
| LANGUAGE plpgsql AS $PROC$ DECLARE rec
| record; ddl text; BEGIN FOR rec IN SELECT
| DISTINCT(name) AS name FROM fdw_enum.upstream_enums
| LOOP CALL
| fdw_enum.create_type_if_not_exists(rec.name); END LOOP;
| FOR rec in SELECT * FROM
| fdw_enum.upstream_enums ORDER BY name, sort_order
| LOOP ddl := FORMAT('ALTER TYPE %s ADD VALUE IF NOT
| EXISTS %s', rec.name, quote_literal(rec.label));
| EXECUTE ddl; END LOOP; END $PROC$ ;
| CALL fdw_enum.refresh_upsteram_enums();
| wojcikstefan wrote:
| Apologies for my ignorance, but what does "FDW" stand for in
| your comment? :)
| Quekid5 wrote:
| Foreign Data Wrapper
| adgjlsfhk1 wrote:
| specifically, they are one of the coolest features in
| postgres. Have you ever wanted to pretend that a CSV (or
| your filesystem or google) or basically anything else was a
| SQL database? With a couple hundred lines of code, you can
| do this. The performance won't be great, but it's
| ridiculously flexible.
| ComputerGuru wrote:
| I have a two-liner shell script that turns a csv into a
| virtual SQLite db (./to_sql.sh foo.csv) and lets me use
| the full power of SQLite. Usually that's enough for my
| purposes, but I do use pgsql's copy from csv to generate
| tables representing foreign data imported from a csv (say
| census info or NAICS occupation codes).
| [deleted]
___________________________________________________________________
(page generated 2023-06-20 23:02 UTC)