[HN Gopher] You might as well timestamp it (2021)
___________________________________________________________________
You might as well timestamp it (2021)
Author : Tomte
Score : 115 points
Date : 2022-12-28 17:59 UTC (5 hours ago)
(HTM) web link (changelog.com)
(TXT) w3m dump (changelog.com)
| sgtnoodle wrote:
| But what timebase and precision? UTC in seconds? Unix time in
| nanoseconds? GPS time in 10us increments?
| jpollock wrote:
| The problem with the timestamp is that it keeps people from
| thinking that they've got an enum.
|
| For example, let's start with an object, which has a boolean flag
| "active".
|
| boolean is_active;
|
| We will always want to get a bit more advanced, and get into sub-
| states as it is brought up. We can't do that with a boolean. So
| we go to an enum.
|
| Enum current_state { not_configured; being_provisioned; active;
| disabled; deleted; }
|
| current_state state;
|
| We have an enum in hiding. Converting the boolean to a timestamp
| stops the conversion to an enum - we can't overload the presence
| of a timestamp into the multiple values in the enum.
|
| The timestamp is "what happened, when", an audit log/change
| history function. That should be kept separate from the rest of
| the database.
|
| Sometimes, there is a business reason to track a timestamp -
| charge the customer 30days after the item became "active". How to
| model that depends on your database and what is required -
| history table, convert the enum to struct, time_became_active,
| time_of_last_state_change, etc.
| TheRealPomax wrote:
| If you're gonna use an enum for a true/false field, you might
| as well timestamp it. If you need enums, you changed the game.
| jpollock wrote:
| The timestamp doesn't give you enough information to answer
| the questions it's typically going to be used for. It tells
| you "when", but not "who", and "what else at the same time".
|
| A change history is a better solution, which would track:
|
| * transaction requestor (user/job/etc)
|
| * transaction timestamp
|
| * diff encoding before/after state
|
| Then if someone asks "why were these customer's disabled", we
| can answer "the XYZ cronjob went rabid and tore them down."
| and then hopefully use the diffs to reverse the transactions.
| bwilliams wrote:
| I think this may miss the point of the article, which is
| pointing out that you can get a lot of value for very
| little effort by using a timestamp instead of a boolean. I
| don't think it's intention is to replace a complete change
| history/audit log implementation, which would require a
| significant amount more time/effort to implement.
| horsawlarway wrote:
| Sure - this is objectively better for tracking/auditing,
| but... now we're getting back into trade-off territory.
|
| Where is that change history stored? How much data are the
| diffs generating? Are we deriving the final state from the
| log, or can the log the and record disagree?
|
| Basically - this is now back in the "What is this history
| buying us?" realm, not in the "easy win" realm.
|
| In some cases it can be absolutely worth it. But probably
| not all.
| idealmedtech wrote:
| I recently converted some generation code to use a plan, so
| you can make a plan and _then_ execute the plan.
| Architecting your code this way gives you things like dry
| runs and rollbacks for free; I feel like diff storage is in
| a similar realm of usefulness.
| maxbond wrote:
| Timestamps are a huge observability win relative to the
| amount of effort they take. "When" will give you hints to
| pull the other information out of logs etc. What you're
| proposing is better, but is a heavier lift; if you have
| capacity for that, awesome, if you don't, timestamps go a
| long way, take very little time to implement, and not
| having any information will really sting.
| cryptonector wrote:
| Validity periods are superior to an is_active/is_expired
| boolean, as you can set an expiration ahead of time without any
| write transactions needed at expiration time to make the
| expiration happen.
|
| For your example, however, I'd be tempted to have an enum as
| you suggest and associated history/audit table that has
| timestamps, mainly because there's no sense in setting
| timestamps into the future for the first two enum values.
| Still, I'd keep a timestamp for inactivation for the reason I
| gave above.
| Retric wrote:
| Expiration times still need a status flag for cleanup.
|
| What should the state be vs what is the current state vs when
| should the state change.
| cryptonector wrote:
| Why repeat yourself? If you can have an expression-valued
| column, do that.
|
| For expiration it's essential that it be able to happen at
| the scheduled time (when it is scheduled anyways) without
| having to execute a transaction _at that time_.
| nine_k wrote:
| This enum is a list of states.
|
| Immediately should one think about a proper tool to handle it,
| which usually is a finite state machine. FSMs are somehow
| underappreciated outside of comm protocols and GUIs. They are
| an adequate tool for quite a few "backend" or "data
| representation" tasks though.
|
| What the original post suggests is to timestamp the latest
| state transition. The approach is trivially obvious if you
| explicitly track state transitions as a part of the logic of
| your FSM.
| jpollock wrote:
| That's a great way of phrasing it. A boolean can be a FSM in
| hiding.
|
| Tracking FSM transitions is the change history/transaction
| log, isn't it?
|
| It would be:
|
| {timestamp, old_state, event, new_state}
|
| If we convert booleans to timestamps for the FSM, I end up
| with:
|
| timestamp not_configured;
|
| timestamp being_provisioned;
|
| timestamp active;
|
| timestamp disabled;
|
| timestamp deleted;
|
| Which would cause confusion when there are multiple FSMs in
| the same database record.
|
| We could isolate each FSM into its own column.
|
| It still makes it kind-of hard to decide what's the current
| state. The set of timestamps needs to be considered and
| sorted, going from field to {field_name, timestamp} and then
| sorting by timestamp and extracting the max. It would make
| "select count(*) from customers where current_state =
| 'active'" brittle - the addition of a new state would require
| all existing queries to change.
| valw wrote:
| Or, you know, use a time-aware database like XTDB or Datomic.
| RajT88 wrote:
| I kind of understand why the example was set up this way, but I
| died inside a little when he wrote to the console log without the
| timestamp in the message.
| kevinob11 wrote:
| I've frequently just done both, which I suppose is rather silly,
| but I like having a nicely named boolean field. I suppose I
| should probably just make those derived properties though.
| newbieuser wrote:
| This should definitely be implemented in projects where the use
| of the feature flag is common. When you go with boolean values,
| after a while you have a pile of data that you don't know what
| was done and when.
| helmsb wrote:
| I always store a timestamp and have an "is_deleted" flag on the
| business object that checks if the "deleted_timestamp" is null or
| depending on the use case, whether the timestamp is in the future
| for records that are to be deleted.
|
| I can also use the timestamp later if I want to move deleted
| records to a different table or database to improve performance
| depending on requirements.
| [deleted]
| simonw wrote:
| This was one of Luke Plant's YAGNI exceptions:
| https://lukeplant.me.uk/blog/posts/yagni-exceptions/
| dang wrote:
| Discussed at the time:
|
| _You might as well timestamp it_ -
| https://news.ycombinator.com/item?id=26922759 - April 2021 (198
| comments)
| AJRF wrote:
| > It's like the exact opposite of YAGNI. You Ain't Gonna Regret
| It?
|
| Pagni's?
|
| https://simonwillison.net/2021/Jul/1/pagnis/
| sedatk wrote:
| If you need to hold timestamp, hold `deleted` and `deleted_at`
| together.
|
| If you can't add a separate field, just add a log table.
|
| Don't optimize for fictional requirements that haven't surfaced,
| and don't add shortcuts.
|
| EDIT: Let me clarify my point here. A timestamp is just a
| timestamp. Giving a timestamp a role on signifying the state of
| an entity is a semantic shortcut, and semantic shortcuts lead to
| problems. I believe that DB structures should be as clear as
| possible. For example, it's impossible to know whether a
| timestamp is the authority on an entity's state by looking at a
| table structure. But a boolean field like "is_deleted" or
| "deleted" doesn't have that problem.
| bob1029 wrote:
| Nullable timestamp columns (64-bit integers) are what I prefer.
|
| I don't know how much of a performance improvement I'd get if I
| had a separate indexed bool column for tracking _just_ the
| truthiness, but I do know that it probably wouldn 't be enough
| to matter, even _if_ SQL was a meaningful bottleneck.
|
| I really enjoy working with these as nullable DateTime in .NET
| as well: if (!record.CompletedUtc.HasValue) {
| /* completion logic */ }
|
| One source of truth also means only 1 thing to update each
| time. It makes many kinds of problems impossible.
| rektide wrote:
| Indexing a 64 bit versus a 1 bit column actually sounds like
| a potentially really good win to make the index much much
| smaller, which could help loads with keeping it in hot memory
| or cache.
|
| On the other hand, on postgres & maybe others, one could
| perhaps index an expression _deleted_at is not null_ , which
| would let one have the data shape they want while letting the
| index shape suit the typical access pattern.
| nightpool wrote:
| > For example, it's impossible to know whether a timestamp is
| the authority on an entity's state by looking at a table
| structure
|
| This is just a matter of convention--if your codebase is
| consistent about it and there's no other option for "is this
| record deleted?", then yes it will be absolutely clear that
| "deleted_at" is the right column to use.
|
| What's _actually_ worse for making the database structure clear
| is to have "vestigial" columns that store old data, like you
| proposed for "deleted" and "deleted_at". Now, if they disagree,
| which one should you trust? Maybe some code is using one, while
| some code is using the other. Having a single source-of-truth
| column--no matter _what_ the name is--is always going to be
| better then having two columns that could get out of sync or
| disagree.
| 0x457 wrote:
| I think you should have log table anyway because if you need
| that information storing it as column(s) next to the rest it's
| not going to work well.
|
| anyways i prefer to have nullable timeshamp columns instead of
| two like you suggested.
| samwillis wrote:
| Log tables are a super good idea in a lot of cases. And if your
| DB support it add a JSON column to that log table. Being able
| to dump either stricter data from our own logging or the
| reponcess from third part APIs in invaluable. You never know in
| advance what is going to be useful.
|
| If you start generating too much data, add a cron job to purge
| older items or scale back the logging.
|
| An example from our e-commerce platform, the "order/cart"
| tables have corresponding log table that log all changes, user
| interactions, and all requests/responses from Stripe during the
| checkout process.
|
| It's super useful both for tracking down bugs, but also
| confirming what happed when customers contact support.
| jrumbut wrote:
| Yeah I much prefer keeping the deleted column as a boolean
| and then having a log table. I've done the timestamps instead
| of boolean thing before and it leads to two problems:
|
| 1. What happens when the record is "un-deleted?" Now the
| timestamp is null again and history has been lost. It's not
| the end of the world but if you're asked to create an account
| history report it's an awkward conversation. "I thought we
| tracked when records were deleted..."
|
| 2. Similar to the end of the last problem, it creates an
| expectation that the times that changes are made are tracked
| but it doesn't quite live up to that expectation. I've
| learned that business people love a feature that works well
| (like a log table), they can deal with a feature not existing
| at all (like a boolean and no logging), but a half-working
| feature causes a lot of suffering.
|
| Just do logging right from the start if you are writing
| business software where the data and transaction volume are
| small compared to the value of the data.
| manigandham wrote:
| No need for both fields. A nullable timestamp column works
| fine: NULL until the record/row is deleted, at which point it
| holds the timestamp of when it happened.
|
| You can derive a boolean "is it deleted" flag from that without
| adding another column to (mis)manage.
| sedatk wrote:
| Because they signify different things. For example, when you
| want to undo that delete operation, erasing timestamp would
| actually deprive you of the information that when it was
| deleted in the first place.
| nightpool wrote:
| Sure, but that just kicks the can one iteration down the
| road, and now it's impossible to delete the record a SECOND
| time without "depriving" yourself of that information. If
| you need an audit table, use an audit table, but being able
| to say "deleted_at > 1.month.ago" to grab all records
| deleted in the past month is super useful.
| manigandham wrote:
| What is a delete operation that can be "undone"? What is
| the timestamp supposed to mean in that scenario? What
| happens when it's deleted again then?
|
| If you need a full record of state changes then just
| upgrade to an audit log, otherwise a nullable timestamp
| field is perfectly fine.
| sedatk wrote:
| That's actually my main point. I edited my top comment to
| clarify that. A timestamp doesn't implicate "state." A
| bool, on the other hand, unquestionably does.
| manigandham wrote:
| Again, the _nullable_ nature of the column denotes both
| the state and when the state changed. You have more
| information in a single column, no need for both.
| sedatk wrote:
| Nullability on timestamp only implies that the timestamp
| is optional that's all. It doesn't imply that the
| timestamp decides the state of the entity.
| temp2022account wrote:
| time-series designs are generally better because they hold
| more information, and you can throw a view on top of them to
| present a nullable timestamp table if that's what's already
| been coded against. On the other hand once you go nullable
| it's difficult to figure out why the field was nulled after
| the fact.
| manigandham wrote:
| It's a _deleted_ timestamp. It 's null until it's deleted.
| Why would there be confusion on it being null?
| echelon wrote:
| Make deleted_at nullable and put an index on it.
|
| An audit log table isn't a bad idea, though. You can capture
| rich entity edit histories, attribution, etc.
| [deleted]
| crazygringo wrote:
| I'm going to have to respectfully disagree on this one, just
| because of unintended behavior and complexity.
|
| Whether it's in a programming language or in a database... what
| is false? Is it false or null or 1970-01-01 00:00:00 UTC? What if
| different parts of the program/table use different versions of
| false? What happens when I combine two truthy values with a
| boolean operator -- for all combinations of values, what is the
| resulting type and is it the value even correct? (E.g. in SQL,
| NULL XOR 1672254005 is NULL, which is flat-out wrong if you're
| using NULL in place of FALSE, and it's easy to think of further
| examples.)
|
| There's so much research I'd have to do to determine that this
| would always be bug-free for a particular language/platform, and
| then I'd need to remember which languages/platforms it is safe
| for and which ones aren't.
| echelon wrote:
| For nullable timestamps, the timestamp itself is opaque. Having
| a value is indication that the thing happened. The time reading
| is orthogonal to that question.
|
| If your language has a tough or flexible time with falsy, I
| found your problem. You shouldn't even be in that headspace,
| and languages that put you there introduce so much mental
| overhead that they're not worth it.
|
| You shouldn't need to do research here. This should be a one
| and done solution from your toolbox that you can pull out at a
| moment's notice: "How do I handle record soft deletion and
| store some state around when it happened?" - zero effort, tried
| and true recipe borne of experience.
|
| The database and data stores drive your product. They're the
| heart and soul.
| crazygringo wrote:
| So the problem is with SQL, because NULL doesn't act the same
| as false, and boolean operators may give incorrect results if
| you expect them to? I'm not sure how to respond if you think
| people shouldn't be in the headspace of SQL.
| echelon wrote:
| In SQL you won't typically coerce null to false, though
| there are mechanisms to accomplish this (which can be
| useful for analytics, ETLs, etc). Presence is first class
| and is probably what you want in your business logic
| queries.
|
| If you let the SQL query predicates handle which records to
| select, I'm not sure I see the problem you originally
| outlined. Why do you care what value the timestamp has when
| the question is whether or not a record experienced an
| event.
| jchw wrote:
| No, you have to do this anyways. Assuming you're using a SQL
| database, a boolean field can be NULL too, unless you
| explicitly mark it non-nullable. If you are worried about
| obscure edge cases, use languages and tools with less of them.
| For example, it is VERY easy to reason about the date field
| being set or not if your language of choice represents it with
| an optional type you have to unwrap in some fashion.
|
| But it's not really that bad in the worst case. Failing docs or
| the ability to read your db connector's source code enough to
| ascertain, you can always just test. Both in the sense of, try
| it and see what happens, and also, writing automated tests that
| ensure you're getting the expected behavior. If your line of
| logic is that hope is not a strategy, I agree. However, there
| are alternatives to hope that don't involve vacating best
| practices out of fear of hitting edge cases, and you probably
| want those automated tests anyways.
| Twisell wrote:
| That is precisely the point in any data oriented application
| (like a database) a boolean should actually have at least 3
| possible value:
|
| - True
|
| - False
|
| - Null/Nil/unset/error
|
| Depending on how the application handle the third case it can
| be more than 3 return values.
|
| timestamp require ad-hoc and potentially breakable workaround
| to emulate this behavior.
| jchw wrote:
| I'll have to disagree. I don't think I'd want that in most
| cases.
|
| It's unclear what general problem it solves. If the
| database row is actually corrupted, I want the query to
| fail violently, and the database to fail all operations
| around it to avoid further corruption. If I really need to
| handle it, I'll handle it in my app layer.
|
| If it's not, I don't want it to be set to some invalid
| state. I don't want every piece of code to need to deal
| with this invalid state that shouldn't be allowed to begin
| with. I'd rather gate it at the door and stop it from
| getting in there in the first place.
|
| If you use a nullable timestamp as a boolean, this is fine;
| there's only two states w.r.t. set or not: NULL or not
| NULL, from the SQL perspective. How this is encoded in your
| language and tools of choice depends. If there's some
| database value not encodable to your language's native
| types, the db connector will probably decide what to do:
| truncate data, fail the whole query, use a type with
| additional entropy (like Go's NullString, for example.) In
| no way do I want there to be an error state for that field
| alone. If I did, though, for example because I need to
| handle partial data as an application concern, I'd want
| union typing. In that case you could have
| Error/Unset/Timestamp just fine. I don't believe SQL has
| this sort of union type natively, but it can be represented
| in many programming languages intuitively (like unsafely as
| a union in C, safely as an enum in Rust, safely as a
| discriminated union in TypeScript, and mostly safely in Go
| abusing interfaces.) However, that is only something I want
| to do if my application specifically needs it. Otherwise?
| Data that goes into the database must meet the constraints
| first.
| Twisell wrote:
| Still after all this explaining you agree there is only
| two state instead of three.
|
| Usually it's used with the following logic for database
| oriented workflow.
|
| True = true
|
| False = false
|
| Null = unknow (This which might be a real-world case that
| you can model using null)
|
| Of course if there is an error the engine will thrown an
| exception Which is a whole other kind of signal.
| jchw wrote:
| What I don't get is, why would I want there to be a third
| state in this boolean example? You're saying it could be
| useful. That's true, it could. However, we're talking
| about something that today is already just a boolean,
| such as "is_published." Even if you can envision a reason
| why you might want a third unknown state for
| "is_published", if you write a bunch of code that treats
| it as a boolean, without knowing what the null state
| might mean, you are not solving a problem. You're just
| modelling your data slightly less precisely. Saving a
| date is saving known, useful state, even if you don't
| need it yet. Using a nullable type for a field that can't
| yet be null is just adding an additional unknown state
| for no reason.
|
| When using Go and sqlc, what would happen is I would use
| a NOT NULL field for the boolean, because I'm modelling a
| boolean, not a tri-state. If I was modelling something
| where I had three states I wanted to handle, I would use
| an enum instead. But even if I wanted to use NULL instead
| of an enum that explicitly specifies the meaning of the
| third state, I'd rather start as NOT NULL. Why? Because
| then, if I ever want to remove the constraint, I can. And
| when I do, sqlc will helpfully change the type of the
| field from a raw boolean to a nullable boolean wrapper.
| And doing _this_ will _break_ the existing code. This is
| good. Why? Because in every place where I assumed there
| were only two possible states, I can now check and see
| what I 'd want to do in my new third case. This is better
| than starting with nullability, because if I don't know
| what NULL means or why I'd have it, it's really difficult
| to know what my code should do when it encounters it. You
| could always throw errors, but if you do that you have to
| ensure you clean all of that up before you start actually
| using NULL.
|
| This is fundamentally about data modelling though.
| Talking in this abstract sense is not useful. What I feel
| you're failing to make an adequate case for is why a
| field like the one described in the linked article needs
| a state other than "set" or "unset". Just putting a
| nullable timestamp is a nearly free upgrade from putting
| a non nullable boolean, which is what I would've done
| anyways.
|
| If I want to add more states, I want explicit data model
| changes, that result in explicit test and type checking
| errors, that highlight wherever I need to update my
| code's state management to account for new states.
|
| Additional note: particular to SQL, removing not NULL
| constraints is basically free from the database end as
| far as I can tell. In addition, NULL has specific
| behaviors, especially w.r.t. JOINs, so I would NOT want
| to use NULL to model a tri-state.
| mixmastamyk wrote:
| > false or null
|
| Not possible with a timestamp field.
|
| > 00:00:00 UTC
|
| I remember a bug around this in Python regarding times (only) a
| decade or two ago, it was fixed a long time ago. I assume this
| is not an issue for modern stacks, but I would double check
| under say PHP which has a known history. Not normally an issue
| I'd say.
|
| > What happens when I combine two truthy values with a boolean
| operator
|
| Type or Runtime error. Who is adding booleans and dates
| together? Such egregious mistakes don't take long to surface
| and are impossible if planned against beforehand.
|
| I don't think these are things to be worried about (if you're
| paying attention even a tiny bit) to be honest.
| grahamm wrote:
| Never really felt the need but I suppose there has always been an
| updated_when field, so for something that is deleted the
| updated_when provides the time stamp.
| echelon wrote:
| Create an `updated_at` with an on update trigger to set it to
| NOW(). Your code will never have to explicitly remember to set
| it.
|
| Go further and create an integer `version` vector clock that
| gets set to its value plus one. You can see the edit frequency
| of individual entities and construct a simple key for cache
| invalidation.
___________________________________________________________________
(page generated 2022-12-28 23:01 UTC)