[HN Gopher] You might as well timestamp it (2021)
       ___________________________________________________________________
        
       You might as well timestamp it (2021)
        
       Author : Tomte
       Score  : 115 points
       Date   : 2022-12-28 17:59 UTC (5 hours ago)
        
 (HTM) web link (changelog.com)
 (TXT) w3m dump (changelog.com)
        
       | sgtnoodle wrote:
       | But what timebase and precision? UTC in seconds? Unix time in
       | nanoseconds? GPS time in 10us increments?
        
       | jpollock wrote:
       | The problem with the timestamp is that it keeps people from
       | thinking that they've got an enum.
       | 
       | For example, let's start with an object, which has a boolean flag
       | "active".
       | 
       | boolean is_active;
       | 
       | We will always want to get a bit more advanced, and get into sub-
       | states as it is brought up. We can't do that with a boolean. So
       | we go to an enum.
       | 
       | Enum current_state { not_configured; being_provisioned; active;
       | disabled; deleted; }
       | 
       | current_state state;
       | 
       | We have an enum in hiding. Converting the boolean to a timestamp
       | stops the conversion to an enum - we can't overload the presence
       | of a timestamp into the multiple values in the enum.
       | 
       | The timestamp is "what happened, when", an audit log/change
       | history function. That should be kept separate from the rest of
       | the database.
       | 
       | Sometimes, there is a business reason to track a timestamp -
       | charge the customer 30days after the item became "active". How to
       | model that depends on your database and what is required -
       | history table, convert the enum to struct, time_became_active,
       | time_of_last_state_change, etc.
        
         | TheRealPomax wrote:
         | If you're gonna use an enum for a true/false field, you might
         | as well timestamp it. If you need enums, you changed the game.
        
           | jpollock wrote:
           | The timestamp doesn't give you enough information to answer
           | the questions it's typically going to be used for. It tells
           | you "when", but not "who", and "what else at the same time".
           | 
           | A change history is a better solution, which would track:
           | 
           | * transaction requestor (user/job/etc)
           | 
           | * transaction timestamp
           | 
           | * diff encoding before/after state
           | 
           | Then if someone asks "why were these customer's disabled", we
           | can answer "the XYZ cronjob went rabid and tore them down."
           | and then hopefully use the diffs to reverse the transactions.
        
             | bwilliams wrote:
             | I think this may miss the point of the article, which is
             | pointing out that you can get a lot of value for very
             | little effort by using a timestamp instead of a boolean. I
             | don't think it's intention is to replace a complete change
             | history/audit log implementation, which would require a
             | significant amount more time/effort to implement.
        
             | horsawlarway wrote:
             | Sure - this is objectively better for tracking/auditing,
             | but... now we're getting back into trade-off territory.
             | 
             | Where is that change history stored? How much data are the
             | diffs generating? Are we deriving the final state from the
             | log, or can the log the and record disagree?
             | 
             | Basically - this is now back in the "What is this history
             | buying us?" realm, not in the "easy win" realm.
             | 
             | In some cases it can be absolutely worth it. But probably
             | not all.
        
             | idealmedtech wrote:
             | I recently converted some generation code to use a plan, so
             | you can make a plan and _then_ execute the plan.
             | Architecting your code this way gives you things like dry
             | runs and rollbacks for free; I feel like diff storage is in
             | a similar realm of usefulness.
        
             | maxbond wrote:
             | Timestamps are a huge observability win relative to the
             | amount of effort they take. "When" will give you hints to
             | pull the other information out of logs etc. What you're
             | proposing is better, but is a heavier lift; if you have
             | capacity for that, awesome, if you don't, timestamps go a
             | long way, take very little time to implement, and not
             | having any information will really sting.
        
         | cryptonector wrote:
         | Validity periods are superior to an is_active/is_expired
         | boolean, as you can set an expiration ahead of time without any
         | write transactions needed at expiration time to make the
         | expiration happen.
         | 
         | For your example, however, I'd be tempted to have an enum as
         | you suggest and associated history/audit table that has
         | timestamps, mainly because there's no sense in setting
         | timestamps into the future for the first two enum values.
         | Still, I'd keep a timestamp for inactivation for the reason I
         | gave above.
        
           | Retric wrote:
           | Expiration times still need a status flag for cleanup.
           | 
           | What should the state be vs what is the current state vs when
           | should the state change.
        
             | cryptonector wrote:
             | Why repeat yourself? If you can have an expression-valued
             | column, do that.
             | 
             | For expiration it's essential that it be able to happen at
             | the scheduled time (when it is scheduled anyways) without
             | having to execute a transaction _at that time_.
        
         | nine_k wrote:
         | This enum is a list of states.
         | 
         | Immediately should one think about a proper tool to handle it,
         | which usually is a finite state machine. FSMs are somehow
         | underappreciated outside of comm protocols and GUIs. They are
         | an adequate tool for quite a few "backend" or "data
         | representation" tasks though.
         | 
         | What the original post suggests is to timestamp the latest
         | state transition. The approach is trivially obvious if you
         | explicitly track state transitions as a part of the logic of
         | your FSM.
        
           | jpollock wrote:
           | That's a great way of phrasing it. A boolean can be a FSM in
           | hiding.
           | 
           | Tracking FSM transitions is the change history/transaction
           | log, isn't it?
           | 
           | It would be:
           | 
           | {timestamp, old_state, event, new_state}
           | 
           | If we convert booleans to timestamps for the FSM, I end up
           | with:
           | 
           | timestamp not_configured;
           | 
           | timestamp being_provisioned;
           | 
           | timestamp active;
           | 
           | timestamp disabled;
           | 
           | timestamp deleted;
           | 
           | Which would cause confusion when there are multiple FSMs in
           | the same database record.
           | 
           | We could isolate each FSM into its own column.
           | 
           | It still makes it kind-of hard to decide what's the current
           | state. The set of timestamps needs to be considered and
           | sorted, going from field to {field_name, timestamp} and then
           | sorting by timestamp and extracting the max. It would make
           | "select count(*) from customers where current_state =
           | 'active'" brittle - the addition of a new state would require
           | all existing queries to change.
        
       | valw wrote:
       | Or, you know, use a time-aware database like XTDB or Datomic.
        
       | RajT88 wrote:
       | I kind of understand why the example was set up this way, but I
       | died inside a little when he wrote to the console log without the
       | timestamp in the message.
        
       | kevinob11 wrote:
       | I've frequently just done both, which I suppose is rather silly,
       | but I like having a nicely named boolean field. I suppose I
       | should probably just make those derived properties though.
        
       | newbieuser wrote:
       | This should definitely be implemented in projects where the use
       | of the feature flag is common. When you go with boolean values,
       | after a while you have a pile of data that you don't know what
       | was done and when.
        
       | helmsb wrote:
       | I always store a timestamp and have an "is_deleted" flag on the
       | business object that checks if the "deleted_timestamp" is null or
       | depending on the use case, whether the timestamp is in the future
       | for records that are to be deleted.
       | 
       | I can also use the timestamp later if I want to move deleted
       | records to a different table or database to improve performance
       | depending on requirements.
        
       | [deleted]
        
       | simonw wrote:
       | This was one of Luke Plant's YAGNI exceptions:
       | https://lukeplant.me.uk/blog/posts/yagni-exceptions/
        
       | dang wrote:
       | Discussed at the time:
       | 
       |  _You might as well timestamp it_ -
       | https://news.ycombinator.com/item?id=26922759 - April 2021 (198
       | comments)
        
       | AJRF wrote:
       | > It's like the exact opposite of YAGNI. You Ain't Gonna Regret
       | It?
       | 
       | Pagni's?
       | 
       | https://simonwillison.net/2021/Jul/1/pagnis/
        
       | sedatk wrote:
       | If you need to hold timestamp, hold `deleted` and `deleted_at`
       | together.
       | 
       | If you can't add a separate field, just add a log table.
       | 
       | Don't optimize for fictional requirements that haven't surfaced,
       | and don't add shortcuts.
       | 
       | EDIT: Let me clarify my point here. A timestamp is just a
       | timestamp. Giving a timestamp a role on signifying the state of
       | an entity is a semantic shortcut, and semantic shortcuts lead to
       | problems. I believe that DB structures should be as clear as
       | possible. For example, it's impossible to know whether a
       | timestamp is the authority on an entity's state by looking at a
       | table structure. But a boolean field like "is_deleted" or
       | "deleted" doesn't have that problem.
        
         | bob1029 wrote:
         | Nullable timestamp columns (64-bit integers) are what I prefer.
         | 
         | I don't know how much of a performance improvement I'd get if I
         | had a separate indexed bool column for tracking _just_ the
         | truthiness, but I do know that it probably wouldn 't be enough
         | to matter, even _if_ SQL was a meaningful bottleneck.
         | 
         | I really enjoy working with these as nullable DateTime in .NET
         | as well:                 if (!record.CompletedUtc.HasValue) {
         | /* completion logic */ }
         | 
         | One source of truth also means only 1 thing to update each
         | time. It makes many kinds of problems impossible.
        
           | rektide wrote:
           | Indexing a 64 bit versus a 1 bit column actually sounds like
           | a potentially really good win to make the index much much
           | smaller, which could help loads with keeping it in hot memory
           | or cache.
           | 
           | On the other hand, on postgres & maybe others, one could
           | perhaps index an expression _deleted_at is not null_ , which
           | would let one have the data shape they want while letting the
           | index shape suit the typical access pattern.
        
         | nightpool wrote:
         | > For example, it's impossible to know whether a timestamp is
         | the authority on an entity's state by looking at a table
         | structure
         | 
         | This is just a matter of convention--if your codebase is
         | consistent about it and there's no other option for "is this
         | record deleted?", then yes it will be absolutely clear that
         | "deleted_at" is the right column to use.
         | 
         | What's _actually_ worse for making the database structure clear
         | is to have  "vestigial" columns that store old data, like you
         | proposed for "deleted" and "deleted_at". Now, if they disagree,
         | which one should you trust? Maybe some code is using one, while
         | some code is using the other. Having a single source-of-truth
         | column--no matter _what_ the name is--is always going to be
         | better then having two columns that could get out of sync or
         | disagree.
        
         | 0x457 wrote:
         | I think you should have log table anyway because if you need
         | that information storing it as column(s) next to the rest it's
         | not going to work well.
         | 
         | anyways i prefer to have nullable timeshamp columns instead of
         | two like you suggested.
        
         | samwillis wrote:
         | Log tables are a super good idea in a lot of cases. And if your
         | DB support it add a JSON column to that log table. Being able
         | to dump either stricter data from our own logging or the
         | reponcess from third part APIs in invaluable. You never know in
         | advance what is going to be useful.
         | 
         | If you start generating too much data, add a cron job to purge
         | older items or scale back the logging.
         | 
         | An example from our e-commerce platform, the "order/cart"
         | tables have corresponding log table that log all changes, user
         | interactions, and all requests/responses from Stripe during the
         | checkout process.
         | 
         | It's super useful both for tracking down bugs, but also
         | confirming what happed when customers contact support.
        
           | jrumbut wrote:
           | Yeah I much prefer keeping the deleted column as a boolean
           | and then having a log table. I've done the timestamps instead
           | of boolean thing before and it leads to two problems:
           | 
           | 1. What happens when the record is "un-deleted?" Now the
           | timestamp is null again and history has been lost. It's not
           | the end of the world but if you're asked to create an account
           | history report it's an awkward conversation. "I thought we
           | tracked when records were deleted..."
           | 
           | 2. Similar to the end of the last problem, it creates an
           | expectation that the times that changes are made are tracked
           | but it doesn't quite live up to that expectation. I've
           | learned that business people love a feature that works well
           | (like a log table), they can deal with a feature not existing
           | at all (like a boolean and no logging), but a half-working
           | feature causes a lot of suffering.
           | 
           | Just do logging right from the start if you are writing
           | business software where the data and transaction volume are
           | small compared to the value of the data.
        
         | manigandham wrote:
         | No need for both fields. A nullable timestamp column works
         | fine: NULL until the record/row is deleted, at which point it
         | holds the timestamp of when it happened.
         | 
         | You can derive a boolean "is it deleted" flag from that without
         | adding another column to (mis)manage.
        
           | sedatk wrote:
           | Because they signify different things. For example, when you
           | want to undo that delete operation, erasing timestamp would
           | actually deprive you of the information that when it was
           | deleted in the first place.
        
             | nightpool wrote:
             | Sure, but that just kicks the can one iteration down the
             | road, and now it's impossible to delete the record a SECOND
             | time without "depriving" yourself of that information. If
             | you need an audit table, use an audit table, but being able
             | to say "deleted_at > 1.month.ago" to grab all records
             | deleted in the past month is super useful.
        
             | manigandham wrote:
             | What is a delete operation that can be "undone"? What is
             | the timestamp supposed to mean in that scenario? What
             | happens when it's deleted again then?
             | 
             | If you need a full record of state changes then just
             | upgrade to an audit log, otherwise a nullable timestamp
             | field is perfectly fine.
        
               | sedatk wrote:
               | That's actually my main point. I edited my top comment to
               | clarify that. A timestamp doesn't implicate "state." A
               | bool, on the other hand, unquestionably does.
        
               | manigandham wrote:
               | Again, the _nullable_ nature of the column denotes both
               | the state and when the state changed. You have more
               | information in a single column, no need for both.
        
               | sedatk wrote:
               | Nullability on timestamp only implies that the timestamp
               | is optional that's all. It doesn't imply that the
               | timestamp decides the state of the entity.
        
           | temp2022account wrote:
           | time-series designs are generally better because they hold
           | more information, and you can throw a view on top of them to
           | present a nullable timestamp table if that's what's already
           | been coded against. On the other hand once you go nullable
           | it's difficult to figure out why the field was nulled after
           | the fact.
        
             | manigandham wrote:
             | It's a _deleted_ timestamp. It 's null until it's deleted.
             | Why would there be confusion on it being null?
        
         | echelon wrote:
         | Make deleted_at nullable and put an index on it.
         | 
         | An audit log table isn't a bad idea, though. You can capture
         | rich entity edit histories, attribution, etc.
        
           | [deleted]
        
       | crazygringo wrote:
       | I'm going to have to respectfully disagree on this one, just
       | because of unintended behavior and complexity.
       | 
       | Whether it's in a programming language or in a database... what
       | is false? Is it false or null or 1970-01-01 00:00:00 UTC? What if
       | different parts of the program/table use different versions of
       | false? What happens when I combine two truthy values with a
       | boolean operator -- for all combinations of values, what is the
       | resulting type and is it the value even correct? (E.g. in SQL,
       | NULL XOR 1672254005 is NULL, which is flat-out wrong if you're
       | using NULL in place of FALSE, and it's easy to think of further
       | examples.)
       | 
       | There's so much research I'd have to do to determine that this
       | would always be bug-free for a particular language/platform, and
       | then I'd need to remember which languages/platforms it is safe
       | for and which ones aren't.
        
         | echelon wrote:
         | For nullable timestamps, the timestamp itself is opaque. Having
         | a value is indication that the thing happened. The time reading
         | is orthogonal to that question.
         | 
         | If your language has a tough or flexible time with falsy, I
         | found your problem. You shouldn't even be in that headspace,
         | and languages that put you there introduce so much mental
         | overhead that they're not worth it.
         | 
         | You shouldn't need to do research here. This should be a one
         | and done solution from your toolbox that you can pull out at a
         | moment's notice: "How do I handle record soft deletion and
         | store some state around when it happened?" - zero effort, tried
         | and true recipe borne of experience.
         | 
         | The database and data stores drive your product. They're the
         | heart and soul.
        
           | crazygringo wrote:
           | So the problem is with SQL, because NULL doesn't act the same
           | as false, and boolean operators may give incorrect results if
           | you expect them to? I'm not sure how to respond if you think
           | people shouldn't be in the headspace of SQL.
        
             | echelon wrote:
             | In SQL you won't typically coerce null to false, though
             | there are mechanisms to accomplish this (which can be
             | useful for analytics, ETLs, etc). Presence is first class
             | and is probably what you want in your business logic
             | queries.
             | 
             | If you let the SQL query predicates handle which records to
             | select, I'm not sure I see the problem you originally
             | outlined. Why do you care what value the timestamp has when
             | the question is whether or not a record experienced an
             | event.
        
         | jchw wrote:
         | No, you have to do this anyways. Assuming you're using a SQL
         | database, a boolean field can be NULL too, unless you
         | explicitly mark it non-nullable. If you are worried about
         | obscure edge cases, use languages and tools with less of them.
         | For example, it is VERY easy to reason about the date field
         | being set or not if your language of choice represents it with
         | an optional type you have to unwrap in some fashion.
         | 
         | But it's not really that bad in the worst case. Failing docs or
         | the ability to read your db connector's source code enough to
         | ascertain, you can always just test. Both in the sense of, try
         | it and see what happens, and also, writing automated tests that
         | ensure you're getting the expected behavior. If your line of
         | logic is that hope is not a strategy, I agree. However, there
         | are alternatives to hope that don't involve vacating best
         | practices out of fear of hitting edge cases, and you probably
         | want those automated tests anyways.
        
           | Twisell wrote:
           | That is precisely the point in any data oriented application
           | (like a database) a boolean should actually have at least 3
           | possible value:
           | 
           | - True
           | 
           | - False
           | 
           | - Null/Nil/unset/error
           | 
           | Depending on how the application handle the third case it can
           | be more than 3 return values.
           | 
           | timestamp require ad-hoc and potentially breakable workaround
           | to emulate this behavior.
        
             | jchw wrote:
             | I'll have to disagree. I don't think I'd want that in most
             | cases.
             | 
             | It's unclear what general problem it solves. If the
             | database row is actually corrupted, I want the query to
             | fail violently, and the database to fail all operations
             | around it to avoid further corruption. If I really need to
             | handle it, I'll handle it in my app layer.
             | 
             | If it's not, I don't want it to be set to some invalid
             | state. I don't want every piece of code to need to deal
             | with this invalid state that shouldn't be allowed to begin
             | with. I'd rather gate it at the door and stop it from
             | getting in there in the first place.
             | 
             | If you use a nullable timestamp as a boolean, this is fine;
             | there's only two states w.r.t. set or not: NULL or not
             | NULL, from the SQL perspective. How this is encoded in your
             | language and tools of choice depends. If there's some
             | database value not encodable to your language's native
             | types, the db connector will probably decide what to do:
             | truncate data, fail the whole query, use a type with
             | additional entropy (like Go's NullString, for example.) In
             | no way do I want there to be an error state for that field
             | alone. If I did, though, for example because I need to
             | handle partial data as an application concern, I'd want
             | union typing. In that case you could have
             | Error/Unset/Timestamp just fine. I don't believe SQL has
             | this sort of union type natively, but it can be represented
             | in many programming languages intuitively (like unsafely as
             | a union in C, safely as an enum in Rust, safely as a
             | discriminated union in TypeScript, and mostly safely in Go
             | abusing interfaces.) However, that is only something I want
             | to do if my application specifically needs it. Otherwise?
             | Data that goes into the database must meet the constraints
             | first.
        
               | Twisell wrote:
               | Still after all this explaining you agree there is only
               | two state instead of three.
               | 
               | Usually it's used with the following logic for database
               | oriented workflow.
               | 
               | True = true
               | 
               | False = false
               | 
               | Null = unknow (This which might be a real-world case that
               | you can model using null)
               | 
               | Of course if there is an error the engine will thrown an
               | exception Which is a whole other kind of signal.
        
               | jchw wrote:
               | What I don't get is, why would I want there to be a third
               | state in this boolean example? You're saying it could be
               | useful. That's true, it could. However, we're talking
               | about something that today is already just a boolean,
               | such as "is_published." Even if you can envision a reason
               | why you might want a third unknown state for
               | "is_published", if you write a bunch of code that treats
               | it as a boolean, without knowing what the null state
               | might mean, you are not solving a problem. You're just
               | modelling your data slightly less precisely. Saving a
               | date is saving known, useful state, even if you don't
               | need it yet. Using a nullable type for a field that can't
               | yet be null is just adding an additional unknown state
               | for no reason.
               | 
               | When using Go and sqlc, what would happen is I would use
               | a NOT NULL field for the boolean, because I'm modelling a
               | boolean, not a tri-state. If I was modelling something
               | where I had three states I wanted to handle, I would use
               | an enum instead. But even if I wanted to use NULL instead
               | of an enum that explicitly specifies the meaning of the
               | third state, I'd rather start as NOT NULL. Why? Because
               | then, if I ever want to remove the constraint, I can. And
               | when I do, sqlc will helpfully change the type of the
               | field from a raw boolean to a nullable boolean wrapper.
               | And doing _this_ will _break_ the existing code. This is
               | good. Why? Because in every place where I assumed there
               | were only two possible states, I can now check and see
               | what I 'd want to do in my new third case. This is better
               | than starting with nullability, because if I don't know
               | what NULL means or why I'd have it, it's really difficult
               | to know what my code should do when it encounters it. You
               | could always throw errors, but if you do that you have to
               | ensure you clean all of that up before you start actually
               | using NULL.
               | 
               | This is fundamentally about data modelling though.
               | Talking in this abstract sense is not useful. What I feel
               | you're failing to make an adequate case for is why a
               | field like the one described in the linked article needs
               | a state other than "set" or "unset". Just putting a
               | nullable timestamp is a nearly free upgrade from putting
               | a non nullable boolean, which is what I would've done
               | anyways.
               | 
               | If I want to add more states, I want explicit data model
               | changes, that result in explicit test and type checking
               | errors, that highlight wherever I need to update my
               | code's state management to account for new states.
               | 
               | Additional note: particular to SQL, removing not NULL
               | constraints is basically free from the database end as
               | far as I can tell. In addition, NULL has specific
               | behaviors, especially w.r.t. JOINs, so I would NOT want
               | to use NULL to model a tri-state.
        
         | mixmastamyk wrote:
         | > false or null
         | 
         | Not possible with a timestamp field.
         | 
         | > 00:00:00 UTC
         | 
         | I remember a bug around this in Python regarding times (only) a
         | decade or two ago, it was fixed a long time ago. I assume this
         | is not an issue for modern stacks, but I would double check
         | under say PHP which has a known history. Not normally an issue
         | I'd say.
         | 
         | > What happens when I combine two truthy values with a boolean
         | operator
         | 
         | Type or Runtime error. Who is adding booleans and dates
         | together? Such egregious mistakes don't take long to surface
         | and are impossible if planned against beforehand.
         | 
         | I don't think these are things to be worried about (if you're
         | paying attention even a tiny bit) to be honest.
        
       | grahamm wrote:
       | Never really felt the need but I suppose there has always been an
       | updated_when field, so for something that is deleted the
       | updated_when provides the time stamp.
        
         | echelon wrote:
         | Create an `updated_at` with an on update trigger to set it to
         | NOW(). Your code will never have to explicitly remember to set
         | it.
         | 
         | Go further and create an integer `version` vector clock that
         | gets set to its value plus one. You can see the edit frequency
         | of individual entities and construct a simple key for cache
         | invalidation.
        
       ___________________________________________________________________
       (page generated 2022-12-28 23:01 UTC)