[HN Gopher] Common data model mistakes made by startups
       ___________________________________________________________________
        
       Common data model mistakes made by startups
        
       Author : ReginaDeiPirati
       Score  : 85 points
       Date   : 2021-05-22 16:34 UTC (6 hours ago)
        
 (HTM) web link (www.metabase.com)
 (TXT) w3m dump (www.metabase.com)
        
       | giovannibonetti wrote:
       | > Queries for business metrics are usually scattered, written by
       | many people, and generally much less controlled. So do what you
       | can to make it easy for your business to get the metrics it needs
       | to make better decisions.
       | 
       | A simple but useful thing is setting the database default time
       | zone match the one where most of your team is (instead of UTC).
       | This reduces the chance your metrics are wrong because you forgot
       | to set the time zone when extracting the date of a timestamp.
        
         | sethammons wrote:
         | We went this route and ended up with a db set to pst and some
         | servers based on Chicago time. Endless time bugs. Pick one
         | timezone for everything or just use unix timestamps.
        
         | xyzzy_plugh wrote:
         | I cannot overstate how bad this advice is. Everything should be
         | UTC by default. You can explicitly use timestamp with timezones
         | and frankly it's trivial to query something like midnight-to-
         | midnight PST. Your team should learn this as early as possible.
         | 
         | Build tooling around this, warn users, hell, educate them, but
         | don't set up foot-guns like non-UTC.
         | 
         | If I see a timestamp without a timezone, it must always be UTC.
         | To do anything else is to introduce insanity.
        
         | dehrmann wrote:
         | I disagree on this because handling DST is error-prone.
        
       | rm999 wrote:
       | >Soft deletes
       | 
       | This section is totally wrong IMO. What is the alternative?
       | "Hard" deleting records from a table is usually a bad idea
       | (unless it is for legal reasons), especially if that table's
       | primary key is a foreign key in another table - imagine deleting
       | a user and then having no idea who made an order. Setting a
       | deleted/inactive flag is by far the least of two evils.
       | 
       | >when multiplied across all the analytics queries that you'll
       | run, this exclusion quickly starts to become a serious drag
       | 
       | I disagree, modern analytics databases filter cheaply and easily.
       | I have scaled data orgs 10-50x and never seen this become an
       | issue. And if this is really an issue, you can remove these
       | records in a transform layer before it hits your analytics team,
       | e.g. in your data warehouse.
       | 
       | >soft deletes introduce yet another place where different users
       | can make different assumptions
       | 
       | Again, you can transform these records out.
        
         | watermelon0 wrote:
         | Hard deletes most likely need to be supported, due to legal or
         | contractual obligations. Designing with this in mind, makes
         | everything a lot easier in the long run.
        
           | rm999 wrote:
           | I've always NULL'd values, not deleted rows. E.g. GDPR
           | request? NULL out all identifying information, but keep the
           | record.
           | 
           | As long as your primary key has no business meaning you
           | should never have to delete the row of a table.
        
             | FigmentEngine wrote:
             | INAL, but... you might want to revisit that code. article
             | 17, right to erasure is about erasure of personal data, not
             | about making non-indentifiable. of course they dont define
             | erase or delete :-)
             | 
             | (edit: typo)
        
               | Tarq0n wrote:
               | If you erase all identifying parts it stops meeting the
               | definition of personal data. That should be sufficient.
        
               | Aperocky wrote:
               | well to me the transaction is the same as deleting a
               | record and populating a NULL record.
               | 
               | I don't see why the law should care in any way about a
               | company populating NULL records.
        
               | FigmentEngine wrote:
               | "NULL out all identifying information" is anonymization,
               | not deleting the information.
        
         | tgbugs wrote:
         | Hard deletes are also awful from the perspective of data
         | preservation. For example, when youtube removes a video they
         | also delete all the metadata or any indication that it ever
         | existed. Countless people have lost what they thought was a
         | secure record of at least the title of songs or videos they
         | saved to a playlist.
         | 
         | There is also a more sinister side, which is that the ability
         | to hard delete something forever means that bad actors can
         | fabricate old "deleted" documents and accuse someone of having
         | created and then deleted them.
        
           | bulhi wrote:
           | Exactly. I get OP's point (i.e. you can accidentally include
           | softdeleted records in your results), but for some types of
           | data hard deletes are an absolute no-go anyways, so you just
           | have to live with it.
        
           | alexpetralia wrote:
           | I do think that hard deletes may sometimes be required to
           | comply with legal requirements (e.g. complete expungement of
           | personal information relating to a user). If it is not
           | required by statutory law, sometimes it is written into
           | commercial contracts.
        
         | irrational wrote:
         | > if that table's primary key is a foreign key in another table
         | - imagine deleting a user and then having no idea who made an
         | order
         | 
         | Assuming you have constraints set up correctly (on delete no
         | action or on delete restrict) then how could this ever happen?
         | If you don't have constraints set up correctly...
        
           | andrewprock wrote:
           | The chance that you don't have constraints set up correctly
           | is indistinguishable from 100%.
        
           | quietbritishjim wrote:
           | That would just make the data loss problem worse still. I
           | realise OP just chose an arbitrary example, but if you really
           | are talking about users and orders, and if you delete a user,
           | then really deleting the records for their associated orders
           | is even worse than losing track of who made them.
        
           | vericiab wrote:
           | Assuming you're deleting the row because it shouldn't be used
           | by read queries, constraints like you described prevent the
           | problem of having orphaned records in the child table but
           | also prevent you from achieving your goal. On delete cascade
           | would allow you to achieve your goal and prevent the orphaned
           | records but could lead to deleting more than intended
           | (especially if the child table is also a parent table
           | referenced by further foreign key constraints, its children
           | could in turn have children, etc). Of course, with no
           | action/restrict you could also manually cascade the delete,
           | but if you actually don't want to delete a child row and
           | there's not an appropriate alternative value for its foreign
           | key then you're in a bit of a pickle.
           | 
           | So if you want to delete a user but keep the records of their
           | orders and still know who made those orders, then some form
           | of soft delete is probably your best option. I believe that's
           | the point rm999 was making (in response to the article
           | asserting that soft deletes are a "data model mistake").
           | Properly configured constraints can prevent an "oops" but
           | don't really do anything to solve the problem of this sort of
           | delete from some contexts but not others.
        
       | jayd16 wrote:
       | Whats the best way to construct a session?
       | 
       | >The exact definition of what comprises a session typically
       | changes as the app itself changes.
       | 
       | Isn't this an argument for post-hoc reconstruction? You can
       | consistently re-run your analytics. If the definition changes in
       | code, your persisted data becomes inconsistent, no?
        
       | brylie wrote:
       | If your company has a subscription business model, keep a history
       | of user's subscriptions. They change over time and it is likely
       | you will need to measure popularity and profitability of product
       | offerings over time. Please don't force your analytics team to
       | rely on event logs to reconstruct a subscription history.
        
         | maneesh wrote:
         | Stripe manages this extremely well
        
         | jabo wrote:
         | This. You want to capture timestamps as users downgrade,
         | upgrade, change quantity, churn, etc. If you have a status
         | field, timestamp the changes to it. This way it's easy to get
         | the state of the world on any given day, which is a common
         | analysis that's done to study behavior of cohorts of
         | subscriptions over time.
        
       | pyrophane wrote:
       | I think the biggest mistake some startups make wrt their data
       | model is not really thinking about it at all. The data model
       | winds up being the byproduct of all the features they've
       | implemented and the framework and the libraries they've used,
       | rather than something that was deliberately designed.
        
         | duped wrote:
         | Practically speaking the data model creates very little value.
         | If your startup is trying to make money, features are more
         | important than design for a good stretch.
         | 
         | There comes a time to refactor and fix your architecture but
         | it's usually not at the beginning.
         | 
         | You can design a data model if you don't know what you're
         | building. And no startup really knows what they're building.
        
         | taeric wrote:
         | Oddly, I feel the opposite is also a trap. A carefully crafted
         | data model often stalls out compared to a grown one.
        
         | allie1 wrote:
         | I think it's a mistake that they don't revisit it occasionally,
         | and if necessary pull the trigger on a new schema + migration
         | scripts.
         | 
         | Some early mistakes just can't be solved without a do-over, and
         | from a recent experience, it ends up being less work than
         | maintaining a flawed schema.
        
           | vosper wrote:
           | This is the place I work at. The data model was designed with
           | a narrow focus. When that turned out to not be viable, the
           | company moved into an adjacent and much larger market. But
           | the names never changed, and the subtle differences between
           | the two worlds was never addressed. So now our application is
           | full of terminology and restrictions that confuse our
           | customers, and our database doesn't match anyone's mental
           | model of what the application does. It's all workable, but
           | IMO we've paid (and pay) a not-insignificant price in
           | productivity and complexity because we never took the time to
           | fix these things.
           | 
           | At this point a ground-up rebuild is probably going to be no
           | slower than trying to update the existing app. Neither will
           | be cheap.
        
             | allie1 wrote:
             | I hear ya. Both would probably cost the same, rebuild
             | probably more, today, but it's still cheaper in the long
             | run. Unless the business goes back to what it was, it will
             | keep diverging from the current terminology. No manager
             | wants to hear it, but taking a 3-6 month breather to
             | address tech debt like this is worth its weight in gold.
        
         | abraae wrote:
         | At the other end of the scale is a data model designed for
         | extreme extensibility.
         | 
         | If you ever hear anyone bragging that their data model is
         | entirely metadata driven, and can be used to model anything -
         | without changing the database - that's a huge red flag, as is
         | looking in and seeing tables called "element", "business
         | object" and the like.
         | 
         | Unfortunately, for most serious Enterprise systems, a degree of
         | flexibility is essential. It's being able to pick the right
         | balance between hard coding first class domain objects into the
         | database and allowing for extensibility that IMO marks the
         | truly expert system designer.
        
       | eterm wrote:
       | A more common thing I think is just trying to collect and hoard
       | too much data.
       | 
       | Most of even these worries such as soft deletes disappear if
       | you're not trying to keep every scrap of data you can.
       | 
       | Focus on the core business requirements and competencies and you
       | likely don't need to store the minutae of every interaction
       | forever.
        
       | Pxtl wrote:
       | How do you reconcile the first bullet point (polluting data with
       | test data) vs Test In Production being the modern trend? Those
       | sound irreconcilable.
        
         | allie1 wrote:
         | Include a cleanup step after each test?
        
       | [deleted]
        
       | konha wrote:
       | > On the flip side, soft deletes require every single read query
       | to exclude deleted records.
       | 
       | You can use partial indexes to only index non-deleted rows. If
       | you are worried about having to remember to exclude deleted rows
       | from queries: Use a view to abstract away the implementation
       | detail from your analytics queries.
        
       | handrous wrote:
       | > 5. The "right database for the job" syndrome
       | 
       | I once saw something a little similar to this, except with one
       | flavor of DB rather than several. A company you've likely heard
       | of went hard for a certain Java graph database product, due to a
       | combination of an internal advocate who seemed determined to be
       | The GraphDB Guy and an engineering manager who was weirdly
       | susceptible to marketing material. This because some of their
       | data could be represented as graphs, so clearly a graph database
       | is a good idea.
       | 
       | However: the data for most of their products was tiny, rarely
       | written, not even read that much really, even less commonly
       | written concurrently, and was naturally sharded (with hard
       | boundaries) among clients. Their use of that graph database
       | product was plainly contributing to bugginess, operational pain,
       | mediocre performance (it was reasonably fast... as long as you
       | didn't want to both traverse a graph and fetch data related to
       | that graph, then it was laughably slow) and low development
       | velocity on multiple projects.
       | 
       | Meanwhile, the best DB to deliver the features they wanted
       | quickly & with some nice built-in "free" features for them
       | (ability to control access via existing file sharing tools they
       | had, for instance) was probably... SQLite.
        
       | ridaj wrote:
       | I would personally add:
       | 
       | - Having informal metrics and dimension definitions: you throw
       | together something quick and dirty and then realize there's
       | something semantically broken about your data definitions or
       | unevenness. For example your Android app and iOS apps report
       | "countries" differently, or they have meaningfully different
       | notions of "active users"
       | 
       | - Not anticipating backfill/restatement needs. Bugs in logging
       | and analytics stacks happen as much as anywhere else, so it's
       | important to plan for backfills. Without a plan, backfills can be
       | major fire drills or impossible.
       | 
       | - Being over-attentive to ratio metrics (CTR, conversion rates)
       | which are typically difficult to diagnose (step 1 figure out
       | whether the numerator or the denominator is the problem). Ratio
       | metrics can be useful to rank N alternatives (eg campaign
       | keywords) but absolute metrics are usually more useful for
       | overall day to day monitoring.
       | 
       | - Overlooking the usefulness of very simple basic alerting. It's
       | common for bugs to cause a metric to go to zero, or to be double
       | counted, or to not be updated with recent data, but often times
       | even these highly obvious problems don't get detected until
       | manual inspection.
        
         | void_mint wrote:
         | > - Not anticipating backfill needs. Bugs in logging and
         | analytics stacks happen, so it's important to plan for
         | backfills. Without a plan, backfills can be major fire drills
         | or impossible.
         | 
         | This matches my experience. Building tools that allow you to
         | rebuild some or all of a dataset with minimal headache make any
         | individual task much easier. Both in terms of safety, and in
         | terms of things like branching/dev environments.
        
       | rectang wrote:
       | Metabase provides business analytics, and this list of "common
       | mistakes" is weighted towards "choices which get in the way of
       | business analytics".
       | 
       | For example:
       | 
       | > _1. Polluting your database with test or fake data_
       | 
       | > _[...] By polluting your database with test data, you've
       | introduced a tax on all analytics (and internal tool building) at
       | your company._
        
         | BigJono wrote:
         | The end of this article is particularly weird. Is it really
         | suggesting that a good general rule is to optimise for business
         | metric queries (which sounds like something that would
         | generally run daily during off peak hours or ad hoc when
         | someone needs the data) over the most commonly run
         | reads/updates (which sounds like something that will happen
         | multiple times per minute for every active user)?
         | 
         | I feel like I'm missing something because that seems insane to
         | me.
        
           | kwertyoowiyop wrote:
           | Consider the source. The barber is suggesting optimizing for
           | haircuts.
        
         | asperous wrote:
         | To your point, many of these could be addressed by making an
         | analytics database copy of the transactional database, for
         | example scrubbing test data and removing soft deletes in your
         | etl.
         | 
         | From my experience with metabase, this makes it easier to use
         | anyway but it means you have to maintain an etl.
        
       | ineedasername wrote:
       | _Polluting your database with test or fake data_
       | 
       | Maybe I've been spoiled, but isn't it common to have dev, test,
       | and prod instances? Possibly multiples of the former 2?
        
       | jasonhansel wrote:
       | > Typically semi-structured data have schemas that are only
       | enforced by convention
       | 
       | Technically, in Postgres you can (kind of) enforce arbitrary
       | schemas for semi-structured data using CHECK constraints.
       | Unfortunately this isn't well-documented and NoSQL DBs often
       | don't support similar mechanisms.
        
         | ridaj wrote:
         | Seems likely that the enforced schema would then break things
         | when someone updates the "live" schema without updating all of
         | the checks littered through downstream tables...
        
       | dugmartin wrote:
       | I would add to their semi structured data fields section a
       | suggestion to add a version or type key. Otherwise your code
       | consuming those field may grow over time to a bunch of
       | conditionals to figure what is in the json.
        
       ___________________________________________________________________
       (page generated 2021-05-22 23:00 UTC)