[HN Gopher] Common data model mistakes made by startups
___________________________________________________________________
Common data model mistakes made by startups
Author : ReginaDeiPirati
Score : 85 points
Date : 2021-05-22 16:34 UTC (6 hours ago)
(HTM) web link (www.metabase.com)
(TXT) w3m dump (www.metabase.com)
| giovannibonetti wrote:
| > Queries for business metrics are usually scattered, written by
| many people, and generally much less controlled. So do what you
| can to make it easy for your business to get the metrics it needs
| to make better decisions.
|
| A simple but useful thing is setting the database default time
| zone match the one where most of your team is (instead of UTC).
| This reduces the chance your metrics are wrong because you forgot
| to set the time zone when extracting the date of a timestamp.
| sethammons wrote:
| We went this route and ended up with a db set to pst and some
| servers based on Chicago time. Endless time bugs. Pick one
| timezone for everything or just use unix timestamps.
| xyzzy_plugh wrote:
| I cannot overstate how bad this advice is. Everything should be
| UTC by default. You can explicitly use timestamp with timezones
| and frankly it's trivial to query something like midnight-to-
| midnight PST. Your team should learn this as early as possible.
|
| Build tooling around this, warn users, hell, educate them, but
| don't set up foot-guns like non-UTC.
|
| If I see a timestamp without a timezone, it must always be UTC.
| To do anything else is to introduce insanity.
| dehrmann wrote:
| I disagree on this because handling DST is error-prone.
| rm999 wrote:
| >Soft deletes
|
| This section is totally wrong IMO. What is the alternative?
| "Hard" deleting records from a table is usually a bad idea
| (unless it is for legal reasons), especially if that table's
| primary key is a foreign key in another table - imagine deleting
| a user and then having no idea who made an order. Setting a
| deleted/inactive flag is by far the least of two evils.
|
| >when multiplied across all the analytics queries that you'll
| run, this exclusion quickly starts to become a serious drag
|
| I disagree, modern analytics databases filter cheaply and easily.
| I have scaled data orgs 10-50x and never seen this become an
| issue. And if this is really an issue, you can remove these
| records in a transform layer before it hits your analytics team,
| e.g. in your data warehouse.
|
| >soft deletes introduce yet another place where different users
| can make different assumptions
|
| Again, you can transform these records out.
| watermelon0 wrote:
| Hard deletes most likely need to be supported, due to legal or
| contractual obligations. Designing with this in mind, makes
| everything a lot easier in the long run.
| rm999 wrote:
| I've always NULL'd values, not deleted rows. E.g. GDPR
| request? NULL out all identifying information, but keep the
| record.
|
| As long as your primary key has no business meaning you
| should never have to delete the row of a table.
| FigmentEngine wrote:
| INAL, but... you might want to revisit that code. article
| 17, right to erasure is about erasure of personal data, not
| about making non-indentifiable. of course they dont define
| erase or delete :-)
|
| (edit: typo)
| Tarq0n wrote:
| If you erase all identifying parts it stops meeting the
| definition of personal data. That should be sufficient.
| Aperocky wrote:
| well to me the transaction is the same as deleting a
| record and populating a NULL record.
|
| I don't see why the law should care in any way about a
| company populating NULL records.
| FigmentEngine wrote:
| "NULL out all identifying information" is anonymization,
| not deleting the information.
| tgbugs wrote:
| Hard deletes are also awful from the perspective of data
| preservation. For example, when youtube removes a video they
| also delete all the metadata or any indication that it ever
| existed. Countless people have lost what they thought was a
| secure record of at least the title of songs or videos they
| saved to a playlist.
|
| There is also a more sinister side, which is that the ability
| to hard delete something forever means that bad actors can
| fabricate old "deleted" documents and accuse someone of having
| created and then deleted them.
| bulhi wrote:
| Exactly. I get OP's point (i.e. you can accidentally include
| softdeleted records in your results), but for some types of
| data hard deletes are an absolute no-go anyways, so you just
| have to live with it.
| alexpetralia wrote:
| I do think that hard deletes may sometimes be required to
| comply with legal requirements (e.g. complete expungement of
| personal information relating to a user). If it is not
| required by statutory law, sometimes it is written into
| commercial contracts.
| irrational wrote:
| > if that table's primary key is a foreign key in another table
| - imagine deleting a user and then having no idea who made an
| order
|
| Assuming you have constraints set up correctly (on delete no
| action or on delete restrict) then how could this ever happen?
| If you don't have constraints set up correctly...
| andrewprock wrote:
| The chance that you don't have constraints set up correctly
| is indistinguishable from 100%.
| quietbritishjim wrote:
| That would just make the data loss problem worse still. I
| realise OP just chose an arbitrary example, but if you really
| are talking about users and orders, and if you delete a user,
| then really deleting the records for their associated orders
| is even worse than losing track of who made them.
| vericiab wrote:
| Assuming you're deleting the row because it shouldn't be used
| by read queries, constraints like you described prevent the
| problem of having orphaned records in the child table but
| also prevent you from achieving your goal. On delete cascade
| would allow you to achieve your goal and prevent the orphaned
| records but could lead to deleting more than intended
| (especially if the child table is also a parent table
| referenced by further foreign key constraints, its children
| could in turn have children, etc). Of course, with no
| action/restrict you could also manually cascade the delete,
| but if you actually don't want to delete a child row and
| there's not an appropriate alternative value for its foreign
| key then you're in a bit of a pickle.
|
| So if you want to delete a user but keep the records of their
| orders and still know who made those orders, then some form
| of soft delete is probably your best option. I believe that's
| the point rm999 was making (in response to the article
| asserting that soft deletes are a "data model mistake").
| Properly configured constraints can prevent an "oops" but
| don't really do anything to solve the problem of this sort of
| delete from some contexts but not others.
| jayd16 wrote:
| Whats the best way to construct a session?
|
| >The exact definition of what comprises a session typically
| changes as the app itself changes.
|
| Isn't this an argument for post-hoc reconstruction? You can
| consistently re-run your analytics. If the definition changes in
| code, your persisted data becomes inconsistent, no?
| brylie wrote:
| If your company has a subscription business model, keep a history
| of user's subscriptions. They change over time and it is likely
| you will need to measure popularity and profitability of product
| offerings over time. Please don't force your analytics team to
| rely on event logs to reconstruct a subscription history.
| maneesh wrote:
| Stripe manages this extremely well
| jabo wrote:
| This. You want to capture timestamps as users downgrade,
| upgrade, change quantity, churn, etc. If you have a status
| field, timestamp the changes to it. This way it's easy to get
| the state of the world on any given day, which is a common
| analysis that's done to study behavior of cohorts of
| subscriptions over time.
| pyrophane wrote:
| I think the biggest mistake some startups make wrt their data
| model is not really thinking about it at all. The data model
| winds up being the byproduct of all the features they've
| implemented and the framework and the libraries they've used,
| rather than something that was deliberately designed.
| duped wrote:
| Practically speaking the data model creates very little value.
| If your startup is trying to make money, features are more
| important than design for a good stretch.
|
| There comes a time to refactor and fix your architecture but
| it's usually not at the beginning.
|
| You can design a data model if you don't know what you're
| building. And no startup really knows what they're building.
| taeric wrote:
| Oddly, I feel the opposite is also a trap. A carefully crafted
| data model often stalls out compared to a grown one.
| allie1 wrote:
| I think it's a mistake that they don't revisit it occasionally,
| and if necessary pull the trigger on a new schema + migration
| scripts.
|
| Some early mistakes just can't be solved without a do-over, and
| from a recent experience, it ends up being less work than
| maintaining a flawed schema.
| vosper wrote:
| This is the place I work at. The data model was designed with
| a narrow focus. When that turned out to not be viable, the
| company moved into an adjacent and much larger market. But
| the names never changed, and the subtle differences between
| the two worlds was never addressed. So now our application is
| full of terminology and restrictions that confuse our
| customers, and our database doesn't match anyone's mental
| model of what the application does. It's all workable, but
| IMO we've paid (and pay) a not-insignificant price in
| productivity and complexity because we never took the time to
| fix these things.
|
| At this point a ground-up rebuild is probably going to be no
| slower than trying to update the existing app. Neither will
| be cheap.
| allie1 wrote:
| I hear ya. Both would probably cost the same, rebuild
| probably more, today, but it's still cheaper in the long
| run. Unless the business goes back to what it was, it will
| keep diverging from the current terminology. No manager
| wants to hear it, but taking a 3-6 month breather to
| address tech debt like this is worth its weight in gold.
| abraae wrote:
| At the other end of the scale is a data model designed for
| extreme extensibility.
|
| If you ever hear anyone bragging that their data model is
| entirely metadata driven, and can be used to model anything -
| without changing the database - that's a huge red flag, as is
| looking in and seeing tables called "element", "business
| object" and the like.
|
| Unfortunately, for most serious Enterprise systems, a degree of
| flexibility is essential. It's being able to pick the right
| balance between hard coding first class domain objects into the
| database and allowing for extensibility that IMO marks the
| truly expert system designer.
| eterm wrote:
| A more common thing I think is just trying to collect and hoard
| too much data.
|
| Most of even these worries such as soft deletes disappear if
| you're not trying to keep every scrap of data you can.
|
| Focus on the core business requirements and competencies and you
| likely don't need to store the minutae of every interaction
| forever.
| Pxtl wrote:
| How do you reconcile the first bullet point (polluting data with
| test data) vs Test In Production being the modern trend? Those
| sound irreconcilable.
| allie1 wrote:
| Include a cleanup step after each test?
| [deleted]
| konha wrote:
| > On the flip side, soft deletes require every single read query
| to exclude deleted records.
|
| You can use partial indexes to only index non-deleted rows. If
| you are worried about having to remember to exclude deleted rows
| from queries: Use a view to abstract away the implementation
| detail from your analytics queries.
| handrous wrote:
| > 5. The "right database for the job" syndrome
|
| I once saw something a little similar to this, except with one
| flavor of DB rather than several. A company you've likely heard
| of went hard for a certain Java graph database product, due to a
| combination of an internal advocate who seemed determined to be
| The GraphDB Guy and an engineering manager who was weirdly
| susceptible to marketing material. This because some of their
| data could be represented as graphs, so clearly a graph database
| is a good idea.
|
| However: the data for most of their products was tiny, rarely
| written, not even read that much really, even less commonly
| written concurrently, and was naturally sharded (with hard
| boundaries) among clients. Their use of that graph database
| product was plainly contributing to bugginess, operational pain,
| mediocre performance (it was reasonably fast... as long as you
| didn't want to both traverse a graph and fetch data related to
| that graph, then it was laughably slow) and low development
| velocity on multiple projects.
|
| Meanwhile, the best DB to deliver the features they wanted
| quickly & with some nice built-in "free" features for them
| (ability to control access via existing file sharing tools they
| had, for instance) was probably... SQLite.
| ridaj wrote:
| I would personally add:
|
| - Having informal metrics and dimension definitions: you throw
| together something quick and dirty and then realize there's
| something semantically broken about your data definitions or
| unevenness. For example your Android app and iOS apps report
| "countries" differently, or they have meaningfully different
| notions of "active users"
|
| - Not anticipating backfill/restatement needs. Bugs in logging
| and analytics stacks happen as much as anywhere else, so it's
| important to plan for backfills. Without a plan, backfills can be
| major fire drills or impossible.
|
| - Being over-attentive to ratio metrics (CTR, conversion rates)
| which are typically difficult to diagnose (step 1 figure out
| whether the numerator or the denominator is the problem). Ratio
| metrics can be useful to rank N alternatives (eg campaign
| keywords) but absolute metrics are usually more useful for
| overall day to day monitoring.
|
| - Overlooking the usefulness of very simple basic alerting. It's
| common for bugs to cause a metric to go to zero, or to be double
| counted, or to not be updated with recent data, but often times
| even these highly obvious problems don't get detected until
| manual inspection.
| void_mint wrote:
| > - Not anticipating backfill needs. Bugs in logging and
| analytics stacks happen, so it's important to plan for
| backfills. Without a plan, backfills can be major fire drills
| or impossible.
|
| This matches my experience. Building tools that allow you to
| rebuild some or all of a dataset with minimal headache make any
| individual task much easier. Both in terms of safety, and in
| terms of things like branching/dev environments.
| rectang wrote:
| Metabase provides business analytics, and this list of "common
| mistakes" is weighted towards "choices which get in the way of
| business analytics".
|
| For example:
|
| > _1. Polluting your database with test or fake data_
|
| > _[...] By polluting your database with test data, you've
| introduced a tax on all analytics (and internal tool building) at
| your company._
| BigJono wrote:
| The end of this article is particularly weird. Is it really
| suggesting that a good general rule is to optimise for business
| metric queries (which sounds like something that would
| generally run daily during off peak hours or ad hoc when
| someone needs the data) over the most commonly run
| reads/updates (which sounds like something that will happen
| multiple times per minute for every active user)?
|
| I feel like I'm missing something because that seems insane to
| me.
| kwertyoowiyop wrote:
| Consider the source. The barber is suggesting optimizing for
| haircuts.
| asperous wrote:
| To your point, many of these could be addressed by making an
| analytics database copy of the transactional database, for
| example scrubbing test data and removing soft deletes in your
| etl.
|
| From my experience with metabase, this makes it easier to use
| anyway but it means you have to maintain an etl.
| ineedasername wrote:
| _Polluting your database with test or fake data_
|
| Maybe I've been spoiled, but isn't it common to have dev, test,
| and prod instances? Possibly multiples of the former 2?
| jasonhansel wrote:
| > Typically semi-structured data have schemas that are only
| enforced by convention
|
| Technically, in Postgres you can (kind of) enforce arbitrary
| schemas for semi-structured data using CHECK constraints.
| Unfortunately this isn't well-documented and NoSQL DBs often
| don't support similar mechanisms.
| ridaj wrote:
| Seems likely that the enforced schema would then break things
| when someone updates the "live" schema without updating all of
| the checks littered through downstream tables...
| dugmartin wrote:
| I would add to their semi structured data fields section a
| suggestion to add a version or type key. Otherwise your code
| consuming those field may grow over time to a bunch of
| conditionals to figure what is in the json.
___________________________________________________________________
(page generated 2021-05-22 23:00 UTC)