[HN Gopher] The Rise and Fall of the OLAP Cube (2020)
___________________________________________________________________
The Rise and Fall of the OLAP Cube (2020)
Author : smartmic
Score : 129 points
Date : 2021-07-05 11:18 UTC (11 hours ago)
(HTM) web link (www.holistics.io)
(TXT) w3m dump (www.holistics.io)
| random3 wrote:
| Brings up nostalgic memories from my previous life building
| distributed analytics systems. We had a fully distributed,
| scalable, real-time OLAP cube 10 years ago on top of HBase and
| Storm (or any other realtime processing pipeline). It was less
| than 30 KLOC and could be used as a library as well as fully
| distributed system.
|
| There are several OSS solutions but I still weep every time I
| think about the stupid reasons for which we delayed open-sourcing
| it. It was such a simple and scalable data model including many
| concepts that were not common at the time.
| https://medium.com/hstackdotorg/hbasecon-low-latency-olap-wi...
|
| I still believe there's room for a similar simple approach,
| likely taking advantage of some of the developments related to
| Algebird, Calcite, Gandiva, etc.
| georgewfraser wrote:
| The OLAP cube needs to come back, as an _optimization_ of a data-
| warehouse-centric workflow. If you are routinely running queries
| like: SELECT dimension, measure FROM
| table WHERE filter = ? GROUP BY 1
|
| You can save a lot of compute time by creating a materialized
| view [1] of: SELECT dimension, filter, measure
| FROM table GROUP BY 1, 2
|
| and the query optimizer will even automatically redirect your
| query for you! At this point, the main thing we need is for BI
| tools to take advantage of this capability in the background.
|
| [1] https://docs.snowflake.com/en/user-guide/views-
| materialized....
| motogpjimbo wrote:
| Did you mean to write HAVING in your first query? Otherwise
| your second query is not equivalent to the first, because the
| WHERE will not be performed prior to the aggregation.
| buremba wrote:
| While native materialized view feature is a great start,
| unfortunately they're not useful in a practical way if you have
| data lineage. They works like a black-box and they can't
| guarantee the query performance.
|
| The new generation ELT tools such as dbt partially solve this
| problem. You can model your data and incrementally update the
| tables that can be used in your BI tools. Looker's Aggregate
| Awareness is also a great start but unfortunately it only works
| for Looker.
|
| We try to solve this problem with metriql as well:
| https://metriql.com/introduction/aggregates The idea is to
| define these measures & dimensions once and use it everyone;
| your BI tools, data science tools, etc.
|
| Disclaimer: I'm the tech lead of the project.
| glogla wrote:
| Is that related to lightdash.com somehow?
|
| It seems like a very similar technology and also the webpages
| for both are almost identical.
| buremba wrote:
| We both use dbt as the data modeling layer but we don't
| actually develop a standalone BI tool. Instead, we
| integrate to third-party BI tools such as Data Studio,
| Tableau, Metabase, etc.
|
| We love Looker and wanted bring the LookML experience to
| existing BI tools rather than introducing a new BI tool,
| that's how metriql was born. I believe that Lightdash is a
| cool project especially for data analysts who are
| extensively using dbt but metriql targets users who are
| already using a BI tool. I'm not particularly sure which
| pages are identical, can you please point me?
| glogla wrote:
| Compare https://metriql.com/introduction/creating-
| datasets and https://docs.lightdash.com/guides/how-to-
| create-metrics
|
| I though you are affiliated somehow, but looking at it
| now, it seems you just use the same documentation website
| generator :)
| mulmen wrote:
| But those queries aren't equivalent so how is anything saved by
| materializing the second one?
|
| e:
|
| I believe (I could be wrong!) you edited the second query from
| SELECT dimension, measure FROM table GROUP BY 1
|
| To SELECT dimension, filter, measure
| FROM table GROUP BY 1, 2
|
| This addresses the filtering but how is that any different from
| the original table? Presumably `table` could have been a finer
| grain than the filter and dimension but you'd do better to add
| the rest of the dimensions as well, at which point you're most
| of the way to a star schema.
|
| This kind of pre-computed aggregate is typical in data
| warehousing. But is it really an "OLAP cube"?
|
| In general I agree there is value in the methods of the past
| and we would be well served to adapt those concepts to our work
| today.
| georgewfraser wrote:
| It's much smaller than the original table. If you compute
| lots of these, then voila, you have an OLAP cube.
| motogpjimbo wrote:
| He did edit his comment, but unfortunately didn't acknowledge
| the edit.
| _dark_matter_ wrote:
| looker does exactly this, though you do have to specify which
| dimensions to aggregate: https://docs.looker.com/data-
| modeling/learning-lookml/aggreg...
| unklefolk wrote:
| I'm not a big fan of OLAP cubes / data warehouses but it should
| be noted that these aren't always just about scale and speed.
| They also serve to provide a single view of the data when the
| data is held across multiple databases / systems. You're orders
| may be in PostgreSQL, your inventory tracked in MySQL etc. So
| bringing all that data into a single database for analytics can
| be valuable.
| chipsa wrote:
| This isn't against data warehouses, but just the cubes.
| Bringing everything into a single database for analysis is
| still required. But you extract load then transform, instead of
| extract transform load.
| rossdavidh wrote:
| While that's true, it feels to me that your comment made this
| point a lot more clearly than the article did.
| AtlasBarfed wrote:
| Which is why hadoop took off?
| debarshri wrote:
| I am not the subject matter expert but have spent first few year
| of my career doing data warehousing. I think OLAP cubes are
| falling out of fashion because traditional BI and reporting in
| general is falling out of fashion.
|
| With self-service BI tools like looker and other tools, Business
| folks are running more adhoc queries than traditionally piped
| data through ETL processes.
| dang wrote:
| One past thread:
|
| _The Rise and Fall of the OLAP Cube_ -
| https://news.ycombinator.com/item?id=22189178 - Jan 2020 (53
| comments)
| gengstrand wrote:
| My first exposure to OLAP was on a team developing a front end to
| Essbase that semi-automated report generation after reloading the
| cube. Since then, I have worked with various OLAP technologies
| from Microsoft to Mondrian.
|
| I agree with Mr. Chin who is the author of this blog. OLAP cubes
| can scale up but not out which limits their potential size so
| they can't really accommodate big data scenarios. My previous
| company used Apache Druid instead. Druid can do a lot of what
| OLAP did but using a distributed approach. Although Druid does
| not provide tight integration with spreadsheets, it does provide
| near real-time ingestion, on-the-fly rollup, and first class time
| series analysis.
| iblaine wrote:
| I'm a bit opinionated on data modelling, so a couple points.
|
| 1. "OLAP Cubes" arguably belong to Microsoft and refer to SQL
| Server cubes that require MDX queries. It's a solution given to
| us by Microsoft that comes with well understood features. "OLAP"
| itself is a general term used to describe any database used for
| analytical processing, so OLAP has a wide range of uses.
|
| 2. OLAP Cubes (as defined above) started to decrease in
| population in 2015 (I'd argue).
|
| 3. Any solution to fixing "OLAP" that comes from a commercial
| vendor is suspicious. As painful as Kimbal & Inmon are, they are
| ideas that don't require vendor lock in.
|
| 4. At my current company, we recently wrapped up a process where
| we set out to define the future of our DW. Our DW was encumbered
| by various design patterns and contractors that came and went
| over the past decade. We analyzed the various tables & ETLs to
| come up with a clear set of principles. The end result looks
| eerily close to Kimball but with our own naming conventions. Our
| new set of principles clearly solve our business needs. Point
| being you don't need Kimball, Inmon, Looker, etc to solve data
| modelling problems.
|
| 5. Columnar databases are so pervasive these days that you should
| no longer need to worry about a lack of options when choosing the
| right storage engine for your OLAP db.
|
| 6. More and more data marts are being defined by BI Teams
| w/little modeling/CS skills. To that end, I think it's important
| to educate your BI teams as a means to minimize the fallout, and
| be accommodating in your DW as means to work efficiently with
| your BI teams. This is to say settle on a set of data modelling
| principles that work for you, but may not work for someone else.
| willvarfar wrote:
| I think the article is spot on about workflow but off on tech.
| Under the hood, Cubes will be back.
|
| Snowflake is the cubeless column store.
|
| And now firebolt comes along, promising order of magnitudes
| improvement. How? Largely "join indexes" and "aggregation
| indexes" which seem to smell like cubes.
|
| So take the modern column store and slap some cubes in front.
|
| In firebolt this seems to be added by programmers. But I think
| the future will see olap mpp engines that transparently pick
| cubes, maintain them, reuse them and discard them all
| automatically.
| dominotw wrote:
| > How? Largely "join indexes" and "aggregation indexes" which
| seem to smell like cubes.
|
| Just curious. What stops snowflake from adding these too.
| Trying to understand what innovation firebolt is doing.
| elthor89 wrote:
| "Be suspicious of companies that are heavily locked into the OLAP
| cube workflow. (Learn how to do this here)."
|
| The opposite is true too. Be suspicious of companies trying to
| lock you into their selfservice of data visualization tool. I
| have seen many BI tool vendors trying lock their clients into
| their tool. As result they get unnecessary complex models
| running.
|
| I find the olap a pretty good mental model. That developers and
| users can understand.
|
| The sheer amount of handwritten, tailor made etl or elt pipelines
| is something I like to see automated or replaced with something
| better.
| glogla wrote:
| Yup. At my job we ended up some way too important stuff
| "modelled in BI tool" in a way that nobody could understand how
| it actually works, and had to do full rewrite to model it in
| SQL.
|
| But not all BI tools are alike - something like Superset which
| internally uses SQL and expects users to do self-service
| transformation using views is easy to industrialize since you
| already have the queries. Something like Tableau, not so much.
|
| (If only Superset were more mature)
| [deleted]
| [deleted]
| malkia wrote:
| At previous job, in an an human ads evaluation team, we had a big
| row table that got turned into column oriented one every couple
| of hours. I think it got close to million columns (expanded from
| leaf protobuf fields). We had some black/white listing as the
| schema grew too big (~20GB? just the schema, and the limits were
| few gigs). That was years ago, so probably changed a lot.
|
| But the use case was for various analysts (statisticians,
| linguists, etc., or any internal users other teams of the table).
| As their quires might use any column they like - there was
| probably a way to track down which columns got used over a time,
| but since it was used for launching new things, you would not
| know - better be safe than sorry I guess.
|
| Anyway, I'm still fascinated that this terabyte thing got
| converted using 20k machines for few hours and worked! (it wasn't
| streaming, it was converting over and over from the begining,
| this might've changed too!)
| intrasight wrote:
| I've been doing OLAP for over 20 years, so I can say with
| authority that OLAP is generally a bag of hurt. The tooling
| sucks, the languages suck, the costs suck. The name itself harks
| back to seemingly a time before computers. "Online"? As opposed
| to paper tablets?
| [deleted]
| myohmy wrote:
| OLAP is legacy hell. I built an MDX query to solve a report
| problem, but now I can't replace it because my seniors failed
| to deconstruct it because it is stupid difficult to wrap your
| head around. So now it sits in limbo until I can find time to
| do the work myself. Let me just take a look at my calendar and
| find my next free week... oh, there is none.
| goodpoint wrote:
| Online as opposed to batch.
| intrasight wrote:
| I know. But still, you must agree, a stupid name. The rest of
| the IT industry would say "real-time vs batch". And OLAP is
| batch for most implementations.
| purerandomness wrote:
| The name was coined way before "online" meant what it means
| today.
| bradleyjg wrote:
| Real-time is engine control software not a webpage with a
| spinner.
| darepublic wrote:
| I remember on a dark gloomy day going to the bookstore and
| picking up a book about SQL server report server. Some of the
| toughest time in my life.
| crazydoggers wrote:
| I always felt the value of an OLAP cube was uncovering questions
| you may not know to ask. If you've got a cube with a good
| interface, it means more people in your organization can inspect
| the data looking for trends across dimensions that might not be
| obvious. As an example in a previous organization, someone just
| playing with an olap cube once found that certain DUI searches
| became much more common on Weekend mornings (perhaps obvious in
| retrospect). That insight allowed better targeting and sales, and
| made concrete what otherwise is just a guess.
|
| Sometimes just exploring the data allows stuff like that to pop
| out.
|
| Are there similar interfaces with columnar stores? Or do all the
| analytics need to be pre compiled? The ability to
| slice/dice/filter/aggregate your data in perhaps non obvious ways
| is really the value of business analytics in my opinion.
| mulmen wrote:
| I don't think this benefit is unique to OLAP _cubes_. This is
| just generally the reason you have a data warehouse at all. The
| implementation is up to you but if you model the _business
| process_ then you will be ready to answer questions that have
| not been asked.
| lou1306 wrote:
| The author mentions the Codd-OLAP conflict of interest, lamenting
| that despite what happened "Codd is still regarded as the father
| of the relational database". But this remark sounds like a non
| sequitur to me. No matter the scandal he got into in the 90s, he
| still invented relational algebra back in the 70s, didn't he?
| data_ders wrote:
| the "scandal"
|
| > coined the term Online analytical processing (OLAP) and wrote
| the "twelve laws of online analytical processing". Controversy
| erupted, however, after it was discovered that this paper had
| been sponsored by Arbor Software (subsequently Hyperion, now
| acquired by Oracle), a conflict of interest that had not been
| disclosed, and Computerworld withdrew the paper.
|
| https://en.m.wikipedia.org/wiki/Edgar_F._Codd
| brazzy wrote:
| He published a _scientific paper_ that was to some degree
| paid advertising, and didn 't disclose the payment.
|
| That kind of thing still totally counts as a scandal.
| ryanmarsh wrote:
| Indeed, but was the paper wrong?
| skeeter2020 wrote:
| It doesn't matter. The undisclosed conflict of interest
| should eliminate it from consideration.
| coldacid wrote:
| >If you're running a query like "give me the average price of all
| transactions over the past 5 years", a relational database would
| have to load all the rows from the previous 5 years even though
| it merely wants to aggregate the price field; a columnar database
| would only have to examine one column -- the price column.
|
| Wouldn't that be _two_ columns?
| glogla wrote:
| Yeah but it's nitpicky :)
|
| Maybe "past 5 years" mean "over everything" and then it might
| technically be one column.
|
| Or if you're using something like S3 with Parquet for "columnar
| database" and it is partitioned by date, then the date values
| are stored in metadata - so you would logically read two
| columns but physically only read one from storage. Same story
| for something like Redshift and using date as sortkey.
| citrin_ru wrote:
| > Wouldn't that be two columns?
|
| Columnar databases can also have indices. If there is an index
| by date (or time) then you DB will know row range from the
| index and will read only price column within given row range.
| If there is no index by data it would be two columns, but it is
| still much less than a full row with many columns.
| Animats wrote:
| If the number of transactions is small enough to query quickly,
| a simple SQL query will give you that. If the number of
| transactions is very large, querying a random sample of perhaps
| 1 in 1000 will give you reasonably accurate results. You can
| calculate the probable error from the sample size.
|
| There are questions which require examining all the data, but
| often you don't really need to.
| ryanmarsh wrote:
| You can pry the star schema database from my cold dead hands. You
| can get a lot of work done, without much effort, in multi-
| dimensional row stores with bare SQL. The ETL's are easier to
| write than the map reduce jobs I've seen on columnar stores. ETL
| pipelines get the data in the structure I need before I begin
| analysis. ELT requires me to know too much about the original
| structure of the data. Sure, that's useful when you find a novel
| case, but shouldn't that be folded back into the data warehouse
| for everyone?
| garethrowlands wrote:
| Columnar stores speak SQL these days, though they may map-
| reduce under the hood. ELT vs ETL just moves some
| transformation into SQL the data normally still ends up in the
| data warehouse.
| mulmen wrote:
| I absolutely love Looker for this reason. It understands
| foreign keys so if you model everything as stars it just writes
| SQL for you.
|
| So simple, so powerful. I wish something open source would do
| this painfully obvious thing.
| marcinzm wrote:
| >Sure, that's useful when you find a novel case, but shouldn't
| that be folded back into the data warehouse for everyone?
|
| And this isn't possible to do in the data warehouse why
| exactly? Most every company seems to use DBT so even analysts
| can write transforms to generate tables and views in the
| warehouse from other tables. Hell, even Fivetran lets you run
| DBT or transformations after loading data.
| eyeball wrote:
| One big table combining all of your facts and dimensions for me.
|
| Less compute time, less time for business users since they don't
| need to figure out star schema joins. Storage is cheap.
| RantyDave wrote:
| I'm not sure we can call NoSQL "a fad that has gone away".
| ridaj wrote:
| I would say yes it's now possible to do analytics without cubes,
| except when butting up against two things:
|
| * Really big data (~ PB/day), which makes it impractical to
| process every analytics query by going back to the source
|
| * User data subject to GDPR-like retention constraints, where you
| cannot usually hold on to the base data for ad long as you need
| analytics for
| didibus wrote:
| One thing about columnar store that annoys me, is that when you
| do want to retrieve the records for something you've found it's
| really slow. I wish there was something of a hybrid.
| mwexler wrote:
| "We model data according to rigorous frameworks like Kimball or
| Inmon because we must regularly construct OLAP cubes for our
| analyses."
|
| Actually, no. We model our data this way so it can be used for
| business decisions. It doesn't take long for any entity of any
| scale to discover that the logging and eventing done for
| heartbeat status, debugging, and scaling is just different than
| what you need to make a variety of business decisions.
|
| You can solve with more events at different scales (button clicks
| nested in screen views) or pick events or event rolkups that
| appear to be clean business stages ("completed checkout") but
| still, your finance team, marketing group, all have different
| needs.
|
| So, you decide to have some core shared metrics, derived and
| defined, and make them usable by everyone. Folks agree on the
| defns, and due to ease and trust, you see more data supporting
| more decisions.
|
| You discover that some folks are doing 10 table joins to get an
| answer; it's fast but difficult to extend for new questions. You
| decide to build a view that solves some of these pains, and
| refactoring to allow a better time dimension. Your version links
| with the metrics you created, and the resulting queries shed tons
| of CTEs while becoming readable to the average user.
|
| And now, you have some ELT pipelines, some event transforms that
| result in counts and filters that map nicely to your business
| needs but still allow you to get atomic raws, and you and your
| teams start to trust in consistent results. Your metrics are
| mostly clearly summable, and ones that aren't are in table views
| that precalc the "daily uniques" or other metrics that may need a
| bit special handling.
|
| You've started modeling your data.
|
| No, we don't need olap cubes. But we do need some type of rigor
| around analytic data. Otherwise, why go to all the trouble to
| collect it, count it, and predict from it if it may be wrong with
| no measure of that uncertainty?
|
| And yeah, Kimball et al are from a world where olap was the
| answr, but it turns out they solved a broader set of sql
| problems. So, worth learning the good, toss the dated, and see
| what good data modeling can do for your analysis and predictions.
| glogla wrote:
| > Actually, no. We model our data this way so it can be used
| for business decisions
|
| Yeah, that would be my disagreement as well.
|
| Sure, some of Kimball might be obsoleted by modern technology.
| But I don't think the point of Kimball is cubes, and even if it
| was, what better is there?
|
| I'd be really interested in what else is there, that is more
| modern and suited to the modern world.
|
| Data Vault? That probably isn't it, most of all because it's
| all about the "platform to build you Kimball on" but not the
| actual business modelling. (But the parts about Metrics Vault
| and Quality Vault and the "service stuff" are really good.)
|
| Anchor Modelling? That one seems, looking by 2021 eyes, like a
| way to implement columnar storage and schema-on-read in MS SQL
| Server using store procedures ... which is probably not
| actually a good idea.
|
| Puppini Bridge aka. Unified Star Schema seems like interesting
| concept, especially in self-service space. But even it's
| proponents warn you it doesn't scale performance-wise and also
| it is kind of incremental change on Kimball. (But the way the
| bridge table is kinda adjacency matrix of graph of your objects
| tickles my inner computer scientist fancy)
|
| So really, what else is there?
| mwexler wrote:
| BTW, their followup piece at
| https://www.holistics.io/blog/olap-is-not-olap-cube/ is a nicer
| drill down into some of these issues. Also a good read.
| tresil wrote:
| Could not have written this any better. I've been in the data
| business for over 10 years. A number of technologies have come
| and gone during that time, but dimensional modeling and solid
| SQL skills keep delivering remarkable value.
| chrstr wrote:
| Indeed, it is generally good for analytics to have a conceptual
| data model that is reasonably efficient both on the technical
| side (to quickly get answers for queries), and also on the
| logical side (to quickly translate questions to queries).
|
| This model doesn't need to be OLAP cubes, but it's also not
| that easy to find something better.
___________________________________________________________________
(page generated 2021-07-05 23:00 UTC)