[HN Gopher] Big data is dead (2023)
___________________________________________________________________
Big data is dead (2023)
Author : armanke13
Score : 460 points
Date : 2024-05-27 08:30 UTC (14 hours ago)
(HTM) web link (motherduck.com)
(TXT) w3m dump (motherduck.com)
| clkao wrote:
| previous discussion:
| https://news.ycombinator.com/item?id=34694926
| spicyusername wrote:
| I find it interesting that this comment section and that
| comment section seem to focus on different things, despite
| being triggered by the same input.
| ahartmetz wrote:
| I guess that hype cycle ended at the plateau of being dead. A not
| uncommon outcome in this incredibly fashion-driven industry.
| silvestrov wrote:
| It has just been rebranded as AI.
|
| AI also use all the data, just with a magick neural network to
| figure out what it all means.
| quonn wrote:
| The overlap in terms of the used technologies, the required
| skills, the actual products and the target market is minimal.
| AI is not mostly Hadoop, it's not MapReduce, the hardware is
| different, the software is different, the skillset is very
| different and a chatbot or image generator is very different
| from a batch job producing an answer to a query.
| renegade-otter wrote:
| But the underlying problem is the same - companies that use
| Big Data tech are clueless about data management. You can
| use unicorns - it's not going to do anything. "Garbage in,
| garbage out" is a timeless principle.
| WesolyKubeczek wrote:
| Given how often it hallucinates, it should be rebranded to
| "high data".
| Cyberdog wrote:
| Assuming you're serious for a moment, I don't think AI is
| really a practical tool for working with big data.
|
| - The "hallucination" factor means every result an AI tells
| you about big data is suspect. I'm sure some of you who
| really understand AI more than the average person can "um
| akshually" me on this and tell me how it's possible to
| configure ChatGPT to absolutely be honest 100% of the time
| but given the current state of what I've seen from general-
| purpose AI tools, I just can't trust it. In many ways this is
| worse than MongoDB just dropping data since at least Mongo
| won't make up conclusions about data that's not there.
|
| - At the end of the day - and I think we're going to be
| seeing this happen a lot in the future with other workflows
| as well - you're using this heavy, low-performance general-
| purpose tool to solve a problem which can be solved much more
| performatively by using tools which have been designed from
| the beginning to handle data management and analysis. The
| reason traditional SQL RDBMSes have endured and aren't going
| anywhere soon is partially because they've proven to be a
| very good compromise between general functionality and
| performance for the task of managing various types of data.
| AI is nowhere near as good of a balance for this task in
| almost all cases.
|
| All that being said, the same way Electron has proven to be a
| popular tool for writing widely-used desktop and mobile
| applications, performance and UI concerns be damned all the
| way to hell, I'm sure we'll be seeing AI-powered "big data"
| analysis tools very soon if they're not out there already,
| and they will suck but people will use them anyway to
| everyone's detriment.
| silvestrov wrote:
| A comment from the old post:
| https://news.ycombinator.com/item?id=34696065
|
| > _I used to joke that Data Scientists exist not to uncover
| insights or provide analysis, but merely to provide
| factoids that confirm senior management 's prior beliefs._
|
| I think AI is used for the same purpose in companies:
| signal to the world that the company is using the latest
| tech and internally for supporting existing political
| beliefs.
|
| So same job. Hallucination is not a problem here as the AI
| conclusions are not used when they don't align to existing
| political beliefs.
| vitus wrote:
| > The "hallucination" factor means every result an AI tells
| you about big data is suspect.
|
| AI / ML means more than just LLM chat output, even if
| that's the current hype cycle of the last couple of years.
| ML can be used to build a perfectly serviceable classifier,
| or predictor, or outlier detector.
|
| It suffers from the lack of explainability that's always
| plagued AI / ML, especially as you start looking at deeper
| neural networks where you're more and more heavily reliant
| on their ability to approximate arbitrary functions as you
| add more layers.
|
| > you're using this heavy, low-performance general-purpose
| tool to solve a problem which can be solved much more
| performatively by using tools which have been designed from
| the beginning to handle data management and analysis
|
| You are not wrong here, but one challenge is that sometimes
| even your domain experts do not know how to solve the
| problem, and applying traditional statistical methods
| without understanding the space is a great way of
| identifying spurious correlations. (To be fair, this
| applies in equal measure to ML methods.)
| blagie wrote:
| Overall, I agree with much of this post, but there are several
| caveats:
|
| 1) Mongo is a bad point of reference.
|
| The one lesson I've learned is that there is nothing Mongo does
| which postgresql doesn't do better. Big data solutions aren't
| nosql / mongo, but usually things like columnar databases,
| map/reduce, Cassandra, etc.
|
| 2) Plan for success
|
| 95% of businesses never become unicorns, but that's the goal for
| most (for the 5% which do). If you don't plan for it, you won't
| make it. The reason to architect for scalability when you have 5
| customers is so if that exponential growth cycle hits, you can
| capitalize on it.
|
| That's not just architecture. To have any chance of becoming a
| unicorn, every part of the business needs to be planned for now
| and for later: How do we make this practical / sustainable today?
| How do we make sure it can grow later when we have millions of
| customers? A lot of this can be left as scaffolding (we'll swap
| in [X], but for now, we'll do [Y]).
|
| But the key lessons are correct:
|
| - Most data isn't big. I can fit data about every person in the
| world on a $100 Chromebook. (8 billion people * 8 bits of data =
| 8GB)
|
| - Most data is rarely queried, and most queries are tiny. The
| first step in most big data jobs I've done is taking terabytes of
| data and shrinking it down to the GB, MB, or oven KB-scale data I
| need. Caveat: I have no algorithm for predicting what I'll need
| in the future.
|
| - Cost of data is increasing with regulatory.
| brtkdotse wrote:
| > 95% of businesses never become unicorns, but that's the goal
| for most
|
| Is it really the general case or is it just a HN echo chamber
| meme?
|
| My pet peeve is that patterns used by companies that in theory
| could become global unicorns are mimicked by companies where
| 5000 paying customers would mean an immense success
| blagie wrote:
| It's neither.
|
| Lifestyle companies are fine, if that's what you're aiming
| for. I know plenty of people who run or work at [?]1-30
| person companies with no intention to grow.
|
| However, if you're going for high-growth, you need to plan
| for success. I've seen many potential unicorns stopped by
| simple lack of planning early on. Despite all the pivots
| which happen, if you haven't outlined a clear path from 1-3
| people in a metaphorical garage to reaching $1B, it almost
| never happens, and sometimes for stupid reasons.
|
| If your goal is 5000 paying customers at $100 per year and
| $500k in annual revenues, that can lead to a very decent
| life. However, it's an entire different ballgame: (1) Don't
| take in investment (2) You probably can't hire more than one
| person (3) You need a plan for break-even revenue before you
| need to quit your job / run out of savings. (4) You need much
| greater than the 1-in-10 odds of success.
|
| And it's very possible (and probably not even hard) to start
| a sustainable 1-5 person business with >>50% odds of success,
| especially late career:
|
| - Find a niche you're aware of from your job
|
| - Do ballpark numbers on revenues. These should land in the
| $500k-$10M range. Less, and you won't sustain. More, and
| there will be too much competition.
|
| - Do it better than the (likely incompetent or non-existent)
| people doing it now
|
| - Use your network of industry contacts to sell it
|
| That's not a big enough market you need to worry about a lot
| of competition, competitors with VC funding, etc. Especially
| ones with tall moats do well -- pick some unique skillset,
| technology, or market access, for example.
|
| However, IF you've e.g. taken in VC funding, then you do need
| to plan for growth, and part of that is planning for the
| small odds your customer base (and ergo, your data) does
| grow.
| IneffablePigeon wrote:
| If you're in b2b 5000 customers can be a lot more revenue
| than that. 10-100x, depending hugely on industry and
| product.
| davedx wrote:
| It's definitely an echo chamber. Most companies definitely do
| not want to become "unicorns" - most SME's around the world
| don't even know what a "unicorn" is, let alone be in an
| industry/sector where it's possible.
|
| Does a mining company want to become a "unicorn"?
|
| A fish and chip shop?
|
| Even within tech there is an extremely large number of
| companies whose goals are to steadily increase profits and
| return them to shareholders. 37 Signals is the posterchild
| there.
|
| Maybe if you're a VC funded startup then yeah.
| threeseed wrote:
| HN is the worst echo chamber around.
|
| Obsessed with this "you must use PostgreSQL for every use
| case" nonsense.
|
| And that anyone who actually has unique data needs is simply
| doing it for their resume or are over-engineering.
| paulryanrogers wrote:
| > Obsessed with this "you must use PostgreSQL for every use
| case" nonsense.
|
| Pg fans are certainly here asking "why not PG?". Yet so are
| fans of other DBs; like DuckDB, CouchDB, SQLite, etc.
| internet101010 wrote:
| I don't see so much DuckDB and CouchDB proselytizing but
| the SQLite force always out strong. I tend to divide the
| Postgres vs. SQLite decision on if the data in question
| is self-contained. Like am I pulling data from elsewhere
| (Postgres) or am I creating data within the application
| that is only used for the functionality of said
| application (SQLite).
| int_19h wrote:
| SQLite, in addition to just being plain popular, is a
| fairly natural stepping stone - you get a lot of
| fundamental benefits of an SQL RDBMS (abstract high-level
| queries, ACID etc) without the overhead of maintaining a
| database server.
|
| Postgres is the next obvious stepping stone after that,
| and the one where the vast majority of actual real-world
| cases that are not hypotheticals end up fitting.
| citizen_friend wrote:
| Nobody is saying this.
|
| > who actually has unique data needs
|
| We are saying this is probably not true, and you just want
| to play with toys rather than ship working systems.
|
| Google search, cannot be built in Postgres.
| babel_ wrote:
| Many startups seem to aim for this, naturally it's difficult
| to put actual numbers to this, and I'm sure many pursue
| multiple aims in the hope one of them sticks. Since unicorns
| are really just describing private valuation, really it's the
| same as saying many aim to get stupendously wealthy. Can't
| put a number on that, but you can at least see it's a hope
| for many, though "goal" is probably making it seem like
| they've got actually achievable plans for it... That, at
| least, I'm not so convinced of.
|
| Startups are, however, atypical from new businesses, ergo the
| unicorn myth, meaning we see many attempts to follow such a
| path that likely stands in the way of many new businesses
| from actually achieving the more real goals of, well, being a
| business, succeeding in their venture to produce whatever it
| is and reach their customers.
|
| I describe it as a unicorn "myth" as it very much behaves in
| such a way, and is misinterpreted similarly to many myths we
| tell ourselves. Unicorns are rare and successful because they
| had the right mixture of novel business and the security of
| investment or buyouts. Startups purportedly are about new
| ways of doing business, however the reality is only a handful
| really explore such (e.g. if it's SaaS, it's probably not a
| startup), meaning the others are just regular businesses with
| known paths ahead (including, of course, following in the
| footsteps of prior startups, which really is self-refuting).
|
| With that in mind, many of the "real" unicorns are
| realistically just highly valued new businesses (that got
| lucky and had fallbacks), as they are often not actually
| developing new approaches to business, whereas the mythical
| unicorns that startups want to be are half-baked ideas of how
| they'll achieve that valuation and wealth without much idea
| of how they do business (or that it can be fluid, matching
| their nebulous conception of it), just that "it'll come",
| especially with "growth".
|
| There is no nominative determinism, and all that, so
| businesses may call themselves startups all they like, but if
| they follow the patterns of startups without the massive
| safety nets of support and circumstance many of the real
| unicorns had, then a failure to develop out the business
| proper means they do indeed suffer themselves by not
| appreciating 5000 paying customers and instead aim for "world
| domination", as it were, or acquisition (which they typically
| don't "survive" from, as an actual business venture). The
| studies have shown this really does contribute to the failure
| rate and instability of so-called startups, effectively due
| to not cutting it as businesses, far above the expected norm
| of new businesses...
|
| So that pet peeve really is indicative of a much more
| profound issue that, indeed, seems to be a bit of an echo
| chamber blind spot with HN.
|
| After all, if it ought to have worked all the time, reality
| would look very different from today. Just saying how many
| don't become unicorns (let alone the failure rate) doesn't
| address the dissonance from then concluding "but this time
| will be different". It also doesn't address the idea that you
| don't need to become a "unicorn", and maybe shouldn't want to
| either... but that's a line of thinking counter to the echo
| chamber, so I won't belabour it here.
| notachatbot1234 wrote:
| > - Most data isn't big. I can fit data about every person in
| the world on a $100 Chromebook. (8 billion people * 8 bits of
| data = 8GB)
|
| Nitpick but I cannot help myself: 8 bits are not even enough
| for a unique integer ID per person, that would require 8 bytes
| per person and then we are at 60GB already.
|
| I agree with pretty much anything else you said, just this
| stood out as wrong and Duty Calls.
| iraqmtpizza wrote:
| meh. memory address is the ID
| L-four wrote:
| Airline booking numbers used to just be the sector number
| of your booking record on the mainframes HDD.
| switch007 wrote:
| My jaw just hit the floor. What a fascinating fact!
| rrr_oh_man wrote:
| That's why they were constantly recycled?
| mauriciolange wrote:
| source?
| devsda wrote:
| This is such a simple scheme.
|
| I wonder how they dealt with common storage issues like
| backups and disks having bad sectors.
| giantrobot wrote:
| They're likely record based formatting rather than file
| based. At the high level the code is just asking for a
| record number from a data set. The data set is managed
| including redundancy/ECC by the hardware of that storage
| device.
| amenhotep wrote:
| Sure it is. You just need a one to one function from person
| to [0, eight billion]. Use that as your array index and
| you're golden. 8 GB is overkill, really, you could pack some
| boolean datum like "is over 18" into bits within the bytes
| and store your database in a single gigabyte.
|
| Writing your mapping function would be tricky! But definitely
| theoretically possible.
| blagie wrote:
| I'm old enough to have built systems with similar
| techniques. We don't do that much anymore since we don't
| need to, but it's not rocket science.
|
| We had spell checkers before computers had enough memory to
| fit all words. They'd probabilistically find almost all
| incorrect words (but not suggest corrections). It worked
| fine.
| OJFord wrote:
| > 95% of businesses never become unicorns, but that's the goal
| for most (for the 5% which do).
|
| I think you're missing quite a few 9s!
| davedx wrote:
| > 2) Plan for success 95% of businesses never become unicorns,
| but that's the goal for most (for the 5% which do). If you
| don't plan for it, you won't make it.
|
| That's exactly what every architecture astronaut everywhere
| says. In my experience it's completely untrue, and actually
| "planning for success" more often than not causes huge drags on
| productivity, and even more important for startups, on agility.
| Because people never just make plans, they usually implement
| too.
|
| Plan for the next 3 months and you'll be much more agile and
| productive. Your startup will never become a unicorn if you
| can't execute.
| MOARDONGZPLZ wrote:
| In my experience the drag caused from the thinking to plan
| for scalability early has been so much greater than the
| effort to rearchitect things when and if the company becomes
| a unicorn that one is significantly more likely to become a
| unicorn if they simply focus on execution and very fast
| iteration and save the scalability until it's actually needed
| (and they can hire a team of whomever to effect this change
| with their newly minted unicorn cachet).
| newaccount74 wrote:
| The biggest problem with planning for scale is that engineers
| often have no idea what problems they will actually run into
| when they scale and they build useless shit that slows them
| down and doesn't help later at all.
|
| I've come to the conclusion that the only strategy that works
| reliably is to build something that solves problems you have
| NOW rather than trying to predict the future.
| fuzzy2 wrote:
| Exactly this. Not only would they not know the tech
| challenges, they also wouldn't know the business/domain
| challenges.
| vegetablepotpie wrote:
| The flip side of that is that you end up with spaghetti
| code that is expensive to add features to and is expensive
| to clean up when you are successful. Then people in the
| business implement workarounds to handle special cases that
| are undocumented and hidden.
| citizen_friend wrote:
| No it doesn't. Simple and targeted solutions are not bad
| code. For example, start with a single postgres instance
| on a single machine, rather than Hadoop clusters and
| Kubernetes. Once that is maxed out, you will have time
| and money to solve bigger problems.
| smrtinsert wrote:
| The success planners almost always seem to be the same ones
| pushing everyone to "not overengineer". Uhhhh..
| CuriouslyC wrote:
| There's writing code to handle every eventuality, and there's
| considering 3-4 places you _MIGHT_ pivot and making sure you
| aren 't making those pivots harder than they need to be.
| blagie wrote:
| This is exactly what I try to do and what I've seen
| successful systems do.
|
| Laying out adjacent markets, potential pivots, likely
| product features, etc. is a weekend-long exercise. That can
| help define both where the architecture needs to be
| flexible, and just as importantly, *where it does not*.
|
| Over-engineering happens when you plan / architect for
| things which are unlikely to happen.
| Spooky23 wrote:
| The exception is when you have people with skills in
| particular tools.
|
| The suggestion upthread to use awk is awesome if you're a
| bunch of Linux grey beards.
|
| But if you have access to people with particular skills or
| domain knowledge... spending extra cash on silly
| infrastructure is (within reason) way cheaper than having
| that employee be less productive.
| citizen_friend wrote:
| Nope, if every person does things completely differently,
| that's just a lack of technical leadership. Leaders pick an
| approach with tradeoffs that meet organizational goals and
| help their team to follow it.
| blagie wrote:
| > That's exactly what every architecture astronaut everywhere
| says. In my experience it's completely untrue, and actually
| "planning for success" more often than not causes huge drags
| on productivity, and even more important for startups, on
| agility. Because people never just make plans, they usually
| implement too.
|
| That's not my experience at all.
|
| Architecture != implementation
|
| Architecture astronauts will try to solve the world's
| problems in v0. That's very different from having an
| architectural vision and building a subset of it to solve
| problems for the next 3 months. Let me illustrate:
|
| * Agile Idiot: We'll stick it all in PostgreSQL, however it
| fits, and meet our 3-month milestone. [Everything crashes-
| and-burns on success]
|
| * Architecture Astronaut: We'll stick it all in a high-
| performance KVS [Business goes under before v0 is shipped]
|
| * Success: We have one table which will grow to petabytes if
| we reach scale. We'll stick it all in postgresql for now, but
| maintain a clean KVS abstraction for that one table. If we
| hit success, we'll migrate to [insert high-performance KVS].
| All the other stuff will stay in postgresql.
|
| The trick is to have a pathway to success while meeting
| short-term milestones. That's not just software architecture.
| That's business strategy (clean beachhead, large ultimate
| market), and every other piece of designing a successful
| startup. There should be a detailed 3-month plan, a long-term
| vision, and a rough set of connecting steps.
| abetusk wrote:
| Another way to say that is that "planning for success" is
| prematurely optimizing for scale.
|
| Scaling up will bring its own challenges, with many of them
| difficult to foresee.
| littlestymaar wrote:
| This. If you plan for the time you'll be a unicorn, you will
| never get anything done in the first place, let alone being a
| unicorn. When you plan for the next 3 month, then hopefully
| in three month you're still here to plan for the next quarter
| again.
| underwater wrote:
| > To have any chance of becoming a unicorn, every part of the
| business needs to be planned for now and for later
|
| I think that in practice that's counterproductive. A startup
| has a limited runway. If your engineers are spending your money
| on something that doesn't pay off for years then they're
| increasing the chance you'll fail before it matters.
| blagie wrote:
| You're confusing planning with implementation.
|
| Planning is a weekend, or at most a few weeks.
| threeseed wrote:
| _> nothing Mongo does which postgresql doesn 't do better_
|
| a) It has a built-in and supported horizontal scalability / HA
| solution.
|
| b) For some use cases e.g. star schemas it has significantly
| better performance.
|
| _> Big data solutions aren 't nosql_
|
| Almost all big data storage solutions are NoSQL.
| ozkatz wrote:
| > Almost all big data storage solutions are NoSQL.
|
| I think it's important to distinguish between OLAP AND OLTP.
|
| For OLAP use cases (which is what this post is mostly about)
| it's almost 100% SQL. The biggest players being Databricks,
| Snowflake and BigQuery. Other tools may include AWS's tools
| (Glue, Athena), Trino, ClickHouse, etc.
|
| I bet there's a <1% market for "NoSQL" tools such as
| MongoDB's "Atlas Data Lake" and probably a bunch of MapReduce
| jobs still being used in production, but these are the
| exception, not the rule.
|
| For OLTP "big data", I'm assuming we're talking about "scale-
| out" distributed databases which are either SQL (e.g.
| cockroachdb, vitess, etc) SQL-like (Casandra's CQL,
| Elasticsearch's non-ANSI SQL, Influx' InfluxQL) or a purpose-
| built language/API (Redis, MongoDB).
|
| I wouldn't say OLTP is "almost all" NoSQL, but definitely a
| larger proportion compared to OLAP.
| blagie wrote:
| > Almost all big data storage solutions are NoSQL.
|
| Most I've seen aren't. NoSQL means non-relational database.
| Most big data solutions I've seen will not use a database at
| all. An example is hadoop.
|
| Once you have a database, SQL makes a lot of sense. There are
| big data SQL solutions, mostly in the form of columnar read-
| optimized databases.
|
| On the above, a little bit of relational can make a huge
| performance difference, in the form of, for example, a big
| table with compact data with indexes into small data tables.
| That can be algorithmically a lot more performant than the
| same thing without relations.
| Toine wrote:
| > To have any chance of becoming a unicorn, every part of the
| business needs to be planned for now and for later
|
| Sources ?
| boxed wrote:
| I see people planning for success to the point of guaranteeing
| failure, much more than people who suddenly must try to handle
| success in panic.
|
| It's a second system syndrome + survivor bias thing I think:
| people who had to clean up the mess of a good MVP complaining
| about what wasn't done before. But the companies that DID do
| that planning and architecting before _did not survive to be
| complained about_.
| CuriouslyC wrote:
| It's not either or. There are best practices that can be
| followed regardless with no time cost up front, and there is
| taking some time to think about how your product might evolve
| (which you really should be doing anyhow) then making choices
| with your software that don't make the evolution process
| harder than it needs to be.
|
| Layers of abstraction make code harder to reason about and
| work with, so it's a lose lose when trying to iterate
| quickly, but there's also the idea of architectural "mise en
| place" vs "just dump shit where it's most convenient right
| now and don't worry about later" which will result near
| immediate productivity losses due to system incoherence and
| disorganization.
| boxed wrote:
| I'm a big fan of "optimize for deletion" (aka leaf-heavy)
| code. It's good for reasoning when the system is big, and
| it's good for growing a code base.
|
| It's a bit annoying how the design of Django templates
| works against this by not allowing free functions...
| nemo44x wrote:
| Mongo allows a developer to burn down a backlog faster than
| anything else. That's why it's so popular. The language drivers
| interface with the database which just says yes. And whatever
| happens later is someone else's problem. Although it's a far
| more stable thing today.
| zemo wrote:
| > The reason to architect for scalability when you have 5
| customers is so if that exponential growth cycle hits, you can
| capitalize on it.
|
| If you have a product gaining that much traction, it's usually
| because of some compound effect based on the existence and
| needs of its userbase. If on the way up you stumble to add new
| users, the userbase that's already there is unlikely to go back
| to the Old Thing or go somewhere else (because these events are
| actually rare). For a good while using Twitter meant seeing the
| fail whale every day. Most people didn't just up and leave, and
| nothing else popped up that could scale better that people
| moved to. Making a product that experiences exponential growth
| in that way is pretty rare, and struggling to scale those cases
| and having a period of availability degradation is common. What
| products hit an exponential growth situation failed because
| they couldn't scale?
| anon84873628 wrote:
| >The one lesson I've learned is that there is nothing Mongo
| does which postgresql doesn't do better. Big data solutions
| aren't nosql / mongo, but usually things like columnar
| databases, map/reduce, Cassandra, etc.
|
| I think that was exactly their point. If new architectures were
| actually necessary, we would have seen a greater rise in Mongo
| and the like. But we didn't, because the existing systems were
| perfectly adequate.
| estheryo wrote:
| "MOST PEOPLE DON'T HAVE THAT MUCH DATA" That's really true
| vegabook wrote:
| my experience is that while data keeps growing at an exponential
| rate, its information content does not. In finance at least, you
| can easily get 100 million data points per series per day if you
| want everything, and you might be dealing with thousands of
| series. That sample rate, and the number of series, is usually
| 99.99% redundant, because the eigenvalues drop off almost to zero
| very quickly after about 10 dimensions, and often far fewer.
| There's very little reason to store petabytes of ticks that you
| will never query. It's much more reasonable in many cases to do
| brutal (and yes, lossy) dimensionality reduction _at ingests
| time_, store the first few principal components + outliers, and
| monitor eigenvalue stability (in case some new, previously
| negligable, factor, starts increasing in importance). It results
| in a much smaller dataset that is tractable and in many cases
| revelatory, because it's actually usable.
| CoastalCoder wrote:
| Could you point to something explaining that eigenvalue /
| dimensions topic?
|
| It sounds interesting, but it's totally new to me.
| mk67 wrote:
| https://en.wikipedia.org/wiki/Principal_component_analysis
| beng-nl wrote:
| Nog OP, but i think they are referring to the fact that you
| can use PCA (principal component analysis) on a matrix of
| datapoints to approximate it. Works out of the box in scikit-
| learn.
|
| You can do (lossy) compression on rows of vectors (treated
| like a matrix) by taking the top N eigenvectors (largest N
| eigenvalues) and using them to approximate the original
| matrix with increasing accuracy (as N grows) by some simple
| linear operations. If the numbers are highly correlated, you
| can get a huge amount of compression with minor loss this
| way.
|
| Personally I like to use it to visualize linear separability
| of a high dimensioned set of vectors by taking a 2-component
| PCA and plotting them as x/y values.
| bartart wrote:
| That's very interesting, so thank you -- how do you handle if
| the eigenvectors change over time?
| vegabook wrote:
| you can store the main eigenvectors for a set rolling period
| and see how the space evolves along them, all the while also
| storing the new ones. In effect the whole idea is to get away
| from "individual security space" and into "factor space",
| which is much smaller, and see how the factors are moving.
| Also, a lot of the time you just care about the outliers --
| those (small numbers of) instruments or clusters of
| instruments that are trading in an unusual way -- then you
| either try to explain it.... or trade against it. Also keep
| in mind that lower-order factors tend to be much more
| stationary so there's a lot of alpha there -- if you can
| execute the trades efficiently (which is why most successful
| quant shops like citadel and jane street are market MAKERS,
| not takers, btw).
| zurfer wrote:
| This is not fully correct.
|
| Originally big data was defined by 3 dimensions:
|
| - Volume (mostly what the author talks about) [solved]
|
| - Velocity, how fast data is processed etc [solved, but
| expensive]
|
| - Variety [not solved]
|
| Big Data today is not: I don't have enough storage or compute.
|
| It is: I don't have enough cognitive capacity to integrate and
| make sense of it.
| maayank wrote:
| What do you mean by 'variety'?
| boesboes wrote:
| not op, but I think they mean the data is complex,
| heterogeneous and noisy. You won't be able to extract meaning
| trivially from it, you need something to find the (hidden)
| meaning in the data.
|
| So AI currently, probably ;)
| threeseed wrote:
| Data that isn't internally managed database exports.
|
| One of the reasons big data systems took off was because
| enterprises had exports out of third party systems that they
| didn't want to model since they didn't own it. As well as a
| bunch of unstructured data e.g. floor plans, images, logs,
| telemetry etc.
| zurfer wrote:
| the other comments get it.
|
| It means that data comes in a ton of different shapes with
| poorly described schemas (technically and semantically).
|
| From the typical CSV export out of an ERP system to a
| proprietary message format from your own custom embedded
| device software.
| nairboon wrote:
| it doesn't fit the relational model, e.g. you have some
| tables, but also tons of different types of images, video,
| sounds, raw text, etc.
| vishnugupta wrote:
| I first heard of this 3 V's in a Michael Stonebreaker's
| talk[1]. For the uninitiated he's a legend in DBMS space,
| Turing award winner[2].
|
| Highly recommend this and related talks by him, most of them
| are in YouTube.
|
| [1] https://www.youtube.com/watch?v=KRcecxdGxvQ
|
| [2]
| https://amturing.acm.org/award_winners/stonebraker_1172121.c...
| snakeyjake wrote:
| >Big Data today is not: I don't have enough storage or compute.
|
| It is for me. Six times per year I go out to the field for two
| weeks to do data acquisition. In the field we do a dual-
| aircraft synthetic aperture radar collection over four bands
| and dual polarities.
|
| That means two aircraft each with one radar system containing
| eight 20TiB 16-drive RAID-0 SSD storage devices.
|
| We don't usually fill up the RAIDs so we generate about 176TiB
| of data per day and over the two weeks we do 7 flight, or
| 1.2PiB per deployment or 7.2PiB per year.
|
| We can only fly every other day because it takes a day between
| flights to offload the data via fiber onto storage servers that
| are usually haphazardly crammed into the corner of a hangar
| next to the apron. It is then duplicated to a second server for
| safekeeping and at the end of the mission everything is shipped
| back to our HQ for storage and processing.
|
| The data is valuable, but not "billions" valuable. It is used
| for resource extraction, mapping, environmental and geodetic
| research, and other applications (but that's not my department)
| so we have kept every single byte for since 2008. This is
| especially useful because as new algorithms are created (not my
| department) the old data can be reprocessed to the new
| standard.
|
| Entire nations finally know how many islands they have, how
| large they are, how their elevations are changing, and how
| their coasts are being eradicated by sea level change because
| of our data and if you've ever used a mapping application and
| flown around a city with 3d buildings that don't look like shit
| because they were stitched together using AI and
| photogrammetry, you've used our data too.
|
| We have to use hard drives because SSDs would be space and most
| certainly cost prohibitive.
|
| We stream 800GiB-2TiB files each representing a complete stripe
| or circular orbit to GPU-equipped processing servers. Files are
| incompressible (the cosmic microwave background, the bulk of
| what we capture, tends to be a little random) and when I
| started I held on to the delusion that I could halve the
| infrastructure by writing to tape until I found out that tape
| capacities were calculated for the storage of gigabyte-sized
| text files of all zeros (or so it seems) that can be compressed
| down to nothing.
|
| GPUs are too slow. CPUs are too slow. PCIe busses are too slow.
| RAM is too slow. My typing speed is too slow. Everything needs
| to be faster all of the time.
|
| Everything is too slow, too hard, and too small. Hard drives
| are too small. Tuning the linux kernel and setting up fast and
| reliable networking to the processing clusters is too hard.
| Kernel and package updates that aren't even bug fixes but just
| changes in the way that something works internally that are
| transparent to all users except for us break things. Networks
| are too slow. Things exist in this fantasy world where RAM is
| scarce so out-of-the-box settings are configured to not hog
| memory for network operations. No. I've got a half a terabyte
| of RAM in this file server use ALL OF IT to make the network
| and filesystem go faster, please. Time to spend six hours
| reading the documentation for every portion of the network
| stack to increase the I/O to 2024-levels of sanity.
|
| I probably know more about sysctl.conf than almost every other
| human being on earth.
|
| Distributed persistent object storage systems for people who
| think they are doing big data but really aren't either
| completely fall apart under our workload or cost hundreds of
| millions of dollars-- which we don't have. When I tell all of
| the distributed filesystem salespeople that our objects are
| roughly a terabyte in size they stop replying to my emails.
| More than one vendor has referred me to their intelligence
| community customer service representative upon reading my
| requirements. I am not the NSA, buddy, and we don't have NSA
| money.
|
| Every once in a while we get a new MBA or PMP who read a
| Bloomberg article about the cloud and asks about moving to AWS
| or Azure after they see the costs of our on-premises
| datacenter. When I show them the numbers, in terms of both
| money and time, they throw up in their mouths and change the
| subject.
|
| To top it all off all of our vendors are jumping on the
| AI/cloud bandwagon and discontinuing product lines applicable
| to us.
|
| And now I've got to compete for GPUs with hedge funds and AI
| startups trying to figure out how to use a LLM to harvest
| customer data and use it to show them ads.
|
| I do not have enough storage or compute, and the storage and
| compute I do have is too slow.
|
| DPUs/IPUs look interesting but fall on their face when an
| object is larger than a SQL database query or compressed
| streaming video chunk.
| gbin wrote:
| IMHO the main driver for big data was company founders egos. Of
| course your company will explode and will be a planet scale
| success!! We need to design for scale! This is really a tragic
| mistake while your product only needs one SQLite DB until you
| reach series C.... All the energy should be focused on the
| product, not its scale yet.
| antupis wrote:
| Well generally yes although there are a couple of exceptions
| like IoT and GIS stuff where is very common to see 10TB+
| datasets.
| threeseed wrote:
| No. Big data was driven by people who had big data problems.
|
| It started with Hadoop which was inspired by what existed at
| Google and became popular in enterprises all around the world
| who wanted a cheaper/better way to deal with their data than
| Oracle.
|
| Spark came about as a solution to the complexity of Hive/Pig
| etc. And then once companies were able to build reliable data
| pipelines we started to see AI being able to be layered on top.
| jandrewrogers wrote:
| It depends on the kind of data you work with. Many kinds of
| important data models -- geospatial, sensing, telemetry, et al
| -- can hit petabyte volumes at "hello world".
|
| Data models generated by intentional human action e.g. clicking
| a link, sending a message, buying something, etc are
| universally small. There is a limit on the number of humans and
| the number of intentional events they can generate per second
| regardless of data model.
|
| Data models generated by machines, on the other hand, can be
| several orders of magnitude higher velocity and higher volume,
| and the data model size is unbounded. These are often some of
| the most interesting and under-utilized data models that exist
| because they can get at many facts about the world that are not
| obtainable from the intentional human data models.
| tobilg wrote:
| I witness the overengineering regarding "big" data tools and
| pipelines since many years... For a lot of use cases, data
| warehouses and data lakes are only in the gigabytes or single-
| digit terabytes range, thus their architecture could be much more
| simplified, e.g. running DuckDB on a decent EC2 instance.
|
| In my experience, doing this will yield the query results faster
| than some other systems even starting the query execution (yes,
| I'm looking at you Athena)...
|
| I even think that a lot of queries can be run from a browser
| nowadays, that's why I created https://sql-workbench.com/ with
| the help of DuckDB WASM (https://github.com/duckdb/duckdb-wasm)
| and perspective.js (https://github.com/finos/perspective).
| geertj wrote:
| I agree with the article that most data sets comfortably fit into
| a single traditional DB system. But I don't think that implies
| that big data is dead. To me big data is about storing data in a
| columnar storage format with a weak schema, and using a query
| system based on partitioning and predicate push down instead of
| indexes. This allows the data to be used in an ad-hoc way by data
| science or other engineers to answer questions you did not have
| when you designed the system. Most setups would be relatively
| small, but could be made to scale relatively well using this
| architecture.
| Shrezzing wrote:
| This is a quite good allegory for the way AI is currently
| discussed (perhaps the outcome will be different this time
| round). Particularly the scary slide[1] with the up-and-to-the-
| right graph, which is used in a near identical fashion today to
| show an apparently inevitable march of progress in the AI space
| due to scaling laws.
|
| [2]https://motherduck.com/_next/image/?url=https%3A%2F%2Fweb-
| as...
| teleforce wrote:
| Not dead, it's just having it's winter time not unlike AI winter
| and once it has its similar "chatbot" moment, all will be well.
|
| My take on the killer application is the climate change for
| example earthquakes monitoring. For a case study China has just
| finished building world's largest earthquake monitoring system
| with the cost of around USD1 Billion across the country with 15K
| stations [1]. Somehow at the moment is just monitoring existing
| earthquakes. But let's say there is a big data analytics
| technique can reliably predicts impending earthquake within a few
| days, that can probably safe many people and China now still hold
| the records of the largest mortality and casualty numbers due to
| earthquakes. Is it probable, the answer is a positive yes based
| on our work and initial results it's already practical but in
| order to do that we need integration with comprehensive in-situ
| IoT networks with regular and frequent data sampling similar to
| that of China.
|
| Secondly, China also has the largest radio astronomy telescopes
| and these telescopes together with other radio telescopes
| collaborate in real-time through e-VLBI to form a virtual giant
| radio telescopes as big as the earth to monitor distance stars
| and galaxy. This is how the black hole got its first image but at
| the time due to logistics one of the telescope remote disks
| cannot be shipped to the main processing centers in US [2]. At
| that moment they are not using real-time e-VLBI onky VLBI, and it
| tooks them several months just to get the complete sets of the
| black holes observation data. With e-VLBI everything is real-time
| and with automatic processing it will be hours instead of month.
| These radio telescopes can also be used for other purposes like
| monitoring climate change in addition to imaging black holes,
| their data is astronomical pardon the pun [3].
|
| [1] Chinese Nationwide Earthquake Early Warning System and Its
| Performance in the 2022 Lushan M6.1 Earthquake:
|
| https://www.mdpi.com/2072-4292/14/17/4269
|
| [2] How Scientists Captured the First Image of a Black Hole:
|
| https://www.jpl.nasa.gov/edu/news/2019/4/19/how-scientists-c...
|
| [3] Alarmed by Climate Change, Astronomers Train Their Sights on
| Earth:
|
| https://www.nytimes.com/2024/05/14/science/astronomy-climate...
| Shrezzing wrote:
| I think these examples still loosely fits the author's
| argument:
|
| > There are some cases where big data is very useful. The
| number of situations where it is useful is limited
|
| Even though there are some great use-cases, the overwhelming
| majority organisations, institutions, and projects will never
| have a "let's query ten petabytes" scenario that forces them
| away from platforms like Postgres.
|
| Most datasets, even at very large companies, fit comfortably
| into RAM on a server - which is now cost-effective, even in the
| _dozens of terabytes_.
| kmarc wrote:
| When I was hiring data scientists for a previous job, my favorite
| tricky question was "what stack/architecture would you build"
| with the somewhat detailed requirements of "6 TiB of data" in
| sight. I was careful not to require overly complicated sums, I
| simply said it's MAX 6TiB
|
| I patiently listened to all the big query hadoop habla-blabla,
| even asked questions about the financials
| (hardware/software/license BOM) and many of them came up with
| astonishing tens of thousands of dollars yearly.
|
| The winner of course was the guy who understood that 6TiB is what
| 6 of us in the room could store on our smart phones, or a $199
| enterprise HDD (or three of them for redundancy), and it could be
| loaded (multiple times) to memory as CSV and simply run awk
| scripts on it.
|
| I am prone to the same fallacy: when I learn how to use a hammer,
| everything looks like a nail. Yet, not understanding the scale of
| "real" big data was a no-go in my eyes when hiring.
| geraldwhen wrote:
| I ask a similar question on screens. Almost no one gives a good
| answer. They describe elaborate architectures for data that
| fits in memory, handily.
| mcny wrote:
| I think that's the way we were taught in college / grad
| school. If the premise of the class is relational databases,
| the professor says, for the purpose of this course, assume
| the data does not fit in memory. Additionally, assume that
| some normalization is necessary and a hard requirement.
|
| Problem is most students don't listen to the first part "for
| the purpose of this course". The professor does not elaborate
| because that is beyond the scope of the course.
| kmarc wrote:
| FWIW if they were juniors, I would've continued the
| interview and direct them with further questions, and
| observer their flow of thinking to decide if they are good
| candidates to pursue further.
|
| But no, this particular person had been working
| professionally for decades (in fact, he was much older than
| me).
| geraldwhen wrote:
| Yeah. I don't even bother asking juniors this. At that
| level I expect that training will be part of the job, so
| it's not a useful screener.
| acomjean wrote:
| I took a Hadoop class. We learned hadoop and were told by
| the instructor we probably wouldn't't need it, and learned
| some other Java processing techniques (streams etc)
| Joel_Mckay wrote:
| People can always find excuses to boot candidates.
|
| I would just back-track from a shipped product date, and try
| to guess who we needed to get there... given the scope of
| requirements.
|
| Generally, process people from a commercially
| "institutionalized" role are useless for solving unknown
| challenges. They will leave something like an SAP, C#, or
| MatLab steaming pile right in the middle of the IT ecosystem.
|
| One could check out Aerospike rather than try to write their
| own version (the dynamic scaling capabilities are very
| economical once setup right.)
|
| Best of luck, =3
| boppo1 wrote:
| You have 6 TiB of ram?
| ninkendo wrote:
| You don't need that much ram to use mmap(2)
| marginalia_nu wrote:
| To be fair, mmap doesn't put your data in RAM, it presents
| it as though it was in RAM and has the OS deal with whether
| or not it actually is.
| ninkendo wrote:
| Right, which is why you can mmap way more data than you
| have ram, and treat it as though you do have that much
| ram.
|
| It'll be slower, perhaps by a lot, but most "big data"
| stuff is already so god damned slow that mmap probably
| still beats it, while being immeasurably simpler and
| cheaper.
| cess11 wrote:
| The "(multiple times)" part probably means batching or
| streaming.
|
| But yeah, they might have that much RAM. At a rather small
| company I was at we had a third of it in the virtualisation
| cluster. I routinely put customer databases in the hundreds
| of gigabytes into RAM to do bug triage and fixing.
| kmarc wrote:
| Indeed, what I meant to say is that you can load it in
| multiple batches. However, now thinking, I did play around
| with servers of TiBs of memory :-)
| vitus wrote:
| If you're one of the public clouds targeting SAP use cases,
| you probably have some machines with 12TB [0, 1, 2].
|
| [0] https://aws.amazon.com/blogs/aws/now-available-amazon-
| ec2-hi...
|
| [1] https://cloud.google.com/blog/products/sap-google-
| cloud/anno...
|
| [2] https://azure.microsoft.com/en-us/updates/azure-
| mv2-series-v...
| qaq wrote:
| You can have 8TB RAM in a 2U box for under 100K. grab a
| couple and it will save you millions a year compared to over-
| engineered bigdata setup.
| apwell23 wrote:
| Bigquery and snowflake are software. They come with a sql
| engine, data governance, integration with your ldap,
| auditing. Loading data into snowflake isn't overegineering.
| What you described is over-engineering.
|
| No business is passing 6tb data around on their laptops.
| qaq wrote:
| So is ClickHouse your point being ? Please point out what
| a server being able to have 8TB of RAM has to do with
| laptops.
| int_19h wrote:
| I wonder how much this costs:
| https://www.ibm.com/products/power-e1080
|
| And how that price would compare to the equivalent big data
| solution in the cloud.
| chx wrote:
| If my business depended on it? I can click a few buttons and
| have a 8TiB Supermicro server on my doorstep in a few days if
| I wanted to colo that. EC2 High Memory instances offer 3, 6,
| 9, 12, 18, and 24 TiB of memory in an instance if that's the
| kind of service you want. Azure Mv2 also does 2850 -
| 11400GiB.
|
| So yes, if need to be, I have 6 TiB of RAM.
| david_allison wrote:
| https://yourdatafitsinram.net/
| compressedgas wrote:
| Was posted as https://news.ycombinator.com/item?id=9581862
| in 2015
| bluedino wrote:
| We are decomming our 5-year old 4TB systems this year and
| could have been ordered with more
| lizknope wrote:
| I personally don't but our computer cluster at work as around
| 50,000 CPU cores. I can request specific configurations
| through LSF and there are at least 100 machines with over 4TB
| RAM and that was 3 years ago. By now there are probably
| machines with more than that. Those machines are usually
| reserved for specific tasks that I don't do but if I really
| needed it I could get approval.
| sfilipco wrote:
| I agree that keeping data local is great and should be the
| first option when possible. It works great on 10GB or even
| 100GB, but after that starts to matter what you optimize for
| because you start seeing execution bottlenecks.
|
| To mitigate these bottlenecks you get fancy hardware (e.g
| oracle appliance) or you scale out (and get TCO/performance
| gains from separating storage and compute - which is how
| Snowflake sold 3x cheaper compared to appliances when they came
| out).
|
| I believe that Trino on HDFS would be able to finish faster
| than awk on 6 enterprise disks for 6TB data.
|
| In conclusion I would say that we should keep data local if
| possible but 6TB is getting into the realm where Big Data tech
| starts to be useful if you do it a lot.
| nottorp wrote:
| > I agree that keeping data local is great and should be the
| first option when possible. It works great on 10GB or even
| 100GB, but after that starts to matter what you optimize for
| because you start seeing execution bottlenecks.
|
| The point of the article is 99.99% of businesses never pass
| even the 10 Gb point though.
| sfilipco wrote:
| I agree with the theme of the article. My reply was to
| parent comment which has a 6 TB working set.
| hectormalot wrote:
| I wouldn't underestimate how much a modern machine with a
| bunch of RAM and SSDs can do vs HDFS. This post[1] is now 10
| years old and has find + awk running an analysis in 12
| seconds (at speed roughly equal to his hard drive) vs Hadoop
| taking 26 minutes. I've had similar experiences with much
| bigger datasets at work (think years of per-second
| manufacturing data across 10ks of sensors).
|
| I get that that post is only on 3.5GB, but, consumer SSDs are
| now much faster at 7.5GB/s vs 270MB/s HDD back when the
| article was written. Even with only mildly optimised
| solutions, people are churning through the 1 billion rows
| (+-12GB) challenge in seconds as well. And, if you have the
| data in memory (not impossible) your bottlenecks won't even
| be reading speed.
|
| [1]: https://adamdrake.com/command-line-tools-can-
| be-235x-faster-...
| pdimitar wrote:
| Blows my mind. I am a backend programmer and a semi-decent
| sysadmin and I would have immediately told you: "make a ZFS or
| BCacheFS pool with 20-30% redundancy bits and just go wild with
| CLI programs, I know dozens that work on CSV and XML, what's
| the problem?".
|
| And I am not a specialized data scientist. But with time I am
| wondering if such a thing even exists... being a good backender
| / sysadmin and knowing a lot of CLI tools has always seemed to
| do the job for me just fine (though granted I never actually
| managed a data lake, so I am likely over-simplifying it).
| WesolyKubeczek wrote:
| > just go wild with CLI programs, I know dozens that work on
| CSV and XML
|
| ...or put it into SQLite for extra blazing fastness! No
| kidding.
| pdimitar wrote:
| That's included in CLI tools. Also duckdb and clickhouse-
| local are amazing.
| WesolyKubeczek wrote:
| I need to learn more about the latter for some log
| processing...
| fijiaarone wrote:
| Log files aren't data. That's your first problem. But
| that's the only thing that most people have that
| generates more bytes than can fit on screen in a single
| spreadsheet.
| thfuran wrote:
| Of course they are. They just aren't always structured
| nicely.
| WesolyKubeczek wrote:
| Everything is data if you are brave enough.
| c0brac0bra wrote:
| clickhouse-local had been astonishingly fast for
| operating on many GB of local CSVs.
|
| I had a heck of a time running the server locally before
| I discovered the CLI.
| nevi-me wrote:
| To be fair on candidates, CLI programs create technical debt
| the moment they're written.
|
| A good answer that strikes a balance between size of data,
| latency and frequency requirements is a candidate who is able
| to show that they can choose the right tool that the next
| person will be comfortable with.
| pdimitar wrote:
| True on the premise, yep, though I'm not sure how using CLI
| programs like LEGO blocks creates a tech debt?
| ImPostingOnHN wrote:
| I remember replacing a CLI program built like Lego
| blocks. It was 90-100 LEGO blocks, written over the
| course of decades, in: Cobol; Fortran; C; Java; Bash; and
| Perl, and the Legos "connected" with environmental
| variables. Nobody wanted to touch it lest they break it.
| Sometimes it's possible to do things too smartly. Apache
| Spark runs locally (and via CLI).
| pdimitar wrote:
| No no, I didn't mean that at all. I meant a script using
| well-known CLI programs.
|
| Obviously organically grown Frankenstein programs are a
| huge liability, I think every reasonable techie agrees on
| that.
| actionfromafar wrote:
| Well your little CLI-query is suddenly in production and
| then... it easily escalates.
| pdimitar wrote:
| I already said I never managed a data lake and simply got
| stuff when it was needed but if you need to criticize
| then by all means, go wild.
| __MatrixMan__ wrote:
| True but it's typically less debt than anything involving a
| gui, pricetag, or separate server.
| citizen_friend wrote:
| Configuring debugged, optimized software, with a shell
| script is orders of magnitude cheaper than developing novel
| software.
| ImPostingOnHN wrote:
| _> But with time I am wondering if such a thing even exists_
|
| Check out "data science at the command line":
|
| https://jeroenjanssens.com/dsatcl/
| apwell23 wrote:
| > make a ZFS or BCacheFS pool with 20-30% redundancy bits and
| just go wild with CLI programs
|
| Lol. Data management is about safety, auditablity, access
| control, knowledge sharing and who bunch of other stuff. I
| would've immediately shown you the door as someone who i
| cannot trust data with.
| zaphar wrote:
| What about his answer prevents any of that? As stated the
| question didn't require any of what you outline here. ZFS
| will probably do a better job of protecting your data than
| almost any other filesystem out there so it's not a bad
| foundation to start with if you want to protect data.
|
| Your entire post reeks of "I'm smarter than you" smugness
| while at the same time revealing no useful information or
| approaches. Near as I can tell no one should trust you with
| any data.
| apwell23 wrote:
| > Your entire post reeks of "I'm smarter than you"
|
| unlike "blows my mind" ?
|
| > As stated the question didn't require any of what you
| outline here.
|
| Right. OP mentioned it was "tricky question" . What makes
| it tricky is that all those attributes are implicitly
| assumed. I wouldn't interview at google and tell them my
| "stack" is "load it on your laptop". I would never say
| that in an interview even if I think that's the right
| "stack" .
| zaphar wrote:
| "blows my mind" is similar in tone yes. But I wasn't
| replying to the OP. Further the OP actually goes into
| some detail about how he would approach the problem. You
| do not.
|
| You are assuming you know what the OP meant by tricky
| question. And your assumption contradicts the rest of the
| OP's post regarding what he considered good answers to
| the question and why.
| pdimitar wrote:
| Honest question: was "blows my mind" so offensive?
| Thought it was quite obvious I meant that "it blows my
| mind people don't try the simpler stuff first, especially
| having in mind that it works for much bigger percentage
| than cloud providers would have you believe"?
|
| I guess it wasn't but even if so, it would be
| legitimately baffling how people manage to project so
| much negativity in three words that are slightly tongue-
| in-cheek casual comment on the state of affairs in an
| area whose value is not always clear (in my observations,
| only after you start having 20+ data sources it starts to
| pay off to have dedicated data team; I've been in teams
| only 3-4 devs and we still managed to have 15-ish data
| dashboards for the executives without too much cursing).
|
| An anecdote, surely, but what isn't?
| zaphar wrote:
| I generally don't find that sort of thing offensive when
| combined with useful alternative approaches like your
| post provided. However the phrase does come with a
| connotation that you are surprised by a lack of knowledge
| or skill in others. That can be taken as smug or elitist
| by someone in the wrong frame of mind.
| pdimitar wrote:
| Thank you, that's helpful.
| pdimitar wrote:
| I already qualified my statement quite well by stating my
| background but if it makes you feel better then sure, show
| me the door. :)
|
| I was never a data scientist, just a guy who helped
| whenever it was necessary.
| apwell23 wrote:
| > I already qualified my statement quite well by stating
| my background
|
| No. You qualified it with "blows my mind" . Why would it
| 'blow your mind' if you don't have any data background.
| zaphar wrote:
| He didn't say he didn't have any data background. He's
| clearly worked with data on several occasions as needed.
| pdimitar wrote:
| Are you trolling? Did you miss the part where I said I
| worked with data but wouldn't say I'm a professional data
| scientist?
|
| This negative cherry picking does not do your image any
| favors.
| koverstreet wrote:
| this is how you know when someone takes themself too
| seriously
|
| buddy, you're just rolling off buzzwords and lording it
| over other people
| apwell23 wrote:
| buddy you suffer from NIH syndrome upset that no one
| wants your 'hacks'.
| photonthug wrote:
| > Lol. Data management is about safety, auditablity, access
| control, knowledge sharing and who bunch of other stuff. I
| would've immediately shown you the door as someone who i
| cannot trust data with.
|
| No need to act smug and superior, especially since nothing
| about OP's plan here _actually precludes_ having all the
| nice things you mentioned, or even having them inside
| $your_favorite_enterprise_environment.
|
| You risk coming across as a person who feels threatened by
| simple solutions, perhaps someone who wants to spend $500k
| in vendor subscriptions every year for simple and/or
| imaginary problems... exactly the type of thing TFA talks
| about.
|
| But I'll ask the question.. _why_ do you think safety,
| auditablity, access control, and knowledge sharing are
| incompatible with CLI tools and a specific choice of file
| system? What 's your preferred alternative? Are you
| sticking with that alternative regardless of how often the
| work load runs, how often it changes, and whether the data
| fits in memory or requires a cluster?
| apwell23 wrote:
| > No need to act smug and superior
|
| I responded with the same tone that gp responded with.
| "blows my mind" ( that people can be so stupid) .
| photonthug wrote:
| Another comment mentions this classic meme:
|
| > Consulting service: you bring your big data problems to
| me, I say "your data set fits in RAM", you pay me $10,000
| for saving you $500,000.
|
| A lot of industry work really does fall into this
| category, and it's not controversial to say that going
| the wrong way on this thing is mind-blowing. More than
| not being controversial, it's not _confrontational_ ,
| because his comment was essentially re: the industry,
| whereas your comment is directed at a person.
|
| Drive by sniping where it's obvious you don't even care
| to debate the tech itself might get you a few "sick burn,
| bro" back-slaps from certain crowds, or the FUD approach
| might get traction with some in management, but overall
| it's not worth it. You don't sound smart or even
| professional, just nervous and afraid of every approach
| that you're not already intimately familiar with.
| apwell23 wrote:
| i repurposed the parent comment
|
| "not understanding the scale of "real" big data was a no-
| go in my eyes when hiring." , "real winner" ect.
|
| But yea you are right. I shouldn't have directed it at
| commenter. I was miffed at interviewers who use "tricky
| questions" and expect people to read their minds and come
| up with their preconceived solution.
| pdimitar wrote:
| The classic putting words in people's mouths technique it
| is then. The good old straw man.
|
| If you really must know: I said "blows my mind [that
| people don't try simpler and proven solutions FIRST]".
|
| I don't know what do you have to gain to come here and
| pretend to be in my head. Now here's another thing that
| blows my mind.
| apwell23 wrote:
| > that people don't try simpler and proven solutions
| FIRST
|
| Well why don't people do that according to you ?
|
| Its not 'mind blowing' to me because you can never guess
| what angle interviewer is coming at you. Especially when
| they use the words like ' data stack'.
| pdimitar wrote:
| I don't know why and this is why I said it's mind-
| blowing. Because to me trying stuff that can work on most
| laptops comes naturally in my head as the first viable
| solution.
|
| As for interviews, sure, they have all sorts of traps. It
| really depends on the format and the role. Since I
| already disclaimed that I am not actual data scientist
| and just a seasoned dev who can make some magic happen
| without a dedicated data team (if/when the need arises)
| then I wouldn't even be in a data scientist interview in
| the first place. -\\_(tsu)_/-
| apwell23 wrote:
| Thats fair. My comment wasn't directed at you. I was
| trying to be smart and write an inverse of original
| comment. Where I as an interviewer was looking for a
| proper 'data stack' and interviewee responded with a
| bespoke solution.
|
| "not understanding the scale of "real" big data was a no-
| go in my eyes when hiring."
| pdimitar wrote:
| Sure, okay, I get it. My point was more like "Have you
| tried this obvious thing first that a lot of devs can do
| for you without too much hassle?". If I were to try for a
| dedicated data scientist position then I'd have done
| homework.
| StrLght wrote:
| > you can never guess what angle interviewer is coming at
| you
|
| Why would you _guess_ in that situation though?
|
| It's an interview, there's at least 1 person talking to
| you -- you should talk to them, ask them questions, share
| your thoughts. If you talking to them is a red flag, then
| high chances that you wouldn't want to work there anyway.
| HelloNurse wrote:
| Abstractly, "safety, auditablity, access control,
| knowledge sharing" are about people reading and writing
| files: simplifying away complicated management systems
| improves security. The operating system should be good
| enough.
| apwell23 wrote:
| Edit: for above comment.
|
| My comment wasn't directed at parent. I was trying to be
| smart and write an inverse of original comment. Opposite
| scenario Where I as an interviewer was looking for a proper
| 'data stack' and interviewee responded with a bespoke
| solution.
|
| "not understanding the scale of "real" big data was a no-go
| in my eyes when hiring."
|
| i was trying to point out that you can never know where the
| interviewer is coming from. Unless i know interviewer
| personally i would bias towards playing it safe and go with
| 'enterpisey stack'
| wslh wrote:
| In my context 99% of the problem is the ETL, nothing to do with
| complex technology. I see people stuck when they need to get
| this from different sources in different technologies and/or
| APIs.
| mattbillenstein wrote:
| I can appreciate the vertical scaling solution, but to be
| honest, this is the wrong solution for almost all use cases -
| consumers of the data don't want awk, and even if they did,
| spooling over 6TB for every kinda of query without partitioning
| or column storage is gonna be slow on a single cpu - always.
|
| I've generally liked BigQuery for this type of stuff - the
| console interface is good enough for ad-hoc stuff, you can
| connect a plethora of other tooling to it (Metabase, Tableau,
| etc). And if partitioned correctly, it shouldn't be too
| expensive - add in rollup tables if that becomes a problem.
| __alexs wrote:
| A moderately powerful desktop processor has memory bandwidth
| of over 50TB/s so yeah it'll take a couple of minutes sure.
| fijiaarone wrote:
| The slow part of using awk is waiting for the disk to spin
| over the magnetic head.
|
| And most laptops have 4 CPU cores these days, and a
| multiprocess operating system, so you don't have to wait
| for random access on a spinning plate to find every bit in
| order, you can simply have multiple awk commands running in
| parallel.
|
| Awk is most certainly a better user interface than whatever
| custom BrandQL you have to use in a textarea in a browser
| served from localhost:randomport
| Androider wrote:
| > The slow part of using awk is waiting for the disk to
| spin over the magnetic head.
|
| If we're talking about 6 TB of data:
|
| - You can upgrade to 8 TB of storage on a 16-inch MacBook
| Pro for $2,200, and the _lowest_ spec has 12 CPU cores.
| With up to 400 GB /s of memory bandwidth, it's truly a
| case of "your big data problem easily fits on my laptop".
|
| - Contemporary motherboards have 4 to 5 M.2 slots, so you
| could today build a 12 TB RAID 5 setup of 4 TB Samsung
| 990 PRO NVMe drives for ~ 4 x $326 = $1,304. Probably in
| a year or two there will be 8 TB NVMe's readily
| available.
|
| Flash memory is cheap in 2024!
| bewaretheirs wrote:
| You can go further.
|
| There are relatively cheap adapter boards which let you
| stick 4 M.2 drives in a single PCIe x16 slot; you can
| usually configure a x16 slot to be bifurcated
| (quadfurcated) as 4 x (x4).
|
| To pick a motherboard at quasi-random:
|
| Tyan HX S8050. Two M.2 on the motherboard.
|
| 20 M.2 drives in quadfurcated adapter cards in the 5 PCIe
| x16 slots
|
| And you can connect another 6 NVMe x4 devices to the MCIO
| ports.
|
| You might also be able to hook up another 2 to the
| SFF-8643 connectors.
|
| This gives you a grand total of 28-30 x4 NVME devices on
| one not particularly exotic motherboard, using most of
| the 128 regular PCIe lanes available from the CPU socket.
| hnfong wrote:
| I haven't been using spinning disks for perf critical
| tasks for a looong time... but if I recall correctly,
| using multiple processes to access the data is usually
| counter-productive since the disk has to keep
| repositioning its read heads to serve the different
| processes reading from different positions.
|
| Ideally if the data is laid out optimally on the spinning
| disk, a single process reading the data would result in a
| mostly-sequential read with much less time wasted on read
| head repositioning seeks.
|
| In the odd case where the HDD throughput is greater than
| a single-threaded CPU processing for whatever reason (eg.
| you're using a slow language and complicated processing
| logic?), you can use one optimized process to just read
| the raw data, and distribute the CPU processing to some
| other worker pool.
| dahart wrote:
| Running awk on an in-memory CSV will come nowhere even
| close to the memory bandwidth your machine is capable of.
| fifilura wrote:
| I agree with this. BigQuery or AWS s3/Athena.
|
| You shouldn't have to set up a cluster for data jobs these
| days.
|
| And it kind of points out the reason for going with a data
| scientist with the toolset he has in mind instead of
| optimizing for a commandline/embedded programmer.
|
| The tools will evolve in the direction of the data scientist,
| while the embedded approach is a dead end in lots of ways.
|
| You may have outsmarted some of your candidates, but you
| would have hired a person not suited for the job long term.
| orhmeh09 wrote:
| It is actually pretty easy to do the same type of
| processing you would do on a cluster with AWS Batch.
| kjkjadksj wrote:
| Hes hiring data scientists not building a service though.
| This might realistically be a one off analysis for those 6tb.
| At which point you are happy your data scientists has
| returned statistical information instead of spending another
| week making sure the pipeline works if someone puts a greek
| character in a field.
| data-ottawa wrote:
| Even if I'm doing a one off, depending on the task it can
| be easier/faster/more reliable to load 6TiB into a big
| query table than waiting hours for some task to complete
| and fiddling with parallelism and memory management.
|
| It's a couple hundred bucks a month and $36 to query the
| entire dataset, after partitioning thats not terrible.
| nostrademons wrote:
| A 6T hard drive and Pandas will cost you a couple hundred
| bucks, one time purchase, and then last you for years
| (and several other data analysis jobs). It also doesn't
| require that you be connected to the Internet, doesn't
| require that you trust 3rd-party services, and is often
| faster (even in execution time) than spooling up
| BigQuery.
|
| You can always save an intermediate data set partitioned
| and massaged into whatever format makes subsequent
| queries easy, but that's usually application-dependent,
| and so you want that control over how you actually store
| your intermediate results.
| data-ottawa wrote:
| I wouldn't make a purchase of either without knowing a
| bit more about the lifecycle and requirements.
|
| If you only needed this once, the BQ approach requires
| very little setup and many places already have a billing
| account. If this is recurring then you need to figure out
| what the ownership plan of the hard drive is (what's it
| connected to, who updates this computer, what happens
| when it goes down, etc.).
| pyrale wrote:
| Once you understand that 6tb fits on a hard drive, you can
| just as well put it in a run-of-the-mill pg instance, which
| metabase will reference just as easily. Hell, metabase is
| fine with even a csv file...
| crowcroft wrote:
| I worked in a large company that had a remote desktop
| instance with 256gb ram running a PG instance that analysts
| would log in to to do analysis. I used to think it was a
| joke of setup for such a large company.
|
| I later moved to a company with a fairly sophisticated
| setup with Databricks. While Databricks offered some QoL
| improvements, it didn't magically make all my queries run
| quickly, and it didn't allow me anything that I couldn't
| have done on the remote desktop setup.
| Stranger43 wrote:
| And here we see this strange thing that data science people
| does in forgetting that 6TB is small change for any SQL
| server worth it's salt.
|
| Just dump it into Oracle, postgre, mssql, or mysql and be
| amazed by the kind of things you can do with 30year old data
| analysis technology on an modern computer.
| apwell23 wrote:
| you wouldn't have been a 'winner' per OP. real answer is
| loading it on their phones not on sqlserver or whatever.
| Stranger43 wrote:
| To be honest OP is kind of making the same mistake in
| assuming that the only real alternatives is "new data
| science products" and old school scripting exists as
| valuable tools.
|
| The extend people goes to to not recognize how much the
| people creating the SQL language and the relational
| database engines we now take for granted actually knew
| what they were doing, are a bit of an mystery to me.
|
| The right answer to any query that can be defined in SQL
| is pretty much always an SQL engine even if it's just
| sqlite running on an laptop. But somehow people seems to
| keep comming up with reasons not to use SQL.
| ryguyrg wrote:
| you can scale vertically with a much better tech than awk.
|
| enter duckdb with columnar vectorized execution and full SQL
| support. :-)
|
| disclaimer: i work with the author at motherduck and we make
| a data warehouse powered by duckdb
| chx wrote:
| https://x.com/garybernhardt/status/600783770925420546 (Gary
| Bernhardt of WAT fame):
|
| > Consulting service: you bring your big data problems to me, I
| say "your data set fits in RAM", you pay me $10,000 for saving
| you $500,000.
|
| This is from 2015...
| RandomCitizen12 wrote:
| https://yourdatafitsinram.net/
| crowcroft wrote:
| I wonder if it's fair to revise this to 'your data set fits
| on NVME drives' these days. Astonishing how fast and how much
| storage you can get these days.
| fbdab103 wrote:
| You can always check available ram:
| https://yourdatafitsinram.net/
| xethos wrote:
| Based on a very brief search: Samsung's fastest NVME drives
| [0] could maybe keep up with the slowest DDR2 [1]. DDR5 is
| several orders of magnitude faster than both [2]. Maybe in
| a decade you can hit 2008 speeds, but I wouldn't consider
| updating the phrase before then (and probably not after,
| either).
|
| [0]
| https://www.tomshardware.com/reviews/samsung-980-m2-nvme-
| ssd...
|
| [1] https://www.tomshardware.com/reviews/ram-speed-
| tests,1807-3....
|
| [2] https://en.wikipedia.org/wiki/DDR5_SDRAM
| dralley wrote:
| The statement was "fits on", not "matches the speed of".
| Dylan16807 wrote:
| Several gigabytes per second, plus RAM caching, is
| probably enough though. Latency can be very important,
| but there exist some very low latency enterprise flash
| drives.
| int_19h wrote:
| I think the point is that if it fits on a single drive,
| you can still get away with a much simpler solution (like
| a traditional SQL database) than any kind of "big data"
| stack.
| marginalia_nu wrote:
| Problem is possibly that most people with that sort of hands-on
| intuition for data don't see themselves as data scientists and
| wouldn't apply for such a position.
|
| It's a specialist role, and most people with the skills you
| seek are generalists.
| deepsquirrelnet wrote:
| Yeah it's not really what you should be hiring a data
| scientist to do. I'm of the opinion that if you don't have a
| data engineer, you probably don't need a data scientist. And
| not knowing who you need for a job causes a lot of confusion
| in interviews.
| the_real_cher wrote:
| How would six terabytes fit into memory?
|
| It seems like it would get a lot of swap thrashing if you had
| multiple processes operating on disorganized data.
|
| I'm not really a data scientist and I've never worked on data
| that size so I'm probably wrong.
| coldtea wrote:
| > _How would six terabytes fit into memory?_
|
| What device do you have in mind? I've seen places use 2TB RAM
| servers, and that was years ago, and it isn't even that
| expensive (can get those for about $5K or so).
|
| Currently HP allows "up to 48 DIMM slots which support up to
| 6 TB for 2933 MT/s DDR4 HPE SmartMemory".
|
| Close enough to fit the OS, the userland, and 6 TiB of data
| with some light compression.
|
| > _It seems like it would get a lot of swap thrashing if you
| had multiple processes operating on disorganized data._
|
| Why would you have "disorganized data"? Or "multiple
| processes" for that matter? The OP mentions processing the
| data with something as simple as awk scripts.
| fijiaarone wrote:
| "How would six terabytes fit into memory?"
|
| A better question would be:
|
| Why would anyone stream 6 terabytes of data over the
| internet?
|
| In 2010 the answer was: because we can't fit that much data
| in a single computer, and we can't get accounting or
| security to approve a $10k purchase order to build a local
| cluster, so we need to pay Amazon the same amount every
| month to give our ever expanding DevOps team something to
| do with all their billable hours.
|
| That may not be the case anymore, but our devops team is
| bigger than ever, and they still need something to do with
| their time.
| the_real_cher wrote:
| Well yeah streaming to the cloud to work around budget
| issues is a while nother convo haha.
| Terr_ wrote:
| I'm having flashbacks to some new outside-hire CEO making
| flim-flam about capex-vs-opex in order to justify sending
| business towards a contracting firm they happened to
| know.
| the_real_cher wrote:
| I mean if you're doing data science the data is not always
| organized and of course you would want multi-processing.
|
| 1 TB of memory is like 5 grand from a quick Google search
| then you probably need specialized motherboards.
| coldtea wrote:
| > _I mean if you 're doing data science the data is not
| always organized and of course you would want multi-
| processing_
|
| Not necessarily - I might not want it or need it. It's a
| few TB, it can be on a fast HD, on an even faster SSD, or
| even in memory. I can crunch them quite fast even with
| basic linear scripts/tools.
|
| And organized could just mean some massaging or just
| having them in csv format.
|
| This is already the same rushed notions about "needing
| this" and "must have that" that the OP describes people
| jumping to, that leads them to suggest huge setups,
| distributed processing, multi-machine infrastructure, for
| use cases and data sizes that could fit on a single
| server with redundancy and be done it.
|
| DHH has often written about this for their Basecamp needs
| (scalling vertically where others scale horizontally
| having worked for them for most of their operation),
| there's also this classic post:
| https://adamdrake.com/command-line-tools-can-
| be-235x-faster-...
|
| > _1 TB of memory is like 5 grand from a quick Google
| search then you probably need specialized motherboards._
|
| Not that specialized, I've work with server deployments
| (HP) with 1, 1.5 and 2TB RAM (and > 100 cores), it's
| trivial to get.
|
| And 5 or even 30 grand would still be cheaper (and more
| effective and simpler) than the "big data" setups some of
| those candidates have in mind.
| the_real_cher wrote:
| Yeah I agree about over engineering.
|
| Im just trying to understand the parent to my original
| comment.
|
| How would running awk for analysis on 6TB of data work
| quickly and efficiently?
|
| They say it would go into memory but its not clear to me
| how that would work as would still have paging and
| thrashing issues if the data didnt have often used
| sections of the data.
|
| am I overthinking it and they were they just referring to
| buying a big ass Ram machine?
| allanbreyes wrote:
| There are machines that can fit that and more:
| https://yourdatafitsinram.net/
|
| I'm not advocating that this is generally a good or bad idea,
| or even economical, but it's possible.
| the_real_cher wrote:
| I'm trying to understand what the person I'm replying to
| had in mind when they said fit six terabytes in memory and
| search with awk.
|
| is this what they were referring to just by a big ass Ram
| machine?
| capitol_ wrote:
| It would easy fit in ram: https://yourdatafitsinram.net/
| jandrewrogers wrote:
| 6 TB does not fit in memory. However, with a good storage
| engine and fast storage this easily fits within the
| parameters of workloads that have memory-like performance.
| The main caveat is that if you are letting the kernel swap
| that for you then you are going to have a bad day, it needs
| to be done in user space to get that performance which
| constrains your choices.
| int_19h wrote:
| Per one of the links below, IBM Power System E980 can be
| configured for up to 64Tb RAM.
| rr808 wrote:
| If you look at the article the data space is more commonly 10GB
| which matches my experience. For these sizes definitely simple
| tools are enough.
| randomtoast wrote:
| Now, you have to consider the cost it takes for you whole team
| to learn how to use AWK instead of SQL. Then you do these TCO
| calculations and revert back to the BigQuery solution.
| tomrod wrote:
| About $20/month for chatgpt or similar copilot, which really
| they should reach for independently anyhow.
| randomtoast wrote:
| And since the data scientist cannot verify the very complex
| AWK output that should be 100% compatible with his SQL
| query, he relies on the GPT output for business-critical
| analysis.
| tomrod wrote:
| Only if your testing frameworks are inadequate. But I
| belive you could be missing or mistaken on how code
| generation successfully integrates into a developer and
| data scientist's work flow.
|
| Why not take a few days to get familiar with AWK, a skill
| which will last a lifetime? Like SQL, it really isn't so
| bad.
| randomtoast wrote:
| It is easier to write complex queries in SQL instead of
| AWK. I know both AWK and SQL, and I find SQL much easier
| for complex data analysis, including JOINS, subqueries,
| window functions, etc. Of course, your mileage may vary,
| but I think most data scientists will be much more
| comfortable with SQL.
| elicksaur wrote:
| Many people have noted how when using LLMs for things like
| this, the person's ultimate knowledge of the topic is less
| than it would've otherwise been.
|
| This effect then forces the person to be reliant on the LLM
| for answering all questions, and they'll be less capable of
| figuring out more complex issues in the topic.
|
| $20/mth is a siren's call to introduce such a dependency to
| critical systems.
| clwg wrote:
| Not necessarily. I always try to write to disk first, usually
| in a rotating compressed format if possible. Then, based on
| something like a queue, cron, or inotify, other tasks occur,
| such as processing and database logging. You still end up at
| the same place, and this approach works really well with
| tools like jq when the raw data is in jsonl format.
|
| The only time this becomes an issue is when the data needs to
| be processed as close to real-time as possible. In those
| instances, I still tend to log the raw data to disk in
| another thread.
| kjkjadksj wrote:
| For someone who is comfortable with sql we are talking
| minutes to hours to figure out awk well enough to see how its
| used or use it.
| noisy_boy wrote:
| It is not only about whether people can figure it out awk.
| It is also about how supportable the solution is. SQL
| provides many features specifically to support complex
| querying and is much more accessible to most people - you
| can't reasonably expect your business analysts to do
| complex analysis using awk.
|
| Not only that, it provides a useful separation from the
| storage format so you can use it to query a flat file
| exposed as table using Apache Drill or a file on s3 exposed
| by Athena or data in an actual table stored in a database
| and so on. The flexibility is terrific.
| esafak wrote:
| I have been using sql for decades and I am not comfortable
| with awk or intend to become so. There are better tools.
| RodgerTheGreat wrote:
| With the exception of regexes- which any programmer or data
| analyst ought to develop some familiarity with anyway- you
| can describe the entirety of AWK on a few sheets of paper.
| It's a versatile, performant, and enduring data-handling tool
| that is _already installed_ on all your servers. You would be
| hard-pressed to find a better investment in technical
| training.
| Dylan16807 wrote:
| No, if you want SQL you install postgresql on the single
| machine.
|
| Why would use use bigquery just to get SQL?
| citizen_friend wrote:
| sqlite cli
| bee_rider wrote:
| There'd still have to be some further questions, right? I guess
| if you store it on the interview group's cellphones you'll have
| to plan on what to do if somebody leaves or the interview room
| is hit by a meteor, if you plan to store it in ram on a server
| you'll need some plan for power outages.
| apwell23 wrote:
| What kind of business just has a static set of 6TiB data that
| people are loading on their laptops.
|
| You tricked candidates with your nonsensical scenario. Hate
| smartass interviewers like this that are trying some gotcha to
| feel smug about themselves.
|
| Most candidates don't feel comfortable telling ppl 'just load
| on your laptops' even if they think thats sensible. They want
| to present a 'professional solution', esp when you tricked them
| with the word 'stack'. which is how most of them prbly
| perceived your trick question.
|
| This comment is so infuriating to me. Why be assholes to each
| other when world is already full of them.
| tomrod wrote:
| I disagree with your take. Your surly rejoinder aside, the
| parent commenter identifies an area where senior level
| knowledge and process appropriately assess a problem. Not
| every job interview is satisfying checklist of prior
| experience or training, but rather assessing how well that
| skillset will fit the needed domain.
|
| In my view, it's an appropriate question.
| apwell23 wrote:
| What did you gather as 'needed domain' from that comment.
| 'needed domain' is often implicit, its not a blank slate.
| candidates assume all sorts of 'needed domain' even before
| the interview starts, if i am interviewing at bank I
| wouldn't suggest 'load it on your laptops' as my 'stack'.
|
| OP even mentioned that it his favorite 'tricky question' .
| It would def trick me because they used the word 'stack'
| which has specific meaning in the industry. There are even
| websites dedicated to 'stack's
| https://stackshare.io/instacart/instacart
| yxwvut wrote:
| Well put. Whoever asked this question is undoubtedly a
| nightmare to work with. Your data is the engine that drives
| your business and its margin improvements, so why hamstring
| yourself with a 'clever' cost saving but ultimately unwieldy
| solution that makes it harder to draw insight (or build
| models/pipelines) from?
|
| Penny wise and pound foolish, plus a dash of NIH syndrome.
| When you're the only company doing something a particular way
| (and you're not Amazon-scale), you're probably not as clever
| as you think.
| marcosdumay wrote:
| > What kind of business just has a static set of 6TiB data
| that people are loading on their laptops.
|
| Most business have static sets of data that people load on
| their PCs. (Why do you assume laptops?)
|
| The only weird part of that question is that 6TiB is so big
| it's not realistic.
| pizzafeelsright wrote:
| Big data companies or those that work with lots of data.
|
| The largest dataset I worked with was about 60TB
|
| While that didn't fit in ram most people would just load the
| sample data into the cluster when I told them it would be
| faster to load 5% locally and work off that.
| throwaway_20357 wrote:
| It depends on what you want to do with the data. It can be
| easier to just stick nicely-compressed columnar Parquets in S3
| (and run arbitrarily complex SQL on them using Athena or
| Presto) than to try to achieve the same with shell-scripting on
| CSVs.
| fock wrote:
| how exactly is this solution easier than putting the very
| Parquet files on a classic filesystem. Why does the easy
| solution require an amazon-subscription?
| filleokus wrote:
| I think I've written about it here before, but I imported [?]1
| TB of logs into DuckDB (which compressed it to fit in RAM of my
| laptop) and was done with my analysis before the data science
| team had even ingested everything into their spark cluster.
|
| (On the other hand, I wouldn't really want the average business
| analyst walking around with all our customer data on their
| laptops all the time. And by the time you have a proper ACL
| system with audit logs and some nice way to share analyses that
| updates in real time as new data is ingested, the Big Data
| Solution(tm) probably have a lower TCO...)
| marcosdumay wrote:
| > And by the time you have ... the Big Data Solution(tm)
| probably have a lower TCO...
|
| I doubt it. The common Big Data Solutions manage to have a
| very high TCO, where the least relevant share is spent on
| hardware and software. Most of its cost comes from
| reliability engineering and UI issues (because managing that
| "proper ACL" that doesn't fit your business is a hell of a
| problem that nobody will get right).
| riku_iki wrote:
| you probably didn't do joins for example on your dataset,
| because DuckDB is OOMing on them if they don't fit memory.
| thunky wrote:
| > requirements of "6 TiB of data"
|
| How could anyone answer this without knowing how the data is to
| be used (query patterns, concurrent readers, writes/updates,
| latency, etc)?
|
| Awk may be right for some scenarios, but without specifics it
| can't be a correct answer.
| marginalia_nu wrote:
| Those are very appropriate follow up questions I think. If
| someone tasks you to deal with 6 TiB of data, it is very
| appropriate to ask enough questions until you can provide a
| good solution, far better than to assume the questions are
| unknowable and blindly architect for all use cases.
| kbolino wrote:
| Even if a 6 terabyte CSV file does fit in RAM, the only thing
| you should do with it is convert it to another format (even if
| that's just the in-memory representation of some program). CSV
| stops working well at billions of records. There is no way to
| find an arbitrary record because records are lines and lines
| are not fixed-size. You can sort it one way and use binary
| search to find something in it in semi-reasonable time but re-
| sorting it a different way will take hours. You also can't
| insert into it while preserving the sort without rewriting half
| the file on average. You don't need Hadoop for 6 TB but,
| assuming this is live data that changes and needs regular
| analysis, you do need something that actually works at that
| size.
| 7thaccount wrote:
| I am a big fan of these simplistic solutions. In my own area,
| it was incredibly frustrating as what we needed was a database
| with a smaller subset of the most recent information from our
| main long-term storage database for back end users to do
| important one-off analysis with. This should've been fairly
| cheap, but of course the IT director architect guy wanted to
| pad his resume and turn it all into multi-million project with
| 100 bells and whistles that nobody wanted.
| palata wrote:
| One thing that may have an impact on the answers: you are
| hiring them, so I assume they are passing a technical
| interview. So they expect that you want to check their
| understanding of the technical stack.
|
| I would not conclude that they over-engineer everything they do
| from such an answer, but rather just that they got tricked in
| this very artificial situation where you are in a dominant
| position and ask trick questions.
|
| I was recently in a technical interview with an interviewer
| roughly my age and my experience, and I messed up. That's the
| game, I get it. But the interviewer got judgemental towards my
| (admittedly bad) answers. I am absolutely certain that were the
| roles inverted, I could choose a topic I know better than him
| and get him in a similarly bad position. But in this case, he
| was in the dominant position and he chose to make me feel bad.
|
| My point, I guess, is this: when you are the interviewer, be
| extra careful not to abuse your dominant position, because it
| is probably counter-productive for your company (and it is just
| not nice for the human being in front of you).
| ufo wrote:
| From the point of view of the interviewee, it's impossible to
| guess if they expect you to answer "no need for big data" or
| if they expect you to answer "the company is aiming for
| exponential growth so disregard the 6TB limit and architect
| for scalability"
| kmarc wrote:
| FWIW, it's a 2.5 second extra to say "Although you don't
| need big data, but if you insist, ..." and gimme the hadoop
| answer.
| whamlastxmas wrote:
| Is this like interviewing for a chef position for a fancy
| restaurant and when asked how to perfectly cook a steak,
| you preface it with "well you can either go to McDonald's
| and get a burger, or..."
|
| It may not be reasonable to suggest that in a role that
| traditionally uses big data tools
| dkz999 wrote:
| Idk, in this instance I feel pretty strongly that cloud,
| and solutions with unecessary overhead, are the fast
| food. The article proposes not eating it all the time.
| hnfong wrote:
| I see it more like "it's 11pm and a family member
| suddenly wants to eat a steak at home, what would you
| do?"
|
| The person who says "I'm going drive back to the
| restaurant and take my professional equipment home to
| cook the steak" is probably offering the wrong answer.
|
| I'm obviously not a professional cook, but presumably the
| ability to improvise with whatever tools you currently
| have is a desirable skill.
| palata wrote:
| Hmm I would say that the equivalent to your 11pm question
| is more something like "your sister wants to backup her
| holiday pictures on the cloud, how do you design it?".
| The person who says "I ask her 10 millions to build a
| data center" is probably offering the wrong answer :-).
| tored wrote:
| I think more like, how would you prepare and cook the
| best five course gala dinner for only $10. That requires
| true skill.
| bee_rider wrote:
| I'm not sure if you are referencing it intentionally or
| not, but some chefs (Gordon Ramsey for one) will ask an
| interviewee to make some scrambled eggs; something not
| super niche or specialized but enough to see what their
| technique is.
|
| It is a sort of "interview hack" example that's been used
| to emphasize the idea of a simple unspecialized skill-
| test that went around a while ago. I guess upcoming chefs
| probably practice egg scrambling nowadays, ruining the
| value of the test. But maybe they could ask to make a bit
| of steak now.
| Dylan16807 wrote:
| The fancy cluster is probably slower for most tasks than
| one big machine storing everything in RAM. It's not like
| a fast food burger.
| jancsika wrote:
| That's great, but it's really just desiderata about you
| and your personal situation.
|
| E.g., if a HN'er takes this as advice they're just as
| likely to be gated by some other interviewer who
| interprets hedging as a smell.
|
| I believe the posters above are essentially saying: you,
| the interviewer, can take the 2.5 seconds to ask the
| follow up, "... and if we're not immediately optimizing
| for scalability?" Then take that data into account when
| doing your assessment instead of attempting to optimize
| based on a single gate.
|
| Edit: clarification
| coffeebeqn wrote:
| This is the crux of it. Another interviewer would've
| marked "run on a local machine with a big SSD" - as: this
| fool doesn't know enough about distributed systems and
| just runs toy projects on one machine
| dartos wrote:
| That is what I think interviewers think when I don't
| immediately bring up kubernetes and sqs in an
| architecture interview
| theamk wrote:
| depending on the shop? For some kinds of tasks, jumping
| to kubernets right away would be a minus during
| interview.
| antisthenes wrote:
| > E.g., if a HN'er takes this as advice they're just as
| likely to be gated by some other interviewer who
| interprets hedging as a smell.
|
| If people in high stakes environments interpret hedging
| as a smell - run from that company as fast as you can.
|
| Hedging is a natural adult reasoning process. Do you
| really want to work with someone who doesn't understand
| that?
| llm_trw wrote:
| I once killed the deployment of a big data team in a
| large bank when I laid out in excruciating details
| exactly what they'd have to deal with during an
| interview.
|
| Last I heard theyd promoted one unix guy on the inside to
| baby sit a bunch of chron jobs on the biggest server they
| could find.
| palata wrote:
| Sure, but as you said yourself: it's a trick question.
| How often does the employee have to answer trick
| questions without having any time to think in the actual
| job?
|
| As an interviewer, why not asking: "how would you do that
| in a setup that doesn't have much data and doesn't need
| to scale, and then how would you do it if it had a ton of
| data and a big need to scale?". There is no trick here,
| do you feel you lose information about the interviewee?
| zdragnar wrote:
| Depends on the level you're hiring for. At a certain
| point, the candidate needs to be able to identify the
| right tool for the job, including when that tool is not
| the usual big data tools but a simple script.
| hirsin wrote:
| Trick questions (although not known as such at the time)
| are the basis of most of the work we do? XY problem is a
| thing for a reason, and I cannot count the number of
| times my teams and I have ratholed on something complex
| only to realize we were solving for the wrong problem,
| i.e. A trick question.
|
| As a sibling puts it though, it's a matter of level.
| Senior/staff and above? Yeah, that's mostly what you do.
| Lower than that, then you should be able to mostly trust
| those upper folks to have seen through the trick.
| palata wrote:
| > are the basis of most of the work we do?
|
| I don't know about you, but in my work, I always have
| more than 3 seconds to find a solution. I can slowly
| think about the problem, sleep on it, read about it, try
| stuff, think about it while running, etc. I usually do at
| least some of those for _new_ problems.
|
| Then of course there is a bunch of stuff that is not
| challenging and for which I can start coding right away.
|
| In an interview, those trick questions will just show you
| who already has experience with the problem you mentioned
| and who doesn't. It doesn't say _at all_ (IMO) how good
| the interviewee is at tackling challenging problem. The
| question then is: do you want to hire someone who is good
| at solving challenging problems, or someone who already
| knows how to solve the one problem you are hiring them
| for?
| theamk wrote:
| If the interviewer expects you to answer entire design
| question in 3 seconds, that interview is pretty broken.
| Those questions should take longish time (minutes to tens
| of minutes), and should let candidate showcase their
| thought process.
| palata wrote:
| I meant that the interviewer expects you to start
| answering after 3 seconds. Of course you can elaborate
| over (tens of) minutes. But that's very far from actual
| work, where you have time to think before you start
| solving a problem.
|
| You may say "yeah but you just have to think out loud,
| that's what the interviewer wants". But again that's not
| how I work. If the interviewer wants to see me design a
| system, they should watch me read documentation for
| hours, then think about it while running, and read again,
| draw a quick thing, etc.
| coryrc wrote:
| Once had a coworker write a long proposal to rewrite some
| big old application from Python to Go. I threw in a
| single comment: why don't we use the existing code as a
| separate executable?
|
| Turns out he was laid off and my suggestion was used.
|
| (Okay, I'm being silly, the layoff was a coincidence)
| theamk wrote:
| because the interview is supposed to ask same questions
| as real job, and in real job there are rarely big hints
| like you are describing.
|
| On the other hand, "hey I have 6TiB data, please prepare
| to analyze it, feel free to ask any questions for
| clarification but I may not know the answers" is much
| more representative of a real-life task.
| int_19h wrote:
| Being able to ask qualifying questions like that, or
| presenting options with different caveats clearly spelled
| out, is part of the job description IMO, at least for
| senior roles.
| valenterry wrote:
| It doesn't matter. The answer should be "It depends, what
| are the circumstances - do we expect high growth in the
| future? Is it gonna stay around 6TB? How and by whom will
| it be used and what for?"
|
| Or, if you can guess what the interviewer is aiming for,
| state the assumption and go from there "If we assume it's
| gonna stay at <10TB for the next couple of years or even
| longer, then..."
|
| Then the interviewer can interrupt and change the
| assumptions to his needs.
| drubio wrote:
| It's almost a law "all technical discussions devolve into
| interview mind games", this industry has a serious
| interview/hiring problem.
| layer8 wrote:
| You shouldn't guess what they expect, you should say what
| you think is right, and why. Do you want to work at a
| company where you would fail an interview due to making a
| correct technical assessment? And even if the guess is
| right, as an interviewer I would be more impressed by an
| applicant that will give justified reasons for a different
| answer than what I expected.
| andoando wrote:
| Its great if the interviewer actually takes time to sort
| out the questions you have, cause seemingly simple
| questions to you have a lot of assumptions you made.
|
| I had an interview "design an app store". I tried asking,
| ok an app store has a ton of components, which part of the
| app store are you asking exactly? The response I got was
| "Have you ever used an app store? Design an app store". Umm
| ok.
| oivey wrote:
| Engineering for scalability here is the single server
| solution that you throw away later when scale is needed.
| The price is so small (in this case) for the simple
| solution that you should basically always start with it.
| mrtimo wrote:
| .parquet files are completely underrated, many people still do
| not know about the format!
|
| .parquet preserves data types (unlike CSV)
|
| They are 10x smaller than CSV. So 600GB instead of 6TB.
|
| They are 50x faster to read than CSV
|
| They are an "open standard" from Apache Foundation
|
| Of course, you can't peek inside them as easily as you can a
| CSV. But, the tradeoffs are worth it!
|
| Please promote the use of .parquet files! Make .parquet files
| available for download everywhere .csv is available!
| sph wrote:
| Third consecutive time in 86 days that you mention .parquet
| files. I am out of my element here, but it's a bit weird
| fifilura wrote:
| FWIW I am the same. I tend to recommend BigQuery and
| AWS/Athena in various posts. Many times paired with
| Parquet.
|
| But it is because it makes a lot of things much simpler,
| and that a lot of people have not realized that. Tooling is
| moving fast in this space, it is not 2004 anymore.
|
| His arguments are still valid and 86 days is a pretty long
| time.
| ok_computer wrote:
| Sometimes when people discover or extensively use something
| they are eager to share in contexts they think are
| relevant. There is an issue when those contexts become too
| broad.
|
| 3 times across 3 months is hardly astroturfing for big
| parquet territory.
| mrtimo wrote:
| I've downloaded many csv files that were mal-formatted
| (extra commas or tabs etc.), or had dates in non-standard
| formats. Parquet format probably would not have had these
| issues!
| ddalex wrote:
| Why is .parquet better than protobuf?
| sdenton4 wrote:
| Parquet is columnar storage, which is much faster for
| querying. And typically for protobuf you deserialize each
| row, which has a performance cost - you need to deserialize
| the whole message, and can't get just the field you want.
|
| So, of you want to query a giant collection of protobufs,
| you end up reading and deserializing every record. For
| parquet, you get much closer to only reading what you need.
| nostrademons wrote:
| Parquet ~= Dremel, for those who are up on their Google
| stack.
|
| Dremel was pretty revolutionary when it came out in 2006 -
| you could run ad-hoc analyses in seconds that previously
| would've taken a couple days of coding & execution time.
| Parquet is awesome for the same reasons.
| thesz wrote:
| Parquet is underdesigned. Some parts of it do not scale well.
|
| I believe that Parquet files have rather monolithic metadata
| at the end and it has 4G max size limit. 600 columns (it is
| realistic, believe me), and we are at slightly less than 7.2
| millions row groups. Give each row group 8K rows and we are
| limited to 60 billion rows total. It is not much.
|
| The flatness of the file metadata require external data
| structures to handle it more or less well. You cannot just
| mmap it and be good. This external data structure most
| probably will take as much memory as file metadata, or even
| more. So, 4G+ of your RAM will be, well, used slightly
| inefficiently.
|
| (block-run-mapped log structured merge tree in one file can
| be as compact as parquet file and allow for very efficient
| memory mapped operations without additional data structures)
|
| Thus, while parqet is a step, I am not sure it is a step in
| definitely right direction. Some aspects of it are good, some
| are not that good.
| datadeft wrote:
| Nobody is forcing you to use a single Parquet file.
| thesz wrote:
| Of course.
|
| But nobody tells me that I can hit a hard limit and then
| I need a second Parquet file and should have some code
| for that.
|
| The situation looks to me as if my "Favorite DB server"
| supports, say, only 1.9 billions records per table and if
| I hit that limit I need a second instance of my "Favorite
| DB server" just for that unfortunate table. And it is not
| documented anywhere.
| apwell23 wrote:
| some critiques of parquet by andy pavlo
|
| https://www.vldb.org/pvldb/vol17/p148-zeng.pdf
| thesz wrote:
| Thanks, very insightful.
|
| "Dictionary Encoding is effective across data types (even
| for floating-point values) because most real-world data
| have low NDV ratios. Future formats should continue to
| apply the technique aggressively, as in Parquet."
|
| So this is not critique, but assessment. And Parquet has
| some interesting design decisions I did not know about.
|
| So, let me thank you again. ;)
| imiric wrote:
| What format would you recommend instead?
| thesz wrote:
| I do not know a good one.
|
| A former colleague of mine is now working on a memory-
| mapped log-structured merge tree implementation and it
| can be a good alternative. LSM provides elasticity, one
| can store as much data as one needs, it is static, thus
| it can be compressed as well as Parquet-stored data,
| memory mapping and implicit indexing of data do not
| require additional data structures.
|
| Something like LevelDB and/or RocksDB can provide most of
| that, especially when used in covering index [1] mode.
|
| [1] https://www.sqlite.org/queryplanner.html#_covering_in
| dexes
| Renaud wrote:
| Parquet is not a database, it's a storage format that
| allows efficient column reads so you can get just the data
| you need without having to parse and read the whole file.
|
| Most tools can run queries across parquet files.
|
| Like everything, it has its strengths and weaknesses, but
| in most cases, it has better trade-offs over CSV if you
| have more than a few thousand rows.
| beryilma wrote:
| > Parquet is not a database.
|
| This is not emphasized often enough. Parquet is useless
| for anything that requires writing back computed results
| as in data used by signal processing applications.
| maxnevermind wrote:
| > 7.2 millions row groups
|
| Why would you need 7.2 mil row groups?
|
| Row group size when stored in HDFS is usually equal to HDFS
| bock size by default, which is 128MB
|
| 7.2 mil * 128MB ~ 1PB
|
| You have a single parquet file 1PB in size?
| thesz wrote:
| Parquet is not HDFS. It is a static format, not a B-tree
| in disguise like HDFS.
|
| You can have compressed Parquet columns with 8192 entries
| being a couple of tens bytes in size. 600 columns in a
| row group is then 12K bytes or so, leading us to 100GB
| file, not a petabyte. Four orders of magnitude of
| difference between your assessment and mine.
| riku_iki wrote:
| > They are 50x faster to read than CSV
|
| I actually benchmarked this and duckdb CSV reader is faster
| than parquet reader.
| wenc wrote:
| I would love to see the benchmarks. That is not my
| experience, except in the rare case of a linear read (in
| which CSV is much easier to parse).
|
| CSV underperforms in almost every other domain, like joins,
| aggregations, filters. Parquet lets you do that lazily
| without reading the entire Parquet dataset into memory.
| riku_iki wrote:
| > That is not my experience, except in the rare case of a
| linear read (in which CSV is much easier to parse).
|
| Yes, I think duckdb only reads CSV, then projects
| necessary data into internal format (which is probably
| more efficient than parquet, again based on my
| benchmarks), and does all ops (joins, aggregations) on
| that format.
| wenc wrote:
| Yes, it does that, assuming you read in the entire CSV,
| which works for CSVs that fit in memory.
|
| With Parquet you almost never read in the entire dataset
| and it's fast on all the projections, joins, etc. while
| living on disk.
| riku_iki wrote:
| > which works for CSVs that fit in memory.
|
| what? Why CSV is required to fit in memory in this case?
| I tested CSVs which are far larger than memory, and it
| works just fine.
| geysersam wrote:
| The entire csv doesn't have to fit in memory, but the
| entire csv has to pass through memory at some point
| during the processing.
|
| The parquet file has metadata that allows duckdb to only
| read the parts that are actually used, reducing total
| amount of data read from disk/network.
| riku_iki wrote:
| > The parquet file has metadata that allows duckdb to
| only read the parts that are actually used, reducing
| total amount of data read from disk/network.
|
| this makes sense, and what I hoped to have. But in
| reality looks like parsing CSV string works faster than
| bloated and overengineered parquet format with libs.
| wenc wrote:
| >But in reality looks like parsing CSV string works
| faster than bloated and overengineered parquet format
| with libs.
|
| Anecdotally having worked with large CSVs and large on-
| disk Parquet datasets, my experience is the opposite of
| yours. My DuckDB queries operate directly on Parquet on
| disk and never load the entire dataset, and is always
| much faster than the equivalent operation on CSV files.
|
| I think your experience might be due to -- what it sounds
| like -- parsing the entire CSV into memory first (CREATE
| TABLE) and then processing after. That is not an apples-
| to-apples comparison because we usually don't do this
| with Parquet -- there's no CREATE TABLE step. At most
| there's a CREATE VIEW, which is lazy.
|
| I've seen your comments bashing Parquet in DuckDB
| multiple times, and I think you might be doing something
| wrong.
| riku_iki wrote:
| > I think your experience might be due to -- what it
| sounds like -- parsing the entire CSV into memory first
| (CREATE TABLE) and then processing after. That is not an
| apples-to-apples
|
| original discussion was about CSV vs parquet "reader"
| part, so this is exactly apple to apple testing, easy to
| benchmark and I stand my ground. What you are doing
| downstream, it is another question which is not possible
| to discuss because no code for your logic is available.
|
| > I've seen your comments bashing Parquet in DuckDB
| multiple times, and I think you might be doing something
| wrong.
|
| like running one command from DuckDB doc.
|
| Also, I am not "bashing", I just state that CSV reader is
| faster.
| xnx wrote:
| For how many rows?
| riku_iki wrote:
| 10B
| jjgreen wrote:
| _Please promote the use of .parquet files!_
| apt-cache search parquet <nada>
|
| Maybe later
| seabass-labrax wrote:
| Parquet is a _file format_ , not a piece of software. 'apt
| install csv' doesn't make any sense either.
| jjgreen wrote:
| There is _no support_ for parquet in Debian, by contrast
| apt-cache search csv | wc -l 259
| fhars wrote:
| If you want to shine with snide remarks, you should at
| least understand the point being made:
| $ apt-cache search csv | wc -l 225 $ apt-
| cache search parquet | wc -l 0
| nostrademons wrote:
| It's more like "sudo pip install pandas" and then Pandas
| comes with Parquet support.
| jjgreen wrote:
| Pandas cannot read parquet files itself, it uses 3rd
| party "engines" for that purpose and those are not
| available in Debian
| nostrademons wrote:
| Ah yes, that's true though a typical Anaconda
| installation will have them automatically installed.
| "sudo pip install pyarrow" or "sudo pip install
| fastparquet" then.
| EdwardDiego wrote:
| If you were hiring me for a data engineering role and asked me
| how to store and query 6 TiB, I'd say you don't need my skills,
| you've probably got a Postgres person already.
| hotstickyballs wrote:
| And how many data scientists are familiar with using awk
| scripts? If you're the only one then you'll have failed at
| scaling the data science team.
| jrm4 wrote:
| This feels representative of _so many of our problems in tech,_
| overengineering, over- "producting," over-proprietary-ing, etc.
|
| Deep centralization at the expense of simplicity and true
| redundancy; like renting a laser cutter when you need a
| boxcutter, a pair of scissors, and the occasional toenail
| clipper.
| rgrieselhuber wrote:
| This is a great test / question. More generally, it tests
| knowledge with basic linux tooling and mindset as well as
| experience level with data sizes. 6TiB really isn't that much
| data these days, depending on context and storage format, etc.
| of course.
| deepsquirrelnet wrote:
| It could be a great question if you clarify the goals. As it
| stands it's "here's a problem, but secretly I have hidden
| constraints in my head you must guess correctly".
|
| The OPs desired solution could have been found from probably
| some of those other candidates if asked "here is the
| challenge, solve in most McGuyver way possible". Because if
| you change the second part, the correct answer changes.
|
| "Here is a challenge, solve in the most accurate, verifiable
| way possible"
|
| "Here is a challenge, solve in a way that enables
| collaboration"
|
| "Here is a challenge, 6TiB but always changing"
|
| ^ These are data science questions much more than the
| question he was asking. The answer in this case is that
| you're not actually looking for a data scientist.
| 6510 wrote:
| I dont know anything but when doing that I always end up next
| Thursday having the same with 4TB and the next with 17 at which
| point I regret picking a solution that fit so exactly.
| wg0 wrote:
| I have lived through the hype of Big data it was a time of
| HDFS+HTable I guess and Hapoop etc.
|
| One can't go wrong with DuckDB+SQLite+Open/Elasticsearch either
| with 6 to 8 even 10 TB of data.
|
| [0]. https://duckdb.org/
| michaelcampbell wrote:
| My smartphone cannot store 1TiB. <shrug>
| dfgdfg34545456 wrote:
| The problem with your question is that they are there to show
| off their knowledge. I failed a tech interview once, question
| was build a web page/back end/db that allows people to order
| let's say widgets, that will scale huge. I went the simpleton
| answer route, all you need is Rails, a redis cache and an AWS
| provisioned relational DB, solve the big problems later if you
| get there sort of thing. Turns out they wanted to hear all
| about microservices and sharding.
| lizknope wrote:
| I'm on some reddit tech forums and people will say "I need help
| storing a huge amount of data!" and people start offering
| replies for servers that store petabytes.
|
| My question is always "How much data do you actually have?"
| Many times you they reply with 500GB or 2TB. I tell that that
| isn't much data when you can get 1TB micro SD card the size of
| a fingernail or a 24TB hard drive.
|
| My feeling is that if you really need to store petabytes of
| data that you aren't going to ask how to do it on reddit. If
| you need to store petabytes you will have an IT team and
| substantial budget and vendors that can figure it out.
| rqtwteye wrote:
| Plenty of people get offended if you tell them that their data
| isn't really "big data". A few years ago I had a discussion
| with one of my directors about a system IT had built for us
| with Hadoop, API gateways, multiple developers and hundreds of
| thousands of yearly cost. I told him that at our scale (now and
| any foreseeable future) I could easily run the whole thing on a
| USB drive attached to his laptop and a few python scripts. He
| looked really annoyed and I was never involved again with this
| project.
|
| I think it's part of the BS cycle that's prevalent in
| companies. You can't admit that you are doing something simple.
| noisy_boy wrote:
| In most non-tech companies, it comes down to the motive of
| the manager and in most cases it is expansion of reporting
| line and grabbing as much budget as possible. Using "simple"
| solutions runs counter to this central motivation.
| disqard wrote:
| This is also true of tech companies. Witness how the
| "GenAI" hammer is being used right now at MS, Google, Meta,
| etc.
| eloisant wrote:
| - the manager wants expansion
|
| - the developers want to get experience in a fancy stack to
| build up their resume
|
| Everyone benefits from the collective hallucination
| boh wrote:
| That's the tech sector in a nutshell. Very few innovations
| actually matter to non-tech companies. Most companies could
| survive on Windows 98 software.
| KronisLV wrote:
| > The winner of course was the guy who understood that 6TiB is
| what 6 of us in the room could store on our smart phones, or a
| $199 enterprise HDD (or three of them for redundancy), and it
| could be loaded (multiple times) to memory as CSV and simply
| run awk scripts on it.
|
| If it's not a very write heavy workload but you'd still want to
| be able to look things up, wouldn't something like SQLite be a
| good choice, up to 281 TB: https://www.sqlite.org/limits.html
|
| It even has basic JSON support, if you're up against some
| freeform JSON and not all of your data neatly fits into a
| schema: https://sqlite.org/json1.html
|
| A step up from that would be PostgreSQL running in a container:
| giving you the support for all sorts of workloads, more
| advanced extensions for pretty much anything you might ever
| want to do, from geospatial data with PostGIS, to something
| like pgvector, timescaledb etc., while still having a plethora
| of drivers and still not making your drown in complexity and
| having no issues with a few dozen/hundred TB of data.
|
| Either of those would be something that most people on the
| market know, neither will make anyone want to pull their hair
| out and they'll give you the benefit of both quick data
| writes/retrieval, as well as querying. Not that everything
| needs or can even work with a relational database, but it's
| still an okay tool to reach for past trivial file storage
| needs. Plus, you have to build a bit less of whatever
| functionality you might need around the data you store, in
| addition to there even being nice options for transparent
| compression.
| hipadev23 wrote:
| Huh? How are you proposing loading a 6TB CSV into memory
| multiple times? And then processing with awk, which generally
| streams one a line at a time.
|
| Obviously we can get boxes with multiple terabytes of RAM for
| $50-200/hr on-demand but nobody is doing that and then also
| using awk. They're loading the data into clickhouse or duckdb
| (at which point the ram requirement is probably 64-128GB)
|
| I feel like this is an anecdotal story that has mixed up sizes
| and tools for dramatic effect.
| dahart wrote:
| Wait, how would you split 6 TiB across 6 phones, how would you
| handle the queries? How long will the data live, do you need to
| handle schema changes, and how? And what is the cost of a
| machine with 15 or 20 TiB of RAM (you said it fits in memory
| multiple times, right?) - isn't the drive cost irrelevant here?
| How many requests per second did you specify? Isn't that
| possibly way more important than data size? Awk on 6 TiB, even
| in memory, isn't very fast. You might need some indexing, which
| suddenly pushes your memory requirement above 6 TiB, no? Do you
| need migrations or backups or redundancy? Those could increase
| your data size by multiples. I'd expect a question that
| specified a small data size to be asking me to estimate the
| _real_ data size, which could easily be 100 TiB or more.
| torginus wrote:
| It's astonishing how shit the cloud is compared to boring-ass
| pedestrian technology.
|
| For example, just logging stuff into a large text file is so
| much easier, performant and searchable that using AWS
| CloudWatch, presumably written by some of the smartest
| programmers who ever lived.
|
| On another note I was once asked to create a big data-ish
| object DB, and me, knowing nothing about the domain, and a bit
| of benchmarking, decided to just use zstd-compressed json
| streams with a separate index in an sql table. I'm sure any
| professional would recoil at it in horror, but it could do
| literally gigabytes/sec retrieval or deserialization on
| consumer grade hardware.
| jandrewrogers wrote:
| As a point of reference, I routinely do fast-twitch analytics
| on _tens_ of TB on a single, fractional VM. Getting the data in
| is essentially wire speed. You won 't do that on Spark or
| similar but in the analytics world people consistently
| underestimate what their hardware is capable of by something
| like two orders of magnitude.
|
| That said, most open source tools have terrible performance and
| efficiency on large, fast hardware. This contributes to the
| intuition that you need to throw hardware at the problem even
| for relatively small problems.
|
| In 2024, "big data" doesn't really start until you are in the
| petabyte range.
| buremba wrote:
| I can't really think of a product with the requirement of max
| 6TiB data. If the data is big as TiB, most products have 100x
| TiB rather than a few ones.
| citizenpaul wrote:
| The funny thing is that is exactly the place I want to work at.
| I've only found one company so far and the owner sold during
| the pandemic. So far my experience is that amount of
| companies/people that want what you describe is incredibly low.
|
| I wrote a comment on here the other day that some place I was
| trying to do work for was using $11k USD a month on a BigQuery
| DB that had 375MB of source data. My advice was basically you
| need to hire a data scientist that knows what they are doing.
| They were not interested and would rather just band-aid the
| situation for a "cheap" employee. Despite the fact their GCP
| bill could pay for a skilled employee.
|
| As I've seen it for the last year job hunting most places don't
| want good people. They want replaceable people.
| itronitron wrote:
| >> "6 TiB of data"
|
| is not somewhat detailed requirements, as it depends quite a
| bit on the nature of the data.
| tonetegeatinst wrote:
| I'm not even in data science, but I am a slight data hoarder.
| And heck even I'd just say throw that data on a drive and have
| a backup in the cloud and on a cold hard drive.
| SkipperCat wrote:
| That makes total sense if you're archiving the data, but what
| happens when you want to have 10,000 people have access to
| read/update the data concurrently. Then you start to need some
| fairly complex solutions.
| kmarc wrote:
| This thread blew up a lot, and some unfriendly commenters
| made many assumptions about this innocent story.
|
| You didn't, and indeed you have a point (missing
| specification of expected queries), so I expand it as a
| response here.
|
| Among the _MANY_ requirements I shared with the candidate,
| only _one_ was the 6TiB. Another one was that it was going to
| be serving as part of the backend of an internal banking
| knowledge base, with at maximum 100 request a day (definitely
| not 10k people using it).
|
| To all the upset data infrastructure wizards here: calm down.
| It was a banking startup, with an experimental project, and
| we needed the sober thinker generalist, who can deliver
| solutions to real *small scale* problems, and not the one who
| was the winner on the buzzword bingo.
|
| HTH.
| citizen_friend wrote:
| This load is well handled by a Postgres instance and 15-25k
| thrown at hardware.
| paulddraper wrote:
| Storing 6TB is easy.
|
| Processing and querying it is trickier.
| TeamDman wrote:
| Would probably try https://github.com/pola-rs/polars and go
| from there lol
| xLaszlo wrote:
| 6TB - Snowflake
|
| Why?
|
| That's the boring solution. If you don't have a use case, what
| kind of queries you would run then opt for maximum flexibility
| with the minimum setup of a managed solution.
|
| If cost is prohibitive on the long run, you can figure out a
| more tailored solution based on the revealed preferences.
|
| Fiddling with CSVs is the DWH version of the legendary "Dropbox
| HN commenter".
| nostrademons wrote:
| I would've said "Pandas with Parquet files". If you're hiring a
| DS it's implied that you want to do some sort of aggregate or
| summary statistics, which is exactly what Pandas is good for,
| while awk + shell scripts would require a lot of clumsy number
| munging. And Parquet is an order of magnitude more storage
| efficient than CSV, and will let you query very quickly.
| atomicnumber3 wrote:
| It's really hard because I've failed interviews by pitching "ok
| we start with postgres, and when that starts to fall over we
| throw more hardware at it, then when that fails we throw read
| replicas in, then we IPO, _then_ we can spend all our money and
| time doing distributed system stuff ".
|
| Whereas the "right answer" (I had a man on the inside) was to
| describe some wild tall and wide event based distributed
| system. For some nominal request volume that was nowhere near
| the limits of postgres. And they didn't even care if you solved
| the actual hard distributed system problems that would arise
| like distributed transactions etc.
|
| Anyway, I said I failed the interview, really they failed my
| filter because if they want me to ignore pragmaticism and
| blindly regurgitate a YouTube video on "system design" FAANG
| interview prep, then I don't want to work there anyway.
| metadat wrote:
| Can you get a single machine with more than 6TiB of memory
| these days?
|
| That's quite a bit..
| 1vuio0pswjnm7 wrote:
| "... or a $199 enterprise HDD"
|
| External or internal? Any examples?
|
| "... it could be loaded (multimple times) to memory"
|
| All 6TiB at once, or loaded in chunks?
| dventimi wrote:
| Question for the Big Data folks: where do sampling and statistics
| fit into this, if at all? Unless you're summing to the penny, why
| would you ever need to aggregate a large volume of data (the
| population) rather than a small volume of data (a sample)? I'm
| not saying there isn't a reason. I just don't know what it is.
| Any thoughts from people who have genuine experience in this
| realm?
| disgruntledphd2 wrote:
| Sampling is almost always heavily used here, because it's Ace.
| However, if you need to produce row level predictions then you
| can't sample as you by definition need the role level data.
|
| However you can aggregate user level info into just the
| features you need which will get you a looooonnnnggggg way.
| gregw2 wrote:
| Good question. I am not an expert but here's my take from my
| time in this space.
|
| Big data folks typically do sampling and such, but that doesn't
| eliminate the need for a big data environment where such
| sampling can occur. Just as a compiler can't predict every
| branch that could happen at compile time (sorry VLIW!) and thus
| CPUs need dynamic branch predictors, so too a sampling function
| can't be predicted in advance of an actual dataset.
|
| In a large dataset there are many ways the sample may not
| represent the whole. The real world is complex. You sample away
| that complexity at your peril. You will often find you want to
| go back to the original raw dataset.
|
| Second, in a large organization, sampling alone presumes you
| are only focused on org-level outcomes. But in a large org
| there may be individuals who care about the non-aggregated data
| relevant to their small domain. There can be thousands of such
| individuals. You do sample the whole but you also have to equip
| people at each level of abstraction to do the same. The
| cardinality of your data will in some way reflect the
| cardinality of your organization and you can't just sample that
| away.
| banku_brougham wrote:
| There is a problem on website data, where new features are only
| touching a subset of customers and you need results for every
| single one.
|
| You wont be partitioned for this case, but the compute you need
| is just filtering out this set.
|
| But sampling wont get what you want especially if you are doing
| QC at the business team level about whether the CX is behaving
| as expected.
| kwillets wrote:
| I've done it both ways. Look into Data Sketches also if you
| want to see applications.
|
| The pros:
|
| -- Samples are small and fast most of the time.
|
| -- can be used opportunistically, eg in queries against the
| full dataset.
|
| -- can run more complex queries that can't be pre-aggregated
| (but not always accurately).
|
| The cons:
|
| -- requires planning about what to sample and what types of
| queries you're answering. Sudden requirements changes are
| difficult.
|
| -- data skew makes uniform sampling a bad choice.
|
| -- requires ETL pipelines to do the sampling as new data comes
| in. That includes re-running large backfills if data or
| sampling changes.
|
| -- requires explaining error to users
|
| -- Data sketches can be particularly inflexible; they're
| usually good at one metric but can't adapt to new ones. Queries
| also have to be mapped into set operations.
|
| These problems can be mitigated with proper management tools; I
| have built frameworks for this type of application before --
| fixed dashboards with slow-changing requirements are relatively
| easy to handle.
| oli5679 wrote:
| BigQuery has a generous 1TB/month free tier, and $6/tb
| afterwards. If you have small data, it's a pragmatic option, just
| make sure to use partitioning and sensible query patterns to
| limit the number of full-data scans, as you approach 'medium
| data' region.
|
| There are some larger data-sizes, and query patterns, where
| either BigQuery Capacity compute pricing, or another vendor like
| Snowflake, becomes more economical.
|
| https://cloud.google.com/bigquery/pricing
| BigQuery offers a choice of two compute pricing models for
| running queries: On-demand pricing (per TiB). With
| this pricing model, you are charged for the number of bytes
| processed by each query. The first 1 TiB of query data processed
| per month is free. Queries (on-demand) - $6.25 per
| TiB - The first 1 TiB per month is free.
| ml-anon wrote:
| Big data was always a buzzword that was weirdly coopted by
| database people in a way which makes 0 sense. Of course there are
| vanishingly small number of use cases where we need fast random
| access to any possible record to look up.
|
| However what ML systems and in particular LLMs rely on having
| access to millions (if not billions or trillions) of examples.
| The underlying infra of which is based on some of these tools.
|
| Big Data isn't dead, just this weird idea that the tools and
| usecases around querying databases has been finally recognised as
| being mostly useless to most people. It is and always has been
| about training ML models.
| stakhanov wrote:
| The funny thing about "big data" was that it came with a perverse
| incentive to avoid even the most basic and obvious optimizations
| on the software level, because the hardware requirement was how
| you proved how badass you were.
|
| Like: "Look, boss, I can compute all those averages for that
| report on just my laptop, by ingesting a SAMPLE of the data,
| rather than making those computations across the WHOLE dataset".
|
| Boss: "What do you mean 'sample'? I just don't know what you're
| trying to imply with your mathmo engineeringy gobbledigook! Me
| having spent those millions on nothing can clearly not be it,
| right?"
| Spooky23 wrote:
| It came with a few cohorts of Xooglers cashing their options
| out.
|
| The amount of salesman hype and chatter about big data,
| followed by the dick measuring contests about whose data was
| big enough to be worthy was intense for awhile.
| bpodgursky wrote:
| This is a pretty snarky outside view and just not actually true
| (I spent the first part of my career trying to reduce compute
| spend as a data engineer).
|
| It was extremely difficult to get > 64gb on a machine for a
| very long time, and implementation complexity gets hard FAST
| when you have a hard cap.
|
| And it's EXTREMELY disruptive to have a process that fails
| every 1/50 times, when data is slightly too large, because your
| team will be juggling dozens of these routine crons, and if
| each of them breaks regularly, you do nothing but dumb oncall
| trying to trim bits off of each strong.
|
| No, Hadoop and MapReduce were not hyperefficient, but it was OK
| if you write it correctly, and having something that ran
| reliably is WAY more valuable than boutique bit-optimized C++
| crap that nobody trusts or can maintain and fails every
| thursday with insane segfaults.
|
| (nowdays, just use Snowflake. but it was a reasonable tool for
| the time).
| lokimedes wrote:
| I was a researcher at the Large Hadron Collider around the time
| "Big Data" became a thing. We had one of the use cases where
| analyzing all the data made sense, since it boiled down to
| frequentist statistics, the more data, the better. Yet even with
| a global network of supercomputers at our disposal, we funnily
| figured out that fast local storage was better than waiting for
| huge jobs to finish. So, surprise, surprise, every single grad
| student managed somehow to boil the relevant data for her
| analysis down to exactly 1-5 TB, without much loss in analysis
| flexibility. There must be like a law of convenience here, that
| rivals Amdahl's scaling law.
| msl09 wrote:
| I think that your law of convenience is spot on. One thing that
| got by talking with commercial systems devs is that they are
| always under pressure by their clients to make their systems as
| cheap as possible, reducing the database stored and the size of
| the computations is one great way to minimize the client's
| monthly bill.
| civilized wrote:
| I think there is a law of convenience, and it also explains why
| many technologies improve at a consistent exponential rate.
| People are very good at finding convenient ways to make
| something a little better each year, but every idea takes some
| minimal time to execute.
| marcosdumay wrote:
| Let me try one:
|
| "If you can't do your statistical analysis in 1 to 5 TB of
| data, your methodology is flawed"
|
| This is probably more about human limitations than math.
| There's a clear ceiling in how much flexibility we can use.
| That will also change with easier ways to run new kinds of
| analysis, but it increases with the logarithm of the amount of
| things we want to do.
| kwillets wrote:
| Back in the 80's and 90's NASA built a National Aerodynamic
| Simulator, which was a big Cray or similar that could crunch
| FEA simulations (probably a low-range graphics card nowadays).
| IIRC they found that the queue for that was as long or longer
| than it took to run jobs on cheaper hardware; MPP systems such
| as Beowulf grew out of those efforts.
| debarshri wrote:
| My first job was doing hadoop stuff around ~2011. I think one of
| the biggest motivators for big data or rather hadoop adoption was
| that it was open source. Back then most of the data warehousing
| platforms dominated by oracle, netezza, EMC, teradata etc. were
| super expensive on per GB basis. It was followed by lot of
| success stories about how facebook save $$$ or you could use
| google's mapreduce for free etc.
|
| Everyone was talking about data being the new "oil".
|
| Enterprise could basically deploy petabyte scale warehouse run
| HBase or Hive on top of it and build makeshift data-warehouses.
| It was also when the cloud was emerging, people started creating
| EMR clusters and deploy workloads there.
|
| I think it was solution looking for problem. And the problem
| existed only for a handful of companies.
|
| I think somehow, how cloud providers abstracted lot of these
| tools and databases, gave a better service and we kind of forgot
| about hadoop et al.
| LightFog wrote:
| When working in a research lab we used to have people boast that
| their analysis was so big it 'brought down the cluster' - which
| outed them pretty quickly to the people who knew what they were
| doing.
| kjkjadksj wrote:
| Must have been abusing the head node if they did that
| corentin88 wrote:
| No mention of Firebase? That might explain the slow decrease of
| MongoDB.
| WesolyKubeczek wrote:
| I remember going on one of "big data" conferences back in 2015,
| when it was the buzzword of the day.
|
| The talks were all concentrated around topics like: ingesting and
| writing the data as quickly as possible, sharding data for the
| benefit of ingesting data, and centralizing IoT data from around
| the whole world.
|
| Back then I had questions which were shrugged off -- back in the
| day it seemed to me -- as extremely naive, as if they signified
| that I was not the "in" crowd somehow for asking them. The
| questions were:
|
| 1) Doesn't optimizing highly for key-value access mean you need
| that you need to anticipate, predict, and implement ALL of the
| future access patterns? What if you need to change your queries a
| year in? The most concrete answer I got was that of course a good
| architect needs to know and design for all possible ways the data
| will be queried! I was amazed at either the level of prowess of
| the said architects -- such predictive powers that I couldn't
| ever dream of attaining! -- or the level of self-delusion, as the
| cynic in me put it.
|
| 2) How can it be faster if you keep shoving intermediate
| processing elements into your pipeline? It's not like you just
| mindlessly keep adding queues upon queues. That had never been
| answered. The processing speeds of high-speed pipelines may be
| impressive, but if some stupid awk over CSV can do it just as
| quickly on commodity hardware, something _must_ be wrong.
| nottorp wrote:
| > How many workloads need more than 24TB of RAM or 445 CPU cores?
|
| <cough> Electron?
| GuB-42 wrote:
| AI is the new "big data". In fact, AI as it is done today is
| nothing without at least terabytes of data.
|
| What the article talks about is more like a particular type of
| database architecture (often collectively called "NoSQL") that
| was a fad a few years ago, and as all fads, it went down. It
| doesn't mean having lots of data is useless, or that NoSQL is
| useless, just that it is not the solution to every problem. And
| also that there is a reason why regular SQL databases have been
| in use since the 70s: except in specific situation most people
| don't encounter, they just work.
| coldtea wrote:
| Selling "Big data" tooling and consulting was a nice money making
| scheme for while it lasted.
| deadbabe wrote:
| Something similar will happen with generative AI someday.
|
| AI scientists will propose all sorts of elaborate complex
| solutions to problems using LLMs, and the dismissive responsive
| will be "Your problem is solvable with a couple if statements."
|
| Most people just don't have problems that require AI.
| doubloon wrote:
| Thats one of the first things Andrew Ng said in his old ML
| course
| deadbabe wrote:
| This is why I personally just can't find motivation to even
| pay attention to most AI developments. It's a toy, it does
| some neat things, but there's no problem I've heard of or
| encountered where LLM style AI was the only tool for the job,
| or even the best tool. The main use seems to be content
| creation and manipulation at scale, which the vast majority
| of companies simply don't have to deal with.
|
| Similarly, a lot of companies talk about how they have tons
| of data, but there's never any real application or game
| changing insight from it. Just a couple neat tricks and
| product managers patting themselves on the back.
|
| Setting up a good database is probably the peak of a typical
| company's tech journey.
| int_19h wrote:
| Natural language processing is in obvious area in which LMs
| are consistently outperforming the best bunches of if-
| statements by a very large margin, and it has very broad
| applicability.
|
| E.g. I would argue that its translation capabilities alone
| make GPT-4 worthwhile, even if it literally couldn't do
| anything else.
| fijiaarone wrote:
| The problem with big data is that people don't have data, they
| have useless noise.
|
| For lack of data, they generate random bytes collected on every
| mouse movement on every page, and every packet that moves through
| their network. It doesn't tell them anything because the only
| information that means anything is who clicks that one button on
| their checkout page after filling out the form with their
| information or that one request that breaches their system.
|
| That's why big data is synonymous with meaningless charts on
| pointless dashboards sold to marketing and security managers who
| never look at them anyway
|
| It's like tracking the wind and humidity and temperature and
| barometer data every tenth of a second every square meter.
|
| It won't help you predict the weather any better than stepping
| outside and looking at the sky a couple times a day.
| thfuran wrote:
| It absolutely would. You can't build a useful model off of
| occasionally looking outside.
| iamleppert wrote:
| What he means to say, is the grift is dead. All the best fish
| have been fished in that data lake (pun intended), leaving most
| waiting on the line to truly catch an appealing mackerel. Most of
| the big data people I know have moved on to more lucrative grifts
| like crypto and (more recently) AI.
|
| There are still bags to be made if you can scare up a CTO or one
| of his lieutenants working for a small to medium size Luddite
| company. Add in storage on the blockchain and a talking AI parrot
| if you want some extra gristle in your grift.
|
| Long live the tech grifters!
| RyanHamilton wrote:
| For 10 years he sold companies on Big Data they didn't need and
| he only just realised most people don't have big data. Now he's
| switched to small data tools we should use/buy that. Is it harsh
| to say either a) 10 Years = He isn't good at his job b) Jordan
| would sell whatever he get's paid to.
| renegade-otter wrote:
| Big Data is not dead - it has been reborn as AI, which is
| essentially Big Data 2.0.
|
| And just in the same fashion, there was massive hype around Big
| Data 1.0. From 2013: https://hbr.org/2013/12/you-may-not-need-
| big-data-after-all
|
| _Everyone_ has _so_ much data that they _must_ use AI in order
| to tame it. The reality is, however, is that most of their data
| is crap and all over the place, and no amount of Big Data 1.0 or
| 2.0 is ever going to fix it.
| donatj wrote:
| The article only touches on it for a moment but GDPR killed big
| data. The vast majority of the data that any regular business
| would have and could be considered big almost certainly contained
| PII in one form or another. It became too much of a liability to
| keep that around.
|
| With GDPR, we went from keeping everything by default unless a
| customer explicitly requested it gone to deleting it all
| automatically after a certain number of days after their license
| expires. This makes opaque data lakes completely untenable.
|
| Don't get me wrong, this is all a net positive. The customers
| data is physically removed and they don't have to worry about
| future leaks or malicious uses, and we get a more efficient
| database. The only people really fussed were the sales team
| trying to lure people back with promises that they could pick
| right back up where they left off.
| gigatexal wrote:
| DuckDb is nothing short of amazing. The only thing is when the
| dataset is bigger than system ram it falls apart. Spilling to
| disk is still broken.
| surfingdino wrote:
| In my experience the only time I worked on a big data project was
| the public Twitter firehose. The team built an amazing pipeline
| and it did actually deal with masses of data. Any other team I've
| been on were delusional and kept building expensive and
| overcomplicated solutions that could be replaced with a single
| Postgres instance. The most overcomplicated system I've seen
| could not process 24hrs-worth of data in under 24 hours... I was
| happy to move on when an opportunity presented itself.
| rr808 wrote:
| You're never going to get a fang job with an approach like that.
| And basically most developers are working towards that goal.
| ricardo81 wrote:
| Is there a solid definition of big data nowadays?
|
| It seems to conflate somewhat with SV companies completely
| dismantling privacy concerns and hoovering up as much data as
| possible. Lots of scenarios I'm sure. I'm just thinking of FAANG
| in the general case.
| nextworddev wrote:
| Big data is there so that it can justify Databricks and Snowflake
| valuations /s
| cheptsov wrote:
| A good clickbait title. One should credit the author for that.
|
| As to the topic, IMO, there is a contradiction. The only way to
| handle big data is to divide it into chunks that aren't expensive
| to query. In that sense, no data is "big" as long as it's handled
| properly.
|
| Also, about big data being only a problem for 1 percent of
| companies: it's a ridiculous argument implying that big data was
| supposed to be a problem for everyone.
|
| I personally don't see the point behind the article, with all due
| respect to the author.
|
| I also see many awk experts here who have never been in charge of
| building enterprise data pipelines.
| markus_zhang wrote:
| I think one problem that arises from practical work is: Databases
| seem to be biased towards either transactional (including
| fetching single records) or aggregational workload, but in
| reality both are used extensively. This also brings difficulty in
| data modelling, when we DEs are mostly thinking about aggregating
| data while our users also want to investigate single records.
|
| Actually, now that I think about it, we should have two products
| for the users, one let them to query single records as fast as
| possible without hitting the production OLTP transactional
| database, even from really big data (find one record from PB
| level data in seconds), one to power the dashboards that ONLY
| show aggregation. Is Lakehouse a solution? I have never used it.
| lukev wrote:
| This is a good post, but it's somewhat myopically focused on
| typical "business" data.
|
| The most interesting applications for "big data" are all (IMO) in
| the scientific computing space. Yeah, your e-commerce business
| probably won't ever need "big data" but load up a couple genomics
| research sets and you sure will.
| maartet wrote:
| Reminds me of this gem from the previous decade:
| https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html (Don't
| use Hadoop - your data isn't that big)
|
| Which, of course, was discussed on HN back then:
| https://news.ycombinator.com/item?id=6398650
| unholyguy001 wrote:
| One of the problems with the article is BigQuery becomes
| astronomically expensive at mid to high volumes. So there is a
| strong incentive to keep the data in BigQuery manageable or to
| even move off it as data volumes get higher.
|
| Also larger enterprises don't even use gcp all that much to begin
| with
| tored wrote:
| Former prime minister of Sweden Stefan Lofven got wind about Big
| data buzzword back in 2015 and used it in one of his speeches,
| that this is the future, however he used the Swedish translation
| of Big data, stordata. That generated some funny quips about
| where is lilldata.
|
| https://www.aftonbladet.se/senastenytt/ttnyheter/inrikes/a/8...
| xiaodai wrote:
| Big memory is eating big data.
| angarg12 wrote:
| Previous years I would have completely agreed with this post.
| Nowadays with the AI and ML craze I'm not so sure. I've seen
| plenty of companies using using vast amounts of data to train ML
| models to incorporate to their products. Definitely more data
| that can be handle by a traditional DB, and well into Big Data
| territory.
|
| This isn't a value judgement about whether that's a good idea,
| just an observation from talking with many tech companies doing
| ML. This definitely feels like a bubble that will burst in due
| time, but for now ML is turbocharging Big Data.
| int_19h wrote:
| But do they need to _query_ that data?
| breckognize wrote:
| 99.9%+ of data sets fit on an SSD, and 99%+ fit in memory. [1]
|
| This is the thesis for https://rowzero.io. We provide a real
| spreadsheet interface on top of these data sets, which gives you
| all the richness and interactivity that affords.
|
| In the rare cases you need more than that, you can hire
| engineers. The rest of the time, a spreadsheet is all you need.
|
| [1] I made these up.
| BiboAlpha wrote:
| One of the major challenges with Big Data is that it often fails
| to deliver real business value, instead offering only misleading
| promises.
| mavili wrote:
| Big Data hasn't been about storage, I thought it was always about
| processing. Guy obviously knows his stuff but I got the
| impression he stressed more about storage and how that's cheap
| and easy these days. When he does mention processing/computing,
| he mentions that most of the time people end up only querying
| recent data (ie small chunk of actual data they hold) but that
| bears the question: is querying only small chunk of data what
| businesses need, or are they doing it because querying the whole
| dataset is just not manageable? In other words, if processing all
| data at once was as easy as querying the most recent X percent,
| would most businesses still choose to only query the small chunk?
| I think there lies the answer whether Big Data (processing) is
| needed or not.
| siliconc0w wrote:
| Storing data in object and querying from compute caching what you
| can basically scales until your queries are too expensive for a
| single node.
| idunnoman1222 wrote:
| I love how the way he talks about this. His paycheque from big-
| data is what is dead. His service offering was always bullshit.
| fredstar wrote:
| I am in a software services company for more than 15 years. And
| to be honest, a lot of these big topics have always been some
| kind of sales talk or door opener. You write a white paper,
| nominate an 'expert' in your team and use these things in
| conversations with clients. Sure some trends are way more real
| and useful then others. But for me the article hits the nail on
| its head.
| aorloff wrote:
| What is dead is the notion that some ever expanding data lake is
| a mine full of gems and not a cost center.
| schindlabua wrote:
| Used to work at a company that produced 20 gigs of analytics
| every day which is probably the biggest data I'll ever work on.
| My junior project was writing some data crunching jobs that did
| aggregations batched and in real time, and store the result in
| parquet blobs in azure.
|
| My boss was smart enough to have stakeholder meetings where they
| regularly discussed what to keep and what to throw away, and with
| some smart algorithms we were able to compress all that data down
| into like 200MB per day.
|
| We loaded the last 2 months into an sql server and the last 2
| years further aggregated into another, and the whole company used
| the data in excel to do queries on it in reasonable time.
|
| The big data is rotting away on tape storage in case they ever
| need it in the future.
|
| My boss got a lot of stuff right and I learned a lot, though I
| only realized that in hindsight. Dude was a bad manager but he
| knew his data.
| dang wrote:
| Related:
|
| _Big data is dead_ -
| https://news.ycombinator.com/item?id=34694926 - Feb 2023 (433
| comments)
|
| _Big Data Is Dead_ -
| https://news.ycombinator.com/item?id=33631561 - Nov 2022 (7
| comments)
| kurts_mustache wrote:
| What was the source data here? It seems like a lot of the graphs
| are just the author's intuitive feel rather than being backed by
| any sort of hard data. Did I miss it in there somewhere?
| lkdfjlkdfjlg wrote:
| TLDR: he used to say that you needed the thing he's selling. He
| changed his mind now and you need the opposite, which he's
| selling.
___________________________________________________________________
(page generated 2024-05-27 23:01 UTC)