[HN Gopher] Big data is dead (2023)
       ___________________________________________________________________
        
       Big data is dead (2023)
        
       Author : armanke13
       Score  : 460 points
       Date   : 2024-05-27 08:30 UTC (14 hours ago)
        
 (HTM) web link (motherduck.com)
 (TXT) w3m dump (motherduck.com)
        
       | clkao wrote:
       | previous discussion:
       | https://news.ycombinator.com/item?id=34694926
        
         | spicyusername wrote:
         | I find it interesting that this comment section and that
         | comment section seem to focus on different things, despite
         | being triggered by the same input.
        
       | ahartmetz wrote:
       | I guess that hype cycle ended at the plateau of being dead. A not
       | uncommon outcome in this incredibly fashion-driven industry.
        
         | silvestrov wrote:
         | It has just been rebranded as AI.
         | 
         | AI also use all the data, just with a magick neural network to
         | figure out what it all means.
        
           | quonn wrote:
           | The overlap in terms of the used technologies, the required
           | skills, the actual products and the target market is minimal.
           | AI is not mostly Hadoop, it's not MapReduce, the hardware is
           | different, the software is different, the skillset is very
           | different and a chatbot or image generator is very different
           | from a batch job producing an answer to a query.
        
             | renegade-otter wrote:
             | But the underlying problem is the same - companies that use
             | Big Data tech are clueless about data management. You can
             | use unicorns - it's not going to do anything. "Garbage in,
             | garbage out" is a timeless principle.
        
           | WesolyKubeczek wrote:
           | Given how often it hallucinates, it should be rebranded to
           | "high data".
        
           | Cyberdog wrote:
           | Assuming you're serious for a moment, I don't think AI is
           | really a practical tool for working with big data.
           | 
           | - The "hallucination" factor means every result an AI tells
           | you about big data is suspect. I'm sure some of you who
           | really understand AI more than the average person can "um
           | akshually" me on this and tell me how it's possible to
           | configure ChatGPT to absolutely be honest 100% of the time
           | but given the current state of what I've seen from general-
           | purpose AI tools, I just can't trust it. In many ways this is
           | worse than MongoDB just dropping data since at least Mongo
           | won't make up conclusions about data that's not there.
           | 
           | - At the end of the day - and I think we're going to be
           | seeing this happen a lot in the future with other workflows
           | as well - you're using this heavy, low-performance general-
           | purpose tool to solve a problem which can be solved much more
           | performatively by using tools which have been designed from
           | the beginning to handle data management and analysis. The
           | reason traditional SQL RDBMSes have endured and aren't going
           | anywhere soon is partially because they've proven to be a
           | very good compromise between general functionality and
           | performance for the task of managing various types of data.
           | AI is nowhere near as good of a balance for this task in
           | almost all cases.
           | 
           | All that being said, the same way Electron has proven to be a
           | popular tool for writing widely-used desktop and mobile
           | applications, performance and UI concerns be damned all the
           | way to hell, I'm sure we'll be seeing AI-powered "big data"
           | analysis tools very soon if they're not out there already,
           | and they will suck but people will use them anyway to
           | everyone's detriment.
        
             | silvestrov wrote:
             | A comment from the old post:
             | https://news.ycombinator.com/item?id=34696065
             | 
             | > _I used to joke that Data Scientists exist not to uncover
             | insights or provide analysis, but merely to provide
             | factoids that confirm senior management 's prior beliefs._
             | 
             | I think AI is used for the same purpose in companies:
             | signal to the world that the company is using the latest
             | tech and internally for supporting existing political
             | beliefs.
             | 
             | So same job. Hallucination is not a problem here as the AI
             | conclusions are not used when they don't align to existing
             | political beliefs.
        
             | vitus wrote:
             | > The "hallucination" factor means every result an AI tells
             | you about big data is suspect.
             | 
             | AI / ML means more than just LLM chat output, even if
             | that's the current hype cycle of the last couple of years.
             | ML can be used to build a perfectly serviceable classifier,
             | or predictor, or outlier detector.
             | 
             | It suffers from the lack of explainability that's always
             | plagued AI / ML, especially as you start looking at deeper
             | neural networks where you're more and more heavily reliant
             | on their ability to approximate arbitrary functions as you
             | add more layers.
             | 
             | > you're using this heavy, low-performance general-purpose
             | tool to solve a problem which can be solved much more
             | performatively by using tools which have been designed from
             | the beginning to handle data management and analysis
             | 
             | You are not wrong here, but one challenge is that sometimes
             | even your domain experts do not know how to solve the
             | problem, and applying traditional statistical methods
             | without understanding the space is a great way of
             | identifying spurious correlations. (To be fair, this
             | applies in equal measure to ML methods.)
        
       | blagie wrote:
       | Overall, I agree with much of this post, but there are several
       | caveats:
       | 
       | 1) Mongo is a bad point of reference.
       | 
       | The one lesson I've learned is that there is nothing Mongo does
       | which postgresql doesn't do better. Big data solutions aren't
       | nosql / mongo, but usually things like columnar databases,
       | map/reduce, Cassandra, etc.
       | 
       | 2) Plan for success
       | 
       | 95% of businesses never become unicorns, but that's the goal for
       | most (for the 5% which do). If you don't plan for it, you won't
       | make it. The reason to architect for scalability when you have 5
       | customers is so if that exponential growth cycle hits, you can
       | capitalize on it.
       | 
       | That's not just architecture. To have any chance of becoming a
       | unicorn, every part of the business needs to be planned for now
       | and for later: How do we make this practical / sustainable today?
       | How do we make sure it can grow later when we have millions of
       | customers? A lot of this can be left as scaffolding (we'll swap
       | in [X], but for now, we'll do [Y]).
       | 
       | But the key lessons are correct:
       | 
       | - Most data isn't big. I can fit data about every person in the
       | world on a $100 Chromebook. (8 billion people * 8 bits of data =
       | 8GB)
       | 
       | - Most data is rarely queried, and most queries are tiny. The
       | first step in most big data jobs I've done is taking terabytes of
       | data and shrinking it down to the GB, MB, or oven KB-scale data I
       | need. Caveat: I have no algorithm for predicting what I'll need
       | in the future.
       | 
       | - Cost of data is increasing with regulatory.
        
         | brtkdotse wrote:
         | > 95% of businesses never become unicorns, but that's the goal
         | for most
         | 
         | Is it really the general case or is it just a HN echo chamber
         | meme?
         | 
         | My pet peeve is that patterns used by companies that in theory
         | could become global unicorns are mimicked by companies where
         | 5000 paying customers would mean an immense success
        
           | blagie wrote:
           | It's neither.
           | 
           | Lifestyle companies are fine, if that's what you're aiming
           | for. I know plenty of people who run or work at [?]1-30
           | person companies with no intention to grow.
           | 
           | However, if you're going for high-growth, you need to plan
           | for success. I've seen many potential unicorns stopped by
           | simple lack of planning early on. Despite all the pivots
           | which happen, if you haven't outlined a clear path from 1-3
           | people in a metaphorical garage to reaching $1B, it almost
           | never happens, and sometimes for stupid reasons.
           | 
           | If your goal is 5000 paying customers at $100 per year and
           | $500k in annual revenues, that can lead to a very decent
           | life. However, it's an entire different ballgame: (1) Don't
           | take in investment (2) You probably can't hire more than one
           | person (3) You need a plan for break-even revenue before you
           | need to quit your job / run out of savings. (4) You need much
           | greater than the 1-in-10 odds of success.
           | 
           | And it's very possible (and probably not even hard) to start
           | a sustainable 1-5 person business with >>50% odds of success,
           | especially late career:
           | 
           | - Find a niche you're aware of from your job
           | 
           | - Do ballpark numbers on revenues. These should land in the
           | $500k-$10M range. Less, and you won't sustain. More, and
           | there will be too much competition.
           | 
           | - Do it better than the (likely incompetent or non-existent)
           | people doing it now
           | 
           | - Use your network of industry contacts to sell it
           | 
           | That's not a big enough market you need to worry about a lot
           | of competition, competitors with VC funding, etc. Especially
           | ones with tall moats do well -- pick some unique skillset,
           | technology, or market access, for example.
           | 
           | However, IF you've e.g. taken in VC funding, then you do need
           | to plan for growth, and part of that is planning for the
           | small odds your customer base (and ergo, your data) does
           | grow.
        
             | IneffablePigeon wrote:
             | If you're in b2b 5000 customers can be a lot more revenue
             | than that. 10-100x, depending hugely on industry and
             | product.
        
           | davedx wrote:
           | It's definitely an echo chamber. Most companies definitely do
           | not want to become "unicorns" - most SME's around the world
           | don't even know what a "unicorn" is, let alone be in an
           | industry/sector where it's possible.
           | 
           | Does a mining company want to become a "unicorn"?
           | 
           | A fish and chip shop?
           | 
           | Even within tech there is an extremely large number of
           | companies whose goals are to steadily increase profits and
           | return them to shareholders. 37 Signals is the posterchild
           | there.
           | 
           | Maybe if you're a VC funded startup then yeah.
        
           | threeseed wrote:
           | HN is the worst echo chamber around.
           | 
           | Obsessed with this "you must use PostgreSQL for every use
           | case" nonsense.
           | 
           | And that anyone who actually has unique data needs is simply
           | doing it for their resume or are over-engineering.
        
             | paulryanrogers wrote:
             | > Obsessed with this "you must use PostgreSQL for every use
             | case" nonsense.
             | 
             | Pg fans are certainly here asking "why not PG?". Yet so are
             | fans of other DBs; like DuckDB, CouchDB, SQLite, etc.
        
               | internet101010 wrote:
               | I don't see so much DuckDB and CouchDB proselytizing but
               | the SQLite force always out strong. I tend to divide the
               | Postgres vs. SQLite decision on if the data in question
               | is self-contained. Like am I pulling data from elsewhere
               | (Postgres) or am I creating data within the application
               | that is only used for the functionality of said
               | application (SQLite).
        
               | int_19h wrote:
               | SQLite, in addition to just being plain popular, is a
               | fairly natural stepping stone - you get a lot of
               | fundamental benefits of an SQL RDBMS (abstract high-level
               | queries, ACID etc) without the overhead of maintaining a
               | database server.
               | 
               | Postgres is the next obvious stepping stone after that,
               | and the one where the vast majority of actual real-world
               | cases that are not hypotheticals end up fitting.
        
             | citizen_friend wrote:
             | Nobody is saying this.
             | 
             | > who actually has unique data needs
             | 
             | We are saying this is probably not true, and you just want
             | to play with toys rather than ship working systems.
             | 
             | Google search, cannot be built in Postgres.
        
           | babel_ wrote:
           | Many startups seem to aim for this, naturally it's difficult
           | to put actual numbers to this, and I'm sure many pursue
           | multiple aims in the hope one of them sticks. Since unicorns
           | are really just describing private valuation, really it's the
           | same as saying many aim to get stupendously wealthy. Can't
           | put a number on that, but you can at least see it's a hope
           | for many, though "goal" is probably making it seem like
           | they've got actually achievable plans for it... That, at
           | least, I'm not so convinced of.
           | 
           | Startups are, however, atypical from new businesses, ergo the
           | unicorn myth, meaning we see many attempts to follow such a
           | path that likely stands in the way of many new businesses
           | from actually achieving the more real goals of, well, being a
           | business, succeeding in their venture to produce whatever it
           | is and reach their customers.
           | 
           | I describe it as a unicorn "myth" as it very much behaves in
           | such a way, and is misinterpreted similarly to many myths we
           | tell ourselves. Unicorns are rare and successful because they
           | had the right mixture of novel business and the security of
           | investment or buyouts. Startups purportedly are about new
           | ways of doing business, however the reality is only a handful
           | really explore such (e.g. if it's SaaS, it's probably not a
           | startup), meaning the others are just regular businesses with
           | known paths ahead (including, of course, following in the
           | footsteps of prior startups, which really is self-refuting).
           | 
           | With that in mind, many of the "real" unicorns are
           | realistically just highly valued new businesses (that got
           | lucky and had fallbacks), as they are often not actually
           | developing new approaches to business, whereas the mythical
           | unicorns that startups want to be are half-baked ideas of how
           | they'll achieve that valuation and wealth without much idea
           | of how they do business (or that it can be fluid, matching
           | their nebulous conception of it), just that "it'll come",
           | especially with "growth".
           | 
           | There is no nominative determinism, and all that, so
           | businesses may call themselves startups all they like, but if
           | they follow the patterns of startups without the massive
           | safety nets of support and circumstance many of the real
           | unicorns had, then a failure to develop out the business
           | proper means they do indeed suffer themselves by not
           | appreciating 5000 paying customers and instead aim for "world
           | domination", as it were, or acquisition (which they typically
           | don't "survive" from, as an actual business venture). The
           | studies have shown this really does contribute to the failure
           | rate and instability of so-called startups, effectively due
           | to not cutting it as businesses, far above the expected norm
           | of new businesses...
           | 
           | So that pet peeve really is indicative of a much more
           | profound issue that, indeed, seems to be a bit of an echo
           | chamber blind spot with HN.
           | 
           | After all, if it ought to have worked all the time, reality
           | would look very different from today. Just saying how many
           | don't become unicorns (let alone the failure rate) doesn't
           | address the dissonance from then concluding "but this time
           | will be different". It also doesn't address the idea that you
           | don't need to become a "unicorn", and maybe shouldn't want to
           | either... but that's a line of thinking counter to the echo
           | chamber, so I won't belabour it here.
        
         | notachatbot1234 wrote:
         | > - Most data isn't big. I can fit data about every person in
         | the world on a $100 Chromebook. (8 billion people * 8 bits of
         | data = 8GB)
         | 
         | Nitpick but I cannot help myself: 8 bits are not even enough
         | for a unique integer ID per person, that would require 8 bytes
         | per person and then we are at 60GB already.
         | 
         | I agree with pretty much anything else you said, just this
         | stood out as wrong and Duty Calls.
        
           | iraqmtpizza wrote:
           | meh. memory address is the ID
        
             | L-four wrote:
             | Airline booking numbers used to just be the sector number
             | of your booking record on the mainframes HDD.
        
               | switch007 wrote:
               | My jaw just hit the floor. What a fascinating fact!
        
               | rrr_oh_man wrote:
               | That's why they were constantly recycled?
        
               | mauriciolange wrote:
               | source?
        
               | devsda wrote:
               | This is such a simple scheme.
               | 
               | I wonder how they dealt with common storage issues like
               | backups and disks having bad sectors.
        
               | giantrobot wrote:
               | They're likely record based formatting rather than file
               | based. At the high level the code is just asking for a
               | record number from a data set. The data set is managed
               | including redundancy/ECC by the hardware of that storage
               | device.
        
           | amenhotep wrote:
           | Sure it is. You just need a one to one function from person
           | to [0, eight billion]. Use that as your array index and
           | you're golden. 8 GB is overkill, really, you could pack some
           | boolean datum like "is over 18" into bits within the bytes
           | and store your database in a single gigabyte.
           | 
           | Writing your mapping function would be tricky! But definitely
           | theoretically possible.
        
             | blagie wrote:
             | I'm old enough to have built systems with similar
             | techniques. We don't do that much anymore since we don't
             | need to, but it's not rocket science.
             | 
             | We had spell checkers before computers had enough memory to
             | fit all words. They'd probabilistically find almost all
             | incorrect words (but not suggest corrections). It worked
             | fine.
        
         | OJFord wrote:
         | > 95% of businesses never become unicorns, but that's the goal
         | for most (for the 5% which do).
         | 
         | I think you're missing quite a few 9s!
        
         | davedx wrote:
         | > 2) Plan for success 95% of businesses never become unicorns,
         | but that's the goal for most (for the 5% which do). If you
         | don't plan for it, you won't make it.
         | 
         | That's exactly what every architecture astronaut everywhere
         | says. In my experience it's completely untrue, and actually
         | "planning for success" more often than not causes huge drags on
         | productivity, and even more important for startups, on agility.
         | Because people never just make plans, they usually implement
         | too.
         | 
         | Plan for the next 3 months and you'll be much more agile and
         | productive. Your startup will never become a unicorn if you
         | can't execute.
        
           | MOARDONGZPLZ wrote:
           | In my experience the drag caused from the thinking to plan
           | for scalability early has been so much greater than the
           | effort to rearchitect things when and if the company becomes
           | a unicorn that one is significantly more likely to become a
           | unicorn if they simply focus on execution and very fast
           | iteration and save the scalability until it's actually needed
           | (and they can hire a team of whomever to effect this change
           | with their newly minted unicorn cachet).
        
           | newaccount74 wrote:
           | The biggest problem with planning for scale is that engineers
           | often have no idea what problems they will actually run into
           | when they scale and they build useless shit that slows them
           | down and doesn't help later at all.
           | 
           | I've come to the conclusion that the only strategy that works
           | reliably is to build something that solves problems you have
           | NOW rather than trying to predict the future.
        
             | fuzzy2 wrote:
             | Exactly this. Not only would they not know the tech
             | challenges, they also wouldn't know the business/domain
             | challenges.
        
             | vegetablepotpie wrote:
             | The flip side of that is that you end up with spaghetti
             | code that is expensive to add features to and is expensive
             | to clean up when you are successful. Then people in the
             | business implement workarounds to handle special cases that
             | are undocumented and hidden.
        
               | citizen_friend wrote:
               | No it doesn't. Simple and targeted solutions are not bad
               | code. For example, start with a single postgres instance
               | on a single machine, rather than Hadoop clusters and
               | Kubernetes. Once that is maxed out, you will have time
               | and money to solve bigger problems.
        
           | smrtinsert wrote:
           | The success planners almost always seem to be the same ones
           | pushing everyone to "not overengineer". Uhhhh..
        
           | CuriouslyC wrote:
           | There's writing code to handle every eventuality, and there's
           | considering 3-4 places you _MIGHT_ pivot and making sure you
           | aren 't making those pivots harder than they need to be.
        
             | blagie wrote:
             | This is exactly what I try to do and what I've seen
             | successful systems do.
             | 
             | Laying out adjacent markets, potential pivots, likely
             | product features, etc. is a weekend-long exercise. That can
             | help define both where the architecture needs to be
             | flexible, and just as importantly, *where it does not*.
             | 
             | Over-engineering happens when you plan / architect for
             | things which are unlikely to happen.
        
           | Spooky23 wrote:
           | The exception is when you have people with skills in
           | particular tools.
           | 
           | The suggestion upthread to use awk is awesome if you're a
           | bunch of Linux grey beards.
           | 
           | But if you have access to people with particular skills or
           | domain knowledge... spending extra cash on silly
           | infrastructure is (within reason) way cheaper than having
           | that employee be less productive.
        
             | citizen_friend wrote:
             | Nope, if every person does things completely differently,
             | that's just a lack of technical leadership. Leaders pick an
             | approach with tradeoffs that meet organizational goals and
             | help their team to follow it.
        
           | blagie wrote:
           | > That's exactly what every architecture astronaut everywhere
           | says. In my experience it's completely untrue, and actually
           | "planning for success" more often than not causes huge drags
           | on productivity, and even more important for startups, on
           | agility. Because people never just make plans, they usually
           | implement too.
           | 
           | That's not my experience at all.
           | 
           | Architecture != implementation
           | 
           | Architecture astronauts will try to solve the world's
           | problems in v0. That's very different from having an
           | architectural vision and building a subset of it to solve
           | problems for the next 3 months. Let me illustrate:
           | 
           | * Agile Idiot: We'll stick it all in PostgreSQL, however it
           | fits, and meet our 3-month milestone. [Everything crashes-
           | and-burns on success]
           | 
           | * Architecture Astronaut: We'll stick it all in a high-
           | performance KVS [Business goes under before v0 is shipped]
           | 
           | * Success: We have one table which will grow to petabytes if
           | we reach scale. We'll stick it all in postgresql for now, but
           | maintain a clean KVS abstraction for that one table. If we
           | hit success, we'll migrate to [insert high-performance KVS].
           | All the other stuff will stay in postgresql.
           | 
           | The trick is to have a pathway to success while meeting
           | short-term milestones. That's not just software architecture.
           | That's business strategy (clean beachhead, large ultimate
           | market), and every other piece of designing a successful
           | startup. There should be a detailed 3-month plan, a long-term
           | vision, and a rough set of connecting steps.
        
           | abetusk wrote:
           | Another way to say that is that "planning for success" is
           | prematurely optimizing for scale.
           | 
           | Scaling up will bring its own challenges, with many of them
           | difficult to foresee.
        
           | littlestymaar wrote:
           | This. If you plan for the time you'll be a unicorn, you will
           | never get anything done in the first place, let alone being a
           | unicorn. When you plan for the next 3 month, then hopefully
           | in three month you're still here to plan for the next quarter
           | again.
        
         | underwater wrote:
         | > To have any chance of becoming a unicorn, every part of the
         | business needs to be planned for now and for later
         | 
         | I think that in practice that's counterproductive. A startup
         | has a limited runway. If your engineers are spending your money
         | on something that doesn't pay off for years then they're
         | increasing the chance you'll fail before it matters.
        
           | blagie wrote:
           | You're confusing planning with implementation.
           | 
           | Planning is a weekend, or at most a few weeks.
        
         | threeseed wrote:
         | _> nothing Mongo does which postgresql doesn 't do better_
         | 
         | a) It has a built-in and supported horizontal scalability / HA
         | solution.
         | 
         | b) For some use cases e.g. star schemas it has significantly
         | better performance.
         | 
         |  _> Big data solutions aren 't nosql_
         | 
         | Almost all big data storage solutions are NoSQL.
        
           | ozkatz wrote:
           | > Almost all big data storage solutions are NoSQL.
           | 
           | I think it's important to distinguish between OLAP AND OLTP.
           | 
           | For OLAP use cases (which is what this post is mostly about)
           | it's almost 100% SQL. The biggest players being Databricks,
           | Snowflake and BigQuery. Other tools may include AWS's tools
           | (Glue, Athena), Trino, ClickHouse, etc.
           | 
           | I bet there's a <1% market for "NoSQL" tools such as
           | MongoDB's "Atlas Data Lake" and probably a bunch of MapReduce
           | jobs still being used in production, but these are the
           | exception, not the rule.
           | 
           | For OLTP "big data", I'm assuming we're talking about "scale-
           | out" distributed databases which are either SQL (e.g.
           | cockroachdb, vitess, etc) SQL-like (Casandra's CQL,
           | Elasticsearch's non-ANSI SQL, Influx' InfluxQL) or a purpose-
           | built language/API (Redis, MongoDB).
           | 
           | I wouldn't say OLTP is "almost all" NoSQL, but definitely a
           | larger proportion compared to OLAP.
        
           | blagie wrote:
           | > Almost all big data storage solutions are NoSQL.
           | 
           | Most I've seen aren't. NoSQL means non-relational database.
           | Most big data solutions I've seen will not use a database at
           | all. An example is hadoop.
           | 
           | Once you have a database, SQL makes a lot of sense. There are
           | big data SQL solutions, mostly in the form of columnar read-
           | optimized databases.
           | 
           | On the above, a little bit of relational can make a huge
           | performance difference, in the form of, for example, a big
           | table with compact data with indexes into small data tables.
           | That can be algorithmically a lot more performant than the
           | same thing without relations.
        
         | Toine wrote:
         | > To have any chance of becoming a unicorn, every part of the
         | business needs to be planned for now and for later
         | 
         | Sources ?
        
         | boxed wrote:
         | I see people planning for success to the point of guaranteeing
         | failure, much more than people who suddenly must try to handle
         | success in panic.
         | 
         | It's a second system syndrome + survivor bias thing I think:
         | people who had to clean up the mess of a good MVP complaining
         | about what wasn't done before. But the companies that DID do
         | that planning and architecting before _did not survive to be
         | complained about_.
        
           | CuriouslyC wrote:
           | It's not either or. There are best practices that can be
           | followed regardless with no time cost up front, and there is
           | taking some time to think about how your product might evolve
           | (which you really should be doing anyhow) then making choices
           | with your software that don't make the evolution process
           | harder than it needs to be.
           | 
           | Layers of abstraction make code harder to reason about and
           | work with, so it's a lose lose when trying to iterate
           | quickly, but there's also the idea of architectural "mise en
           | place" vs "just dump shit where it's most convenient right
           | now and don't worry about later" which will result near
           | immediate productivity losses due to system incoherence and
           | disorganization.
        
             | boxed wrote:
             | I'm a big fan of "optimize for deletion" (aka leaf-heavy)
             | code. It's good for reasoning when the system is big, and
             | it's good for growing a code base.
             | 
             | It's a bit annoying how the design of Django templates
             | works against this by not allowing free functions...
        
         | nemo44x wrote:
         | Mongo allows a developer to burn down a backlog faster than
         | anything else. That's why it's so popular. The language drivers
         | interface with the database which just says yes. And whatever
         | happens later is someone else's problem. Although it's a far
         | more stable thing today.
        
         | zemo wrote:
         | > The reason to architect for scalability when you have 5
         | customers is so if that exponential growth cycle hits, you can
         | capitalize on it.
         | 
         | If you have a product gaining that much traction, it's usually
         | because of some compound effect based on the existence and
         | needs of its userbase. If on the way up you stumble to add new
         | users, the userbase that's already there is unlikely to go back
         | to the Old Thing or go somewhere else (because these events are
         | actually rare). For a good while using Twitter meant seeing the
         | fail whale every day. Most people didn't just up and leave, and
         | nothing else popped up that could scale better that people
         | moved to. Making a product that experiences exponential growth
         | in that way is pretty rare, and struggling to scale those cases
         | and having a period of availability degradation is common. What
         | products hit an exponential growth situation failed because
         | they couldn't scale?
        
         | anon84873628 wrote:
         | >The one lesson I've learned is that there is nothing Mongo
         | does which postgresql doesn't do better. Big data solutions
         | aren't nosql / mongo, but usually things like columnar
         | databases, map/reduce, Cassandra, etc.
         | 
         | I think that was exactly their point. If new architectures were
         | actually necessary, we would have seen a greater rise in Mongo
         | and the like. But we didn't, because the existing systems were
         | perfectly adequate.
        
       | estheryo wrote:
       | "MOST PEOPLE DON'T HAVE THAT MUCH DATA" That's really true
        
       | vegabook wrote:
       | my experience is that while data keeps growing at an exponential
       | rate, its information content does not. In finance at least, you
       | can easily get 100 million data points per series per day if you
       | want everything, and you might be dealing with thousands of
       | series. That sample rate, and the number of series, is usually
       | 99.99% redundant, because the eigenvalues drop off almost to zero
       | very quickly after about 10 dimensions, and often far fewer.
       | There's very little reason to store petabytes of ticks that you
       | will never query. It's much more reasonable in many cases to do
       | brutal (and yes, lossy) dimensionality reduction _at ingests
       | time_, store the first few principal components + outliers, and
       | monitor eigenvalue stability (in case some new, previously
       | negligable, factor, starts increasing in importance). It results
       | in a much smaller dataset that is tractable and in many cases
       | revelatory, because it's actually usable.
        
         | CoastalCoder wrote:
         | Could you point to something explaining that eigenvalue /
         | dimensions topic?
         | 
         | It sounds interesting, but it's totally new to me.
        
           | mk67 wrote:
           | https://en.wikipedia.org/wiki/Principal_component_analysis
        
           | beng-nl wrote:
           | Nog OP, but i think they are referring to the fact that you
           | can use PCA (principal component analysis) on a matrix of
           | datapoints to approximate it. Works out of the box in scikit-
           | learn.
           | 
           | You can do (lossy) compression on rows of vectors (treated
           | like a matrix) by taking the top N eigenvectors (largest N
           | eigenvalues) and using them to approximate the original
           | matrix with increasing accuracy (as N grows) by some simple
           | linear operations. If the numbers are highly correlated, you
           | can get a huge amount of compression with minor loss this
           | way.
           | 
           | Personally I like to use it to visualize linear separability
           | of a high dimensioned set of vectors by taking a 2-component
           | PCA and plotting them as x/y values.
        
         | bartart wrote:
         | That's very interesting, so thank you -- how do you handle if
         | the eigenvectors change over time?
        
           | vegabook wrote:
           | you can store the main eigenvectors for a set rolling period
           | and see how the space evolves along them, all the while also
           | storing the new ones. In effect the whole idea is to get away
           | from "individual security space" and into "factor space",
           | which is much smaller, and see how the factors are moving.
           | Also, a lot of the time you just care about the outliers --
           | those (small numbers of) instruments or clusters of
           | instruments that are trading in an unusual way -- then you
           | either try to explain it.... or trade against it. Also keep
           | in mind that lower-order factors tend to be much more
           | stationary so there's a lot of alpha there -- if you can
           | execute the trades efficiently (which is why most successful
           | quant shops like citadel and jane street are market MAKERS,
           | not takers, btw).
        
       | zurfer wrote:
       | This is not fully correct.
       | 
       | Originally big data was defined by 3 dimensions:
       | 
       | - Volume (mostly what the author talks about) [solved]
       | 
       | - Velocity, how fast data is processed etc [solved, but
       | expensive]
       | 
       | - Variety [not solved]
       | 
       | Big Data today is not: I don't have enough storage or compute.
       | 
       | It is: I don't have enough cognitive capacity to integrate and
       | make sense of it.
        
         | maayank wrote:
         | What do you mean by 'variety'?
        
           | boesboes wrote:
           | not op, but I think they mean the data is complex,
           | heterogeneous and noisy. You won't be able to extract meaning
           | trivially from it, you need something to find the (hidden)
           | meaning in the data.
           | 
           | So AI currently, probably ;)
        
           | threeseed wrote:
           | Data that isn't internally managed database exports.
           | 
           | One of the reasons big data systems took off was because
           | enterprises had exports out of third party systems that they
           | didn't want to model since they didn't own it. As well as a
           | bunch of unstructured data e.g. floor plans, images, logs,
           | telemetry etc.
        
           | zurfer wrote:
           | the other comments get it.
           | 
           | It means that data comes in a ton of different shapes with
           | poorly described schemas (technically and semantically).
           | 
           | From the typical CSV export out of an ERP system to a
           | proprietary message format from your own custom embedded
           | device software.
        
           | nairboon wrote:
           | it doesn't fit the relational model, e.g. you have some
           | tables, but also tons of different types of images, video,
           | sounds, raw text, etc.
        
         | vishnugupta wrote:
         | I first heard of this 3 V's in a Michael Stonebreaker's
         | talk[1]. For the uninitiated he's a legend in DBMS space,
         | Turing award winner[2].
         | 
         | Highly recommend this and related talks by him, most of them
         | are in YouTube.
         | 
         | [1] https://www.youtube.com/watch?v=KRcecxdGxvQ
         | 
         | [2]
         | https://amturing.acm.org/award_winners/stonebraker_1172121.c...
        
         | snakeyjake wrote:
         | >Big Data today is not: I don't have enough storage or compute.
         | 
         | It is for me. Six times per year I go out to the field for two
         | weeks to do data acquisition. In the field we do a dual-
         | aircraft synthetic aperture radar collection over four bands
         | and dual polarities.
         | 
         | That means two aircraft each with one radar system containing
         | eight 20TiB 16-drive RAID-0 SSD storage devices.
         | 
         | We don't usually fill up the RAIDs so we generate about 176TiB
         | of data per day and over the two weeks we do 7 flight, or
         | 1.2PiB per deployment or 7.2PiB per year.
         | 
         | We can only fly every other day because it takes a day between
         | flights to offload the data via fiber onto storage servers that
         | are usually haphazardly crammed into the corner of a hangar
         | next to the apron. It is then duplicated to a second server for
         | safekeeping and at the end of the mission everything is shipped
         | back to our HQ for storage and processing.
         | 
         | The data is valuable, but not "billions" valuable. It is used
         | for resource extraction, mapping, environmental and geodetic
         | research, and other applications (but that's not my department)
         | so we have kept every single byte for since 2008. This is
         | especially useful because as new algorithms are created (not my
         | department) the old data can be reprocessed to the new
         | standard.
         | 
         | Entire nations finally know how many islands they have, how
         | large they are, how their elevations are changing, and how
         | their coasts are being eradicated by sea level change because
         | of our data and if you've ever used a mapping application and
         | flown around a city with 3d buildings that don't look like shit
         | because they were stitched together using AI and
         | photogrammetry, you've used our data too.
         | 
         | We have to use hard drives because SSDs would be space and most
         | certainly cost prohibitive.
         | 
         | We stream 800GiB-2TiB files each representing a complete stripe
         | or circular orbit to GPU-equipped processing servers. Files are
         | incompressible (the cosmic microwave background, the bulk of
         | what we capture, tends to be a little random) and when I
         | started I held on to the delusion that I could halve the
         | infrastructure by writing to tape until I found out that tape
         | capacities were calculated for the storage of gigabyte-sized
         | text files of all zeros (or so it seems) that can be compressed
         | down to nothing.
         | 
         | GPUs are too slow. CPUs are too slow. PCIe busses are too slow.
         | RAM is too slow. My typing speed is too slow. Everything needs
         | to be faster all of the time.
         | 
         | Everything is too slow, too hard, and too small. Hard drives
         | are too small. Tuning the linux kernel and setting up fast and
         | reliable networking to the processing clusters is too hard.
         | Kernel and package updates that aren't even bug fixes but just
         | changes in the way that something works internally that are
         | transparent to all users except for us break things. Networks
         | are too slow. Things exist in this fantasy world where RAM is
         | scarce so out-of-the-box settings are configured to not hog
         | memory for network operations. No. I've got a half a terabyte
         | of RAM in this file server use ALL OF IT to make the network
         | and filesystem go faster, please. Time to spend six hours
         | reading the documentation for every portion of the network
         | stack to increase the I/O to 2024-levels of sanity.
         | 
         | I probably know more about sysctl.conf than almost every other
         | human being on earth.
         | 
         | Distributed persistent object storage systems for people who
         | think they are doing big data but really aren't either
         | completely fall apart under our workload or cost hundreds of
         | millions of dollars-- which we don't have. When I tell all of
         | the distributed filesystem salespeople that our objects are
         | roughly a terabyte in size they stop replying to my emails.
         | More than one vendor has referred me to their intelligence
         | community customer service representative upon reading my
         | requirements. I am not the NSA, buddy, and we don't have NSA
         | money.
         | 
         | Every once in a while we get a new MBA or PMP who read a
         | Bloomberg article about the cloud and asks about moving to AWS
         | or Azure after they see the costs of our on-premises
         | datacenter. When I show them the numbers, in terms of both
         | money and time, they throw up in their mouths and change the
         | subject.
         | 
         | To top it all off all of our vendors are jumping on the
         | AI/cloud bandwagon and discontinuing product lines applicable
         | to us.
         | 
         | And now I've got to compete for GPUs with hedge funds and AI
         | startups trying to figure out how to use a LLM to harvest
         | customer data and use it to show them ads.
         | 
         | I do not have enough storage or compute, and the storage and
         | compute I do have is too slow.
         | 
         | DPUs/IPUs look interesting but fall on their face when an
         | object is larger than a SQL database query or compressed
         | streaming video chunk.
        
       | gbin wrote:
       | IMHO the main driver for big data was company founders egos. Of
       | course your company will explode and will be a planet scale
       | success!! We need to design for scale! This is really a tragic
       | mistake while your product only needs one SQLite DB until you
       | reach series C.... All the energy should be focused on the
       | product, not its scale yet.
        
         | antupis wrote:
         | Well generally yes although there are a couple of exceptions
         | like IoT and GIS stuff where is very common to see 10TB+
         | datasets.
        
         | threeseed wrote:
         | No. Big data was driven by people who had big data problems.
         | 
         | It started with Hadoop which was inspired by what existed at
         | Google and became popular in enterprises all around the world
         | who wanted a cheaper/better way to deal with their data than
         | Oracle.
         | 
         | Spark came about as a solution to the complexity of Hive/Pig
         | etc. And then once companies were able to build reliable data
         | pipelines we started to see AI being able to be layered on top.
        
         | jandrewrogers wrote:
         | It depends on the kind of data you work with. Many kinds of
         | important data models -- geospatial, sensing, telemetry, et al
         | -- can hit petabyte volumes at "hello world".
         | 
         | Data models generated by intentional human action e.g. clicking
         | a link, sending a message, buying something, etc are
         | universally small. There is a limit on the number of humans and
         | the number of intentional events they can generate per second
         | regardless of data model.
         | 
         | Data models generated by machines, on the other hand, can be
         | several orders of magnitude higher velocity and higher volume,
         | and the data model size is unbounded. These are often some of
         | the most interesting and under-utilized data models that exist
         | because they can get at many facts about the world that are not
         | obtainable from the intentional human data models.
        
       | tobilg wrote:
       | I witness the overengineering regarding "big" data tools and
       | pipelines since many years... For a lot of use cases, data
       | warehouses and data lakes are only in the gigabytes or single-
       | digit terabytes range, thus their architecture could be much more
       | simplified, e.g. running DuckDB on a decent EC2 instance.
       | 
       | In my experience, doing this will yield the query results faster
       | than some other systems even starting the query execution (yes,
       | I'm looking at you Athena)...
       | 
       | I even think that a lot of queries can be run from a browser
       | nowadays, that's why I created https://sql-workbench.com/ with
       | the help of DuckDB WASM (https://github.com/duckdb/duckdb-wasm)
       | and perspective.js (https://github.com/finos/perspective).
        
       | geertj wrote:
       | I agree with the article that most data sets comfortably fit into
       | a single traditional DB system. But I don't think that implies
       | that big data is dead. To me big data is about storing data in a
       | columnar storage format with a weak schema, and using a query
       | system based on partitioning and predicate push down instead of
       | indexes. This allows the data to be used in an ad-hoc way by data
       | science or other engineers to answer questions you did not have
       | when you designed the system. Most setups would be relatively
       | small, but could be made to scale relatively well using this
       | architecture.
        
       | Shrezzing wrote:
       | This is a quite good allegory for the way AI is currently
       | discussed (perhaps the outcome will be different this time
       | round). Particularly the scary slide[1] with the up-and-to-the-
       | right graph, which is used in a near identical fashion today to
       | show an apparently inevitable march of progress in the AI space
       | due to scaling laws.
       | 
       | [2]https://motherduck.com/_next/image/?url=https%3A%2F%2Fweb-
       | as...
        
       | teleforce wrote:
       | Not dead, it's just having it's winter time not unlike AI winter
       | and once it has its similar "chatbot" moment, all will be well.
       | 
       | My take on the killer application is the climate change for
       | example earthquakes monitoring. For a case study China has just
       | finished building world's largest earthquake monitoring system
       | with the cost of around USD1 Billion across the country with 15K
       | stations [1]. Somehow at the moment is just monitoring existing
       | earthquakes. But let's say there is a big data analytics
       | technique can reliably predicts impending earthquake within a few
       | days, that can probably safe many people and China now still hold
       | the records of the largest mortality and casualty numbers due to
       | earthquakes. Is it probable, the answer is a positive yes based
       | on our work and initial results it's already practical but in
       | order to do that we need integration with comprehensive in-situ
       | IoT networks with regular and frequent data sampling similar to
       | that of China.
       | 
       | Secondly, China also has the largest radio astronomy telescopes
       | and these telescopes together with other radio telescopes
       | collaborate in real-time through e-VLBI to form a virtual giant
       | radio telescopes as big as the earth to monitor distance stars
       | and galaxy. This is how the black hole got its first image but at
       | the time due to logistics one of the telescope remote disks
       | cannot be shipped to the main processing centers in US [2]. At
       | that moment they are not using real-time e-VLBI onky VLBI, and it
       | tooks them several months just to get the complete sets of the
       | black holes observation data. With e-VLBI everything is real-time
       | and with automatic processing it will be hours instead of month.
       | These radio telescopes can also be used for other purposes like
       | monitoring climate change in addition to imaging black holes,
       | their data is astronomical pardon the pun [3].
       | 
       | [1] Chinese Nationwide Earthquake Early Warning System and Its
       | Performance in the 2022 Lushan M6.1 Earthquake:
       | 
       | https://www.mdpi.com/2072-4292/14/17/4269
       | 
       | [2] How Scientists Captured the First Image of a Black Hole:
       | 
       | https://www.jpl.nasa.gov/edu/news/2019/4/19/how-scientists-c...
       | 
       | [3] Alarmed by Climate Change, Astronomers Train Their Sights on
       | Earth:
       | 
       | https://www.nytimes.com/2024/05/14/science/astronomy-climate...
        
         | Shrezzing wrote:
         | I think these examples still loosely fits the author's
         | argument:
         | 
         | > There are some cases where big data is very useful. The
         | number of situations where it is useful is limited
         | 
         | Even though there are some great use-cases, the overwhelming
         | majority organisations, institutions, and projects will never
         | have a "let's query ten petabytes" scenario that forces them
         | away from platforms like Postgres.
         | 
         | Most datasets, even at very large companies, fit comfortably
         | into RAM on a server - which is now cost-effective, even in the
         | _dozens of terabytes_.
        
       | kmarc wrote:
       | When I was hiring data scientists for a previous job, my favorite
       | tricky question was "what stack/architecture would you build"
       | with the somewhat detailed requirements of "6 TiB of data" in
       | sight. I was careful not to require overly complicated sums, I
       | simply said it's MAX 6TiB
       | 
       | I patiently listened to all the big query hadoop habla-blabla,
       | even asked questions about the financials
       | (hardware/software/license BOM) and many of them came up with
       | astonishing tens of thousands of dollars yearly.
       | 
       | The winner of course was the guy who understood that 6TiB is what
       | 6 of us in the room could store on our smart phones, or a $199
       | enterprise HDD (or three of them for redundancy), and it could be
       | loaded (multiple times) to memory as CSV and simply run awk
       | scripts on it.
       | 
       | I am prone to the same fallacy: when I learn how to use a hammer,
       | everything looks like a nail. Yet, not understanding the scale of
       | "real" big data was a no-go in my eyes when hiring.
        
         | geraldwhen wrote:
         | I ask a similar question on screens. Almost no one gives a good
         | answer. They describe elaborate architectures for data that
         | fits in memory, handily.
        
           | mcny wrote:
           | I think that's the way we were taught in college / grad
           | school. If the premise of the class is relational databases,
           | the professor says, for the purpose of this course, assume
           | the data does not fit in memory. Additionally, assume that
           | some normalization is necessary and a hard requirement.
           | 
           | Problem is most students don't listen to the first part "for
           | the purpose of this course". The professor does not elaborate
           | because that is beyond the scope of the course.
        
             | kmarc wrote:
             | FWIW if they were juniors, I would've continued the
             | interview and direct them with further questions, and
             | observer their flow of thinking to decide if they are good
             | candidates to pursue further.
             | 
             | But no, this particular person had been working
             | professionally for decades (in fact, he was much older than
             | me).
        
               | geraldwhen wrote:
               | Yeah. I don't even bother asking juniors this. At that
               | level I expect that training will be part of the job, so
               | it's not a useful screener.
        
             | acomjean wrote:
             | I took a Hadoop class. We learned hadoop and were told by
             | the instructor we probably wouldn't't need it, and learned
             | some other Java processing techniques (streams etc)
        
           | Joel_Mckay wrote:
           | People can always find excuses to boot candidates.
           | 
           | I would just back-track from a shipped product date, and try
           | to guess who we needed to get there... given the scope of
           | requirements.
           | 
           | Generally, process people from a commercially
           | "institutionalized" role are useless for solving unknown
           | challenges. They will leave something like an SAP, C#, or
           | MatLab steaming pile right in the middle of the IT ecosystem.
           | 
           | One could check out Aerospike rather than try to write their
           | own version (the dynamic scaling capabilities are very
           | economical once setup right.)
           | 
           | Best of luck, =3
        
         | boppo1 wrote:
         | You have 6 TiB of ram?
        
           | ninkendo wrote:
           | You don't need that much ram to use mmap(2)
        
             | marginalia_nu wrote:
             | To be fair, mmap doesn't put your data in RAM, it presents
             | it as though it was in RAM and has the OS deal with whether
             | or not it actually is.
        
               | ninkendo wrote:
               | Right, which is why you can mmap way more data than you
               | have ram, and treat it as though you do have that much
               | ram.
               | 
               | It'll be slower, perhaps by a lot, but most "big data"
               | stuff is already so god damned slow that mmap probably
               | still beats it, while being immeasurably simpler and
               | cheaper.
        
           | cess11 wrote:
           | The "(multiple times)" part probably means batching or
           | streaming.
           | 
           | But yeah, they might have that much RAM. At a rather small
           | company I was at we had a third of it in the virtualisation
           | cluster. I routinely put customer databases in the hundreds
           | of gigabytes into RAM to do bug triage and fixing.
        
             | kmarc wrote:
             | Indeed, what I meant to say is that you can load it in
             | multiple batches. However, now thinking, I did play around
             | with servers of TiBs of memory :-)
        
           | vitus wrote:
           | If you're one of the public clouds targeting SAP use cases,
           | you probably have some machines with 12TB [0, 1, 2].
           | 
           | [0] https://aws.amazon.com/blogs/aws/now-available-amazon-
           | ec2-hi...
           | 
           | [1] https://cloud.google.com/blog/products/sap-google-
           | cloud/anno...
           | 
           | [2] https://azure.microsoft.com/en-us/updates/azure-
           | mv2-series-v...
        
           | qaq wrote:
           | You can have 8TB RAM in a 2U box for under 100K. grab a
           | couple and it will save you millions a year compared to over-
           | engineered bigdata setup.
        
             | apwell23 wrote:
             | Bigquery and snowflake are software. They come with a sql
             | engine, data governance, integration with your ldap,
             | auditing. Loading data into snowflake isn't overegineering.
             | What you described is over-engineering.
             | 
             | No business is passing 6tb data around on their laptops.
        
               | qaq wrote:
               | So is ClickHouse your point being ? Please point out what
               | a server being able to have 8TB of RAM has to do with
               | laptops.
        
             | int_19h wrote:
             | I wonder how much this costs:
             | https://www.ibm.com/products/power-e1080
             | 
             | And how that price would compare to the equivalent big data
             | solution in the cloud.
        
           | chx wrote:
           | If my business depended on it? I can click a few buttons and
           | have a 8TiB Supermicro server on my doorstep in a few days if
           | I wanted to colo that. EC2 High Memory instances offer 3, 6,
           | 9, 12, 18, and 24 TiB of memory in an instance if that's the
           | kind of service you want. Azure Mv2 also does 2850 -
           | 11400GiB.
           | 
           | So yes, if need to be, I have 6 TiB of RAM.
        
           | david_allison wrote:
           | https://yourdatafitsinram.net/
        
             | compressedgas wrote:
             | Was posted as https://news.ycombinator.com/item?id=9581862
             | in 2015
        
           | bluedino wrote:
           | We are decomming our 5-year old 4TB systems this year and
           | could have been ordered with more
        
           | lizknope wrote:
           | I personally don't but our computer cluster at work as around
           | 50,000 CPU cores. I can request specific configurations
           | through LSF and there are at least 100 machines with over 4TB
           | RAM and that was 3 years ago. By now there are probably
           | machines with more than that. Those machines are usually
           | reserved for specific tasks that I don't do but if I really
           | needed it I could get approval.
        
         | sfilipco wrote:
         | I agree that keeping data local is great and should be the
         | first option when possible. It works great on 10GB or even
         | 100GB, but after that starts to matter what you optimize for
         | because you start seeing execution bottlenecks.
         | 
         | To mitigate these bottlenecks you get fancy hardware (e.g
         | oracle appliance) or you scale out (and get TCO/performance
         | gains from separating storage and compute - which is how
         | Snowflake sold 3x cheaper compared to appliances when they came
         | out).
         | 
         | I believe that Trino on HDFS would be able to finish faster
         | than awk on 6 enterprise disks for 6TB data.
         | 
         | In conclusion I would say that we should keep data local if
         | possible but 6TB is getting into the realm where Big Data tech
         | starts to be useful if you do it a lot.
        
           | nottorp wrote:
           | > I agree that keeping data local is great and should be the
           | first option when possible. It works great on 10GB or even
           | 100GB, but after that starts to matter what you optimize for
           | because you start seeing execution bottlenecks.
           | 
           | The point of the article is 99.99% of businesses never pass
           | even the 10 Gb point though.
        
             | sfilipco wrote:
             | I agree with the theme of the article. My reply was to
             | parent comment which has a 6 TB working set.
        
           | hectormalot wrote:
           | I wouldn't underestimate how much a modern machine with a
           | bunch of RAM and SSDs can do vs HDFS. This post[1] is now 10
           | years old and has find + awk running an analysis in 12
           | seconds (at speed roughly equal to his hard drive) vs Hadoop
           | taking 26 minutes. I've had similar experiences with much
           | bigger datasets at work (think years of per-second
           | manufacturing data across 10ks of sensors).
           | 
           | I get that that post is only on 3.5GB, but, consumer SSDs are
           | now much faster at 7.5GB/s vs 270MB/s HDD back when the
           | article was written. Even with only mildly optimised
           | solutions, people are churning through the 1 billion rows
           | (+-12GB) challenge in seconds as well. And, if you have the
           | data in memory (not impossible) your bottlenecks won't even
           | be reading speed.
           | 
           | [1]: https://adamdrake.com/command-line-tools-can-
           | be-235x-faster-...
        
         | pdimitar wrote:
         | Blows my mind. I am a backend programmer and a semi-decent
         | sysadmin and I would have immediately told you: "make a ZFS or
         | BCacheFS pool with 20-30% redundancy bits and just go wild with
         | CLI programs, I know dozens that work on CSV and XML, what's
         | the problem?".
         | 
         | And I am not a specialized data scientist. But with time I am
         | wondering if such a thing even exists... being a good backender
         | / sysadmin and knowing a lot of CLI tools has always seemed to
         | do the job for me just fine (though granted I never actually
         | managed a data lake, so I am likely over-simplifying it).
        
           | WesolyKubeczek wrote:
           | > just go wild with CLI programs, I know dozens that work on
           | CSV and XML
           | 
           | ...or put it into SQLite for extra blazing fastness! No
           | kidding.
        
             | pdimitar wrote:
             | That's included in CLI tools. Also duckdb and clickhouse-
             | local are amazing.
        
               | WesolyKubeczek wrote:
               | I need to learn more about the latter for some log
               | processing...
        
               | fijiaarone wrote:
               | Log files aren't data. That's your first problem. But
               | that's the only thing that most people have that
               | generates more bytes than can fit on screen in a single
               | spreadsheet.
        
               | thfuran wrote:
               | Of course they are. They just aren't always structured
               | nicely.
        
               | WesolyKubeczek wrote:
               | Everything is data if you are brave enough.
        
               | c0brac0bra wrote:
               | clickhouse-local had been astonishingly fast for
               | operating on many GB of local CSVs.
               | 
               | I had a heck of a time running the server locally before
               | I discovered the CLI.
        
           | nevi-me wrote:
           | To be fair on candidates, CLI programs create technical debt
           | the moment they're written.
           | 
           | A good answer that strikes a balance between size of data,
           | latency and frequency requirements is a candidate who is able
           | to show that they can choose the right tool that the next
           | person will be comfortable with.
        
             | pdimitar wrote:
             | True on the premise, yep, though I'm not sure how using CLI
             | programs like LEGO blocks creates a tech debt?
        
               | ImPostingOnHN wrote:
               | I remember replacing a CLI program built like Lego
               | blocks. It was 90-100 LEGO blocks, written over the
               | course of decades, in: Cobol; Fortran; C; Java; Bash; and
               | Perl, and the Legos "connected" with environmental
               | variables. Nobody wanted to touch it lest they break it.
               | Sometimes it's possible to do things too smartly. Apache
               | Spark runs locally (and via CLI).
        
               | pdimitar wrote:
               | No no, I didn't mean that at all. I meant a script using
               | well-known CLI programs.
               | 
               | Obviously organically grown Frankenstein programs are a
               | huge liability, I think every reasonable techie agrees on
               | that.
        
               | actionfromafar wrote:
               | Well your little CLI-query is suddenly in production and
               | then... it easily escalates.
        
               | pdimitar wrote:
               | I already said I never managed a data lake and simply got
               | stuff when it was needed but if you need to criticize
               | then by all means, go wild.
        
             | __MatrixMan__ wrote:
             | True but it's typically less debt than anything involving a
             | gui, pricetag, or separate server.
        
             | citizen_friend wrote:
             | Configuring debugged, optimized software, with a shell
             | script is orders of magnitude cheaper than developing novel
             | software.
        
           | ImPostingOnHN wrote:
           | _> But with time I am wondering if such a thing even exists_
           | 
           | Check out "data science at the command line":
           | 
           | https://jeroenjanssens.com/dsatcl/
        
           | apwell23 wrote:
           | > make a ZFS or BCacheFS pool with 20-30% redundancy bits and
           | just go wild with CLI programs
           | 
           | Lol. Data management is about safety, auditablity, access
           | control, knowledge sharing and who bunch of other stuff. I
           | would've immediately shown you the door as someone who i
           | cannot trust data with.
        
             | zaphar wrote:
             | What about his answer prevents any of that? As stated the
             | question didn't require any of what you outline here. ZFS
             | will probably do a better job of protecting your data than
             | almost any other filesystem out there so it's not a bad
             | foundation to start with if you want to protect data.
             | 
             | Your entire post reeks of "I'm smarter than you" smugness
             | while at the same time revealing no useful information or
             | approaches. Near as I can tell no one should trust you with
             | any data.
        
               | apwell23 wrote:
               | > Your entire post reeks of "I'm smarter than you"
               | 
               | unlike "blows my mind" ?
               | 
               | > As stated the question didn't require any of what you
               | outline here.
               | 
               | Right. OP mentioned it was "tricky question" . What makes
               | it tricky is that all those attributes are implicitly
               | assumed. I wouldn't interview at google and tell them my
               | "stack" is "load it on your laptop". I would never say
               | that in an interview even if I think that's the right
               | "stack" .
        
               | zaphar wrote:
               | "blows my mind" is similar in tone yes. But I wasn't
               | replying to the OP. Further the OP actually goes into
               | some detail about how he would approach the problem. You
               | do not.
               | 
               | You are assuming you know what the OP meant by tricky
               | question. And your assumption contradicts the rest of the
               | OP's post regarding what he considered good answers to
               | the question and why.
        
               | pdimitar wrote:
               | Honest question: was "blows my mind" so offensive?
               | Thought it was quite obvious I meant that "it blows my
               | mind people don't try the simpler stuff first, especially
               | having in mind that it works for much bigger percentage
               | than cloud providers would have you believe"?
               | 
               | I guess it wasn't but even if so, it would be
               | legitimately baffling how people manage to project so
               | much negativity in three words that are slightly tongue-
               | in-cheek casual comment on the state of affairs in an
               | area whose value is not always clear (in my observations,
               | only after you start having 20+ data sources it starts to
               | pay off to have dedicated data team; I've been in teams
               | only 3-4 devs and we still managed to have 15-ish data
               | dashboards for the executives without too much cursing).
               | 
               | An anecdote, surely, but what isn't?
        
               | zaphar wrote:
               | I generally don't find that sort of thing offensive when
               | combined with useful alternative approaches like your
               | post provided. However the phrase does come with a
               | connotation that you are surprised by a lack of knowledge
               | or skill in others. That can be taken as smug or elitist
               | by someone in the wrong frame of mind.
        
               | pdimitar wrote:
               | Thank you, that's helpful.
        
             | pdimitar wrote:
             | I already qualified my statement quite well by stating my
             | background but if it makes you feel better then sure, show
             | me the door. :)
             | 
             | I was never a data scientist, just a guy who helped
             | whenever it was necessary.
        
               | apwell23 wrote:
               | > I already qualified my statement quite well by stating
               | my background
               | 
               | No. You qualified it with "blows my mind" . Why would it
               | 'blow your mind' if you don't have any data background.
        
               | zaphar wrote:
               | He didn't say he didn't have any data background. He's
               | clearly worked with data on several occasions as needed.
        
               | pdimitar wrote:
               | Are you trolling? Did you miss the part where I said I
               | worked with data but wouldn't say I'm a professional data
               | scientist?
               | 
               | This negative cherry picking does not do your image any
               | favors.
        
             | koverstreet wrote:
             | this is how you know when someone takes themself too
             | seriously
             | 
             | buddy, you're just rolling off buzzwords and lording it
             | over other people
        
               | apwell23 wrote:
               | buddy you suffer from NIH syndrome upset that no one
               | wants your 'hacks'.
        
             | photonthug wrote:
             | > Lol. Data management is about safety, auditablity, access
             | control, knowledge sharing and who bunch of other stuff. I
             | would've immediately shown you the door as someone who i
             | cannot trust data with.
             | 
             | No need to act smug and superior, especially since nothing
             | about OP's plan here _actually precludes_ having all the
             | nice things you mentioned, or even having them inside
             | $your_favorite_enterprise_environment.
             | 
             | You risk coming across as a person who feels threatened by
             | simple solutions, perhaps someone who wants to spend $500k
             | in vendor subscriptions every year for simple and/or
             | imaginary problems... exactly the type of thing TFA talks
             | about.
             | 
             | But I'll ask the question.. _why_ do you think safety,
             | auditablity, access control, and knowledge sharing are
             | incompatible with CLI tools and a specific choice of file
             | system? What 's your preferred alternative? Are you
             | sticking with that alternative regardless of how often the
             | work load runs, how often it changes, and whether the data
             | fits in memory or requires a cluster?
        
               | apwell23 wrote:
               | > No need to act smug and superior
               | 
               | I responded with the same tone that gp responded with.
               | "blows my mind" ( that people can be so stupid) .
        
               | photonthug wrote:
               | Another comment mentions this classic meme:
               | 
               | > Consulting service: you bring your big data problems to
               | me, I say "your data set fits in RAM", you pay me $10,000
               | for saving you $500,000.
               | 
               | A lot of industry work really does fall into this
               | category, and it's not controversial to say that going
               | the wrong way on this thing is mind-blowing. More than
               | not being controversial, it's not _confrontational_ ,
               | because his comment was essentially re: the industry,
               | whereas your comment is directed at a person.
               | 
               | Drive by sniping where it's obvious you don't even care
               | to debate the tech itself might get you a few "sick burn,
               | bro" back-slaps from certain crowds, or the FUD approach
               | might get traction with some in management, but overall
               | it's not worth it. You don't sound smart or even
               | professional, just nervous and afraid of every approach
               | that you're not already intimately familiar with.
        
               | apwell23 wrote:
               | i repurposed the parent comment
               | 
               | "not understanding the scale of "real" big data was a no-
               | go in my eyes when hiring." , "real winner" ect.
               | 
               | But yea you are right. I shouldn't have directed it at
               | commenter. I was miffed at interviewers who use "tricky
               | questions" and expect people to read their minds and come
               | up with their preconceived solution.
        
               | pdimitar wrote:
               | The classic putting words in people's mouths technique it
               | is then. The good old straw man.
               | 
               | If you really must know: I said "blows my mind [that
               | people don't try simpler and proven solutions FIRST]".
               | 
               | I don't know what do you have to gain to come here and
               | pretend to be in my head. Now here's another thing that
               | blows my mind.
        
               | apwell23 wrote:
               | > that people don't try simpler and proven solutions
               | FIRST
               | 
               | Well why don't people do that according to you ?
               | 
               | Its not 'mind blowing' to me because you can never guess
               | what angle interviewer is coming at you. Especially when
               | they use the words like ' data stack'.
        
               | pdimitar wrote:
               | I don't know why and this is why I said it's mind-
               | blowing. Because to me trying stuff that can work on most
               | laptops comes naturally in my head as the first viable
               | solution.
               | 
               | As for interviews, sure, they have all sorts of traps. It
               | really depends on the format and the role. Since I
               | already disclaimed that I am not actual data scientist
               | and just a seasoned dev who can make some magic happen
               | without a dedicated data team (if/when the need arises)
               | then I wouldn't even be in a data scientist interview in
               | the first place. -\\_(tsu)_/-
        
               | apwell23 wrote:
               | Thats fair. My comment wasn't directed at you. I was
               | trying to be smart and write an inverse of original
               | comment. Where I as an interviewer was looking for a
               | proper 'data stack' and interviewee responded with a
               | bespoke solution.
               | 
               | "not understanding the scale of "real" big data was a no-
               | go in my eyes when hiring."
        
               | pdimitar wrote:
               | Sure, okay, I get it. My point was more like "Have you
               | tried this obvious thing first that a lot of devs can do
               | for you without too much hassle?". If I were to try for a
               | dedicated data scientist position then I'd have done
               | homework.
        
               | StrLght wrote:
               | > you can never guess what angle interviewer is coming at
               | you
               | 
               | Why would you _guess_ in that situation though?
               | 
               | It's an interview, there's at least 1 person talking to
               | you -- you should talk to them, ask them questions, share
               | your thoughts. If you talking to them is a red flag, then
               | high chances that you wouldn't want to work there anyway.
        
               | HelloNurse wrote:
               | Abstractly, "safety, auditablity, access control,
               | knowledge sharing" are about people reading and writing
               | files: simplifying away complicated management systems
               | improves security. The operating system should be good
               | enough.
        
             | apwell23 wrote:
             | Edit: for above comment.
             | 
             | My comment wasn't directed at parent. I was trying to be
             | smart and write an inverse of original comment. Opposite
             | scenario Where I as an interviewer was looking for a proper
             | 'data stack' and interviewee responded with a bespoke
             | solution.
             | 
             | "not understanding the scale of "real" big data was a no-go
             | in my eyes when hiring."
             | 
             | i was trying to point out that you can never know where the
             | interviewer is coming from. Unless i know interviewer
             | personally i would bias towards playing it safe and go with
             | 'enterpisey stack'
        
         | wslh wrote:
         | In my context 99% of the problem is the ETL, nothing to do with
         | complex technology. I see people stuck when they need to get
         | this from different sources in different technologies and/or
         | APIs.
        
         | mattbillenstein wrote:
         | I can appreciate the vertical scaling solution, but to be
         | honest, this is the wrong solution for almost all use cases -
         | consumers of the data don't want awk, and even if they did,
         | spooling over 6TB for every kinda of query without partitioning
         | or column storage is gonna be slow on a single cpu - always.
         | 
         | I've generally liked BigQuery for this type of stuff - the
         | console interface is good enough for ad-hoc stuff, you can
         | connect a plethora of other tooling to it (Metabase, Tableau,
         | etc). And if partitioned correctly, it shouldn't be too
         | expensive - add in rollup tables if that becomes a problem.
        
           | __alexs wrote:
           | A moderately powerful desktop processor has memory bandwidth
           | of over 50TB/s so yeah it'll take a couple of minutes sure.
        
             | fijiaarone wrote:
             | The slow part of using awk is waiting for the disk to spin
             | over the magnetic head.
             | 
             | And most laptops have 4 CPU cores these days, and a
             | multiprocess operating system, so you don't have to wait
             | for random access on a spinning plate to find every bit in
             | order, you can simply have multiple awk commands running in
             | parallel.
             | 
             | Awk is most certainly a better user interface than whatever
             | custom BrandQL you have to use in a textarea in a browser
             | served from localhost:randomport
        
               | Androider wrote:
               | > The slow part of using awk is waiting for the disk to
               | spin over the magnetic head.
               | 
               | If we're talking about 6 TB of data:
               | 
               | - You can upgrade to 8 TB of storage on a 16-inch MacBook
               | Pro for $2,200, and the _lowest_ spec has 12 CPU cores.
               | With up to 400 GB /s of memory bandwidth, it's truly a
               | case of "your big data problem easily fits on my laptop".
               | 
               | - Contemporary motherboards have 4 to 5 M.2 slots, so you
               | could today build a 12 TB RAID 5 setup of 4 TB Samsung
               | 990 PRO NVMe drives for ~ 4 x $326 = $1,304. Probably in
               | a year or two there will be 8 TB NVMe's readily
               | available.
               | 
               | Flash memory is cheap in 2024!
        
               | bewaretheirs wrote:
               | You can go further.
               | 
               | There are relatively cheap adapter boards which let you
               | stick 4 M.2 drives in a single PCIe x16 slot; you can
               | usually configure a x16 slot to be bifurcated
               | (quadfurcated) as 4 x (x4).
               | 
               | To pick a motherboard at quasi-random:
               | 
               | Tyan HX S8050. Two M.2 on the motherboard.
               | 
               | 20 M.2 drives in quadfurcated adapter cards in the 5 PCIe
               | x16 slots
               | 
               | And you can connect another 6 NVMe x4 devices to the MCIO
               | ports.
               | 
               | You might also be able to hook up another 2 to the
               | SFF-8643 connectors.
               | 
               | This gives you a grand total of 28-30 x4 NVME devices on
               | one not particularly exotic motherboard, using most of
               | the 128 regular PCIe lanes available from the CPU socket.
        
               | hnfong wrote:
               | I haven't been using spinning disks for perf critical
               | tasks for a looong time... but if I recall correctly,
               | using multiple processes to access the data is usually
               | counter-productive since the disk has to keep
               | repositioning its read heads to serve the different
               | processes reading from different positions.
               | 
               | Ideally if the data is laid out optimally on the spinning
               | disk, a single process reading the data would result in a
               | mostly-sequential read with much less time wasted on read
               | head repositioning seeks.
               | 
               | In the odd case where the HDD throughput is greater than
               | a single-threaded CPU processing for whatever reason (eg.
               | you're using a slow language and complicated processing
               | logic?), you can use one optimized process to just read
               | the raw data, and distribute the CPU processing to some
               | other worker pool.
        
             | dahart wrote:
             | Running awk on an in-memory CSV will come nowhere even
             | close to the memory bandwidth your machine is capable of.
        
           | fifilura wrote:
           | I agree with this. BigQuery or AWS s3/Athena.
           | 
           | You shouldn't have to set up a cluster for data jobs these
           | days.
           | 
           | And it kind of points out the reason for going with a data
           | scientist with the toolset he has in mind instead of
           | optimizing for a commandline/embedded programmer.
           | 
           | The tools will evolve in the direction of the data scientist,
           | while the embedded approach is a dead end in lots of ways.
           | 
           | You may have outsmarted some of your candidates, but you
           | would have hired a person not suited for the job long term.
        
             | orhmeh09 wrote:
             | It is actually pretty easy to do the same type of
             | processing you would do on a cluster with AWS Batch.
        
           | kjkjadksj wrote:
           | Hes hiring data scientists not building a service though.
           | This might realistically be a one off analysis for those 6tb.
           | At which point you are happy your data scientists has
           | returned statistical information instead of spending another
           | week making sure the pipeline works if someone puts a greek
           | character in a field.
        
             | data-ottawa wrote:
             | Even if I'm doing a one off, depending on the task it can
             | be easier/faster/more reliable to load 6TiB into a big
             | query table than waiting hours for some task to complete
             | and fiddling with parallelism and memory management.
             | 
             | It's a couple hundred bucks a month and $36 to query the
             | entire dataset, after partitioning thats not terrible.
        
               | nostrademons wrote:
               | A 6T hard drive and Pandas will cost you a couple hundred
               | bucks, one time purchase, and then last you for years
               | (and several other data analysis jobs). It also doesn't
               | require that you be connected to the Internet, doesn't
               | require that you trust 3rd-party services, and is often
               | faster (even in execution time) than spooling up
               | BigQuery.
               | 
               | You can always save an intermediate data set partitioned
               | and massaged into whatever format makes subsequent
               | queries easy, but that's usually application-dependent,
               | and so you want that control over how you actually store
               | your intermediate results.
        
               | data-ottawa wrote:
               | I wouldn't make a purchase of either without knowing a
               | bit more about the lifecycle and requirements.
               | 
               | If you only needed this once, the BQ approach requires
               | very little setup and many places already have a billing
               | account. If this is recurring then you need to figure out
               | what the ownership plan of the hard drive is (what's it
               | connected to, who updates this computer, what happens
               | when it goes down, etc.).
        
           | pyrale wrote:
           | Once you understand that 6tb fits on a hard drive, you can
           | just as well put it in a run-of-the-mill pg instance, which
           | metabase will reference just as easily. Hell, metabase is
           | fine with even a csv file...
        
             | crowcroft wrote:
             | I worked in a large company that had a remote desktop
             | instance with 256gb ram running a PG instance that analysts
             | would log in to to do analysis. I used to think it was a
             | joke of setup for such a large company.
             | 
             | I later moved to a company with a fairly sophisticated
             | setup with Databricks. While Databricks offered some QoL
             | improvements, it didn't magically make all my queries run
             | quickly, and it didn't allow me anything that I couldn't
             | have done on the remote desktop setup.
        
           | Stranger43 wrote:
           | And here we see this strange thing that data science people
           | does in forgetting that 6TB is small change for any SQL
           | server worth it's salt.
           | 
           | Just dump it into Oracle, postgre, mssql, or mysql and be
           | amazed by the kind of things you can do with 30year old data
           | analysis technology on an modern computer.
        
             | apwell23 wrote:
             | you wouldn't have been a 'winner' per OP. real answer is
             | loading it on their phones not on sqlserver or whatever.
        
               | Stranger43 wrote:
               | To be honest OP is kind of making the same mistake in
               | assuming that the only real alternatives is "new data
               | science products" and old school scripting exists as
               | valuable tools.
               | 
               | The extend people goes to to not recognize how much the
               | people creating the SQL language and the relational
               | database engines we now take for granted actually knew
               | what they were doing, are a bit of an mystery to me.
               | 
               | The right answer to any query that can be defined in SQL
               | is pretty much always an SQL engine even if it's just
               | sqlite running on an laptop. But somehow people seems to
               | keep comming up with reasons not to use SQL.
        
           | ryguyrg wrote:
           | you can scale vertically with a much better tech than awk.
           | 
           | enter duckdb with columnar vectorized execution and full SQL
           | support. :-)
           | 
           | disclaimer: i work with the author at motherduck and we make
           | a data warehouse powered by duckdb
        
         | chx wrote:
         | https://x.com/garybernhardt/status/600783770925420546 (Gary
         | Bernhardt of WAT fame):
         | 
         | > Consulting service: you bring your big data problems to me, I
         | say "your data set fits in RAM", you pay me $10,000 for saving
         | you $500,000.
         | 
         | This is from 2015...
        
           | RandomCitizen12 wrote:
           | https://yourdatafitsinram.net/
        
           | crowcroft wrote:
           | I wonder if it's fair to revise this to 'your data set fits
           | on NVME drives' these days. Astonishing how fast and how much
           | storage you can get these days.
        
             | fbdab103 wrote:
             | You can always check available ram:
             | https://yourdatafitsinram.net/
        
             | xethos wrote:
             | Based on a very brief search: Samsung's fastest NVME drives
             | [0] could maybe keep up with the slowest DDR2 [1]. DDR5 is
             | several orders of magnitude faster than both [2]. Maybe in
             | a decade you can hit 2008 speeds, but I wouldn't consider
             | updating the phrase before then (and probably not after,
             | either).
             | 
             | [0]
             | https://www.tomshardware.com/reviews/samsung-980-m2-nvme-
             | ssd...
             | 
             | [1] https://www.tomshardware.com/reviews/ram-speed-
             | tests,1807-3....
             | 
             | [2] https://en.wikipedia.org/wiki/DDR5_SDRAM
        
               | dralley wrote:
               | The statement was "fits on", not "matches the speed of".
        
               | Dylan16807 wrote:
               | Several gigabytes per second, plus RAM caching, is
               | probably enough though. Latency can be very important,
               | but there exist some very low latency enterprise flash
               | drives.
        
               | int_19h wrote:
               | I think the point is that if it fits on a single drive,
               | you can still get away with a much simpler solution (like
               | a traditional SQL database) than any kind of "big data"
               | stack.
        
         | marginalia_nu wrote:
         | Problem is possibly that most people with that sort of hands-on
         | intuition for data don't see themselves as data scientists and
         | wouldn't apply for such a position.
         | 
         | It's a specialist role, and most people with the skills you
         | seek are generalists.
        
           | deepsquirrelnet wrote:
           | Yeah it's not really what you should be hiring a data
           | scientist to do. I'm of the opinion that if you don't have a
           | data engineer, you probably don't need a data scientist. And
           | not knowing who you need for a job causes a lot of confusion
           | in interviews.
        
         | the_real_cher wrote:
         | How would six terabytes fit into memory?
         | 
         | It seems like it would get a lot of swap thrashing if you had
         | multiple processes operating on disorganized data.
         | 
         | I'm not really a data scientist and I've never worked on data
         | that size so I'm probably wrong.
        
           | coldtea wrote:
           | > _How would six terabytes fit into memory?_
           | 
           | What device do you have in mind? I've seen places use 2TB RAM
           | servers, and that was years ago, and it isn't even that
           | expensive (can get those for about $5K or so).
           | 
           | Currently HP allows "up to 48 DIMM slots which support up to
           | 6 TB for 2933 MT/s DDR4 HPE SmartMemory".
           | 
           | Close enough to fit the OS, the userland, and 6 TiB of data
           | with some light compression.
           | 
           | > _It seems like it would get a lot of swap thrashing if you
           | had multiple processes operating on disorganized data._
           | 
           | Why would you have "disorganized data"? Or "multiple
           | processes" for that matter? The OP mentions processing the
           | data with something as simple as awk scripts.
        
             | fijiaarone wrote:
             | "How would six terabytes fit into memory?"
             | 
             | A better question would be:
             | 
             | Why would anyone stream 6 terabytes of data over the
             | internet?
             | 
             | In 2010 the answer was: because we can't fit that much data
             | in a single computer, and we can't get accounting or
             | security to approve a $10k purchase order to build a local
             | cluster, so we need to pay Amazon the same amount every
             | month to give our ever expanding DevOps team something to
             | do with all their billable hours.
             | 
             | That may not be the case anymore, but our devops team is
             | bigger than ever, and they still need something to do with
             | their time.
        
               | the_real_cher wrote:
               | Well yeah streaming to the cloud to work around budget
               | issues is a while nother convo haha.
        
               | Terr_ wrote:
               | I'm having flashbacks to some new outside-hire CEO making
               | flim-flam about capex-vs-opex in order to justify sending
               | business towards a contracting firm they happened to
               | know.
        
             | the_real_cher wrote:
             | I mean if you're doing data science the data is not always
             | organized and of course you would want multi-processing.
             | 
             | 1 TB of memory is like 5 grand from a quick Google search
             | then you probably need specialized motherboards.
        
               | coldtea wrote:
               | > _I mean if you 're doing data science the data is not
               | always organized and of course you would want multi-
               | processing_
               | 
               | Not necessarily - I might not want it or need it. It's a
               | few TB, it can be on a fast HD, on an even faster SSD, or
               | even in memory. I can crunch them quite fast even with
               | basic linear scripts/tools.
               | 
               | And organized could just mean some massaging or just
               | having them in csv format.
               | 
               | This is already the same rushed notions about "needing
               | this" and "must have that" that the OP describes people
               | jumping to, that leads them to suggest huge setups,
               | distributed processing, multi-machine infrastructure, for
               | use cases and data sizes that could fit on a single
               | server with redundancy and be done it.
               | 
               | DHH has often written about this for their Basecamp needs
               | (scalling vertically where others scale horizontally
               | having worked for them for most of their operation),
               | there's also this classic post:
               | https://adamdrake.com/command-line-tools-can-
               | be-235x-faster-...
               | 
               | > _1 TB of memory is like 5 grand from a quick Google
               | search then you probably need specialized motherboards._
               | 
               | Not that specialized, I've work with server deployments
               | (HP) with 1, 1.5 and 2TB RAM (and > 100 cores), it's
               | trivial to get.
               | 
               | And 5 or even 30 grand would still be cheaper (and more
               | effective and simpler) than the "big data" setups some of
               | those candidates have in mind.
        
               | the_real_cher wrote:
               | Yeah I agree about over engineering.
               | 
               | Im just trying to understand the parent to my original
               | comment.
               | 
               | How would running awk for analysis on 6TB of data work
               | quickly and efficiently?
               | 
               | They say it would go into memory but its not clear to me
               | how that would work as would still have paging and
               | thrashing issues if the data didnt have often used
               | sections of the data.
               | 
               | am I overthinking it and they were they just referring to
               | buying a big ass Ram machine?
        
           | allanbreyes wrote:
           | There are machines that can fit that and more:
           | https://yourdatafitsinram.net/
           | 
           | I'm not advocating that this is generally a good or bad idea,
           | or even economical, but it's possible.
        
             | the_real_cher wrote:
             | I'm trying to understand what the person I'm replying to
             | had in mind when they said fit six terabytes in memory and
             | search with awk.
             | 
             | is this what they were referring to just by a big ass Ram
             | machine?
        
           | capitol_ wrote:
           | It would easy fit in ram: https://yourdatafitsinram.net/
        
           | jandrewrogers wrote:
           | 6 TB does not fit in memory. However, with a good storage
           | engine and fast storage this easily fits within the
           | parameters of workloads that have memory-like performance.
           | The main caveat is that if you are letting the kernel swap
           | that for you then you are going to have a bad day, it needs
           | to be done in user space to get that performance which
           | constrains your choices.
        
           | int_19h wrote:
           | Per one of the links below, IBM Power System E980 can be
           | configured for up to 64Tb RAM.
        
         | rr808 wrote:
         | If you look at the article the data space is more commonly 10GB
         | which matches my experience. For these sizes definitely simple
         | tools are enough.
        
         | randomtoast wrote:
         | Now, you have to consider the cost it takes for you whole team
         | to learn how to use AWK instead of SQL. Then you do these TCO
         | calculations and revert back to the BigQuery solution.
        
           | tomrod wrote:
           | About $20/month for chatgpt or similar copilot, which really
           | they should reach for independently anyhow.
        
             | randomtoast wrote:
             | And since the data scientist cannot verify the very complex
             | AWK output that should be 100% compatible with his SQL
             | query, he relies on the GPT output for business-critical
             | analysis.
        
               | tomrod wrote:
               | Only if your testing frameworks are inadequate. But I
               | belive you could be missing or mistaken on how code
               | generation successfully integrates into a developer and
               | data scientist's work flow.
               | 
               | Why not take a few days to get familiar with AWK, a skill
               | which will last a lifetime? Like SQL, it really isn't so
               | bad.
        
               | randomtoast wrote:
               | It is easier to write complex queries in SQL instead of
               | AWK. I know both AWK and SQL, and I find SQL much easier
               | for complex data analysis, including JOINS, subqueries,
               | window functions, etc. Of course, your mileage may vary,
               | but I think most data scientists will be much more
               | comfortable with SQL.
        
             | elicksaur wrote:
             | Many people have noted how when using LLMs for things like
             | this, the person's ultimate knowledge of the topic is less
             | than it would've otherwise been.
             | 
             | This effect then forces the person to be reliant on the LLM
             | for answering all questions, and they'll be less capable of
             | figuring out more complex issues in the topic.
             | 
             | $20/mth is a siren's call to introduce such a dependency to
             | critical systems.
        
           | clwg wrote:
           | Not necessarily. I always try to write to disk first, usually
           | in a rotating compressed format if possible. Then, based on
           | something like a queue, cron, or inotify, other tasks occur,
           | such as processing and database logging. You still end up at
           | the same place, and this approach works really well with
           | tools like jq when the raw data is in jsonl format.
           | 
           | The only time this becomes an issue is when the data needs to
           | be processed as close to real-time as possible. In those
           | instances, I still tend to log the raw data to disk in
           | another thread.
        
           | kjkjadksj wrote:
           | For someone who is comfortable with sql we are talking
           | minutes to hours to figure out awk well enough to see how its
           | used or use it.
        
             | noisy_boy wrote:
             | It is not only about whether people can figure it out awk.
             | It is also about how supportable the solution is. SQL
             | provides many features specifically to support complex
             | querying and is much more accessible to most people - you
             | can't reasonably expect your business analysts to do
             | complex analysis using awk.
             | 
             | Not only that, it provides a useful separation from the
             | storage format so you can use it to query a flat file
             | exposed as table using Apache Drill or a file on s3 exposed
             | by Athena or data in an actual table stored in a database
             | and so on. The flexibility is terrific.
        
             | esafak wrote:
             | I have been using sql for decades and I am not comfortable
             | with awk or intend to become so. There are better tools.
        
           | RodgerTheGreat wrote:
           | With the exception of regexes- which any programmer or data
           | analyst ought to develop some familiarity with anyway- you
           | can describe the entirety of AWK on a few sheets of paper.
           | It's a versatile, performant, and enduring data-handling tool
           | that is _already installed_ on all your servers. You would be
           | hard-pressed to find a better investment in technical
           | training.
        
           | Dylan16807 wrote:
           | No, if you want SQL you install postgresql on the single
           | machine.
           | 
           | Why would use use bigquery just to get SQL?
        
           | citizen_friend wrote:
           | sqlite cli
        
         | bee_rider wrote:
         | There'd still have to be some further questions, right? I guess
         | if you store it on the interview group's cellphones you'll have
         | to plan on what to do if somebody leaves or the interview room
         | is hit by a meteor, if you plan to store it in ram on a server
         | you'll need some plan for power outages.
        
         | apwell23 wrote:
         | What kind of business just has a static set of 6TiB data that
         | people are loading on their laptops.
         | 
         | You tricked candidates with your nonsensical scenario. Hate
         | smartass interviewers like this that are trying some gotcha to
         | feel smug about themselves.
         | 
         | Most candidates don't feel comfortable telling ppl 'just load
         | on your laptops' even if they think thats sensible. They want
         | to present a 'professional solution', esp when you tricked them
         | with the word 'stack'. which is how most of them prbly
         | perceived your trick question.
         | 
         | This comment is so infuriating to me. Why be assholes to each
         | other when world is already full of them.
        
           | tomrod wrote:
           | I disagree with your take. Your surly rejoinder aside, the
           | parent commenter identifies an area where senior level
           | knowledge and process appropriately assess a problem. Not
           | every job interview is satisfying checklist of prior
           | experience or training, but rather assessing how well that
           | skillset will fit the needed domain.
           | 
           | In my view, it's an appropriate question.
        
             | apwell23 wrote:
             | What did you gather as 'needed domain' from that comment.
             | 'needed domain' is often implicit, its not a blank slate.
             | candidates assume all sorts of 'needed domain' even before
             | the interview starts, if i am interviewing at bank I
             | wouldn't suggest 'load it on your laptops' as my 'stack'.
             | 
             | OP even mentioned that it his favorite 'tricky question' .
             | It would def trick me because they used the word 'stack'
             | which has specific meaning in the industry. There are even
             | websites dedicated to 'stack's
             | https://stackshare.io/instacart/instacart
        
           | yxwvut wrote:
           | Well put. Whoever asked this question is undoubtedly a
           | nightmare to work with. Your data is the engine that drives
           | your business and its margin improvements, so why hamstring
           | yourself with a 'clever' cost saving but ultimately unwieldy
           | solution that makes it harder to draw insight (or build
           | models/pipelines) from?
           | 
           | Penny wise and pound foolish, plus a dash of NIH syndrome.
           | When you're the only company doing something a particular way
           | (and you're not Amazon-scale), you're probably not as clever
           | as you think.
        
           | marcosdumay wrote:
           | > What kind of business just has a static set of 6TiB data
           | that people are loading on their laptops.
           | 
           | Most business have static sets of data that people load on
           | their PCs. (Why do you assume laptops?)
           | 
           | The only weird part of that question is that 6TiB is so big
           | it's not realistic.
        
           | pizzafeelsright wrote:
           | Big data companies or those that work with lots of data.
           | 
           | The largest dataset I worked with was about 60TB
           | 
           | While that didn't fit in ram most people would just load the
           | sample data into the cluster when I told them it would be
           | faster to load 5% locally and work off that.
        
         | throwaway_20357 wrote:
         | It depends on what you want to do with the data. It can be
         | easier to just stick nicely-compressed columnar Parquets in S3
         | (and run arbitrarily complex SQL on them using Athena or
         | Presto) than to try to achieve the same with shell-scripting on
         | CSVs.
        
           | fock wrote:
           | how exactly is this solution easier than putting the very
           | Parquet files on a classic filesystem. Why does the easy
           | solution require an amazon-subscription?
        
         | filleokus wrote:
         | I think I've written about it here before, but I imported [?]1
         | TB of logs into DuckDB (which compressed it to fit in RAM of my
         | laptop) and was done with my analysis before the data science
         | team had even ingested everything into their spark cluster.
         | 
         | (On the other hand, I wouldn't really want the average business
         | analyst walking around with all our customer data on their
         | laptops all the time. And by the time you have a proper ACL
         | system with audit logs and some nice way to share analyses that
         | updates in real time as new data is ingested, the Big Data
         | Solution(tm) probably have a lower TCO...)
        
           | marcosdumay wrote:
           | > And by the time you have ... the Big Data Solution(tm)
           | probably have a lower TCO...
           | 
           | I doubt it. The common Big Data Solutions manage to have a
           | very high TCO, where the least relevant share is spent on
           | hardware and software. Most of its cost comes from
           | reliability engineering and UI issues (because managing that
           | "proper ACL" that doesn't fit your business is a hell of a
           | problem that nobody will get right).
        
           | riku_iki wrote:
           | you probably didn't do joins for example on your dataset,
           | because DuckDB is OOMing on them if they don't fit memory.
        
         | thunky wrote:
         | > requirements of "6 TiB of data"
         | 
         | How could anyone answer this without knowing how the data is to
         | be used (query patterns, concurrent readers, writes/updates,
         | latency, etc)?
         | 
         | Awk may be right for some scenarios, but without specifics it
         | can't be a correct answer.
        
           | marginalia_nu wrote:
           | Those are very appropriate follow up questions I think. If
           | someone tasks you to deal with 6 TiB of data, it is very
           | appropriate to ask enough questions until you can provide a
           | good solution, far better than to assume the questions are
           | unknowable and blindly architect for all use cases.
        
         | kbolino wrote:
         | Even if a 6 terabyte CSV file does fit in RAM, the only thing
         | you should do with it is convert it to another format (even if
         | that's just the in-memory representation of some program). CSV
         | stops working well at billions of records. There is no way to
         | find an arbitrary record because records are lines and lines
         | are not fixed-size. You can sort it one way and use binary
         | search to find something in it in semi-reasonable time but re-
         | sorting it a different way will take hours. You also can't
         | insert into it while preserving the sort without rewriting half
         | the file on average. You don't need Hadoop for 6 TB but,
         | assuming this is live data that changes and needs regular
         | analysis, you do need something that actually works at that
         | size.
        
         | 7thaccount wrote:
         | I am a big fan of these simplistic solutions. In my own area,
         | it was incredibly frustrating as what we needed was a database
         | with a smaller subset of the most recent information from our
         | main long-term storage database for back end users to do
         | important one-off analysis with. This should've been fairly
         | cheap, but of course the IT director architect guy wanted to
         | pad his resume and turn it all into multi-million project with
         | 100 bells and whistles that nobody wanted.
        
         | palata wrote:
         | One thing that may have an impact on the answers: you are
         | hiring them, so I assume they are passing a technical
         | interview. So they expect that you want to check their
         | understanding of the technical stack.
         | 
         | I would not conclude that they over-engineer everything they do
         | from such an answer, but rather just that they got tricked in
         | this very artificial situation where you are in a dominant
         | position and ask trick questions.
         | 
         | I was recently in a technical interview with an interviewer
         | roughly my age and my experience, and I messed up. That's the
         | game, I get it. But the interviewer got judgemental towards my
         | (admittedly bad) answers. I am absolutely certain that were the
         | roles inverted, I could choose a topic I know better than him
         | and get him in a similarly bad position. But in this case, he
         | was in the dominant position and he chose to make me feel bad.
         | 
         | My point, I guess, is this: when you are the interviewer, be
         | extra careful not to abuse your dominant position, because it
         | is probably counter-productive for your company (and it is just
         | not nice for the human being in front of you).
        
           | ufo wrote:
           | From the point of view of the interviewee, it's impossible to
           | guess if they expect you to answer "no need for big data" or
           | if they expect you to answer "the company is aiming for
           | exponential growth so disregard the 6TB limit and architect
           | for scalability"
        
             | kmarc wrote:
             | FWIW, it's a 2.5 second extra to say "Although you don't
             | need big data, but if you insist, ..." and gimme the hadoop
             | answer.
        
               | whamlastxmas wrote:
               | Is this like interviewing for a chef position for a fancy
               | restaurant and when asked how to perfectly cook a steak,
               | you preface it with "well you can either go to McDonald's
               | and get a burger, or..."
               | 
               | It may not be reasonable to suggest that in a role that
               | traditionally uses big data tools
        
               | dkz999 wrote:
               | Idk, in this instance I feel pretty strongly that cloud,
               | and solutions with unecessary overhead, are the fast
               | food. The article proposes not eating it all the time.
        
               | hnfong wrote:
               | I see it more like "it's 11pm and a family member
               | suddenly wants to eat a steak at home, what would you
               | do?"
               | 
               | The person who says "I'm going drive back to the
               | restaurant and take my professional equipment home to
               | cook the steak" is probably offering the wrong answer.
               | 
               | I'm obviously not a professional cook, but presumably the
               | ability to improvise with whatever tools you currently
               | have is a desirable skill.
        
               | palata wrote:
               | Hmm I would say that the equivalent to your 11pm question
               | is more something like "your sister wants to backup her
               | holiday pictures on the cloud, how do you design it?".
               | The person who says "I ask her 10 millions to build a
               | data center" is probably offering the wrong answer :-).
        
               | tored wrote:
               | I think more like, how would you prepare and cook the
               | best five course gala dinner for only $10. That requires
               | true skill.
        
               | bee_rider wrote:
               | I'm not sure if you are referencing it intentionally or
               | not, but some chefs (Gordon Ramsey for one) will ask an
               | interviewee to make some scrambled eggs; something not
               | super niche or specialized but enough to see what their
               | technique is.
               | 
               | It is a sort of "interview hack" example that's been used
               | to emphasize the idea of a simple unspecialized skill-
               | test that went around a while ago. I guess upcoming chefs
               | probably practice egg scrambling nowadays, ruining the
               | value of the test. But maybe they could ask to make a bit
               | of steak now.
        
               | Dylan16807 wrote:
               | The fancy cluster is probably slower for most tasks than
               | one big machine storing everything in RAM. It's not like
               | a fast food burger.
        
               | jancsika wrote:
               | That's great, but it's really just desiderata about you
               | and your personal situation.
               | 
               | E.g., if a HN'er takes this as advice they're just as
               | likely to be gated by some other interviewer who
               | interprets hedging as a smell.
               | 
               | I believe the posters above are essentially saying: you,
               | the interviewer, can take the 2.5 seconds to ask the
               | follow up, "... and if we're not immediately optimizing
               | for scalability?" Then take that data into account when
               | doing your assessment instead of attempting to optimize
               | based on a single gate.
               | 
               | Edit: clarification
        
               | coffeebeqn wrote:
               | This is the crux of it. Another interviewer would've
               | marked "run on a local machine with a big SSD" - as: this
               | fool doesn't know enough about distributed systems and
               | just runs toy projects on one machine
        
               | dartos wrote:
               | That is what I think interviewers think when I don't
               | immediately bring up kubernetes and sqs in an
               | architecture interview
        
               | theamk wrote:
               | depending on the shop? For some kinds of tasks, jumping
               | to kubernets right away would be a minus during
               | interview.
        
               | antisthenes wrote:
               | > E.g., if a HN'er takes this as advice they're just as
               | likely to be gated by some other interviewer who
               | interprets hedging as a smell.
               | 
               | If people in high stakes environments interpret hedging
               | as a smell - run from that company as fast as you can.
               | 
               | Hedging is a natural adult reasoning process. Do you
               | really want to work with someone who doesn't understand
               | that?
        
               | llm_trw wrote:
               | I once killed the deployment of a big data team in a
               | large bank when I laid out in excruciating details
               | exactly what they'd have to deal with during an
               | interview.
               | 
               | Last I heard theyd promoted one unix guy on the inside to
               | baby sit a bunch of chron jobs on the biggest server they
               | could find.
        
               | palata wrote:
               | Sure, but as you said yourself: it's a trick question.
               | How often does the employee have to answer trick
               | questions without having any time to think in the actual
               | job?
               | 
               | As an interviewer, why not asking: "how would you do that
               | in a setup that doesn't have much data and doesn't need
               | to scale, and then how would you do it if it had a ton of
               | data and a big need to scale?". There is no trick here,
               | do you feel you lose information about the interviewee?
        
               | zdragnar wrote:
               | Depends on the level you're hiring for. At a certain
               | point, the candidate needs to be able to identify the
               | right tool for the job, including when that tool is not
               | the usual big data tools but a simple script.
        
               | hirsin wrote:
               | Trick questions (although not known as such at the time)
               | are the basis of most of the work we do? XY problem is a
               | thing for a reason, and I cannot count the number of
               | times my teams and I have ratholed on something complex
               | only to realize we were solving for the wrong problem,
               | i.e. A trick question.
               | 
               | As a sibling puts it though, it's a matter of level.
               | Senior/staff and above? Yeah, that's mostly what you do.
               | Lower than that, then you should be able to mostly trust
               | those upper folks to have seen through the trick.
        
               | palata wrote:
               | > are the basis of most of the work we do?
               | 
               | I don't know about you, but in my work, I always have
               | more than 3 seconds to find a solution. I can slowly
               | think about the problem, sleep on it, read about it, try
               | stuff, think about it while running, etc. I usually do at
               | least some of those for _new_ problems.
               | 
               | Then of course there is a bunch of stuff that is not
               | challenging and for which I can start coding right away.
               | 
               | In an interview, those trick questions will just show you
               | who already has experience with the problem you mentioned
               | and who doesn't. It doesn't say _at all_ (IMO) how good
               | the interviewee is at tackling challenging problem. The
               | question then is: do you want to hire someone who is good
               | at solving challenging problems, or someone who already
               | knows how to solve the one problem you are hiring them
               | for?
        
               | theamk wrote:
               | If the interviewer expects you to answer entire design
               | question in 3 seconds, that interview is pretty broken.
               | Those questions should take longish time (minutes to tens
               | of minutes), and should let candidate showcase their
               | thought process.
        
               | palata wrote:
               | I meant that the interviewer expects you to start
               | answering after 3 seconds. Of course you can elaborate
               | over (tens of) minutes. But that's very far from actual
               | work, where you have time to think before you start
               | solving a problem.
               | 
               | You may say "yeah but you just have to think out loud,
               | that's what the interviewer wants". But again that's not
               | how I work. If the interviewer wants to see me design a
               | system, they should watch me read documentation for
               | hours, then think about it while running, and read again,
               | draw a quick thing, etc.
        
               | coryrc wrote:
               | Once had a coworker write a long proposal to rewrite some
               | big old application from Python to Go. I threw in a
               | single comment: why don't we use the existing code as a
               | separate executable?
               | 
               | Turns out he was laid off and my suggestion was used.
               | 
               | (Okay, I'm being silly, the layoff was a coincidence)
        
               | theamk wrote:
               | because the interview is supposed to ask same questions
               | as real job, and in real job there are rarely big hints
               | like you are describing.
               | 
               | On the other hand, "hey I have 6TiB data, please prepare
               | to analyze it, feel free to ask any questions for
               | clarification but I may not know the answers" is much
               | more representative of a real-life task.
        
               | int_19h wrote:
               | Being able to ask qualifying questions like that, or
               | presenting options with different caveats clearly spelled
               | out, is part of the job description IMO, at least for
               | senior roles.
        
             | valenterry wrote:
             | It doesn't matter. The answer should be "It depends, what
             | are the circumstances - do we expect high growth in the
             | future? Is it gonna stay around 6TB? How and by whom will
             | it be used and what for?"
             | 
             | Or, if you can guess what the interviewer is aiming for,
             | state the assumption and go from there "If we assume it's
             | gonna stay at <10TB for the next couple of years or even
             | longer, then..."
             | 
             | Then the interviewer can interrupt and change the
             | assumptions to his needs.
        
             | drubio wrote:
             | It's almost a law "all technical discussions devolve into
             | interview mind games", this industry has a serious
             | interview/hiring problem.
        
             | layer8 wrote:
             | You shouldn't guess what they expect, you should say what
             | you think is right, and why. Do you want to work at a
             | company where you would fail an interview due to making a
             | correct technical assessment? And even if the guess is
             | right, as an interviewer I would be more impressed by an
             | applicant that will give justified reasons for a different
             | answer than what I expected.
        
             | andoando wrote:
             | Its great if the interviewer actually takes time to sort
             | out the questions you have, cause seemingly simple
             | questions to you have a lot of assumptions you made.
             | 
             | I had an interview "design an app store". I tried asking,
             | ok an app store has a ton of components, which part of the
             | app store are you asking exactly? The response I got was
             | "Have you ever used an app store? Design an app store". Umm
             | ok.
        
             | oivey wrote:
             | Engineering for scalability here is the single server
             | solution that you throw away later when scale is needed.
             | The price is so small (in this case) for the simple
             | solution that you should basically always start with it.
        
         | mrtimo wrote:
         | .parquet files are completely underrated, many people still do
         | not know about the format!
         | 
         | .parquet preserves data types (unlike CSV)
         | 
         | They are 10x smaller than CSV. So 600GB instead of 6TB.
         | 
         | They are 50x faster to read than CSV
         | 
         | They are an "open standard" from Apache Foundation
         | 
         | Of course, you can't peek inside them as easily as you can a
         | CSV. But, the tradeoffs are worth it!
         | 
         | Please promote the use of .parquet files! Make .parquet files
         | available for download everywhere .csv is available!
        
           | sph wrote:
           | Third consecutive time in 86 days that you mention .parquet
           | files. I am out of my element here, but it's a bit weird
        
             | fifilura wrote:
             | FWIW I am the same. I tend to recommend BigQuery and
             | AWS/Athena in various posts. Many times paired with
             | Parquet.
             | 
             | But it is because it makes a lot of things much simpler,
             | and that a lot of people have not realized that. Tooling is
             | moving fast in this space, it is not 2004 anymore.
             | 
             | His arguments are still valid and 86 days is a pretty long
             | time.
        
             | ok_computer wrote:
             | Sometimes when people discover or extensively use something
             | they are eager to share in contexts they think are
             | relevant. There is an issue when those contexts become too
             | broad.
             | 
             | 3 times across 3 months is hardly astroturfing for big
             | parquet territory.
        
             | mrtimo wrote:
             | I've downloaded many csv files that were mal-formatted
             | (extra commas or tabs etc.), or had dates in non-standard
             | formats. Parquet format probably would not have had these
             | issues!
        
           | ddalex wrote:
           | Why is .parquet better than protobuf?
        
             | sdenton4 wrote:
             | Parquet is columnar storage, which is much faster for
             | querying. And typically for protobuf you deserialize each
             | row, which has a performance cost - you need to deserialize
             | the whole message, and can't get just the field you want.
             | 
             | So, of you want to query a giant collection of protobufs,
             | you end up reading and deserializing every record. For
             | parquet, you get much closer to only reading what you need.
        
             | nostrademons wrote:
             | Parquet ~= Dremel, for those who are up on their Google
             | stack.
             | 
             | Dremel was pretty revolutionary when it came out in 2006 -
             | you could run ad-hoc analyses in seconds that previously
             | would've taken a couple days of coding & execution time.
             | Parquet is awesome for the same reasons.
        
           | thesz wrote:
           | Parquet is underdesigned. Some parts of it do not scale well.
           | 
           | I believe that Parquet files have rather monolithic metadata
           | at the end and it has 4G max size limit. 600 columns (it is
           | realistic, believe me), and we are at slightly less than 7.2
           | millions row groups. Give each row group 8K rows and we are
           | limited to 60 billion rows total. It is not much.
           | 
           | The flatness of the file metadata require external data
           | structures to handle it more or less well. You cannot just
           | mmap it and be good. This external data structure most
           | probably will take as much memory as file metadata, or even
           | more. So, 4G+ of your RAM will be, well, used slightly
           | inefficiently.
           | 
           | (block-run-mapped log structured merge tree in one file can
           | be as compact as parquet file and allow for very efficient
           | memory mapped operations without additional data structures)
           | 
           | Thus, while parqet is a step, I am not sure it is a step in
           | definitely right direction. Some aspects of it are good, some
           | are not that good.
        
             | datadeft wrote:
             | Nobody is forcing you to use a single Parquet file.
        
               | thesz wrote:
               | Of course.
               | 
               | But nobody tells me that I can hit a hard limit and then
               | I need a second Parquet file and should have some code
               | for that.
               | 
               | The situation looks to me as if my "Favorite DB server"
               | supports, say, only 1.9 billions records per table and if
               | I hit that limit I need a second instance of my "Favorite
               | DB server" just for that unfortunate table. And it is not
               | documented anywhere.
        
             | apwell23 wrote:
             | some critiques of parquet by andy pavlo
             | 
             | https://www.vldb.org/pvldb/vol17/p148-zeng.pdf
        
               | thesz wrote:
               | Thanks, very insightful.
               | 
               | "Dictionary Encoding is effective across data types (even
               | for floating-point values) because most real-world data
               | have low NDV ratios. Future formats should continue to
               | apply the technique aggressively, as in Parquet."
               | 
               | So this is not critique, but assessment. And Parquet has
               | some interesting design decisions I did not know about.
               | 
               | So, let me thank you again. ;)
        
             | imiric wrote:
             | What format would you recommend instead?
        
               | thesz wrote:
               | I do not know a good one.
               | 
               | A former colleague of mine is now working on a memory-
               | mapped log-structured merge tree implementation and it
               | can be a good alternative. LSM provides elasticity, one
               | can store as much data as one needs, it is static, thus
               | it can be compressed as well as Parquet-stored data,
               | memory mapping and implicit indexing of data do not
               | require additional data structures.
               | 
               | Something like LevelDB and/or RocksDB can provide most of
               | that, especially when used in covering index [1] mode.
               | 
               | [1] https://www.sqlite.org/queryplanner.html#_covering_in
               | dexes
        
             | Renaud wrote:
             | Parquet is not a database, it's a storage format that
             | allows efficient column reads so you can get just the data
             | you need without having to parse and read the whole file.
             | 
             | Most tools can run queries across parquet files.
             | 
             | Like everything, it has its strengths and weaknesses, but
             | in most cases, it has better trade-offs over CSV if you
             | have more than a few thousand rows.
        
               | beryilma wrote:
               | > Parquet is not a database.
               | 
               | This is not emphasized often enough. Parquet is useless
               | for anything that requires writing back computed results
               | as in data used by signal processing applications.
        
             | maxnevermind wrote:
             | > 7.2 millions row groups
             | 
             | Why would you need 7.2 mil row groups?
             | 
             | Row group size when stored in HDFS is usually equal to HDFS
             | bock size by default, which is 128MB
             | 
             | 7.2 mil * 128MB ~ 1PB
             | 
             | You have a single parquet file 1PB in size?
        
               | thesz wrote:
               | Parquet is not HDFS. It is a static format, not a B-tree
               | in disguise like HDFS.
               | 
               | You can have compressed Parquet columns with 8192 entries
               | being a couple of tens bytes in size. 600 columns in a
               | row group is then 12K bytes or so, leading us to 100GB
               | file, not a petabyte. Four orders of magnitude of
               | difference between your assessment and mine.
        
           | riku_iki wrote:
           | > They are 50x faster to read than CSV
           | 
           | I actually benchmarked this and duckdb CSV reader is faster
           | than parquet reader.
        
             | wenc wrote:
             | I would love to see the benchmarks. That is not my
             | experience, except in the rare case of a linear read (in
             | which CSV is much easier to parse).
             | 
             | CSV underperforms in almost every other domain, like joins,
             | aggregations, filters. Parquet lets you do that lazily
             | without reading the entire Parquet dataset into memory.
        
               | riku_iki wrote:
               | > That is not my experience, except in the rare case of a
               | linear read (in which CSV is much easier to parse).
               | 
               | Yes, I think duckdb only reads CSV, then projects
               | necessary data into internal format (which is probably
               | more efficient than parquet, again based on my
               | benchmarks), and does all ops (joins, aggregations) on
               | that format.
        
               | wenc wrote:
               | Yes, it does that, assuming you read in the entire CSV,
               | which works for CSVs that fit in memory.
               | 
               | With Parquet you almost never read in the entire dataset
               | and it's fast on all the projections, joins, etc. while
               | living on disk.
        
               | riku_iki wrote:
               | > which works for CSVs that fit in memory.
               | 
               | what? Why CSV is required to fit in memory in this case?
               | I tested CSVs which are far larger than memory, and it
               | works just fine.
        
               | geysersam wrote:
               | The entire csv doesn't have to fit in memory, but the
               | entire csv has to pass through memory at some point
               | during the processing.
               | 
               | The parquet file has metadata that allows duckdb to only
               | read the parts that are actually used, reducing total
               | amount of data read from disk/network.
        
               | riku_iki wrote:
               | > The parquet file has metadata that allows duckdb to
               | only read the parts that are actually used, reducing
               | total amount of data read from disk/network.
               | 
               | this makes sense, and what I hoped to have. But in
               | reality looks like parsing CSV string works faster than
               | bloated and overengineered parquet format with libs.
        
               | wenc wrote:
               | >But in reality looks like parsing CSV string works
               | faster than bloated and overengineered parquet format
               | with libs.
               | 
               | Anecdotally having worked with large CSVs and large on-
               | disk Parquet datasets, my experience is the opposite of
               | yours. My DuckDB queries operate directly on Parquet on
               | disk and never load the entire dataset, and is always
               | much faster than the equivalent operation on CSV files.
               | 
               | I think your experience might be due to -- what it sounds
               | like -- parsing the entire CSV into memory first (CREATE
               | TABLE) and then processing after. That is not an apples-
               | to-apples comparison because we usually don't do this
               | with Parquet -- there's no CREATE TABLE step. At most
               | there's a CREATE VIEW, which is lazy.
               | 
               | I've seen your comments bashing Parquet in DuckDB
               | multiple times, and I think you might be doing something
               | wrong.
        
               | riku_iki wrote:
               | > I think your experience might be due to -- what it
               | sounds like -- parsing the entire CSV into memory first
               | (CREATE TABLE) and then processing after. That is not an
               | apples-to-apples
               | 
               | original discussion was about CSV vs parquet "reader"
               | part, so this is exactly apple to apple testing, easy to
               | benchmark and I stand my ground. What you are doing
               | downstream, it is another question which is not possible
               | to discuss because no code for your logic is available.
               | 
               | > I've seen your comments bashing Parquet in DuckDB
               | multiple times, and I think you might be doing something
               | wrong.
               | 
               | like running one command from DuckDB doc.
               | 
               | Also, I am not "bashing", I just state that CSV reader is
               | faster.
        
             | xnx wrote:
             | For how many rows?
        
               | riku_iki wrote:
               | 10B
        
           | jjgreen wrote:
           | _Please promote the use of .parquet files!_
           | apt-cache search parquet       <nada>
           | 
           | Maybe later
        
             | seabass-labrax wrote:
             | Parquet is a _file format_ , not a piece of software. 'apt
             | install csv' doesn't make any sense either.
        
               | jjgreen wrote:
               | There is _no support_ for parquet in Debian, by contrast
               | apt-cache search csv | wc -l       259
        
               | fhars wrote:
               | If you want to shine with snide remarks, you should at
               | least understand the point being made:
               | $ apt-cache search csv | wc -l         225         $ apt-
               | cache search parquet | wc -l         0
        
             | nostrademons wrote:
             | It's more like "sudo pip install pandas" and then Pandas
             | comes with Parquet support.
        
               | jjgreen wrote:
               | Pandas cannot read parquet files itself, it uses 3rd
               | party "engines" for that purpose and those are not
               | available in Debian
        
               | nostrademons wrote:
               | Ah yes, that's true though a typical Anaconda
               | installation will have them automatically installed.
               | "sudo pip install pyarrow" or "sudo pip install
               | fastparquet" then.
        
         | EdwardDiego wrote:
         | If you were hiring me for a data engineering role and asked me
         | how to store and query 6 TiB, I'd say you don't need my skills,
         | you've probably got a Postgres person already.
        
         | hotstickyballs wrote:
         | And how many data scientists are familiar with using awk
         | scripts? If you're the only one then you'll have failed at
         | scaling the data science team.
        
         | jrm4 wrote:
         | This feels representative of _so many of our problems in tech,_
         | overengineering, over- "producting," over-proprietary-ing, etc.
         | 
         | Deep centralization at the expense of simplicity and true
         | redundancy; like renting a laser cutter when you need a
         | boxcutter, a pair of scissors, and the occasional toenail
         | clipper.
        
         | rgrieselhuber wrote:
         | This is a great test / question. More generally, it tests
         | knowledge with basic linux tooling and mindset as well as
         | experience level with data sizes. 6TiB really isn't that much
         | data these days, depending on context and storage format, etc.
         | of course.
        
           | deepsquirrelnet wrote:
           | It could be a great question if you clarify the goals. As it
           | stands it's "here's a problem, but secretly I have hidden
           | constraints in my head you must guess correctly".
           | 
           | The OPs desired solution could have been found from probably
           | some of those other candidates if asked "here is the
           | challenge, solve in most McGuyver way possible". Because if
           | you change the second part, the correct answer changes.
           | 
           | "Here is a challenge, solve in the most accurate, verifiable
           | way possible"
           | 
           | "Here is a challenge, solve in a way that enables
           | collaboration"
           | 
           | "Here is a challenge, 6TiB but always changing"
           | 
           | ^ These are data science questions much more than the
           | question he was asking. The answer in this case is that
           | you're not actually looking for a data scientist.
        
         | 6510 wrote:
         | I dont know anything but when doing that I always end up next
         | Thursday having the same with 4TB and the next with 17 at which
         | point I regret picking a solution that fit so exactly.
        
         | wg0 wrote:
         | I have lived through the hype of Big data it was a time of
         | HDFS+HTable I guess and Hapoop etc.
         | 
         | One can't go wrong with DuckDB+SQLite+Open/Elasticsearch either
         | with 6 to 8 even 10 TB of data.
         | 
         | [0]. https://duckdb.org/
        
         | michaelcampbell wrote:
         | My smartphone cannot store 1TiB. <shrug>
        
         | dfgdfg34545456 wrote:
         | The problem with your question is that they are there to show
         | off their knowledge. I failed a tech interview once, question
         | was build a web page/back end/db that allows people to order
         | let's say widgets, that will scale huge. I went the simpleton
         | answer route, all you need is Rails, a redis cache and an AWS
         | provisioned relational DB, solve the big problems later if you
         | get there sort of thing. Turns out they wanted to hear all
         | about microservices and sharding.
        
         | lizknope wrote:
         | I'm on some reddit tech forums and people will say "I need help
         | storing a huge amount of data!" and people start offering
         | replies for servers that store petabytes.
         | 
         | My question is always "How much data do you actually have?"
         | Many times you they reply with 500GB or 2TB. I tell that that
         | isn't much data when you can get 1TB micro SD card the size of
         | a fingernail or a 24TB hard drive.
         | 
         | My feeling is that if you really need to store petabytes of
         | data that you aren't going to ask how to do it on reddit. If
         | you need to store petabytes you will have an IT team and
         | substantial budget and vendors that can figure it out.
        
         | rqtwteye wrote:
         | Plenty of people get offended if you tell them that their data
         | isn't really "big data". A few years ago I had a discussion
         | with one of my directors about a system IT had built for us
         | with Hadoop, API gateways, multiple developers and hundreds of
         | thousands of yearly cost. I told him that at our scale (now and
         | any foreseeable future) I could easily run the whole thing on a
         | USB drive attached to his laptop and a few python scripts. He
         | looked really annoyed and I was never involved again with this
         | project.
         | 
         | I think it's part of the BS cycle that's prevalent in
         | companies. You can't admit that you are doing something simple.
        
           | noisy_boy wrote:
           | In most non-tech companies, it comes down to the motive of
           | the manager and in most cases it is expansion of reporting
           | line and grabbing as much budget as possible. Using "simple"
           | solutions runs counter to this central motivation.
        
             | disqard wrote:
             | This is also true of tech companies. Witness how the
             | "GenAI" hammer is being used right now at MS, Google, Meta,
             | etc.
        
             | eloisant wrote:
             | - the manager wants expansion
             | 
             | - the developers want to get experience in a fancy stack to
             | build up their resume
             | 
             | Everyone benefits from the collective hallucination
        
           | boh wrote:
           | That's the tech sector in a nutshell. Very few innovations
           | actually matter to non-tech companies. Most companies could
           | survive on Windows 98 software.
        
         | KronisLV wrote:
         | > The winner of course was the guy who understood that 6TiB is
         | what 6 of us in the room could store on our smart phones, or a
         | $199 enterprise HDD (or three of them for redundancy), and it
         | could be loaded (multiple times) to memory as CSV and simply
         | run awk scripts on it.
         | 
         | If it's not a very write heavy workload but you'd still want to
         | be able to look things up, wouldn't something like SQLite be a
         | good choice, up to 281 TB: https://www.sqlite.org/limits.html
         | 
         | It even has basic JSON support, if you're up against some
         | freeform JSON and not all of your data neatly fits into a
         | schema: https://sqlite.org/json1.html
         | 
         | A step up from that would be PostgreSQL running in a container:
         | giving you the support for all sorts of workloads, more
         | advanced extensions for pretty much anything you might ever
         | want to do, from geospatial data with PostGIS, to something
         | like pgvector, timescaledb etc., while still having a plethora
         | of drivers and still not making your drown in complexity and
         | having no issues with a few dozen/hundred TB of data.
         | 
         | Either of those would be something that most people on the
         | market know, neither will make anyone want to pull their hair
         | out and they'll give you the benefit of both quick data
         | writes/retrieval, as well as querying. Not that everything
         | needs or can even work with a relational database, but it's
         | still an okay tool to reach for past trivial file storage
         | needs. Plus, you have to build a bit less of whatever
         | functionality you might need around the data you store, in
         | addition to there even being nice options for transparent
         | compression.
        
         | hipadev23 wrote:
         | Huh? How are you proposing loading a 6TB CSV into memory
         | multiple times? And then processing with awk, which generally
         | streams one a line at a time.
         | 
         | Obviously we can get boxes with multiple terabytes of RAM for
         | $50-200/hr on-demand but nobody is doing that and then also
         | using awk. They're loading the data into clickhouse or duckdb
         | (at which point the ram requirement is probably 64-128GB)
         | 
         | I feel like this is an anecdotal story that has mixed up sizes
         | and tools for dramatic effect.
        
         | dahart wrote:
         | Wait, how would you split 6 TiB across 6 phones, how would you
         | handle the queries? How long will the data live, do you need to
         | handle schema changes, and how? And what is the cost of a
         | machine with 15 or 20 TiB of RAM (you said it fits in memory
         | multiple times, right?) - isn't the drive cost irrelevant here?
         | How many requests per second did you specify? Isn't that
         | possibly way more important than data size? Awk on 6 TiB, even
         | in memory, isn't very fast. You might need some indexing, which
         | suddenly pushes your memory requirement above 6 TiB, no? Do you
         | need migrations or backups or redundancy? Those could increase
         | your data size by multiples. I'd expect a question that
         | specified a small data size to be asking me to estimate the
         | _real_ data size, which could easily be 100 TiB or more.
        
         | torginus wrote:
         | It's astonishing how shit the cloud is compared to boring-ass
         | pedestrian technology.
         | 
         | For example, just logging stuff into a large text file is so
         | much easier, performant and searchable that using AWS
         | CloudWatch, presumably written by some of the smartest
         | programmers who ever lived.
         | 
         | On another note I was once asked to create a big data-ish
         | object DB, and me, knowing nothing about the domain, and a bit
         | of benchmarking, decided to just use zstd-compressed json
         | streams with a separate index in an sql table. I'm sure any
         | professional would recoil at it in horror, but it could do
         | literally gigabytes/sec retrieval or deserialization on
         | consumer grade hardware.
        
         | jandrewrogers wrote:
         | As a point of reference, I routinely do fast-twitch analytics
         | on _tens_ of TB on a single, fractional VM. Getting the data in
         | is essentially wire speed. You won 't do that on Spark or
         | similar but in the analytics world people consistently
         | underestimate what their hardware is capable of by something
         | like two orders of magnitude.
         | 
         | That said, most open source tools have terrible performance and
         | efficiency on large, fast hardware. This contributes to the
         | intuition that you need to throw hardware at the problem even
         | for relatively small problems.
         | 
         | In 2024, "big data" doesn't really start until you are in the
         | petabyte range.
        
         | buremba wrote:
         | I can't really think of a product with the requirement of max
         | 6TiB data. If the data is big as TiB, most products have 100x
         | TiB rather than a few ones.
        
         | citizenpaul wrote:
         | The funny thing is that is exactly the place I want to work at.
         | I've only found one company so far and the owner sold during
         | the pandemic. So far my experience is that amount of
         | companies/people that want what you describe is incredibly low.
         | 
         | I wrote a comment on here the other day that some place I was
         | trying to do work for was using $11k USD a month on a BigQuery
         | DB that had 375MB of source data. My advice was basically you
         | need to hire a data scientist that knows what they are doing.
         | They were not interested and would rather just band-aid the
         | situation for a "cheap" employee. Despite the fact their GCP
         | bill could pay for a skilled employee.
         | 
         | As I've seen it for the last year job hunting most places don't
         | want good people. They want replaceable people.
        
         | itronitron wrote:
         | >> "6 TiB of data"
         | 
         | is not somewhat detailed requirements, as it depends quite a
         | bit on the nature of the data.
        
         | tonetegeatinst wrote:
         | I'm not even in data science, but I am a slight data hoarder.
         | And heck even I'd just say throw that data on a drive and have
         | a backup in the cloud and on a cold hard drive.
        
         | SkipperCat wrote:
         | That makes total sense if you're archiving the data, but what
         | happens when you want to have 10,000 people have access to
         | read/update the data concurrently. Then you start to need some
         | fairly complex solutions.
        
           | kmarc wrote:
           | This thread blew up a lot, and some unfriendly commenters
           | made many assumptions about this innocent story.
           | 
           | You didn't, and indeed you have a point (missing
           | specification of expected queries), so I expand it as a
           | response here.
           | 
           | Among the _MANY_ requirements I shared with the candidate,
           | only _one_ was the 6TiB. Another one was that it was going to
           | be serving as part of the backend of an internal banking
           | knowledge base, with at maximum 100 request a day (definitely
           | not 10k people using it).
           | 
           | To all the upset data infrastructure wizards here: calm down.
           | It was a banking startup, with an experimental project, and
           | we needed the sober thinker generalist, who can deliver
           | solutions to real *small scale* problems, and not the one who
           | was the winner on the buzzword bingo.
           | 
           | HTH.
        
           | citizen_friend wrote:
           | This load is well handled by a Postgres instance and 15-25k
           | thrown at hardware.
        
         | paulddraper wrote:
         | Storing 6TB is easy.
         | 
         | Processing and querying it is trickier.
        
         | TeamDman wrote:
         | Would probably try https://github.com/pola-rs/polars and go
         | from there lol
        
         | xLaszlo wrote:
         | 6TB - Snowflake
         | 
         | Why?
         | 
         | That's the boring solution. If you don't have a use case, what
         | kind of queries you would run then opt for maximum flexibility
         | with the minimum setup of a managed solution.
         | 
         | If cost is prohibitive on the long run, you can figure out a
         | more tailored solution based on the revealed preferences.
         | 
         | Fiddling with CSVs is the DWH version of the legendary "Dropbox
         | HN commenter".
        
         | nostrademons wrote:
         | I would've said "Pandas with Parquet files". If you're hiring a
         | DS it's implied that you want to do some sort of aggregate or
         | summary statistics, which is exactly what Pandas is good for,
         | while awk + shell scripts would require a lot of clumsy number
         | munging. And Parquet is an order of magnitude more storage
         | efficient than CSV, and will let you query very quickly.
        
         | atomicnumber3 wrote:
         | It's really hard because I've failed interviews by pitching "ok
         | we start with postgres, and when that starts to fall over we
         | throw more hardware at it, then when that fails we throw read
         | replicas in, then we IPO, _then_ we can spend all our money and
         | time doing distributed system stuff ".
         | 
         | Whereas the "right answer" (I had a man on the inside) was to
         | describe some wild tall and wide event based distributed
         | system. For some nominal request volume that was nowhere near
         | the limits of postgres. And they didn't even care if you solved
         | the actual hard distributed system problems that would arise
         | like distributed transactions etc.
         | 
         | Anyway, I said I failed the interview, really they failed my
         | filter because if they want me to ignore pragmaticism and
         | blindly regurgitate a YouTube video on "system design" FAANG
         | interview prep, then I don't want to work there anyway.
        
         | metadat wrote:
         | Can you get a single machine with more than 6TiB of memory
         | these days?
         | 
         | That's quite a bit..
        
         | 1vuio0pswjnm7 wrote:
         | "... or a $199 enterprise HDD"
         | 
         | External or internal? Any examples?
         | 
         | "... it could be loaded (multimple times) to memory"
         | 
         | All 6TiB at once, or loaded in chunks?
        
       | dventimi wrote:
       | Question for the Big Data folks: where do sampling and statistics
       | fit into this, if at all? Unless you're summing to the penny, why
       | would you ever need to aggregate a large volume of data (the
       | population) rather than a small volume of data (a sample)? I'm
       | not saying there isn't a reason. I just don't know what it is.
       | Any thoughts from people who have genuine experience in this
       | realm?
        
         | disgruntledphd2 wrote:
         | Sampling is almost always heavily used here, because it's Ace.
         | However, if you need to produce row level predictions then you
         | can't sample as you by definition need the role level data.
         | 
         | However you can aggregate user level info into just the
         | features you need which will get you a looooonnnnggggg way.
        
         | gregw2 wrote:
         | Good question. I am not an expert but here's my take from my
         | time in this space.
         | 
         | Big data folks typically do sampling and such, but that doesn't
         | eliminate the need for a big data environment where such
         | sampling can occur. Just as a compiler can't predict every
         | branch that could happen at compile time (sorry VLIW!) and thus
         | CPUs need dynamic branch predictors, so too a sampling function
         | can't be predicted in advance of an actual dataset.
         | 
         | In a large dataset there are many ways the sample may not
         | represent the whole. The real world is complex. You sample away
         | that complexity at your peril. You will often find you want to
         | go back to the original raw dataset.
         | 
         | Second, in a large organization, sampling alone presumes you
         | are only focused on org-level outcomes. But in a large org
         | there may be individuals who care about the non-aggregated data
         | relevant to their small domain. There can be thousands of such
         | individuals. You do sample the whole but you also have to equip
         | people at each level of abstraction to do the same. The
         | cardinality of your data will in some way reflect the
         | cardinality of your organization and you can't just sample that
         | away.
        
         | banku_brougham wrote:
         | There is a problem on website data, where new features are only
         | touching a subset of customers and you need results for every
         | single one.
         | 
         | You wont be partitioned for this case, but the compute you need
         | is just filtering out this set.
         | 
         | But sampling wont get what you want especially if you are doing
         | QC at the business team level about whether the CX is behaving
         | as expected.
        
         | kwillets wrote:
         | I've done it both ways. Look into Data Sketches also if you
         | want to see applications.
         | 
         | The pros:
         | 
         | -- Samples are small and fast most of the time.
         | 
         | -- can be used opportunistically, eg in queries against the
         | full dataset.
         | 
         | -- can run more complex queries that can't be pre-aggregated
         | (but not always accurately).
         | 
         | The cons:
         | 
         | -- requires planning about what to sample and what types of
         | queries you're answering. Sudden requirements changes are
         | difficult.
         | 
         | -- data skew makes uniform sampling a bad choice.
         | 
         | -- requires ETL pipelines to do the sampling as new data comes
         | in. That includes re-running large backfills if data or
         | sampling changes.
         | 
         | -- requires explaining error to users
         | 
         | -- Data sketches can be particularly inflexible; they're
         | usually good at one metric but can't adapt to new ones. Queries
         | also have to be mapped into set operations.
         | 
         | These problems can be mitigated with proper management tools; I
         | have built frameworks for this type of application before --
         | fixed dashboards with slow-changing requirements are relatively
         | easy to handle.
        
       | oli5679 wrote:
       | BigQuery has a generous 1TB/month free tier, and $6/tb
       | afterwards. If you have small data, it's a pragmatic option, just
       | make sure to use partitioning and sensible query patterns to
       | limit the number of full-data scans, as you approach 'medium
       | data' region.
       | 
       | There are some larger data-sizes, and query patterns, where
       | either BigQuery Capacity compute pricing, or another vendor like
       | Snowflake, becomes more economical.
       | 
       | https://cloud.google.com/bigquery/pricing
       | BigQuery offers a choice of two compute pricing models for
       | running queries:              On-demand pricing (per TiB). With
       | this pricing model, you are charged for the number of bytes
       | processed by each query. The first 1 TiB of query data processed
       | per month is free.              Queries (on-demand) - $6.25 per
       | TiB - The first 1 TiB per month is free.
        
       | ml-anon wrote:
       | Big data was always a buzzword that was weirdly coopted by
       | database people in a way which makes 0 sense. Of course there are
       | vanishingly small number of use cases where we need fast random
       | access to any possible record to look up.
       | 
       | However what ML systems and in particular LLMs rely on having
       | access to millions (if not billions or trillions) of examples.
       | The underlying infra of which is based on some of these tools.
       | 
       | Big Data isn't dead, just this weird idea that the tools and
       | usecases around querying databases has been finally recognised as
       | being mostly useless to most people. It is and always has been
       | about training ML models.
        
       | stakhanov wrote:
       | The funny thing about "big data" was that it came with a perverse
       | incentive to avoid even the most basic and obvious optimizations
       | on the software level, because the hardware requirement was how
       | you proved how badass you were.
       | 
       | Like: "Look, boss, I can compute all those averages for that
       | report on just my laptop, by ingesting a SAMPLE of the data,
       | rather than making those computations across the WHOLE dataset".
       | 
       | Boss: "What do you mean 'sample'? I just don't know what you're
       | trying to imply with your mathmo engineeringy gobbledigook! Me
       | having spent those millions on nothing can clearly not be it,
       | right?"
        
         | Spooky23 wrote:
         | It came with a few cohorts of Xooglers cashing their options
         | out.
         | 
         | The amount of salesman hype and chatter about big data,
         | followed by the dick measuring contests about whose data was
         | big enough to be worthy was intense for awhile.
        
         | bpodgursky wrote:
         | This is a pretty snarky outside view and just not actually true
         | (I spent the first part of my career trying to reduce compute
         | spend as a data engineer).
         | 
         | It was extremely difficult to get > 64gb on a machine for a
         | very long time, and implementation complexity gets hard FAST
         | when you have a hard cap.
         | 
         | And it's EXTREMELY disruptive to have a process that fails
         | every 1/50 times, when data is slightly too large, because your
         | team will be juggling dozens of these routine crons, and if
         | each of them breaks regularly, you do nothing but dumb oncall
         | trying to trim bits off of each strong.
         | 
         | No, Hadoop and MapReduce were not hyperefficient, but it was OK
         | if you write it correctly, and having something that ran
         | reliably is WAY more valuable than boutique bit-optimized C++
         | crap that nobody trusts or can maintain and fails every
         | thursday with insane segfaults.
         | 
         | (nowdays, just use Snowflake. but it was a reasonable tool for
         | the time).
        
       | lokimedes wrote:
       | I was a researcher at the Large Hadron Collider around the time
       | "Big Data" became a thing. We had one of the use cases where
       | analyzing all the data made sense, since it boiled down to
       | frequentist statistics, the more data, the better. Yet even with
       | a global network of supercomputers at our disposal, we funnily
       | figured out that fast local storage was better than waiting for
       | huge jobs to finish. So, surprise, surprise, every single grad
       | student managed somehow to boil the relevant data for her
       | analysis down to exactly 1-5 TB, without much loss in analysis
       | flexibility. There must be like a law of convenience here, that
       | rivals Amdahl's scaling law.
        
         | msl09 wrote:
         | I think that your law of convenience is spot on. One thing that
         | got by talking with commercial systems devs is that they are
         | always under pressure by their clients to make their systems as
         | cheap as possible, reducing the database stored and the size of
         | the computations is one great way to minimize the client's
         | monthly bill.
        
         | civilized wrote:
         | I think there is a law of convenience, and it also explains why
         | many technologies improve at a consistent exponential rate.
         | People are very good at finding convenient ways to make
         | something a little better each year, but every idea takes some
         | minimal time to execute.
        
         | marcosdumay wrote:
         | Let me try one:
         | 
         | "If you can't do your statistical analysis in 1 to 5 TB of
         | data, your methodology is flawed"
         | 
         | This is probably more about human limitations than math.
         | There's a clear ceiling in how much flexibility we can use.
         | That will also change with easier ways to run new kinds of
         | analysis, but it increases with the logarithm of the amount of
         | things we want to do.
        
         | kwillets wrote:
         | Back in the 80's and 90's NASA built a National Aerodynamic
         | Simulator, which was a big Cray or similar that could crunch
         | FEA simulations (probably a low-range graphics card nowadays).
         | IIRC they found that the queue for that was as long or longer
         | than it took to run jobs on cheaper hardware; MPP systems such
         | as Beowulf grew out of those efforts.
        
       | debarshri wrote:
       | My first job was doing hadoop stuff around ~2011. I think one of
       | the biggest motivators for big data or rather hadoop adoption was
       | that it was open source. Back then most of the data warehousing
       | platforms dominated by oracle, netezza, EMC, teradata etc. were
       | super expensive on per GB basis. It was followed by lot of
       | success stories about how facebook save $$$ or you could use
       | google's mapreduce for free etc.
       | 
       | Everyone was talking about data being the new "oil".
       | 
       | Enterprise could basically deploy petabyte scale warehouse run
       | HBase or Hive on top of it and build makeshift data-warehouses.
       | It was also when the cloud was emerging, people started creating
       | EMR clusters and deploy workloads there.
       | 
       | I think it was solution looking for problem. And the problem
       | existed only for a handful of companies.
       | 
       | I think somehow, how cloud providers abstracted lot of these
       | tools and databases, gave a better service and we kind of forgot
       | about hadoop et al.
        
       | LightFog wrote:
       | When working in a research lab we used to have people boast that
       | their analysis was so big it 'brought down the cluster' - which
       | outed them pretty quickly to the people who knew what they were
       | doing.
        
         | kjkjadksj wrote:
         | Must have been abusing the head node if they did that
        
       | corentin88 wrote:
       | No mention of Firebase? That might explain the slow decrease of
       | MongoDB.
        
       | WesolyKubeczek wrote:
       | I remember going on one of "big data" conferences back in 2015,
       | when it was the buzzword of the day.
       | 
       | The talks were all concentrated around topics like: ingesting and
       | writing the data as quickly as possible, sharding data for the
       | benefit of ingesting data, and centralizing IoT data from around
       | the whole world.
       | 
       | Back then I had questions which were shrugged off -- back in the
       | day it seemed to me -- as extremely naive, as if they signified
       | that I was not the "in" crowd somehow for asking them. The
       | questions were:
       | 
       | 1) Doesn't optimizing highly for key-value access mean you need
       | that you need to anticipate, predict, and implement ALL of the
       | future access patterns? What if you need to change your queries a
       | year in? The most concrete answer I got was that of course a good
       | architect needs to know and design for all possible ways the data
       | will be queried! I was amazed at either the level of prowess of
       | the said architects -- such predictive powers that I couldn't
       | ever dream of attaining! -- or the level of self-delusion, as the
       | cynic in me put it.
       | 
       | 2) How can it be faster if you keep shoving intermediate
       | processing elements into your pipeline? It's not like you just
       | mindlessly keep adding queues upon queues. That had never been
       | answered. The processing speeds of high-speed pipelines may be
       | impressive, but if some stupid awk over CSV can do it just as
       | quickly on commodity hardware, something _must_ be wrong.
        
       | nottorp wrote:
       | > How many workloads need more than 24TB of RAM or 445 CPU cores?
       | 
       | <cough> Electron?
        
       | GuB-42 wrote:
       | AI is the new "big data". In fact, AI as it is done today is
       | nothing without at least terabytes of data.
       | 
       | What the article talks about is more like a particular type of
       | database architecture (often collectively called "NoSQL") that
       | was a fad a few years ago, and as all fads, it went down. It
       | doesn't mean having lots of data is useless, or that NoSQL is
       | useless, just that it is not the solution to every problem. And
       | also that there is a reason why regular SQL databases have been
       | in use since the 70s: except in specific situation most people
       | don't encounter, they just work.
        
       | coldtea wrote:
       | Selling "Big data" tooling and consulting was a nice money making
       | scheme for while it lasted.
        
       | deadbabe wrote:
       | Something similar will happen with generative AI someday.
       | 
       | AI scientists will propose all sorts of elaborate complex
       | solutions to problems using LLMs, and the dismissive responsive
       | will be "Your problem is solvable with a couple if statements."
       | 
       | Most people just don't have problems that require AI.
        
         | doubloon wrote:
         | Thats one of the first things Andrew Ng said in his old ML
         | course
        
           | deadbabe wrote:
           | This is why I personally just can't find motivation to even
           | pay attention to most AI developments. It's a toy, it does
           | some neat things, but there's no problem I've heard of or
           | encountered where LLM style AI was the only tool for the job,
           | or even the best tool. The main use seems to be content
           | creation and manipulation at scale, which the vast majority
           | of companies simply don't have to deal with.
           | 
           | Similarly, a lot of companies talk about how they have tons
           | of data, but there's never any real application or game
           | changing insight from it. Just a couple neat tricks and
           | product managers patting themselves on the back.
           | 
           | Setting up a good database is probably the peak of a typical
           | company's tech journey.
        
             | int_19h wrote:
             | Natural language processing is in obvious area in which LMs
             | are consistently outperforming the best bunches of if-
             | statements by a very large margin, and it has very broad
             | applicability.
             | 
             | E.g. I would argue that its translation capabilities alone
             | make GPT-4 worthwhile, even if it literally couldn't do
             | anything else.
        
       | fijiaarone wrote:
       | The problem with big data is that people don't have data, they
       | have useless noise.
       | 
       | For lack of data, they generate random bytes collected on every
       | mouse movement on every page, and every packet that moves through
       | their network. It doesn't tell them anything because the only
       | information that means anything is who clicks that one button on
       | their checkout page after filling out the form with their
       | information or that one request that breaches their system.
       | 
       | That's why big data is synonymous with meaningless charts on
       | pointless dashboards sold to marketing and security managers who
       | never look at them anyway
       | 
       | It's like tracking the wind and humidity and temperature and
       | barometer data every tenth of a second every square meter.
       | 
       | It won't help you predict the weather any better than stepping
       | outside and looking at the sky a couple times a day.
        
         | thfuran wrote:
         | It absolutely would. You can't build a useful model off of
         | occasionally looking outside.
        
       | iamleppert wrote:
       | What he means to say, is the grift is dead. All the best fish
       | have been fished in that data lake (pun intended), leaving most
       | waiting on the line to truly catch an appealing mackerel. Most of
       | the big data people I know have moved on to more lucrative grifts
       | like crypto and (more recently) AI.
       | 
       | There are still bags to be made if you can scare up a CTO or one
       | of his lieutenants working for a small to medium size Luddite
       | company. Add in storage on the blockchain and a talking AI parrot
       | if you want some extra gristle in your grift.
       | 
       | Long live the tech grifters!
        
       | RyanHamilton wrote:
       | For 10 years he sold companies on Big Data they didn't need and
       | he only just realised most people don't have big data. Now he's
       | switched to small data tools we should use/buy that. Is it harsh
       | to say either a) 10 Years = He isn't good at his job b) Jordan
       | would sell whatever he get's paid to.
        
       | renegade-otter wrote:
       | Big Data is not dead - it has been reborn as AI, which is
       | essentially Big Data 2.0.
       | 
       | And just in the same fashion, there was massive hype around Big
       | Data 1.0. From 2013: https://hbr.org/2013/12/you-may-not-need-
       | big-data-after-all
       | 
       |  _Everyone_ has _so_ much data that they _must_ use AI in order
       | to tame it. The reality is, however, is that most of their data
       | is crap and all over the place, and no amount of Big Data 1.0 or
       | 2.0 is ever going to fix it.
        
       | donatj wrote:
       | The article only touches on it for a moment but GDPR killed big
       | data. The vast majority of the data that any regular business
       | would have and could be considered big almost certainly contained
       | PII in one form or another. It became too much of a liability to
       | keep that around.
       | 
       | With GDPR, we went from keeping everything by default unless a
       | customer explicitly requested it gone to deleting it all
       | automatically after a certain number of days after their license
       | expires. This makes opaque data lakes completely untenable.
       | 
       | Don't get me wrong, this is all a net positive. The customers
       | data is physically removed and they don't have to worry about
       | future leaks or malicious uses, and we get a more efficient
       | database. The only people really fussed were the sales team
       | trying to lure people back with promises that they could pick
       | right back up where they left off.
        
       | gigatexal wrote:
       | DuckDb is nothing short of amazing. The only thing is when the
       | dataset is bigger than system ram it falls apart. Spilling to
       | disk is still broken.
        
       | surfingdino wrote:
       | In my experience the only time I worked on a big data project was
       | the public Twitter firehose. The team built an amazing pipeline
       | and it did actually deal with masses of data. Any other team I've
       | been on were delusional and kept building expensive and
       | overcomplicated solutions that could be replaced with a single
       | Postgres instance. The most overcomplicated system I've seen
       | could not process 24hrs-worth of data in under 24 hours... I was
       | happy to move on when an opportunity presented itself.
        
       | rr808 wrote:
       | You're never going to get a fang job with an approach like that.
       | And basically most developers are working towards that goal.
        
       | ricardo81 wrote:
       | Is there a solid definition of big data nowadays?
       | 
       | It seems to conflate somewhat with SV companies completely
       | dismantling privacy concerns and hoovering up as much data as
       | possible. Lots of scenarios I'm sure. I'm just thinking of FAANG
       | in the general case.
        
       | nextworddev wrote:
       | Big data is there so that it can justify Databricks and Snowflake
       | valuations /s
        
       | cheptsov wrote:
       | A good clickbait title. One should credit the author for that.
       | 
       | As to the topic, IMO, there is a contradiction. The only way to
       | handle big data is to divide it into chunks that aren't expensive
       | to query. In that sense, no data is "big" as long as it's handled
       | properly.
       | 
       | Also, about big data being only a problem for 1 percent of
       | companies: it's a ridiculous argument implying that big data was
       | supposed to be a problem for everyone.
       | 
       | I personally don't see the point behind the article, with all due
       | respect to the author.
       | 
       | I also see many awk experts here who have never been in charge of
       | building enterprise data pipelines.
        
       | markus_zhang wrote:
       | I think one problem that arises from practical work is: Databases
       | seem to be biased towards either transactional (including
       | fetching single records) or aggregational workload, but in
       | reality both are used extensively. This also brings difficulty in
       | data modelling, when we DEs are mostly thinking about aggregating
       | data while our users also want to investigate single records.
       | 
       | Actually, now that I think about it, we should have two products
       | for the users, one let them to query single records as fast as
       | possible without hitting the production OLTP transactional
       | database, even from really big data (find one record from PB
       | level data in seconds), one to power the dashboards that ONLY
       | show aggregation. Is Lakehouse a solution? I have never used it.
        
       | lukev wrote:
       | This is a good post, but it's somewhat myopically focused on
       | typical "business" data.
       | 
       | The most interesting applications for "big data" are all (IMO) in
       | the scientific computing space. Yeah, your e-commerce business
       | probably won't ever need "big data" but load up a couple genomics
       | research sets and you sure will.
        
       | maartet wrote:
       | Reminds me of this gem from the previous decade:
       | https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html (Don't
       | use Hadoop - your data isn't that big)
       | 
       | Which, of course, was discussed on HN back then:
       | https://news.ycombinator.com/item?id=6398650
        
       | unholyguy001 wrote:
       | One of the problems with the article is BigQuery becomes
       | astronomically expensive at mid to high volumes. So there is a
       | strong incentive to keep the data in BigQuery manageable or to
       | even move off it as data volumes get higher.
       | 
       | Also larger enterprises don't even use gcp all that much to begin
       | with
        
       | tored wrote:
       | Former prime minister of Sweden Stefan Lofven got wind about Big
       | data buzzword back in 2015 and used it in one of his speeches,
       | that this is the future, however he used the Swedish translation
       | of Big data, stordata. That generated some funny quips about
       | where is lilldata.
       | 
       | https://www.aftonbladet.se/senastenytt/ttnyheter/inrikes/a/8...
        
       | xiaodai wrote:
       | Big memory is eating big data.
        
       | angarg12 wrote:
       | Previous years I would have completely agreed with this post.
       | Nowadays with the AI and ML craze I'm not so sure. I've seen
       | plenty of companies using using vast amounts of data to train ML
       | models to incorporate to their products. Definitely more data
       | that can be handle by a traditional DB, and well into Big Data
       | territory.
       | 
       | This isn't a value judgement about whether that's a good idea,
       | just an observation from talking with many tech companies doing
       | ML. This definitely feels like a bubble that will burst in due
       | time, but for now ML is turbocharging Big Data.
        
         | int_19h wrote:
         | But do they need to _query_ that data?
        
       | breckognize wrote:
       | 99.9%+ of data sets fit on an SSD, and 99%+ fit in memory. [1]
       | 
       | This is the thesis for https://rowzero.io. We provide a real
       | spreadsheet interface on top of these data sets, which gives you
       | all the richness and interactivity that affords.
       | 
       | In the rare cases you need more than that, you can hire
       | engineers. The rest of the time, a spreadsheet is all you need.
       | 
       | [1] I made these up.
        
       | BiboAlpha wrote:
       | One of the major challenges with Big Data is that it often fails
       | to deliver real business value, instead offering only misleading
       | promises.
        
       | mavili wrote:
       | Big Data hasn't been about storage, I thought it was always about
       | processing. Guy obviously knows his stuff but I got the
       | impression he stressed more about storage and how that's cheap
       | and easy these days. When he does mention processing/computing,
       | he mentions that most of the time people end up only querying
       | recent data (ie small chunk of actual data they hold) but that
       | bears the question: is querying only small chunk of data what
       | businesses need, or are they doing it because querying the whole
       | dataset is just not manageable? In other words, if processing all
       | data at once was as easy as querying the most recent X percent,
       | would most businesses still choose to only query the small chunk?
       | I think there lies the answer whether Big Data (processing) is
       | needed or not.
        
       | siliconc0w wrote:
       | Storing data in object and querying from compute caching what you
       | can basically scales until your queries are too expensive for a
       | single node.
        
       | idunnoman1222 wrote:
       | I love how the way he talks about this. His paycheque from big-
       | data is what is dead. His service offering was always bullshit.
        
       | fredstar wrote:
       | I am in a software services company for more than 15 years. And
       | to be honest, a lot of these big topics have always been some
       | kind of sales talk or door opener. You write a white paper,
       | nominate an 'expert' in your team and use these things in
       | conversations with clients. Sure some trends are way more real
       | and useful then others. But for me the article hits the nail on
       | its head.
        
       | aorloff wrote:
       | What is dead is the notion that some ever expanding data lake is
       | a mine full of gems and not a cost center.
        
       | schindlabua wrote:
       | Used to work at a company that produced 20 gigs of analytics
       | every day which is probably the biggest data I'll ever work on.
       | My junior project was writing some data crunching jobs that did
       | aggregations batched and in real time, and store the result in
       | parquet blobs in azure.
       | 
       | My boss was smart enough to have stakeholder meetings where they
       | regularly discussed what to keep and what to throw away, and with
       | some smart algorithms we were able to compress all that data down
       | into like 200MB per day.
       | 
       | We loaded the last 2 months into an sql server and the last 2
       | years further aggregated into another, and the whole company used
       | the data in excel to do queries on it in reasonable time.
       | 
       | The big data is rotting away on tape storage in case they ever
       | need it in the future.
       | 
       | My boss got a lot of stuff right and I learned a lot, though I
       | only realized that in hindsight. Dude was a bad manager but he
       | knew his data.
        
       | dang wrote:
       | Related:
       | 
       |  _Big data is dead_ -
       | https://news.ycombinator.com/item?id=34694926 - Feb 2023 (433
       | comments)
       | 
       |  _Big Data Is Dead_ -
       | https://news.ycombinator.com/item?id=33631561 - Nov 2022 (7
       | comments)
        
       | kurts_mustache wrote:
       | What was the source data here? It seems like a lot of the graphs
       | are just the author's intuitive feel rather than being backed by
       | any sort of hard data. Did I miss it in there somewhere?
        
       | lkdfjlkdfjlg wrote:
       | TLDR: he used to say that you needed the thing he's selling. He
       | changed his mind now and you need the opposite, which he's
       | selling.
        
       ___________________________________________________________________
       (page generated 2024-05-27 23:01 UTC)