[HN Gopher] What's the origin of the phrase "big data doesn't fi...
       ___________________________________________________________________
        
       What's the origin of the phrase "big data doesn't fit in excel"?
        
       Author : edent
       Score  : 99 points
       Date   : 2021-04-17 11:57 UTC (11 hours ago)
        
 (HTM) web link (shkspr.mobi)
 (TXT) w3m dump (shkspr.mobi)
        
       | SunlightEdge wrote:
       | It's been common knowledge since... Forever really. I doubt there
       | was a specific person. It just spread across industries due to
       | the known limitations in performance.
       | 
       | Excel was well known to not be able to handle large datasets back
       | in 2006 when I started in data analytics field. I'd guess Excels
       | limitations go way back to when it was originally released.
       | 
       | Excel is very good btw. A marvel in many ways. It just becomes
       | painful with big data around the 100-300k mark.
        
         | SunlightEdge wrote:
         | The million row mark for excel while widely known is never
         | taken seriously by anyone who actually uses excel often. It's
         | performances collapses around the 100-300k mark.
        
           | tyingq wrote:
           | I would guess it depends on the columns as well. I just
           | opened a million-row, 10 column spreadsheet and added a few
           | formulas like =average(a1:a999999) and it's responsive...no
           | hangs or other issues. "Save as" ran for a 5 seconds or so.
           | On an older i5 laptop with 8GB.
        
             | SunlightEdge wrote:
             | It's difficult to comment without seeing your workbook. But
             | broadly if you're developing any kind of useful analytical
             | model you will find the performance fall off a cliff very
             | quickly if using 1 million rows. e.g. waiting around 25
             | minutes to save the file (or being corrupted), the software
             | being suddenly very slow and weird. And this includes
             | tricks like turning off automatic calculations, hard copy
             | pasting in the results of calculations (rather than keeping
             | all cells as formulas), only running calculations on
             | specific cells etc.
             | 
             | There are things you can do to try and manage big data in
             | excel. But it becomes a chore and ultimately a big bloated
             | monster.
             | 
             | And why bother with the pain when you can use Python (e.g
             | pandas) or R or SAS.
             | 
             | Power excel / an SQL backend (with VBA) is also a common
             | solution. Or was. I mostly work in Python/R these days.
        
             | CryptoBanker wrote:
             | Try adding a recursive formula or two in there and see what
             | happens
        
           | LegitShady wrote:
           | My guess is that even if it is factorable it would be made
           | into its own separate product or attached to another program
           | like power query or their bi offerings.
        
       | tyingq wrote:
       | Hm. Here's the various limits for Excel:
       | 
       | https://support.microsoft.com/en-us/office/excel-specificati...
       | 
       | Everything from Excel 2007 and up supports 1,048,576 rows by
       | 16,384 columns, if you use the XLSX format. The older XLS format
       | tops out at 65536 rows and 256 columns.
       | 
       | You can, of course, "shard" worksheets ;)
        
         | carl_dr wrote:
         | As the UK found out just recently :
         | https://www.bbc.com/news/technology-54423988
        
         | [deleted]
        
         | EvanAnderson wrote:
         | A fun-to-troubleshoot "stupid Excel trick" that I ran into,
         | arguably "caused" by the increased row limit in >2007 versions:
         | 
         | User typically drags formatting and "default" values down to
         | row 65535 in Excel 2003-based "data collection" spreadsheet.
         | Does the same thing in a >2007 spreadsheet, except they drag it
         | down to row 1,048,576. The file size doesn't take a dramatic
         | "hit" (since XLSX files are really ZIP archives), but
         | performance goes in the toilet.
         | 
         | Knowing enough to unzip XLSX and DOCX files any eyeball the XML
         | directly can often identify fun corner cases. (There's probably
         | a ton of fuzzing fruit to be picked in the Office products
         | using "malformed" documents, too.)
        
         | dopidopHN wrote:
         | What I was getting it at exactly. What prevent a user to split
         | data in thousand of spreadsheet?
         | 
         | Given how the corporate world run on excel... someone has done
         | it, somewhere.
        
           | chrismarlow9 wrote:
           | I'm sure it happens all the time from a logical perspective
           | without the user even knowing what a shard is.
           | 
           | Example: Too many rows? Okay let's just break it up into
           | monthly sheets and then change this formula to look up the
           | value based on the month. Maybe just use a drop-down for the
           | sheet selection.
           | 
           | I think most people could arrive at this answer intuitively
           | who have enough excel experience.
        
         | conductr wrote:
         | In practice, excel has all types of issues as you approach a
         | fraction of the available row/column maximums. Your files will
         | be crashing, corrupting, or just impossible to use. If any of
         | your rows or columns have formulas it's order of magnitude
         | worse at high volumes.
         | 
         | That said. Excel has power query/pivot and connects to a large
         | number of external databases, flat files, etc. and it's best to
         | just never import the entire dataset into excel. That's my view
         | as a tenured financial analyst working with excel
         | professionally since version 2003.
        
           | AstralStorm wrote:
           | Yup, Excel easily interacts with SQL databases, some of which
           | can handle really big data pretty nicely.
        
         | m463 wrote:
         | I wonder if excel could get cuda support...
        
       | sumnole wrote:
       | If Microsoft releases a solution for viewing big data in Excel,
       | does data have to become even bigger to be "big"? They're stuck
       | endlessly chasing a carrot! :)
        
         | kreeben wrote:
         | My thesis is MS will never go down this path, not because it
         | isn't an excellent idea but because Excel has become
         | inrefactorable.
        
           | dragonwriter wrote:
           | > My thesis is MS will never go down this path, not because
           | it isn't an excellent idea but because Excel has become
           | inrefactorable.
           | 
           | They'll just make an optional feature called Power
           | _Something_ that is integrated with the Excel UI but mostly a
           | separate tool that interacts with the main Excel system
           | without sharing its limits.
        
           | bitwize wrote:
           | That's okay, they'll just rewrite it -- largely in JavaScript
           | -- and offer it as a service as part of Microsoft 365. It
           | could then ingest datasets from anywhere in the cloud, even
           | large SQL or MongoDB databases.
        
         | PeterisP wrote:
         | Yes, sure - the "big data" issues are essentially about data
         | above a certain point where very different techniques for
         | processing it (e.g. cluster computing and algorithms to split
         | and merge partial results of computing that's not trivially
         | parallelizable) start to make sense or even become required. If
         | technology progress grants a huge increase in available memory
         | and computing power and algorithm efficiency, then that
         | boundary moves upwards.
         | 
         | There are early "big data" publications on datasets that
         | decidedly shouldn't be treated as "big data" today because now
         | it's entirely appropriate to process them with simple methods
         | within the RAM on a decent server or in some cases even on my
         | laptop.
        
       | random5634 wrote:
       | Do a ton of excel work - the 1m row limit is surprisingly low.
       | And sharding across worksheets is miserable in my view
        
         | edrobap wrote:
         | It amuses me how the low limit nature of Excel has affected the
         | ecosystem around it. Apache POI, which is a popular library to
         | operate on Excel files, has a weird 4GB limit [1] on
         | uncompressed file size.
         | 
         | throw new IllegalArgumentException("Max entry size is bounded
         | [0-4GB], but had " + maxEntrySize);
         | 
         | [1]
         | https://svn.apache.org/viewvc/poi/tags/REL_5_0_0/src/ooxml/j...
        
           | simias wrote:
           | What makes you think this limitation is related to Excel?
           | 4GiB is a very common limit due to it being the max size you
           | can fit on a 32bit integer. It's the max file size on FAT32
           | as well for instance.
        
       | narush wrote:
       | Explicit data size limits (something on the order of 1M rows at
       | time of writing) are only one part of the problem in trying to
       | use Excel with Big Data (TM). Generally, if you've got millions
       | of rows, you're working with some sort of data export from
       | another tool (e.g. different from some financial data you're
       | using to model your companies finances).
       | 
       | And with millions of rows of data pouring out of some other tool,
       | usually you're trying to define a repeatable process to
       | clean/munge/transform that data into something more useful to
       | you/your team/your management.
       | 
       | Within Excel, there are ways of accomplishing the "define
       | repeatable task" goal - but my personal experience working with
       | VBA (and talking to VBA users across the spectrum) is that it's a
       | horrible language that is absolutely no fun at all to write. Good
       | luck using a nice library to do anything with it, really.
       | 
       | I'm a co-founder of Mito [1], where we're taking a bit of a
       | different angle. Rather than bringing big data into Excel, we're
       | bringing an Excel ethos to where you might work with your big
       | data otherwise. Mito is a spreadsheet interface that lives inside
       | of a Jupyter notebook; you can write spreadsheet formulas, merge
       | datasets, explore summary stats, all from within this
       | spreadsheet. While you edit the spreadsheet, it generates valid
       | Python code for you.
       | 
       | Our current users mostly fit the bill of "previous Excel junkies
       | who started teaching themself Python but still have a lot to
       | learn, so use Mito to augment/speed up their workflow."
       | 
       | Questions / comments / hard-hitting HN feedback greatly
       | appreciated!
       | 
       | [1] https://trymito.io/launch
        
       | shagie wrote:
       | This reminds me of DevOps Borat, from Jan 8, 2013:
       | https://twitter.com/devops_borat/status/288698056470315008?l...
       | 
       | > Big Data is any thing which is crash Excel.
        
         | edent wrote:
         | Thanks. Although that's a few years after the phrase was first
         | popularised - it's still funny :-)
        
       | indymike wrote:
       | The joke in our office was, "Big data doesn't fit on a thumb
       | drive." At the time a big thumb drive was 64GB.
        
       | ur-whale wrote:
       | Do glaringly obvious things need to have an origin?
       | 
       | What value could that have?
        
         | edent wrote:
         | As I say in the post, I saw it quoted in an academic paper and
         | I wanted to see who actually said it.
         | 
         | That's then a good way to understand _why_ they said it and
         | what they actually meant.
         | 
         | So, for me, the value is understanding the provenance and
         | assumptions of the quote.
        
       | nojito wrote:
       | How does no one in this thread know about power query and
       | powerpivot?
       | 
       | I've been using excel with tens of millions of rows for years
       | now.
        
         | fencepost wrote:
         | Keep in mind that for the people discussing this 10-12 years
         | ago Excel had almost always had a limit of 64k rows. As another
         | comment notes Excel 2007 is where that was increased to >
         | 1,000,000.
         | 
         | If you were a recognized person in data science you'd probably
         | been using tools with the 64k limit for at least a decade.
        
       | atat7024 wrote:
       | Academia.
        
       | dfilppi wrote:
       | Sounds like a feature request for MSFT
        
       | supernova87a wrote:
       | I'll get back to you on that as soon as my sheet finishes
       | calculating for 5 min because I changed the value of 1 cell.
       | 
       | Ohh..sorry, make that 10 min -- I had to save the file.
        
       | known wrote:
       | Complete/Big data should be loaded into memory for doing any
       | processing in Excel;
        
       | Scarblac wrote:
       | I seem to remember a blog post one day about someone interviewing
       | for several "Big Data" positions - where the data turned out to
       | be some Excel sheets. Or fit on a thumb drive. The hiring
       | companies that thought they had "big data" really didn't, was the
       | point. The blog post was widely shared around my office at the
       | time and I see it as the origin of this idea.
       | 
       | Edit: I think it was
       | https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html, HN
       | discussion https://news.ycombinator.com/item?id=6398650
       | 
       | But the idea wasn't that data was big as soon as it didn't fit in
       | Excel anymore - Excel was a reductio ad absurdum. If it still
       | fits in Excel, it's _laughably far_ away from being  "big".
       | _That_ was the idea.
       | 
       | Big data was about the problems you get when you need to join
       | data together that you can't fit well in one database server, not
       | even with the amounts of memory and disk space you can get these
       | days. It was Google type problems, the stuff that map-reduce
       | needed to be invented for. _The kind of problem that most
       | companies just don 't have_.
       | 
       | So the term mostly disappeared from job descriptions, "data
       | science" became popular instead, and people keep using Excel
       | because it is great.
       | 
       | (that blog post is from 2013 so it can't be the source of that
       | quote from 2012. But please don't define big data in terms of
       | Excel in a 2021 MSc thesis)
        
         | GoodDreams wrote:
         | A company I worked for defined big data as anything not
         | analyzable with a relational database - a PDF or video was big
         | data. Companies say they want the latest buzzword even though
         | it's against their own interest and other companies say "we can
         | give that to you" and instead deliver something that is of
         | actual value so the license gets renewed. Same thing happened
         | with SOA, REST, microservices, blockchain. IBM developed a
         | blockchain product just to get the sales conversations going
         | and then sold something else. Imagine being on that product's
         | team.
        
           | [deleted]
        
           | saalweachter wrote:
           | I would go with "big data is when you have to spend more
           | effort on the mechanism of performing the computation than on
           | the details of the analysis itself."
        
         | moksly wrote:
         | I work for an average sized Danish city. Even when we combined
         | our datasets with 10 other average sized cities and accumulate
         | the biggest national data set and let the big data wizards have
         | a go at it, it wasn't really large enough to amount to anything
         | useful.
         | 
         | Don't get me wrong, Watson came up with a lot of BI suggestions
         | that were useful and some of the scientists came up with semi-
         | interesting prediction models. The thing is though, our
         | analytics team has much better BI models and the our finance
         | department does the prediction much more efficient and, well,
         | legal. The prediction could become a useful aid, if it ever
         | became legal, but not really at the license fees we were
         | looking at.
         | 
         | Not sure if automated BI is ever going to become good enough.
         | It's impressive that Watson can come up with stuff an
         | university post-bachelor intern can, but it's license is more
         | than our entire analytics team, so yeah...
        
           | Scarblac wrote:
           | I work in water management in the Netherlands and it feels
           | the same.
           | 
           | A human domain expert _knows_ what kind of thing to look for.
           | Custom tools can help to get an overview (eg we visualize
           | water velocities of a whole regional system on a map), but
           | generic machine learning can 't add much even though we
           | record basically everything everywhere at five minute
           | intervals.
           | 
           | The value of ML is perhaps in things that give little value
           | in the individual case but can be used a huge number of
           | times, like image classification.
        
             | bostik wrote:
             | I can give one example where the ML/AI _toolkit_ is useful,
             | even if only the largest providers in the world may need to
             | work with real Big Data[tm] volumes. Fraud detection. And
             | you don 't need to be a particularly big operator to get
             | sufficiently large & varied data sets.
             | 
             | In a way I think it helps that fraud is an ever-evolving,
             | fiercely adversarial domain. New avenues are explored all
             | the time. Occasionally old tricks are revived for a while,
             | because they may work in the margins but become
             | distinguishing features as soon as they see more use.
             | 
             | And even if your ML is based on nothing more than a random
             | forest, you can still get surprisingly useful feature
             | combinations out of it. (Or as our data scientists said:
             | individually meaningless features may become a valuable
             | signal when enough of them occur at the same time.)
        
             | pbhjpbhj wrote:
             | Human directed machine learning then? Presumably a domain
             | expert who can also expertly leverage ML/AI could make some
             | useful tools?
        
         | edent wrote:
         | Thanks for that link. Interesting that it was written a year-
         | or-so after the phrase became popular.
        
         | [deleted]
        
         | hkmurakami wrote:
         | Friend in one of the big data vendors: "we mostly work with
         | medium data"
        
           | nerdponx wrote:
           | The way I always think of it: Medium data fits on disk but
           | not in memory. Big data doesn't fit in either.
        
             | alexchamberlain wrote:
             | I like this definition too; the advantage is it scales with
             | advances in technology. What has medium a decade ago, might
             | be considered small (ish) now.
        
         | cortesoft wrote:
         | Luckily that blog post is out of date and there are a lot of
         | options now to handle datasets larger than 5TB without needing
         | Hadoop.
        
           | jleahy wrote:
           | Awk, for example.
        
         | simias wrote:
         | I think it's kind of pointless to fight for an exact meaning
         | for these buzzwords. "Big Data", "Webscale", "Web 2.0", "NoSQL"
         | and whatever other trendy hashtag du jour have lost most of
         | their meaning because they hardly had any to begin with.
         | They're marking drivel, not technical terms. And since
         | everybody wants to look cool we all find ways to say that we're
         | doing it, one way or the other.
        
           | klodolph wrote:
           | Other than "webscale" I think the other terms still mean
           | something. They don't need to have a precise, agreed upon
           | meaning to be useful words.
           | 
           | "Webscale" is just a buzzword, though.
        
             | Quarrelsome wrote:
             | Sorry I can't resist:
             | https://www.youtube.com/watch?v=b2F-DItXtZs
        
               | srswtf123 wrote:
               | This video is one of the best critiques of modern
               | software development, and I suppose IT in general, that
               | I've ever seen. Even if I've seen it a hundred times, I
               | always appreciate seeing it again. Thank you!
        
           | stadium wrote:
           | NoSQL was jargon speak for document store. It's really schema
           | on read vs the traditional database where schema is defined
           | on (or before) write. And there are plenty of ways to use sql
           | on various data formats including json, presto being one.
           | 
           | Around the time of peak NoSQL buzz, I went to Oracle Open
           | world in the early-mid 2010's and there was an army of
           | MongoDB folks outside the convention center grounds holding
           | signs and handing out swag. NoSQL was definitely a good
           | marketing play to try and take customers away from Oracle.
        
         | snidane wrote:
         | Since the new breed of "data scientists" using R and Python
         | dataframes replaced Excel "data analysts" they realized the
         | "big" of big data hasn't disappeared and started hiring data
         | engineers for developing and explaining the new big data
         | platforms for stuff which didn't fit into their pandas
         | dataframes.
         | 
         | Still, some call it medium data since this spans the range of
         | higher gigabytes to lower terabytes. There are still couple of
         | business problems in the petabyte scale, but they are probably
         | so niche that they deserve a custom solution anyway.
        
       | aappleby wrote:
       | If you can lift the device it's stored on with your bare hands,
       | it's not big data.
        
       | social_quotient wrote:
       | Maybe it's the limits that use to push you in to a corner of
       | pain?
       | 
       | 118 character file name limit 65,536 row limit till 2011 256
       | columns till 2011 2gb memory limits
       | 
       | Those are the structural limits with should be good for a lot of
       | things but... there are the practical issues of it freezing and
       | having issues while actually using it at any scale or any sort of
       | complexity.
        
       | Vaslo wrote:
       | There are definite practical limits in Excel. Even at 100k rows,
       | some of the more complex financial models we have made get very
       | difficult to work with. We've spent days learning and trying new
       | ways to speed them up (never use Vlookup or offset, etc) and they
       | became usable but still frustrating.
       | 
       | I guess the biggest upside is that it's forced my team to go out
       | and learn SQL, Python, etc. However that has it's own downside
       | because we are the only ones in our division that know these
       | tools, so we get stuck maintaining stuff that we shouldn't!
        
         | NicoJuicy wrote:
         | Share linqpad scripts?
        
         | bob1029 wrote:
         | Having everyone on the team competent in SQL is not something
         | that is an unreasonable objective. Even project managers and
         | executives could benefit from knowing how to write their own
         | queries against various data sources.
         | 
         | As wizened developers, we can assist our more business-like
         | coworkers by developing views and reporting databases that help
         | to present a more consistent and higher-order perspective of
         | the world. Just having a document that lists out examples of
         | SQL they can use is 99% of what most people need to get
         | bootstrapped.
         | 
         | Getting your problem domains modeled in SQL and using the
         | appropriate form of normalization is foundational for managing
         | non-trivial levels of complexity in larger projects. For me,
         | non-trivial complexity means any domain model with more than 10
         | related types or more than 100 total properties to deal with.
         | 
         | Relational modeling is one of the most powerful abstractions we
         | have available for working with anything that goes beyond the 3
         | spatial dimensions that we can see with our eyeballs. I have
         | dealt with queries in factory automation that join over 40
         | tables to produce some important projection.
         | 
         | You could make a bunch of assumptions about how the domain
         | model should be shaped in some complex object graph monstrosity
         | and then stick it into MongoDB, or you can leave it neatly
         | organized and indexed such that any reasonable query can be
         | made of the data to produce virtually any shape of output you
         | need.
         | 
         | With all of that in mind, Excel is still one of the most
         | powerful tools on your computer for documenting virtually
         | anything. Any problem domain can be represented as tables of
         | things and relations between them. Anyone can figure out the
         | most important parts of this tool with just a few minutes of
         | screwing around with it. It is trivial to take a model from
         | someone's xlsx and turn it into proper SQL tables and then slap
         | a front-end and business logic around it. When someone wants me
         | to write a new piece of software for a new problem area, we
         | always start with types, properties, and relationships between
         | these things. Excel is a perfect fit for the first phase of any
         | software project.
        
       | amelius wrote:
       | Does it matter? Because I'm guessing that sooner rather than
       | later Excel will be updated so that big data will fit just fine
       | (using cloud technologies).
        
       | CalChris wrote:
       | I'm reminded of Joe Hellerstein's (and Oz Nova's summation of it,
       | _You Are Not Google_ [1] [2]) about even actual Big Data people
       | getting their scale wrong and thinking they 're Google:
       | The thing is there's like 5 companies in the world that run jobs
       | that big. For everybody else... you're doing all this I/O for
       | fault tolerance that you didn't really need. People got kinda
       | Google mania in the 2000s: "we'll do everything the way Google
       | does because we also run the world's largest internet data
       | service" [tilts head sideways and waits for laughter].
       | 
       | [1] https://blog.bradfieldcs.com/you-are-not-google-84912cf44afb
       | 
       | [2] https://news.ycombinator.com/item?id=19576092
        
       | inshadows wrote:
       | Sorry for off-topic. Several years ago I watched part of a video
       | lecture about implementing Excel principles. It was a talk about
       | functional programming, and he may have used Haskell, or Scala. I
       | found it interesting because it began with him explaining how
       | Excel is huge array and how changes propagate. Do you have a link
       | to this talk/lecture? Thanks in advance!
        
       ___________________________________________________________________
       (page generated 2021-04-17 23:01 UTC)