[HN Gopher] What's the origin of the phrase "big data doesn't fi...
___________________________________________________________________
What's the origin of the phrase "big data doesn't fit in excel"?
Author : edent
Score : 99 points
Date : 2021-04-17 11:57 UTC (11 hours ago)
(HTM) web link (shkspr.mobi)
(TXT) w3m dump (shkspr.mobi)
| SunlightEdge wrote:
| It's been common knowledge since... Forever really. I doubt there
| was a specific person. It just spread across industries due to
| the known limitations in performance.
|
| Excel was well known to not be able to handle large datasets back
| in 2006 when I started in data analytics field. I'd guess Excels
| limitations go way back to when it was originally released.
|
| Excel is very good btw. A marvel in many ways. It just becomes
| painful with big data around the 100-300k mark.
| SunlightEdge wrote:
| The million row mark for excel while widely known is never
| taken seriously by anyone who actually uses excel often. It's
| performances collapses around the 100-300k mark.
| tyingq wrote:
| I would guess it depends on the columns as well. I just
| opened a million-row, 10 column spreadsheet and added a few
| formulas like =average(a1:a999999) and it's responsive...no
| hangs or other issues. "Save as" ran for a 5 seconds or so.
| On an older i5 laptop with 8GB.
| SunlightEdge wrote:
| It's difficult to comment without seeing your workbook. But
| broadly if you're developing any kind of useful analytical
| model you will find the performance fall off a cliff very
| quickly if using 1 million rows. e.g. waiting around 25
| minutes to save the file (or being corrupted), the software
| being suddenly very slow and weird. And this includes
| tricks like turning off automatic calculations, hard copy
| pasting in the results of calculations (rather than keeping
| all cells as formulas), only running calculations on
| specific cells etc.
|
| There are things you can do to try and manage big data in
| excel. But it becomes a chore and ultimately a big bloated
| monster.
|
| And why bother with the pain when you can use Python (e.g
| pandas) or R or SAS.
|
| Power excel / an SQL backend (with VBA) is also a common
| solution. Or was. I mostly work in Python/R these days.
| CryptoBanker wrote:
| Try adding a recursive formula or two in there and see what
| happens
| LegitShady wrote:
| My guess is that even if it is factorable it would be made
| into its own separate product or attached to another program
| like power query or their bi offerings.
| tyingq wrote:
| Hm. Here's the various limits for Excel:
|
| https://support.microsoft.com/en-us/office/excel-specificati...
|
| Everything from Excel 2007 and up supports 1,048,576 rows by
| 16,384 columns, if you use the XLSX format. The older XLS format
| tops out at 65536 rows and 256 columns.
|
| You can, of course, "shard" worksheets ;)
| carl_dr wrote:
| As the UK found out just recently :
| https://www.bbc.com/news/technology-54423988
| [deleted]
| EvanAnderson wrote:
| A fun-to-troubleshoot "stupid Excel trick" that I ran into,
| arguably "caused" by the increased row limit in >2007 versions:
|
| User typically drags formatting and "default" values down to
| row 65535 in Excel 2003-based "data collection" spreadsheet.
| Does the same thing in a >2007 spreadsheet, except they drag it
| down to row 1,048,576. The file size doesn't take a dramatic
| "hit" (since XLSX files are really ZIP archives), but
| performance goes in the toilet.
|
| Knowing enough to unzip XLSX and DOCX files any eyeball the XML
| directly can often identify fun corner cases. (There's probably
| a ton of fuzzing fruit to be picked in the Office products
| using "malformed" documents, too.)
| dopidopHN wrote:
| What I was getting it at exactly. What prevent a user to split
| data in thousand of spreadsheet?
|
| Given how the corporate world run on excel... someone has done
| it, somewhere.
| chrismarlow9 wrote:
| I'm sure it happens all the time from a logical perspective
| without the user even knowing what a shard is.
|
| Example: Too many rows? Okay let's just break it up into
| monthly sheets and then change this formula to look up the
| value based on the month. Maybe just use a drop-down for the
| sheet selection.
|
| I think most people could arrive at this answer intuitively
| who have enough excel experience.
| conductr wrote:
| In practice, excel has all types of issues as you approach a
| fraction of the available row/column maximums. Your files will
| be crashing, corrupting, or just impossible to use. If any of
| your rows or columns have formulas it's order of magnitude
| worse at high volumes.
|
| That said. Excel has power query/pivot and connects to a large
| number of external databases, flat files, etc. and it's best to
| just never import the entire dataset into excel. That's my view
| as a tenured financial analyst working with excel
| professionally since version 2003.
| AstralStorm wrote:
| Yup, Excel easily interacts with SQL databases, some of which
| can handle really big data pretty nicely.
| m463 wrote:
| I wonder if excel could get cuda support...
| sumnole wrote:
| If Microsoft releases a solution for viewing big data in Excel,
| does data have to become even bigger to be "big"? They're stuck
| endlessly chasing a carrot! :)
| kreeben wrote:
| My thesis is MS will never go down this path, not because it
| isn't an excellent idea but because Excel has become
| inrefactorable.
| dragonwriter wrote:
| > My thesis is MS will never go down this path, not because
| it isn't an excellent idea but because Excel has become
| inrefactorable.
|
| They'll just make an optional feature called Power
| _Something_ that is integrated with the Excel UI but mostly a
| separate tool that interacts with the main Excel system
| without sharing its limits.
| bitwize wrote:
| That's okay, they'll just rewrite it -- largely in JavaScript
| -- and offer it as a service as part of Microsoft 365. It
| could then ingest datasets from anywhere in the cloud, even
| large SQL or MongoDB databases.
| PeterisP wrote:
| Yes, sure - the "big data" issues are essentially about data
| above a certain point where very different techniques for
| processing it (e.g. cluster computing and algorithms to split
| and merge partial results of computing that's not trivially
| parallelizable) start to make sense or even become required. If
| technology progress grants a huge increase in available memory
| and computing power and algorithm efficiency, then that
| boundary moves upwards.
|
| There are early "big data" publications on datasets that
| decidedly shouldn't be treated as "big data" today because now
| it's entirely appropriate to process them with simple methods
| within the RAM on a decent server or in some cases even on my
| laptop.
| random5634 wrote:
| Do a ton of excel work - the 1m row limit is surprisingly low.
| And sharding across worksheets is miserable in my view
| edrobap wrote:
| It amuses me how the low limit nature of Excel has affected the
| ecosystem around it. Apache POI, which is a popular library to
| operate on Excel files, has a weird 4GB limit [1] on
| uncompressed file size.
|
| throw new IllegalArgumentException("Max entry size is bounded
| [0-4GB], but had " + maxEntrySize);
|
| [1]
| https://svn.apache.org/viewvc/poi/tags/REL_5_0_0/src/ooxml/j...
| simias wrote:
| What makes you think this limitation is related to Excel?
| 4GiB is a very common limit due to it being the max size you
| can fit on a 32bit integer. It's the max file size on FAT32
| as well for instance.
| narush wrote:
| Explicit data size limits (something on the order of 1M rows at
| time of writing) are only one part of the problem in trying to
| use Excel with Big Data (TM). Generally, if you've got millions
| of rows, you're working with some sort of data export from
| another tool (e.g. different from some financial data you're
| using to model your companies finances).
|
| And with millions of rows of data pouring out of some other tool,
| usually you're trying to define a repeatable process to
| clean/munge/transform that data into something more useful to
| you/your team/your management.
|
| Within Excel, there are ways of accomplishing the "define
| repeatable task" goal - but my personal experience working with
| VBA (and talking to VBA users across the spectrum) is that it's a
| horrible language that is absolutely no fun at all to write. Good
| luck using a nice library to do anything with it, really.
|
| I'm a co-founder of Mito [1], where we're taking a bit of a
| different angle. Rather than bringing big data into Excel, we're
| bringing an Excel ethos to where you might work with your big
| data otherwise. Mito is a spreadsheet interface that lives inside
| of a Jupyter notebook; you can write spreadsheet formulas, merge
| datasets, explore summary stats, all from within this
| spreadsheet. While you edit the spreadsheet, it generates valid
| Python code for you.
|
| Our current users mostly fit the bill of "previous Excel junkies
| who started teaching themself Python but still have a lot to
| learn, so use Mito to augment/speed up their workflow."
|
| Questions / comments / hard-hitting HN feedback greatly
| appreciated!
|
| [1] https://trymito.io/launch
| shagie wrote:
| This reminds me of DevOps Borat, from Jan 8, 2013:
| https://twitter.com/devops_borat/status/288698056470315008?l...
|
| > Big Data is any thing which is crash Excel.
| edent wrote:
| Thanks. Although that's a few years after the phrase was first
| popularised - it's still funny :-)
| indymike wrote:
| The joke in our office was, "Big data doesn't fit on a thumb
| drive." At the time a big thumb drive was 64GB.
| ur-whale wrote:
| Do glaringly obvious things need to have an origin?
|
| What value could that have?
| edent wrote:
| As I say in the post, I saw it quoted in an academic paper and
| I wanted to see who actually said it.
|
| That's then a good way to understand _why_ they said it and
| what they actually meant.
|
| So, for me, the value is understanding the provenance and
| assumptions of the quote.
| nojito wrote:
| How does no one in this thread know about power query and
| powerpivot?
|
| I've been using excel with tens of millions of rows for years
| now.
| fencepost wrote:
| Keep in mind that for the people discussing this 10-12 years
| ago Excel had almost always had a limit of 64k rows. As another
| comment notes Excel 2007 is where that was increased to >
| 1,000,000.
|
| If you were a recognized person in data science you'd probably
| been using tools with the 64k limit for at least a decade.
| atat7024 wrote:
| Academia.
| dfilppi wrote:
| Sounds like a feature request for MSFT
| supernova87a wrote:
| I'll get back to you on that as soon as my sheet finishes
| calculating for 5 min because I changed the value of 1 cell.
|
| Ohh..sorry, make that 10 min -- I had to save the file.
| known wrote:
| Complete/Big data should be loaded into memory for doing any
| processing in Excel;
| Scarblac wrote:
| I seem to remember a blog post one day about someone interviewing
| for several "Big Data" positions - where the data turned out to
| be some Excel sheets. Or fit on a thumb drive. The hiring
| companies that thought they had "big data" really didn't, was the
| point. The blog post was widely shared around my office at the
| time and I see it as the origin of this idea.
|
| Edit: I think it was
| https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html, HN
| discussion https://news.ycombinator.com/item?id=6398650
|
| But the idea wasn't that data was big as soon as it didn't fit in
| Excel anymore - Excel was a reductio ad absurdum. If it still
| fits in Excel, it's _laughably far_ away from being "big".
| _That_ was the idea.
|
| Big data was about the problems you get when you need to join
| data together that you can't fit well in one database server, not
| even with the amounts of memory and disk space you can get these
| days. It was Google type problems, the stuff that map-reduce
| needed to be invented for. _The kind of problem that most
| companies just don 't have_.
|
| So the term mostly disappeared from job descriptions, "data
| science" became popular instead, and people keep using Excel
| because it is great.
|
| (that blog post is from 2013 so it can't be the source of that
| quote from 2012. But please don't define big data in terms of
| Excel in a 2021 MSc thesis)
| GoodDreams wrote:
| A company I worked for defined big data as anything not
| analyzable with a relational database - a PDF or video was big
| data. Companies say they want the latest buzzword even though
| it's against their own interest and other companies say "we can
| give that to you" and instead deliver something that is of
| actual value so the license gets renewed. Same thing happened
| with SOA, REST, microservices, blockchain. IBM developed a
| blockchain product just to get the sales conversations going
| and then sold something else. Imagine being on that product's
| team.
| [deleted]
| saalweachter wrote:
| I would go with "big data is when you have to spend more
| effort on the mechanism of performing the computation than on
| the details of the analysis itself."
| moksly wrote:
| I work for an average sized Danish city. Even when we combined
| our datasets with 10 other average sized cities and accumulate
| the biggest national data set and let the big data wizards have
| a go at it, it wasn't really large enough to amount to anything
| useful.
|
| Don't get me wrong, Watson came up with a lot of BI suggestions
| that were useful and some of the scientists came up with semi-
| interesting prediction models. The thing is though, our
| analytics team has much better BI models and the our finance
| department does the prediction much more efficient and, well,
| legal. The prediction could become a useful aid, if it ever
| became legal, but not really at the license fees we were
| looking at.
|
| Not sure if automated BI is ever going to become good enough.
| It's impressive that Watson can come up with stuff an
| university post-bachelor intern can, but it's license is more
| than our entire analytics team, so yeah...
| Scarblac wrote:
| I work in water management in the Netherlands and it feels
| the same.
|
| A human domain expert _knows_ what kind of thing to look for.
| Custom tools can help to get an overview (eg we visualize
| water velocities of a whole regional system on a map), but
| generic machine learning can 't add much even though we
| record basically everything everywhere at five minute
| intervals.
|
| The value of ML is perhaps in things that give little value
| in the individual case but can be used a huge number of
| times, like image classification.
| bostik wrote:
| I can give one example where the ML/AI _toolkit_ is useful,
| even if only the largest providers in the world may need to
| work with real Big Data[tm] volumes. Fraud detection. And
| you don 't need to be a particularly big operator to get
| sufficiently large & varied data sets.
|
| In a way I think it helps that fraud is an ever-evolving,
| fiercely adversarial domain. New avenues are explored all
| the time. Occasionally old tricks are revived for a while,
| because they may work in the margins but become
| distinguishing features as soon as they see more use.
|
| And even if your ML is based on nothing more than a random
| forest, you can still get surprisingly useful feature
| combinations out of it. (Or as our data scientists said:
| individually meaningless features may become a valuable
| signal when enough of them occur at the same time.)
| pbhjpbhj wrote:
| Human directed machine learning then? Presumably a domain
| expert who can also expertly leverage ML/AI could make some
| useful tools?
| edent wrote:
| Thanks for that link. Interesting that it was written a year-
| or-so after the phrase became popular.
| [deleted]
| hkmurakami wrote:
| Friend in one of the big data vendors: "we mostly work with
| medium data"
| nerdponx wrote:
| The way I always think of it: Medium data fits on disk but
| not in memory. Big data doesn't fit in either.
| alexchamberlain wrote:
| I like this definition too; the advantage is it scales with
| advances in technology. What has medium a decade ago, might
| be considered small (ish) now.
| cortesoft wrote:
| Luckily that blog post is out of date and there are a lot of
| options now to handle datasets larger than 5TB without needing
| Hadoop.
| jleahy wrote:
| Awk, for example.
| simias wrote:
| I think it's kind of pointless to fight for an exact meaning
| for these buzzwords. "Big Data", "Webscale", "Web 2.0", "NoSQL"
| and whatever other trendy hashtag du jour have lost most of
| their meaning because they hardly had any to begin with.
| They're marking drivel, not technical terms. And since
| everybody wants to look cool we all find ways to say that we're
| doing it, one way or the other.
| klodolph wrote:
| Other than "webscale" I think the other terms still mean
| something. They don't need to have a precise, agreed upon
| meaning to be useful words.
|
| "Webscale" is just a buzzword, though.
| Quarrelsome wrote:
| Sorry I can't resist:
| https://www.youtube.com/watch?v=b2F-DItXtZs
| srswtf123 wrote:
| This video is one of the best critiques of modern
| software development, and I suppose IT in general, that
| I've ever seen. Even if I've seen it a hundred times, I
| always appreciate seeing it again. Thank you!
| stadium wrote:
| NoSQL was jargon speak for document store. It's really schema
| on read vs the traditional database where schema is defined
| on (or before) write. And there are plenty of ways to use sql
| on various data formats including json, presto being one.
|
| Around the time of peak NoSQL buzz, I went to Oracle Open
| world in the early-mid 2010's and there was an army of
| MongoDB folks outside the convention center grounds holding
| signs and handing out swag. NoSQL was definitely a good
| marketing play to try and take customers away from Oracle.
| snidane wrote:
| Since the new breed of "data scientists" using R and Python
| dataframes replaced Excel "data analysts" they realized the
| "big" of big data hasn't disappeared and started hiring data
| engineers for developing and explaining the new big data
| platforms for stuff which didn't fit into their pandas
| dataframes.
|
| Still, some call it medium data since this spans the range of
| higher gigabytes to lower terabytes. There are still couple of
| business problems in the petabyte scale, but they are probably
| so niche that they deserve a custom solution anyway.
| aappleby wrote:
| If you can lift the device it's stored on with your bare hands,
| it's not big data.
| social_quotient wrote:
| Maybe it's the limits that use to push you in to a corner of
| pain?
|
| 118 character file name limit 65,536 row limit till 2011 256
| columns till 2011 2gb memory limits
|
| Those are the structural limits with should be good for a lot of
| things but... there are the practical issues of it freezing and
| having issues while actually using it at any scale or any sort of
| complexity.
| Vaslo wrote:
| There are definite practical limits in Excel. Even at 100k rows,
| some of the more complex financial models we have made get very
| difficult to work with. We've spent days learning and trying new
| ways to speed them up (never use Vlookup or offset, etc) and they
| became usable but still frustrating.
|
| I guess the biggest upside is that it's forced my team to go out
| and learn SQL, Python, etc. However that has it's own downside
| because we are the only ones in our division that know these
| tools, so we get stuck maintaining stuff that we shouldn't!
| NicoJuicy wrote:
| Share linqpad scripts?
| bob1029 wrote:
| Having everyone on the team competent in SQL is not something
| that is an unreasonable objective. Even project managers and
| executives could benefit from knowing how to write their own
| queries against various data sources.
|
| As wizened developers, we can assist our more business-like
| coworkers by developing views and reporting databases that help
| to present a more consistent and higher-order perspective of
| the world. Just having a document that lists out examples of
| SQL they can use is 99% of what most people need to get
| bootstrapped.
|
| Getting your problem domains modeled in SQL and using the
| appropriate form of normalization is foundational for managing
| non-trivial levels of complexity in larger projects. For me,
| non-trivial complexity means any domain model with more than 10
| related types or more than 100 total properties to deal with.
|
| Relational modeling is one of the most powerful abstractions we
| have available for working with anything that goes beyond the 3
| spatial dimensions that we can see with our eyeballs. I have
| dealt with queries in factory automation that join over 40
| tables to produce some important projection.
|
| You could make a bunch of assumptions about how the domain
| model should be shaped in some complex object graph monstrosity
| and then stick it into MongoDB, or you can leave it neatly
| organized and indexed such that any reasonable query can be
| made of the data to produce virtually any shape of output you
| need.
|
| With all of that in mind, Excel is still one of the most
| powerful tools on your computer for documenting virtually
| anything. Any problem domain can be represented as tables of
| things and relations between them. Anyone can figure out the
| most important parts of this tool with just a few minutes of
| screwing around with it. It is trivial to take a model from
| someone's xlsx and turn it into proper SQL tables and then slap
| a front-end and business logic around it. When someone wants me
| to write a new piece of software for a new problem area, we
| always start with types, properties, and relationships between
| these things. Excel is a perfect fit for the first phase of any
| software project.
| amelius wrote:
| Does it matter? Because I'm guessing that sooner rather than
| later Excel will be updated so that big data will fit just fine
| (using cloud technologies).
| CalChris wrote:
| I'm reminded of Joe Hellerstein's (and Oz Nova's summation of it,
| _You Are Not Google_ [1] [2]) about even actual Big Data people
| getting their scale wrong and thinking they 're Google:
| The thing is there's like 5 companies in the world that run jobs
| that big. For everybody else... you're doing all this I/O for
| fault tolerance that you didn't really need. People got kinda
| Google mania in the 2000s: "we'll do everything the way Google
| does because we also run the world's largest internet data
| service" [tilts head sideways and waits for laughter].
|
| [1] https://blog.bradfieldcs.com/you-are-not-google-84912cf44afb
|
| [2] https://news.ycombinator.com/item?id=19576092
| inshadows wrote:
| Sorry for off-topic. Several years ago I watched part of a video
| lecture about implementing Excel principles. It was a talk about
| functional programming, and he may have used Haskell, or Scala. I
| found it interesting because it began with him explaining how
| Excel is huge array and how changes propagate. Do you have a link
| to this talk/lecture? Thanks in advance!
___________________________________________________________________
(page generated 2021-04-17 23:01 UTC)