[HN Gopher] How we made Jupyter notebooks load faster
___________________________________________________________________
How we made Jupyter notebooks load faster
Author : lneves12
Score : 143 points
Date : 2024-09-10 13:19 UTC (1 days ago)
(HTM) web link (www.singlestore.com)
(TXT) w3m dump (www.singlestore.com)
| davidgomes wrote:
| I was on this team at SingleStore and I can vouch for how hard
| this team worked on this project. I just opened a couple
| notebooks in production and they loaded *instantly*, so kudos to
| the team for seeing this project through.
|
| (If you're not familiar with SingleStore's Jupyter Notebooks,
| they're similar to Databricks Notebooks[1] or Azure Synapse
| Notebooks[2]).
|
| [1]: https://docs.databricks.com/en/notebooks/index.html
|
| [2]: https://learn.microsoft.com/en-us/azure/synapse-
| analytics/sp...
| yunohn wrote:
| I appreciate the write-up, it's very insightful. But I was
| quite concerned about the amount of response mocking being used
| as a solution. Did the team ever consider instead making
| upstream changes to jupyter-lab (which is FOSS), so that these
| requests are either deferred or can be configured to not run?
| That seems like it would benefit everyone, including your
| company - and might even uncover further optimizations.
| tgrine wrote:
| Hi, I'm one of the co-authors of the blog post. You raise a
| valid point and, in large part, I agree with you. To offer
| some explanation, there are essentially 2 main reasons for us
| taking this approach:
|
| 1. Contributing to an open source project, especially one the
| size and complexity of jupyter-lab, is generally going to be
| a slower process than finding a solution "in house".
| Improving the load times became a priority once we realized
| our notebooks were bringing value to users, and we wanted to
| deliver a better experience as soon as possible; 2. It's not
| always apparent if the changes you are looking for from an
| open sourced project are useful to a more general audience or
| if it's very specific to the way you are using the project. A
| lot of the requests that were mocked could only be so because
| we either didn't use them in our implementation (for example,
| users and workspaces) or because we know the response won't
| change (for example, some extension settings which we don't
| allow users to change). Is this a common situation for others
| or is it a niche circumstance of how we are using jupyter-
| lab? If it's not common, then adding these options in
| jupyter-lab itself could just increase its complexity while
| not bringing that much benefit (not saying this is
| necessarily the case here);
|
| To your point though, a good example of this is the
| checkpoints feature. There is an open issue requesting the
| option to disable checkpoints[1] as it is not always useful
| for people. We had the same issue, since we are not using
| checkpoints, but the requests were always being made.
| Ultimately, we just mocked the checkpoints requests, but it's
| probably the case that making the changes to jupyter-lab to
| disable this would benefit us and other people as well.
|
| [1]: https://github.com/jupyterlab/jupyterlab/issues/11826
| yunohn wrote:
| Thanks for the clarification, that's fair enough. But I
| hope you do look into upstreaming, maybe even just by
| opening an Issue to guage community interest. Like the pre-
| existing checkpoints Issue you linked to, opening some for
| your functionality might show others what is possible.
| paddy_m wrote:
| Impressive work. I love jupyter, but it's a bear to work on.
|
| What mix of JS packages do you see? Could that be built into one
| uber package?
| lneves12 wrote:
| We are actually bundling everything inside one big main.js
| file, compared to jupyter-lab app that loads each extension
| from a different file using webpack federated modules. We did
| some benchmarking and it was actually faster than having one
| file per extension. There is definitely still some room for
| improvement here, but we have some other places we would like
| to optimize first, like, optimising the fetching of the
| notebooks content.
|
| You can take a look at the notebooks entrypoint network
| request: https://portal-notebooks.singlestore.com/
| paddy_m wrote:
| I take it that's supposed to be the pre-loading page to just
| look at the requests, not a full working UI?
|
| After cursory googling I couldn't find one, do you have a
| public notebook gallery?
| lneves12 wrote:
| Yeah, sorry should have made that more clear. This doesn't
| really load anything, it's just the entry point for our
| iframe. To try our notebooks you can create an account at
| portal.singlestore.com (we have a gallery of notebooks
| there)
| canucker2016 wrote:
| I don't see a Content-Encoding header on the response for the
| JS and HTML files, which suggests the 11.5MB JS and the HTML
| files aren't compressed.
|
| Not much of a worry on the tiny HTML file, but the 11.5MB JS
| file should compress to a much smaller file on the wire.
| lneves12 wrote:
| ah that's an embarrassing oversight! we reuse the same cdn
| configuration for multiple projects and for some reason the
| compression isn't properly configured for our portal-
| notebooks.singlestore.com entry point. It's funny that I
| reconfirmed that before publishing the blogpost, but
| mistakenly looked at the request headers and not the
| response headers (facepalm). We are fixing that now, thank
| you! This will be helpful for cases where you access the
| notebooks UI directly. For cases where you come from other
| page, it shouldn't make that much difference, since the
| iframe is already pre rendered.
| jasongrout wrote:
| Nice. Bundling everything together was how JupyterLab used to
| work before version 3, but it required a compilation step to
| install extensions, which made it inconvenient for users.
| With JupyterLab 3 (and maybe 4?), if I recall correctly, you
| could have both worlds - compile some extensions into a base
| js bundle, then install other extension to be loaded as
| federated webpack modules.
| lneves12 wrote:
| Thanks for the context, that makes a lot of sense. In our
| case since we have a more controlled and less generic
| environment we have a bit more flexibility/control in what
| we can do.
| spiralk wrote:
| I dislike how Jupyter notebooks have become normalized. Yes, the
| interactive execution and visuals are nice for more academic
| workflows where the priority is quick results over code
| organization. However, when it comes to sharing code with others
| for the sake of doing reproducible science, jupyter notebooks
| cause more trouble than they are worth. Using cell based
| execution with python is so elegant with '# %%' lines in regular
| .py files (though it requires using VSCode or fiddling with vim
| plugins which not all scientists want to do I suppose). No .ipynb
| is necessary, .py files can be version controlled and shared like
| normal code while sill retaining the ability to use
| interactively, cell by cell.
|
| Its much easier to organize .py files into a proper python
| module, and then share and collaborate with others. Instead,
| groups will collect jumbles of slightly different versions of the
| same jupyter notebooks that progressively become more complex and
| less manageable over time. It's not a hypothetical unfortunately,
| I've seen this happen at major university labs. I'm not blaming
| anyone because I understand -- the funding is there to do science
| and not rewrite code to build convenient software libraries. Yet,
| I can't help but wish jupyter notebooks could be removed from
| academic workflows.
| luplex wrote:
| In the end, usability wins. In a Jupyter notebook, you have a
| much better idea of state between cells, you can iterate much
| faster, you can write documentation in readable markdown.
| Often, Jupiter notebooks are more like interactive markdown
| than they are like python scripts.
| dxbydt wrote:
| > In a Jupyter notebook... > Often, Jupiter notebooks...
|
| Everytime I search my Slack, I have to run two searches
| because DS can't agree on how to spell the damn thing.
| Twirrim wrote:
| I use jupyter notebooks at work, not so much for academic
| stuff, but often to help build and show a narrative to folks,
| including executives (where I have any even remotely technical
| leadership). It's great for narrative stuff, especially being
| able to emit PDFs and what not. I've been in a number of
| meetings where I've got the code up in Jupyter, sharing the
| screen, and leadership want us to tweak numbers and see the
| consequences.
|
| It's great for exploring code and data too, especially
| situations where I'm really trying to feel my way towards a
| solution. I get to merrily intermingle rich text narrative and
| code so I explain how I got to where I got to and can walk
| people through it (I did that with some experimenting with an
| SMT solver several months ago, meant that people that had no
| experience with an SMT solver could understand the model I
| built).
|
| I'd never use it to share code though. If we get to that stage,
| it's time to export from jupyter (which it natively supports),
| and then tidy up the code and productionise it. There's no way
| jupyter should be the deployed thing.
| spiralk wrote:
| That seems like a reasonable way to use jupyter notebooks
| since you have an actual plan to move beyond it when
| necessary. My issue is mostly with the way its misused, often
| by people who are arguably at the top of the field.
| KolenCh wrote:
| I don't disagree anything you said. Jupytext can be a good tool
| to bridge some gap, where you pair ipynb to a py script and can
| then commit the py only (git-ignore all ipynb for your
| collaborators.)
|
| Also, while many practices out there is questionable, in
| alternative scenarios where ipynb doesn't exist, they might
| have been using something like matlab for example. Eg, in my
| field (physics), often time there are experimentalists doing
| some coding. Ipynb can be very enabling for them.
|
| I think a piece of research should be broken down and worked by
| multiple people to improve the state of the project. Some
| scientists might be passing you the initial prototype in the
| form of a notebook, and some others should be refactoring to
| something more suitable for deployment and archival purpose.
| Properly funding these roles is important, and is lacking but
| improving (eg hiring RSE.)
|
| In my field, the most prominent way when ipynb is shared a lot
| is for training. It's a great application as that becomes
| literate programming. In this sense notebook is highly
| underused as literate programming still hasn't got mainstream.
| spiralk wrote:
| I've looked into Jupytext, but ultimately decided to go with
| pure python. Most of the practical functionality can be
| replicated, but I do admit there isn't a easy single install
| tool or guide to replace notebooks at the moment.
|
| I think the notebooks are a fine learning tool to introduce
| people to programming initially, but I'm afraid it doesn't
| allow for growth beyond a certain level. You have a good
| point about funding for those software roles. Perhaps this
| may not be as big of a concern if there were more software
| talent in these labs to handle the issues that arise.
| KolenCh wrote:
| In an ideal world that we control everything and/or don't
| need to collaborate with others, then whatever tooling one
| use is actually not that important (and each can choose the
| best fitting their needs.) So Jupyter+Jupytext is useful in
| the context of collaboration, where you can't control your
| collaborators but want something from them.
|
| While in an ideal world scientists who write softwares
| should write professionally, the same goes for anything
| they do, including math and stats used in their research,
| writing and typesetting and generates publishing quality
| visualization... That rarely happens because of how the
| academic world is financed, and the incentives associated
| with it. I can certainly complain about that all days, but
| in short a researcher hired by a research university,
| especially with a tenured track position in the US, will
| not be successful to get such position, let alone getting
| tenured, if they had not focused their scarce resource of
| time to maximize their "research output" (publications,
| grant, etc.), where software engineering is not part of.
| (Sorry, sentence too complicated.)
| ambicapter wrote:
| The form factor of Jupyter notebooks seems to fit well with
| peoples workflows though. Looks like you just wish the
| internals of Jupyter were better architected.
| spiralk wrote:
| Imo, the better architected .ipynb is simply .py with '# %%'
| blocks. It does almost everything a .ipynb can do with the
| right VSCode extensions. Even interactive visualizations can
| be sent to a browser window or saved to disk with plotly.
| Though I do wish '# %%' cell based execution was accessible
| to more people.
|
| There isn't a single install tool that "just works" for this
| at the moment. If editors came with more robust support for
| it by default, I think the notebook format wouldn't be needed
| at that point and people could use regular python and
| interactive cell based python more interchangeably. I've seen
| important code get buried under collections of jupyter
| notebooks across different users so I have a good reason for
| this. Notebooks simply dont scale beyond a certain
| complexity.
| paddy_m wrote:
| The two can coexist. store libraries in python code that is
| versioned and deployed properly. Notebooks with their data
| ingest, code, then output should read cleanly. Making the
| ingest and code readable is the job of library writers. A
| clean and elegantly coded notebook with inline outputs is a
| substantively different experience than searching all over
| the place for the correct browser window that corresponds
| to the output from a given piece of code.
| abdullahkhalids wrote:
| The same problem exists with spreadsheets. Should we get rid of
| excel (the single tool that literally runs half the world), and
| start manually writing markdown tables in text files?
|
| The tool and the tool maker are supposed to serve the user. The
| user is not supposed to conform to the whims of the tool maker.
| ants_everywhere wrote:
| Since 94% of business spreadsheets contain errors [0], then
| probably yes we should get rid of or significantly improve
| spreadsheets.
|
| Probably the solution is that things like Jupyter notebooks
| and spreadsheets should be views into some better source of
| truth rather than the source of truth themselves.
|
| [0] https://phys.org/news/2024-08-business-spreadsheets-
| critical.... I remember a similar figure from studies a
| decade or so ago.
| glzone1 wrote:
| The funny thing is I've seen folks try to deploy software
| to get rid of spreadsheets. It always ends badly to
| terribly.
|
| Spreadsheets are the nonprogrammers programming / modeling
| tool in business.
|
| It does presentation, data filtering / sorting, modeling
| and more.
|
| No AI needed (and you can now plug AI in in some cases).
| ants_everywhere wrote:
| Sure, but here's the basic problem I think:
|
| Suppose you have some formula that computes a financial
| metric for your company. Someone you've shared it with
| drunkenly fat-fingers the formula 3/4 of the way down a
| long row, and that causes all entries below it to
| recompute with the wrong formula. Unless the change is
| really drastic, you may never know it happened.
|
| And this sort of mistake -- basically a typo or a bad
| mouse movement -- happens daily in every company in the
| world in some spreadsheet. Often people will notice the
| mistake, but not with probability 1.
|
| Software engineers have mechanisms to guard against some
| of these mistakes, and even we have a hard time getting
| people to take code review or tests seriously. What is
| the guard in the spreadsheet world?
| paddy_m wrote:
| Another issue is that jupyter, pandas, and polars don't take
| displaying tabular data seriously. Just have a better default
| table display widget. Look at ipydatagrid, perspective, or
| buckaroo (my project) for examples of how it could be done
| better.
| epistasis wrote:
| I think there's a fundamental mistunderstanding and mismatch
| between what you want to do, and what Jupyter notebooks are
| for. The distinction is between code versus the results.
|
| If the code is the end product, sure, use a python package.
|
| But does your .py with `# %%` in it also store the outputs? If
| not, why even bring this up? A .py output without the plots
| tied to the code doesn't meet the basic use case.
|
| If the end product is the plot, I want to see how that plot was
| generated. And a Jupyter notebook is a much much better
| artifact than a Python package, unless that Python package hard
| codes the inputs and execution path like a notebook would.
|
| Over the past 20 years of my career I have run into this
| divergence of use cases a lot. Software engineers seem to not
| understand the end goals, how it should be performed, and the
| learnings of the practitioners that have been generating
| results for a long time. It's hard to protect data scientists
| from these inflexible software engineers that see "aha that's
| code, I know this!" without bothering to understand the actual
| use case at hand.
| spiralk wrote:
| Not having the outputs tied into the code is actually
| preferable if the ultimate goal is reproducible science. Code
| should be code, documentation should be documentation, and
| outputs should be outputs. Having multiple copies of
| important code in non-version controlled files is not a good
| practice. Having documentation dispersed with questionable
| organization in unsearchable files is not good a practice.
| Having outputs without run information and timestamps is not
| a good practice. Its easy to fall in to those traps with
| Jupyter notebooks. It might speed up initial set up and
| experimentation, but I've been working academic labs long
| enough to see the downstream effects.
| majormajor wrote:
| Having the outputs recorded _alongside specific versions of
| the code_ can actually be very valuable.
|
| But since most uses of Jupyter notebooks I've seen don't
| version control them much at all, it's not as useful in
| practice often.
| spiralk wrote:
| Yeah, jupyter notebooks don't guarantee any specifics
| about versions of code used for that output. In the real
| world you can expect everyone in the lab including all of
| the students to be editing jupyter notebooks at whim. The
| only way to do this would be to have proper version
| control and of your code, a snapshot of the environment,
| and to log all this along with the run that generated the
| output. This is possible with regular python using git,
| proper log files, etc. Jupyter notebooks seem like an
| extra roadblock.
| paddy_m wrote:
| Ooh. That's a nice utility funtion that I will write
| soon. We tend to look at requirements as something we
| hope the package manager gets right, and then we ignore
| at runtime, but there are a bunch of errors we could
| avoid if we verified at runtime. Sometimes when writing a
| library you have to have different code paths for
| different versions.
|
| Something like `if check_versions(pandas__gt="2.0.0",
| pandas__lt="3.0.0"):`
| yunohn wrote:
| Often the notebook was run on a beefy server with GPUs
| attached, potentially taking hours/days of compute. It
| would be senseless to force every viewer of a Jupyter
| notebook to have the same setup and time just to read
| through the results and output.
| ants_everywhere wrote:
| We've seen how this ends because mathematicians have been
| sharing Mathematica notebooks forever. It's not pretty.
|
| Like you I see the appeal, but they're a usability nightmare
| beyond a few lines. Part of the problem, I think, is that you
| can't really incrementally improve them. Who wants to refactor
| a notebook and deal with all the cell dependency breakage?
|
| So they start off okay and then slowly become terrible until
| they're either irreplaceable or too terrible to work with and a
| new one is started.
| pizza wrote:
| While we're on the topic of jupyter enhancements, would really
| love to be able to pop a cell off the pending execution stack if
| I realized running it would be a mistake and still have time
| before it gets there.. :^)
| sa-code wrote:
| And also preserving cell output if you happen to reload the
| page
___________________________________________________________________
(page generated 2024-09-11 23:02 UTC)