hngopher.com

       [HN Gopher] How we made Jupyter notebooks load faster
       ___________________________________________________________________
        
       How we made Jupyter notebooks load faster
        
       Author : lneves12
       Score  : 143 points
       Date   : 2024-09-10 13:19 UTC (1 days ago)
        
 (HTM) web link (www.singlestore.com)
 (TXT) w3m dump (www.singlestore.com)
        
       | davidgomes wrote:
       | I was on this team at SingleStore and I can vouch for how hard
       | this team worked on this project. I just opened a couple
       | notebooks in production and they loaded *instantly*, so kudos to
       | the team for seeing this project through.
       | 
       | (If you're not familiar with SingleStore's Jupyter Notebooks,
       | they're similar to Databricks Notebooks[1] or Azure Synapse
       | Notebooks[2]).
       | 
       | [1]: https://docs.databricks.com/en/notebooks/index.html
       | 
       | [2]: https://learn.microsoft.com/en-us/azure/synapse-
       | analytics/sp...
        
         | yunohn wrote:
         | I appreciate the write-up, it's very insightful. But I was
         | quite concerned about the amount of response mocking being used
         | as a solution. Did the team ever consider instead making
         | upstream changes to jupyter-lab (which is FOSS), so that these
         | requests are either deferred or can be configured to not run?
         | That seems like it would benefit everyone, including your
         | company - and might even uncover further optimizations.
        
           | tgrine wrote:
           | Hi, I'm one of the co-authors of the blog post. You raise a
           | valid point and, in large part, I agree with you. To offer
           | some explanation, there are essentially 2 main reasons for us
           | taking this approach:
           | 
           | 1. Contributing to an open source project, especially one the
           | size and complexity of jupyter-lab, is generally going to be
           | a slower process than finding a solution "in house".
           | Improving the load times became a priority once we realized
           | our notebooks were bringing value to users, and we wanted to
           | deliver a better experience as soon as possible; 2. It's not
           | always apparent if the changes you are looking for from an
           | open sourced project are useful to a more general audience or
           | if it's very specific to the way you are using the project. A
           | lot of the requests that were mocked could only be so because
           | we either didn't use them in our implementation (for example,
           | users and workspaces) or because we know the response won't
           | change (for example, some extension settings which we don't
           | allow users to change). Is this a common situation for others
           | or is it a niche circumstance of how we are using jupyter-
           | lab? If it's not common, then adding these options in
           | jupyter-lab itself could just increase its complexity while
           | not bringing that much benefit (not saying this is
           | necessarily the case here);
           | 
           | To your point though, a good example of this is the
           | checkpoints feature. There is an open issue requesting the
           | option to disable checkpoints[1] as it is not always useful
           | for people. We had the same issue, since we are not using
           | checkpoints, but the requests were always being made.
           | Ultimately, we just mocked the checkpoints requests, but it's
           | probably the case that making the changes to jupyter-lab to
           | disable this would benefit us and other people as well.
           | 
           | [1]: https://github.com/jupyterlab/jupyterlab/issues/11826
        
             | yunohn wrote:
             | Thanks for the clarification, that's fair enough. But I
             | hope you do look into upstreaming, maybe even just by
             | opening an Issue to guage community interest. Like the pre-
             | existing checkpoints Issue you linked to, opening some for
             | your functionality might show others what is possible.
        
       | paddy_m wrote:
       | Impressive work. I love jupyter, but it's a bear to work on.
       | 
       | What mix of JS packages do you see? Could that be built into one
       | uber package?
        
         | lneves12 wrote:
         | We are actually bundling everything inside one big main.js
         | file, compared to jupyter-lab app that loads each extension
         | from a different file using webpack federated modules. We did
         | some benchmarking and it was actually faster than having one
         | file per extension. There is definitely still some room for
         | improvement here, but we have some other places we would like
         | to optimize first, like, optimising the fetching of the
         | notebooks content.
         | 
         | You can take a look at the notebooks entrypoint network
         | request: https://portal-notebooks.singlestore.com/
        
           | paddy_m wrote:
           | I take it that's supposed to be the pre-loading page to just
           | look at the requests, not a full working UI?
           | 
           | After cursory googling I couldn't find one, do you have a
           | public notebook gallery?
        
             | lneves12 wrote:
             | Yeah, sorry should have made that more clear. This doesn't
             | really load anything, it's just the entry point for our
             | iframe. To try our notebooks you can create an account at
             | portal.singlestore.com (we have a gallery of notebooks
             | there)
        
           | canucker2016 wrote:
           | I don't see a Content-Encoding header on the response for the
           | JS and HTML files, which suggests the 11.5MB JS and the HTML
           | files aren't compressed.
           | 
           | Not much of a worry on the tiny HTML file, but the 11.5MB JS
           | file should compress to a much smaller file on the wire.
        
             | lneves12 wrote:
             | ah that's an embarrassing oversight! we reuse the same cdn
             | configuration for multiple projects and for some reason the
             | compression isn't properly configured for our portal-
             | notebooks.singlestore.com entry point. It's funny that I
             | reconfirmed that before publishing the blogpost, but
             | mistakenly looked at the request headers and not the
             | response headers (facepalm). We are fixing that now, thank
             | you! This will be helpful for cases where you access the
             | notebooks UI directly. For cases where you come from other
             | page, it shouldn't make that much difference, since the
             | iframe is already pre rendered.
        
           | jasongrout wrote:
           | Nice. Bundling everything together was how JupyterLab used to
           | work before version 3, but it required a compilation step to
           | install extensions, which made it inconvenient for users.
           | With JupyterLab 3 (and maybe 4?), if I recall correctly, you
           | could have both worlds - compile some extensions into a base
           | js bundle, then install other extension to be loaded as
           | federated webpack modules.
        
             | lneves12 wrote:
             | Thanks for the context, that makes a lot of sense. In our
             | case since we have a more controlled and less generic
             | environment we have a bit more flexibility/control in what
             | we can do.
        
       | spiralk wrote:
       | I dislike how Jupyter notebooks have become normalized. Yes, the
       | interactive execution and visuals are nice for more academic
       | workflows where the priority is quick results over code
       | organization. However, when it comes to sharing code with others
       | for the sake of doing reproducible science, jupyter notebooks
       | cause more trouble than they are worth. Using cell based
       | execution with python is so elegant with '# %%' lines in regular
       | .py files (though it requires using VSCode or fiddling with vim
       | plugins which not all scientists want to do I suppose). No .ipynb
       | is necessary, .py files can be version controlled and shared like
       | normal code while sill retaining the ability to use
       | interactively, cell by cell.
       | 
       | Its much easier to organize .py files into a proper python
       | module, and then share and collaborate with others. Instead,
       | groups will collect jumbles of slightly different versions of the
       | same jupyter notebooks that progressively become more complex and
       | less manageable over time. It's not a hypothetical unfortunately,
       | I've seen this happen at major university labs. I'm not blaming
       | anyone because I understand -- the funding is there to do science
       | and not rewrite code to build convenient software libraries. Yet,
       | I can't help but wish jupyter notebooks could be removed from
       | academic workflows.
        
         | luplex wrote:
         | In the end, usability wins. In a Jupyter notebook, you have a
         | much better idea of state between cells, you can iterate much
         | faster, you can write documentation in readable markdown.
         | Often, Jupiter notebooks are more like interactive markdown
         | than they are like python scripts.
        
           | dxbydt wrote:
           | > In a Jupyter notebook... > Often, Jupiter notebooks...
           | 
           | Everytime I search my Slack, I have to run two searches
           | because DS can't agree on how to spell the damn thing.
        
         | Twirrim wrote:
         | I use jupyter notebooks at work, not so much for academic
         | stuff, but often to help build and show a narrative to folks,
         | including executives (where I have any even remotely technical
         | leadership). It's great for narrative stuff, especially being
         | able to emit PDFs and what not. I've been in a number of
         | meetings where I've got the code up in Jupyter, sharing the
         | screen, and leadership want us to tweak numbers and see the
         | consequences.
         | 
         | It's great for exploring code and data too, especially
         | situations where I'm really trying to feel my way towards a
         | solution. I get to merrily intermingle rich text narrative and
         | code so I explain how I got to where I got to and can walk
         | people through it (I did that with some experimenting with an
         | SMT solver several months ago, meant that people that had no
         | experience with an SMT solver could understand the model I
         | built).
         | 
         | I'd never use it to share code though. If we get to that stage,
         | it's time to export from jupyter (which it natively supports),
         | and then tidy up the code and productionise it. There's no way
         | jupyter should be the deployed thing.
        
           | spiralk wrote:
           | That seems like a reasonable way to use jupyter notebooks
           | since you have an actual plan to move beyond it when
           | necessary. My issue is mostly with the way its misused, often
           | by people who are arguably at the top of the field.
        
         | KolenCh wrote:
         | I don't disagree anything you said. Jupytext can be a good tool
         | to bridge some gap, where you pair ipynb to a py script and can
         | then commit the py only (git-ignore all ipynb for your
         | collaborators.)
         | 
         | Also, while many practices out there is questionable, in
         | alternative scenarios where ipynb doesn't exist, they might
         | have been using something like matlab for example. Eg, in my
         | field (physics), often time there are experimentalists doing
         | some coding. Ipynb can be very enabling for them.
         | 
         | I think a piece of research should be broken down and worked by
         | multiple people to improve the state of the project. Some
         | scientists might be passing you the initial prototype in the
         | form of a notebook, and some others should be refactoring to
         | something more suitable for deployment and archival purpose.
         | Properly funding these roles is important, and is lacking but
         | improving (eg hiring RSE.)
         | 
         | In my field, the most prominent way when ipynb is shared a lot
         | is for training. It's a great application as that becomes
         | literate programming. In this sense notebook is highly
         | underused as literate programming still hasn't got mainstream.
        
           | spiralk wrote:
           | I've looked into Jupytext, but ultimately decided to go with
           | pure python. Most of the practical functionality can be
           | replicated, but I do admit there isn't a easy single install
           | tool or guide to replace notebooks at the moment.
           | 
           | I think the notebooks are a fine learning tool to introduce
           | people to programming initially, but I'm afraid it doesn't
           | allow for growth beyond a certain level. You have a good
           | point about funding for those software roles. Perhaps this
           | may not be as big of a concern if there were more software
           | talent in these labs to handle the issues that arise.
        
             | KolenCh wrote:
             | In an ideal world that we control everything and/or don't
             | need to collaborate with others, then whatever tooling one
             | use is actually not that important (and each can choose the
             | best fitting their needs.) So Jupyter+Jupytext is useful in
             | the context of collaboration, where you can't control your
             | collaborators but want something from them.
             | 
             | While in an ideal world scientists who write softwares
             | should write professionally, the same goes for anything
             | they do, including math and stats used in their research,
             | writing and typesetting and generates publishing quality
             | visualization... That rarely happens because of how the
             | academic world is financed, and the incentives associated
             | with it. I can certainly complain about that all days, but
             | in short a researcher hired by a research university,
             | especially with a tenured track position in the US, will
             | not be successful to get such position, let alone getting
             | tenured, if they had not focused their scarce resource of
             | time to maximize their "research output" (publications,
             | grant, etc.), where software engineering is not part of.
             | (Sorry, sentence too complicated.)
        
         | ambicapter wrote:
         | The form factor of Jupyter notebooks seems to fit well with
         | peoples workflows though. Looks like you just wish the
         | internals of Jupyter were better architected.
        
           | spiralk wrote:
           | Imo, the better architected .ipynb is simply .py with '# %%'
           | blocks. It does almost everything a .ipynb can do with the
           | right VSCode extensions. Even interactive visualizations can
           | be sent to a browser window or saved to disk with plotly.
           | Though I do wish '# %%' cell based execution was accessible
           | to more people.
           | 
           | There isn't a single install tool that "just works" for this
           | at the moment. If editors came with more robust support for
           | it by default, I think the notebook format wouldn't be needed
           | at that point and people could use regular python and
           | interactive cell based python more interchangeably. I've seen
           | important code get buried under collections of jupyter
           | notebooks across different users so I have a good reason for
           | this. Notebooks simply dont scale beyond a certain
           | complexity.
        
             | paddy_m wrote:
             | The two can coexist. store libraries in python code that is
             | versioned and deployed properly. Notebooks with their data
             | ingest, code, then output should read cleanly. Making the
             | ingest and code readable is the job of library writers. A
             | clean and elegantly coded notebook with inline outputs is a
             | substantively different experience than searching all over
             | the place for the correct browser window that corresponds
             | to the output from a given piece of code.
        
         | abdullahkhalids wrote:
         | The same problem exists with spreadsheets. Should we get rid of
         | excel (the single tool that literally runs half the world), and
         | start manually writing markdown tables in text files?
         | 
         | The tool and the tool maker are supposed to serve the user. The
         | user is not supposed to conform to the whims of the tool maker.
        
           | ants_everywhere wrote:
           | Since 94% of business spreadsheets contain errors [0], then
           | probably yes we should get rid of or significantly improve
           | spreadsheets.
           | 
           | Probably the solution is that things like Jupyter notebooks
           | and spreadsheets should be views into some better source of
           | truth rather than the source of truth themselves.
           | 
           | [0] https://phys.org/news/2024-08-business-spreadsheets-
           | critical.... I remember a similar figure from studies a
           | decade or so ago.
        
             | glzone1 wrote:
             | The funny thing is I've seen folks try to deploy software
             | to get rid of spreadsheets. It always ends badly to
             | terribly.
             | 
             | Spreadsheets are the nonprogrammers programming / modeling
             | tool in business.
             | 
             | It does presentation, data filtering / sorting, modeling
             | and more.
             | 
             | No AI needed (and you can now plug AI in in some cases).
        
               | ants_everywhere wrote:
               | Sure, but here's the basic problem I think:
               | 
               | Suppose you have some formula that computes a financial
               | metric for your company. Someone you've shared it with
               | drunkenly fat-fingers the formula 3/4 of the way down a
               | long row, and that causes all entries below it to
               | recompute with the wrong formula. Unless the change is
               | really drastic, you may never know it happened.
               | 
               | And this sort of mistake -- basically a typo or a bad
               | mouse movement -- happens daily in every company in the
               | world in some spreadsheet. Often people will notice the
               | mistake, but not with probability 1.
               | 
               | Software engineers have mechanisms to guard against some
               | of these mistakes, and even we have a hard time getting
               | people to take code review or tests seriously. What is
               | the guard in the spreadsheet world?
        
           | paddy_m wrote:
           | Another issue is that jupyter, pandas, and polars don't take
           | displaying tabular data seriously. Just have a better default
           | table display widget. Look at ipydatagrid, perspective, or
           | buckaroo (my project) for examples of how it could be done
           | better.
        
         | epistasis wrote:
         | I think there's a fundamental mistunderstanding and mismatch
         | between what you want to do, and what Jupyter notebooks are
         | for. The distinction is between code versus the results.
         | 
         | If the code is the end product, sure, use a python package.
         | 
         | But does your .py with `# %%` in it also store the outputs? If
         | not, why even bring this up? A .py output without the plots
         | tied to the code doesn't meet the basic use case.
         | 
         | If the end product is the plot, I want to see how that plot was
         | generated. And a Jupyter notebook is a much much better
         | artifact than a Python package, unless that Python package hard
         | codes the inputs and execution path like a notebook would.
         | 
         | Over the past 20 years of my career I have run into this
         | divergence of use cases a lot. Software engineers seem to not
         | understand the end goals, how it should be performed, and the
         | learnings of the practitioners that have been generating
         | results for a long time. It's hard to protect data scientists
         | from these inflexible software engineers that see "aha that's
         | code, I know this!" without bothering to understand the actual
         | use case at hand.
        
           | spiralk wrote:
           | Not having the outputs tied into the code is actually
           | preferable if the ultimate goal is reproducible science. Code
           | should be code, documentation should be documentation, and
           | outputs should be outputs. Having multiple copies of
           | important code in non-version controlled files is not a good
           | practice. Having documentation dispersed with questionable
           | organization in unsearchable files is not good a practice.
           | Having outputs without run information and timestamps is not
           | a good practice. Its easy to fall in to those traps with
           | Jupyter notebooks. It might speed up initial set up and
           | experimentation, but I've been working academic labs long
           | enough to see the downstream effects.
        
             | majormajor wrote:
             | Having the outputs recorded _alongside specific versions of
             | the code_ can actually be very valuable.
             | 
             | But since most uses of Jupyter notebooks I've seen don't
             | version control them much at all, it's not as useful in
             | practice often.
        
               | spiralk wrote:
               | Yeah, jupyter notebooks don't guarantee any specifics
               | about versions of code used for that output. In the real
               | world you can expect everyone in the lab including all of
               | the students to be editing jupyter notebooks at whim. The
               | only way to do this would be to have proper version
               | control and of your code, a snapshot of the environment,
               | and to log all this along with the run that generated the
               | output. This is possible with regular python using git,
               | proper log files, etc. Jupyter notebooks seem like an
               | extra roadblock.
        
               | paddy_m wrote:
               | Ooh. That's a nice utility funtion that I will write
               | soon. We tend to look at requirements as something we
               | hope the package manager gets right, and then we ignore
               | at runtime, but there are a bunch of errors we could
               | avoid if we verified at runtime. Sometimes when writing a
               | library you have to have different code paths for
               | different versions.
               | 
               | Something like `if check_versions(pandas__gt="2.0.0",
               | pandas__lt="3.0.0"):`
        
             | yunohn wrote:
             | Often the notebook was run on a beefy server with GPUs
             | attached, potentially taking hours/days of compute. It
             | would be senseless to force every viewer of a Jupyter
             | notebook to have the same setup and time just to read
             | through the results and output.
        
         | ants_everywhere wrote:
         | We've seen how this ends because mathematicians have been
         | sharing Mathematica notebooks forever. It's not pretty.
         | 
         | Like you I see the appeal, but they're a usability nightmare
         | beyond a few lines. Part of the problem, I think, is that you
         | can't really incrementally improve them. Who wants to refactor
         | a notebook and deal with all the cell dependency breakage?
         | 
         | So they start off okay and then slowly become terrible until
         | they're either irreplaceable or too terrible to work with and a
         | new one is started.
        
       | pizza wrote:
       | While we're on the topic of jupyter enhancements, would really
       | love to be able to pop a cell off the pending execution stack if
       | I realized running it would be a mistake and still have time
       | before it gets there.. :^)
        
         | sa-code wrote:
         | And also preserving cell output if you happen to reload the
         | page
        
       ___________________________________________________________________
       (page generated 2024-09-11 23:02 UTC)