[HN Gopher] Mito - Excel-like interface for Pandas dataframes in...
___________________________________________________________________
Mito - Excel-like interface for Pandas dataframes in Jupyter
notebook
Author : alefnula
Score : 207 points
Date : 2022-05-20 12:11 UTC (10 hours ago)
(HTM) web link (www.trymito.io)
(TXT) w3m dump (www.trymito.io)
| jpn wrote:
| I played around with many of these before:
|
| - https://github.com/quantopian/qgrid
|
| - https://github.com/man-group/dtale
|
| I find that I'm actually a lot faster using basic Pandas methods
| to get the data I want in exactly the form I want it.
|
| If I really want to show everything, I just use:
|
| ```
|
| with pd.option_context('display.max_rows', None):
| print(df)
|
| ```
| dekhn wrote:
| what irks me about dtale is if you scroll with the vertical
| slider, it can't update the view fast enough until you stop
| scrolling.
| Foivos wrote:
| I use a similar function when I want to see everything:
|
| ```
|
| def showAllRows(dataframeToShow): with
| pd.option_context('display.max_rows', None,
| 'display.max_columns', None):
| display(dataframeToShow)
|
| # calling it while limiting the number of returned rows.
|
| showAllRows(df.head(1000))
|
| ```
|
| Be warned though! if you call this function without limiting
| the number of rows to be fetched, it is guaranteed you will
| crash your machine. Always use head, sample or slices.
|
| If do get a crush, then your only option is to open the ipynb
| file with vi and manually delete the millions of lines this
| function created.
|
| Another function that I like is:
|
| ```
|
| def showColumns(df, substring): print([x for
| x in df.columns if substring in x]) return
|
| # calling it
|
| showColumns(df, "year")
|
| ```
|
| This is useful in data frames with many columns, when you want
| to find all the columns that have a specific string in their
| name. It returns a string, which then you can pass it in the
| dataframe to print only these columns.
| sodimel wrote:
| Looks like a Datasette[0] clone which runs on top of something
| (jupyter) which runs on top of Python (ipython). I think I would
| like to see how much time it takes to open a massive dataset in
| Mito & in Datasette :P
|
| [0]: https://datasette.io/
| aarondia wrote:
| Heyo, one of the Mito creators here. Thanks for sharing
| Datasette. I haven't seen that one before. It looks neat!
|
| You're right though, there are several tools that fit the
| general shape of: GUI on top of Jupyter on top of Python.
| There's a few general vectors to understand these tools by:
|
| 1. Excel-ness: Although most (if not all) of these tools
| incorporate some type of spreadsheet, the interface for
| interacting with the data in that spreadsheet differs greatly.
| Some tools, like Bamboolib [1] and Datasette [2] resemble Excel
| only in the spreadsheet. Other tools, like Mito [3], stick to a
| lot of the other Excel design decisions -- things like having a
| toolbar with buttons and menu items to access functionality,
| the ability to write spreadsheet formulas inside of the cell &
| formula bar, etc. In many ways, this Excel-ness design vector
| is a proxy for how easy it is to get started with the tool.
| What we see, is that users are able to download Mito and get
| something useful out their first analysis because the interface
| is one that they are used to!
|
| 2. Ownership of your analysis / lack of lockin: We believe that
| the most powerful low-code spreadsheet tools allow spreadsheet
| users to easily transition to full programming languages, if
| they want to. Instead oflocking users into a limited and
| proprietary product, it's better if users can transition to a
| full programming language (like Python) very naturally. This
| transition is super natural in Mito because we generate Python
| code for every edit that a user makes. So if Mito doesn't
| support the exact transformation that you want, you can use
| Mito as a starting point for your analysis and customize the
| script that Mito generates.
|
| [1] https://bamboolib.8080labs.com/ [2] https://datasette.io/
| [3] https://www.trymito.io/
| kite_and_code wrote:
| bamboolib co-founder here. We are also thinking about adding
| Excel-type formulas to the UI and already have internal
| prototypes.
|
| However, please be aware that bamboolib might soon only be
| available within Databricks notebooks instead of local
| Jupyter notebooks like mito.
| narush wrote:
| Hey everyone. Mito cofounder here. Thanks to whoever posted this
| - was a real surprise to find it here :-)
|
| Mito (pronounced my-toe) was born out of our personal experience
| with spreadsheets, and a previous (failed) spreadsheet version
| control product.
|
| Spreadsheets were the original killer app for computers, and are
| the most popular programming language used worldwide today. That
| being said, spreadsheets have some growing to do! They don't
| handle large datasets well, they don't lead to repeatable or
| auditable processes, and generally they disrespect many of the
| hard won software engineering principals that us engineers fight
| for.
|
| More than that, as spreadsheet users run into these problems and
| turn to Python to solve them, they struggle to use pandas to
| accomplish what would have been two clicks in a spreadsheet.
| Pandas is great, but the syntax is not always so obvious (not is
| learning to program in the first place!)
|
| Mito is the our first step in addressing these problems. Take any
| dataframe, edit it like a spreadsheet, and generate code that
| corresponds to those edits. You can then take this Python code
| and use it in other scripts, send it to your colleagues, or just
| rerun it.
|
| We've been working on Mito for over a year now. Growth has really
| picked up in the past few months - and we've begun working with
| larger companies to help accelerate their transition to Python.
|
| To any companies who are somewhere in that Python transition
| process - please do reach out - we would love to see if we can be
| helpful for all your spreadsheet users!
|
| Feel free to browse my profile for other spreadsheet related
| thoughts, I'm a bit of a HN junkie. Of course, any and all
| feedback (positive or negative) is appreciated.
|
| My cofounders and I will be trolling about in the comments. Say
| hey! :-)
| kite_and_code wrote:
| If you are a large company trying to migrate to Python, you
| might also want to have a look at bamboolib.com which was
| acquired by Databricks.
|
| bamboolib is very similar to mito (hard to tell who was first).
|
| The advantage is that it runs within Databricks which gives you
| the ability to scale to any amount of data easily and
| Databricks has many (and growing) security certifications e.g.
| HIPAA compliance.
|
| bamboolib can be used in plain Jupyter. Also, bamboolib private
| preview within Databricks is about to start within the next
| days.
|
| Full disclosure: I am a co-founder of bamboolib and employed by
| Databricks
| NoImmatureAdHom wrote:
| bamboolib appears to be closed-source. You're at their mercy.
| kite_and_code wrote:
| bamboolib co-founder here:
|
| It's correct that bamboolib is (still) closed-source (which
| might be subject to change but I don't make promises).
|
| It's also correct that customers can extend the bamboolib
| UI in various ways via plugins that they can author
| themselves. That empowers them to build bamboolib into the
| kind of tool that they want.
|
| Also, all the code is always exported and thus, there is at
| least no "code lockin".
|
| Regarding being "at their mercy", I can say that there are
| many customers who are happy by the service that we
| provide.
| NoImmatureAdHom wrote:
| I'm sure you have good intentions, but the fact of the
| matter is the company may be acquired or the people
| replaced, and those intentions might change.
|
| IMHO investing in a closed-source product like bamboolib
| as a tool for an important business function is very
| risky. Imagine you're a small company, and you start
| using bamboolib for some part of your data analysis
| pipeline. Bamboolib gets acquired (you have exited
| kite_and_code, congratulations), and the now very large
| company that controls it decides to stop supporting some
| feature critical to what you're doing, make an addition
| that messes everything up, go full-on SaaS somehow, or
| just shut the product down. What now? You've been
| growing, so you've got a small team of junior non-experts
| who were getting the hang of it...switching will be
| painful (or you could lock yourself in that walled garden
| and pay the SaaS price...).
| nojito wrote:
| Excel is closed source and it powers the world.
| wanderingmind wrote:
| If MS shuts down, there are better FOSS tools that can
| process excel files (Librecalc), or in general the entire
| office ecosystem. Can't say the same for small startups.
| aarondia wrote:
| Heyo! Another co-founder here. Excited to see Mito on HN :)
| Thanks @alefnula for posting!
|
| +1 to everything @narush said.
|
| It's important to us that the software we build is empowering
| to users and not restrictive. This plays out in two primary
| ways: 1) Since Mito is open source and generates Python code
| for every edit, Mito doesn't lock users into a 'Mito
| ecosystem', instead it help users interact with the powerful &
| robust Python ecosystem. 2) Because Mito is an extension to
| Jupyter Notebooks + JupyterLab, Mito improves your existing
| workflows instead of completely altering your data analytics
| stack.
|
| Excited to interact with you all in the comments :)
| kite_and_code wrote:
| Can you please clarify what you mean by "mito is open-
| source"?
|
| Last time I checked the code was under a proprietary license.
|
| Edit: I found in another comment below that mito is now
| available under GPL license here: https://github.com/mito-
| ds/monorepo/blob/dev/LICENSE
|
| Edit2: Just saw your answer now - thanks for the
| clarification and links!
| aarondia wrote:
| Mito is licensed [1] under the AGPL liscence. The TLDR of
| the license is that you can use, distribute, and modify
| Mito for free, but any modifications that you make need to
| be shared back with the Mito community.
|
| There is an additional version of Mito, Mito Pro, that is
| licensed under a different license that provides access to
| advanced functionality only if you are paying for a Mito
| Pro / Enterprise subscription.
|
| [1] https://github.com/mito-ds/monorepo/blob/dev/LICENSE
| [2] https://github.com/mito-
| ds/monorepo/blob/dev/mitosheet/src/p...
| teruakohatu wrote:
| Does AGPL mean it can only be used in a notebook for
| which the notebook itself is open source?
|
| Or does it mean it can only be used with notebook
| software (eg. Jupyter) that is open source but in a
| closed source notebook?
| whoevercares wrote:
| Tricky question - what do you think about Databricks who acquired
| Bamboolib and saying they will integrate pandas GUI into their
| workspace?
| rcarmo wrote:
| The telemetry thing is... weird. So we can use it for free but
| have no way to turn it off but upgrade to paid?
| MadameBanaan wrote:
| Yeah, I have a hard pass on anything that offers an "Open
| Source" version, but actually meant to be a "Try it and be my
| Guinea Pig".
| [deleted]
| aarondia wrote:
| Thanks for that feedback. Mito's approach to telemetry is that
| we never log any of your data or metadata about your data. We
| don't track things like the size, shape, or content of your
| data.
|
| We do collect info about app usage, things like which buttons
| users click. This allows us to focus development time on
| improving the features that are used most often.
|
| That being said, it's important to us that there is a way to be
| totally telemetry-less if users don't want any information to
| be leave their computer. Compared to most other cloud-based
| sass data science tools where you pretty much have no hope of
| total privacy, we're proud of the flexibility that we offer.
|
| But of course, we're always open to feedback about how we can
| continue to improve our practices!
| learndeeply wrote:
| I don't get it. What in the license prevents users from
| removing the telemetry? AGPL just means the user needs to
| open source that change, right?
|
| Edit: To remove telemetry, just call: from
| mitoinstaller.user_install import go_pro; go_pro();
|
| No licensing or payment required, and doesn't violate the
| license.
| narush wrote:
| Mito is open source, but using Pro features does actually
| require a Pro or enterprise license. You can check out this
| callout in the license [1], as well as the restrictions on
| Mito Pro features here [2]. We're in the process of fixing
| up the upgrade to Pro process a bit... as you can tell...
| :)
|
| You can of course fork Mito and turn off telemetry as long
| as you open source your changes! Go for it - happy to hop
| on a call and help you get set up with the codebase, if you
| want. Yay open source!
|
| [1] https://github.com/mito-
| ds/monorepo/blob/974091b455950c6c50e... [2]
| https://github.com/mito-
| ds/monorepo/blob/dev/mitosheet/mitos...
| teruakohatu wrote:
| You should consider using something other than pip to
| distribute an installer for the Pro version.
| NoImmatureAdHom wrote:
| Just to be very clear, the way to be "totally telemetry-less"
| is to pay you?
| aarondia wrote:
| Yes
| boringg wrote:
| Looks neat - pandas is very powerful and it makes it more
| approachable for non-programmers. However paid product like this
| - I probably wouldn't make the switch to this and then have the
| company go belly up leaving users stranded. Too much risk.
|
| Hope for the best though - pandas is pretty fantastic.
| okennedy wrote:
| You might want to check out a tool Vizier:
| https://vizierdb.info (I'm one of the devs). Direct interaction
| with notebooks state (e.g., dataframes as spreadsheets) is one
| of the central ideas, and it's fully open source.
| aarondia wrote:
| This looks cool :)
| aarondia wrote:
| One of the creators of Mito, here. Thanks for your feedback. I
| wanted to share a couple of nuggets about Mito that have been
| helpful in talking about this with other users.
|
| 1. The core Mito product is open source. You can see our GitHub
| here [1]. We also have a pro version that has some additional,
| code visible, but non-open source features. The way that we
| think about which features belong in which version of the
| product is as following: Features that are needed to just get
| any average analysis done are open source features. On the
| other hand, features that are specifically useful in an
| organization -- connecting to company databases, formatting /
| styling data and graphs for a presentation, etc. -- are pro
| features. So if you are a team that is relying on our pro
| features, you're helping support the longevity & progress of
| Mito. If you are not one of those users and using the open
| source version, then you will always have access to Mito (and
| can even help improve it!). Of course the line between what
| features are specifically helpful in an organization and what
| feature are needed for an average analysis is a bit blurry, and
| is a moving target as we continue to expand Mito's offering.
|
| 2. Mito is designed specifically to not force users to make a
| big 'switch'. I've commented this elsewhere in this thread, but
| just to recap: Because Mito is an extension to Juptyer and
| because we generate python code for every edit you make, Mito
| is designed to improve your existing workflow instead of lock
| you into a new system. Many Mito users use Mito as a starting
| point! They do as much of their analysis as they can in the
| Mito spreadsheet and then continue writing more customized
| Python code to finish up their work.
|
| Not requiring a big switch is nice for the user and its nice
| for Mito too! Lots of large companies have been able to get up
| and running with Mito in 30 minutes because it fits into their
| data stack.
|
| Anyways, not that these are the only two reasons you might feel
| uneasy about adopting Mito, but at least wanted to share why
| the switch to Mito might be less scary than switching to other
| tools.
|
| [1] https://github.com/mito-ds/monorepo
| kite_and_code wrote:
| I love how mito enables companies to use the power of open-
| source!
|
| You might want to think about enabling companies to create
| the company specific extensions themselves e.g. via a plugin
| API. You might still request them to pay for this version of
| Mito but they are enabled to extend it with their engineering
| power instead of relying on you.
|
| We had good experiences with this at bamboolib (I am one of
| the co-founders) and in addition to recurring license revenue
| it also increased demand for consulting from our end because
| the internal company devs started working on plugins and then
| wanted our direct guidance on how to get the more tricky
| things to work.
| narush wrote:
| Yeah, we've thought a bit about a plugin API - for the
| reasons you say, I think it would be an awesome feature to
| open up to teams!
|
| Any tips on going about it? No need to share the secret
| sauce, unless you want :P
|
| To be totally honest, we're not architected super well to
| support plugins currently. The big challenge would be
| allowing users to specify this plugin in pure Python (seems
| like we want this) - but we think that hand-coded UIs
| outperform autogenerated ones for now. We've been thinking
| about how to do better though... maybe soon.
|
| Of course, if Mito is missing features, we're open source
| [1] -- all contributions are welcome! Also feel free to
| open an issue and we can discuss :)
|
| [1] https://github.com/mito-ds/monorepo
| flakiness wrote:
| Nice! The page looks more like a SaaS offering or something,
| which initially scared me away a bit. I hope the emphasis is more
| on the opensource library and showing paying options as some
| premium thing.
|
| I didn't realize that the "too nice" landing page makes me
| anxious for open source software :-/
| aarondia wrote:
| Well first of off, thank you, I put a lot of effort into
| implementing that landing page :)
|
| We're super focused on the open source offering. The vast
| majority of our users are on the open source version and the
| vast majority of the features we release are open source! (You
| can check out our PR's if you're interested in verifying)
|
| The Mito Pro and Enterprise plans are designed for advanced
| users and teams. In those versions we provide features that
| make it easier to collaborate, create presentation-ready
| materials, and hook up to other company resources.
|
| But we're an open source tool through and through!
| narush wrote:
| Fancy seeing you here, writing the same comment as me... :}
| narush wrote:
| To pull the curtains back a bit: we probably spend about 85% of
| our product and development time on open source code. Just this
| week, we developed copy and paste, nan value filling, and
| spilling a text column on a delimiter - all of these are open
| source features.
|
| As we've begun to engage with larger teams, we often take
| features that we build out for their workflow and open source
| them as well - a few of the teams have been explicit proponents
| for the open source tool, which is awesome to see.
|
| I'm sure our thinking on this will evolve over time, but we are
| highly focused on developing just a _great_ piece of open
| source software. And for folks that need more power, we want to
| give them the chance to get it - while also supporting Mito's
| development :)
|
| P.S. Check out our Mito Pro roadmap here:
| https://www.trymito.io/plans#mito_pro_roadmap. Feedback
| appreciated!
| kite_and_code wrote:
| I am not so sure about the open-source fact. Please see
| comments and thread below.
|
| Edit: It is GPL by now as seen here https://github.com/mito-
| ds/monorepo/blob/dev/LICENSE
| noobker wrote:
| Mito looks cool. I'm hopeful a tool like it can create a bridge
| between Excel-based analysts/researchers and more mature
| application flows.
|
| Another tool like Mito is Bamboo: https://bamboolib.8080labs.com/
| aarondia wrote:
| Heyo, Mito cofounder here, bridging that gap is one of the main
| ways that enterprises are using Mito today! Helping business
| users become data self-sufficient in a world where Excel's data
| size limitations make it a non-option is where Mito shines :)
| pipeline_peak wrote:
| The web page needs to be heavier
| narush wrote:
| Super fair, lol. We'll work on optimizing it - just a tiny team
| and lots of things on our plate rn. The main issue is our
| images / video, which I have tried compressing but can't do so
| while maintaining the quality. Any tips are greatly
| appreciated!
|
| Believe it or not, the last version of this website was even
| heavier...
| harabat wrote:
| For those who are going through the thread finding new tools:
| pandas-profiling[0] is a library for automatic EDA (which
| bamboolib[1], mentioned elsewhere, also does).
|
| [0]: https://github.com/pandas-profiling/pandas-profiling [1]:
| https://bamboolib.com/
| kite_and_code wrote:
| Lux might also be interesting: https://github.com/lux-org/lux
| narush wrote:
| Def check these all out! Lots of cool tools out there. For
| anyone who's tried a bunch of these... that's a great topic
| for a Medium post :)
| jpalomaki wrote:
| If others are interested, Mito does not work in vscode or Google
| Collab. Only classic Jupyter Notebooks and Jupyter Labs are
| supported currently [1].
|
| [1] https://docs.trymito.io/misc/faq
| malshe wrote:
| Thanks for checking that! I use vscode so this is a no go...
| aarondia wrote:
| Yeah, Mito is limited to the Jupyter ecosystem (for now). We
| want to expand to VSCode, Google Collab, and Streamlit!
|
| For the time being, because Mito generates pandas code for
| every edit you make, you can always use Mito in Jupyter to
| generate code, and then copy it over to VSCode. Admittedly,
| its not as nice of a workflow, but it does work!
| pen2l wrote:
| I see you guys provide convenient installers that can be
| obtained with pip. Me, I run my JupyterLab that I got with
| Conda on a Windows setup. Can you comment on whether it's a
| thorn-free path to get Mito in such a setup? Or should I
| use this as another sign to completely migrate to a nix
| system for all my dev needs... :)
| aarondia wrote:
| You should be good to go with your conda setup on
| windows! I run Mito on a windows machine through a conda
| virtual environment often! We have some instructions for
| how to do that here [1]
|
| [1] https://docs.trymito.io/getting-started/installing-
| mito/inst...
| punk_ihaq wrote:
| Wow love it! It would be cool to see a bidirectional Streamlit
| custom component for Mito!
| narush wrote:
| This is on the roadmap! Would love to hear a bit more about how
| you would use this component...
|
| 1. Would you want the component to generate code? Or would it
| just be the editing of a dataframe that is useful to you?
|
| 2. What other components would be used in this dashboard? Would
| love to hear a bit more about the workflow around Mito here.
|
| The more detail you can provide - the more helpful in
| prioritizing this! I think Mito in streamlit would be ...
| awesome!
| kite_and_code wrote:
| Another alternative is bamboolib.com which was acquired by
| Databricks last September to offer it within Databricks notebooks
| filmor wrote:
| Are you affiliated? There are three comments in this comment
| page by you, and they all manage to mention bamboolib...
| kite_and_code wrote:
| Yes, I am one of the co-founders of bamboolib and employed by
| Databricks.
|
| I already added my disclosure to the following answer [0] in
| this thread but I was hesitant to add it to every answer.
|
| Do you prefer if I explicitly add my affiliation in every
| comment that mentions bamboolib? If so, I will try to edit
| them (if the HN UI still allows me to - I observed that it
| stops allowing this after some time)
|
| [0] https://news.ycombinator.com/item?id=31450910
| Closi wrote:
| > Do you prefer if I explicitly add my affiliation in every
| comment that mentions bamboolib?
|
| Personally I thought your original post's tone implied that
| you weren't affiliated to me personally.
|
| You don't have to add a formal 'disclosure', but you could
| just say "I built x which is..." rather than "Another
| product is x which is...".
| santiagobasulto wrote:
| I like this. Is a "friendlier" way to browse data. Said that, I
| have to add:
|
| Exploring large datasets requires a COMPLETELY different mindset.
| When your data starts growing, it's impossible to keep it all in
| a visual format (for 2 reasons[0]) and you have to start thinking
| analytically. You have to start looking at the statistical values
| of your data to understand what's its shape. That's why the
| `.describe()` and `.info()` methods in Pandas are so useful.
| After many years doing this, I can "see" the shape of my data
| just by looking at the statistical information about it (mean,
| median, std, min, max, etc).
|
| After some time you don't need to rely on visual tools, just can
| run a few methods, look at some numbers, and understand all your
| data. Kinda feels like the operator of The Matrix that is looking
| at the green numbers descend and knows what's going on behind the
| scenes.
|
| [0] Your eyes are really inefficient at capturing information and
| there's only so much memory available: try loading a 15GB CSV in
| Excel.
| wenc wrote:
| I would caution against this approach in general (unless you're
| working with unusually uniform data from a deterministic source
| -- in my world that is rarely the case). Summary statistics are
| useful but taken in isolation they can mislead. One loses the
| ability to get a feel for interesting non-aggregated
| phenomenon.
|
| I find it's important to actually "touch" the raw data even if
| only in a buffered, random sampling sort of way to get a feel
| for it. Sometimes with big datasets, looking through rows of
| data feels tedious and meaningless but I've found that I've
| often picked up on things I wouldn't have without actually
| looking at the raw data. Raw data is often flawed, but there's
| often some signal in it that tells a story hence it's important
| not to overlook these through a lens of aggregate statistics.
|
| The next step is to visualize the data multidimensionally in
| something like Tableau. Tableau works on very large datasets
| (it has an internal columnstore format called Hyper) and can
| dynamically disaggregate and drill down. Insights are usually
| obtained by looking at details, not aggregates.
| kite_and_code wrote:
| If you want to use open-source Python-based visualizations
| instead of Tableau, the following tools allow the creation of
| custom plots - including the ability to export the underlying
| code.
|
| - bamboolib (proprietary license - acquired by Databricks in
| order to run within the Databricks notebooks)
|
| - mito (GPL license)
|
| - dtale (MIT license)
| pea wrote:
| If you can write visualisations in Python itself, I am a
| big fan of Altair's syntax (https://github.com/altair-
| viz/altair), which is based on vega-lite. A while back, I
| wrote a brief guide and comparison of the main plotting
| libraries: https://datapane.com/reports/87NNEJ7/the-
| ultimate-guide-to-p...
|
| One benefit of having them in actual code is that you can
| programmatically automate the creation of things like
| dashboards and reports. For instance, schedule a script to
| share an interactive plot every Monday morning, or build a
| live dashboard that updates every 10m. This opens up a lot
| of possibilities that would be impossible in a traditional
| drag-and-drop tool.
| kite_and_code wrote:
| Thanks for mentioning Altair. I am personally also a big
| fan.
|
| I am one of the co-founders of bamboolib and we are
| actively thinking about adding support for altair to the
| Plot Creator (instead of just relying on Plotly).
|
| Since we are talking other viz options in Python, there
| are of course also matplotlib, seaborn, plotly, and more.
| aarondia wrote:
| > programmatically automate the creation of things like
| dashboards and reports.
|
| That's an awesome use case for Python, and that sort of
| script generation is one of the main reasons that we see
| people adopting Python/Mito. And specifically,
| graphing[1] is one of the most popular features in Mito.
|
| Mito generates Plotly [2] graphs, and of course generates
| the Plotly graph code too, so you can customize the
| graphs to your perfect liking (Plotly has great
| documentation and a lot of customizations) or schedule
| the script to run automatically.
|
| [1] https://docs.trymito.io/how-to/graphing [2]
| https://plotly.com/
| mejutoco wrote:
| A good example of what you are warning against is Anscombe's
| quartet
|
| https://en.wikipedia.org/wiki/Anscombe's_quartet
| santiagobasulto wrote:
| Histograms and Boxplots (and IQRs) don't lie tho...
| mrbungie wrote:
| Boxplots don't lie, but they can mislead as any summary
| statistic, data viz or model can.
| https://blog.bioturing.com/wp-
| content/uploads/2018/11/BoxVio...
|
| Misleading histograms depend totally on the bin-width
| tho.
| santiagobasulto wrote:
| Of course that `.head()`, `.tail()`, `iloc` and other
| mechanisms to visualize the data of subsets is always
| important. But would you really caution AGAINST this? Like,
| literally telling someone NOT to use summary statistics to
| explore a dataset?
| wenc wrote:
| No, I'm more cautioning against using summary statistics
| _in isolation_ without looking at the raw data.
|
| I was more responding to the statement that one can "see"
| the shape of data through them and not needing visual
| tools. The lens of summary statistics is a very narrow one
| -- it's a necessary but almost always insufficient one.
| Even .ilocs are insufficient --- it's hard to know what to
| .iloc for. One really needs to browse the data
| interactively to get a good sense of it.
| santiagobasulto wrote:
| Ah, ok. Sorry, I misunderstood. Yes, we're on the same
| page. As usual, a good balance is necessary.
| mint2 wrote:
| Do you as a rule look at a sample of the individual raw data,
| non aggregated?
| santiagobasulto wrote:
| Usually aggregated... then can start looking at "subsets".
| For example, step 1 is look at the whole dataset. Then you
| identify that there are a lot of rows with a type of missing
| value, so you look at the statistical attributes of that
| subset (all the rows with value X in null).
|
| From time to time you can do a `.head()/.title()` or an
| `.iloc[X:Y]` to check some things visually. But just as a
| "refresher".
| aarondia wrote:
| This sort of bouncing back and forth between the aggregate
| the raw data is something that Mito is really great at. To
| view aggregate info, users tend to either look at graphs or
| pivot tables of their data in Mito. They use that aggregate
| view to identify subsets that need some further
| investigation/cleaning/transforming. And then they filter
| down to that subset, make the correction, and use the
| aggregate view again to see the results.
|
| Practically, this just looks like moving between two tabs
| in the spreadsheet!
|
| Something that we don't support right now, but would love
| to support in the future is cross-filtering. It would be a
| powerful/easy way of supporting that back and forth
| workflow.
| aarondia wrote:
| This is a great point and something that we're actively working
| on improving in Mito. If you have millions of rows of data, its
| not enough to just scroll through your data, you need tools to
| build your understanding.
|
| Some of the tools that you mentioned exist in Mito today. For
| example, Mito generates summary information about each column
| (all of the .describe() info along with a histogram of the
| data). And we're creating features for gaining a global
| understanding of the data too.
|
| In practice, one of the main ways that we see people use Mito
| is for that initial exploration of the data. Often the first
| thing that users do when they import data into Mito is to
| correct the column dtype, delete columns that are irrelevant to
| their analysis, and filter out/replace missing values.
| pbronez wrote:
| It would be super fun to implement an intelligent head()
| function that shows a representative sample rather than the
| first X rows. Do the profiling & identify a collection of
| rows that represent the overall distribution.
|
| You could develop some IP around efficient and effective ways
| to do this. Probably would require an ensemble of
| unsupervised methods.
| aarondia wrote:
| That's a cool idea! One helpful .head() function could
| include the most unique data typed data. It could help you
| identify which columns have mixed dtypes: mostly numbers,
| and some cells that are supposed to be numbers but are
| actually strings because of additional decimals.
| narush wrote:
| Good points! I also think that this is an area that Mito could
| do better in. While we do provide pretty cool summary stats [1]
| and graphing capabilities [2], there isn't a great view for the
| summary stats of the entire dataframe. It's def on the roadmap
| -- but this comment makes me think we should move on it quick.
|
| Thanks for the feedback!
|
| [1] https://docs.trymito.io/how-to/summary-statistics
|
| [2] https://docs.trymito.io/how-to/graphing
| awild wrote:
| > try loading a 15GB CSV in Excel.
|
| Or visualising it in r or pandas without meaningful
| subsampling.
| pea wrote:
| One cool library I saw recently for helping on the
| visualisation side is
| https://github.com/vegafusion/vegafusion
|
| It allows you to use Altair in Python for visualising data,
| but does the computation in the backend using Arrow
| DataFusion. Not for 15GB perhaps, but cool nonetheless.
| CJefferson wrote:
| I find the world is full of datasets with < 200 datapoints, and
| that is where excel (in my experience) is great. With such
| datasets it often makes sense to look through the data at
| particular outliers.
|
| Also, even with huge datasets I tend to always look at a random
| sample, and the "most extreme" datapoints -- mainly because in
| my experience there is a good chance some parts of the data are
| malformed, and need to be recollected/fixed. Of course, if you
| trust your data collection you don't need this!
| kite_and_code wrote:
| +1 - this is also how I operated as a Data Scientist myself
| kite_and_code wrote:
| To the founders of mito, regarding the mito GPL license:
|
| What is your take on that regarding usage inside cloud provider's
| notebooks like AWS, GCP, Azure, Databricks?
|
| Is it allowed or not allowed by the license? And who should/can
| control the usage since users can install any kind of Python
| library in those environments.
|
| And, separately from the maybe ambiguous legal answer: What is
| your personal intention with the license?
|
| Disclosure: I am employed by Databricks.
| narush wrote:
| Hiya kite_and_code - thanks for the question + good to see you
| here :)
|
| Our understanding of our license is evolving - we're first time
| open source devs, and as I'm sure you know it can be a tricky
| process. That being said: we totally support Mito users using
| Mito from notebooks hosted in the cloud!
|
| Currently, we have quite a few users using Mito in notebooks
| hosted through AWS, GCP, etc. We're aiming to be good stewards
| of open source software, and want to see Mito exist where ever
| it is solving users problems!
|
| We've had lots of folks in lots of environments request Mito,
| and are actively working on prioritizing supporting those other
| environments. We added classic Notebook support last month
| (funnily, I thought it'd take weeks to support, and it took 2
| days lol) - and are looking into VS Code, Streamlit, Dash, and
| more!
|
| EDIT: due to comment below, I edited this comment for clarity
| that we 100% support users using Mito from notebooks in the
| cloud!
| kite_and_code wrote:
| I can totally relate that finding a suitable open-source
| business model is a fuzzy journey.
|
| Nevertheless, from the user perspective I would love to hear
| a more clear answer - at least for e.g. the next 6-12 months.
|
| Currently, it seems like you are tolerating usage inside the
| cloud providers without taking a clear stance. I think this
| creates fear, uncertainty, doubt and slows down mito adoption
| within the cloud.
|
| I would appreciate a clear statement in the near future
| around your thinking on how mito should be made available in
| those environments. After all, the clouds are an environment
| to where more and more users are migrating to. Or at least
| use it in parallel to local setups.
|
| I can understand if you don't want to answer on the spot in
| case you don't have a clear stance yet. In this case, please
| take your time and let us know when you made your decision.
|
| Really love what you're doing and the innovation that you are
| pushing for! <3
| narush wrote:
| Oh, sorry I wasn't clear! We totally expect that users will
| use Mito in notebooks on the cloud cloud, and we are in
| support of this usage!
|
| Ideally, we will continue to extend our support to these
| environments over time, as currently there are lots of
| environments where users want Mito but we don't support it
| yet (notebooks api differences, etc) - a good example being
| AWS Sagemaker.
|
| I'll edit my answer above to be more clear about this as
| well. Thanks for the ask for clarification!
| ryzvonusef wrote:
| https://www.youtube.com/watch?v=T7YkWuTIlTw
|
| video of it in use.
| aarondia wrote:
| Thanks for sharing. There's a few other YouTubers who have made
| some cool videos about Mito -- The Data Professor [1] and Talk
| Python to Me [2]
|
| And some cool Medium posts too! Mitosheet: enabling
| collaboration [3], Mito: One of the Coolest Python Libraries
| You Have Ever Seen [4] Preparing a dataset for analysis [5]
|
| [1] https://www.youtube.com/watch?v=l2nBO_LkkcQ [2]
| https://www.youtube.com/watch?v=XAGmSPZsYLU [3]
| https://medium.com/trymito/mitosheet-empowering-collaboratio...
| [4] https://towardsdatascience.com/mito-one-of-the-coolest-
| pytho... [5] https://medium.com/@twelsh37/preparing-a-dataset-
| for-analysi...
| alefnula wrote:
| I have no affiliation with the project. Just found it, tried it
| out, and it looks very promising...
| aarondia wrote:
| Thanks for posting!
___________________________________________________________________
(page generated 2022-05-20 23:01 UTC)