[HN Gopher] Mito - Excel-like interface for Pandas dataframes in...
       ___________________________________________________________________
        
       Mito - Excel-like interface for Pandas dataframes in Jupyter
       notebook
        
       Author : alefnula
       Score  : 207 points
       Date   : 2022-05-20 12:11 UTC (10 hours ago)
        
 (HTM) web link (www.trymito.io)
 (TXT) w3m dump (www.trymito.io)
        
       | jpn wrote:
       | I played around with many of these before:
       | 
       | - https://github.com/quantopian/qgrid
       | 
       | - https://github.com/man-group/dtale
       | 
       | I find that I'm actually a lot faster using basic Pandas methods
       | to get the data I want in exactly the form I want it.
       | 
       | If I really want to show everything, I just use:
       | 
       | ```
       | 
       | with pd.option_context('display.max_rows', None):
       | print(df)
       | 
       | ```
        
         | dekhn wrote:
         | what irks me about dtale is if you scroll with the vertical
         | slider, it can't update the view fast enough until you stop
         | scrolling.
        
         | Foivos wrote:
         | I use a similar function when I want to see everything:
         | 
         | ```
         | 
         | def showAllRows(dataframeToShow):                 with
         | pd.option_context('display.max_rows', None,
         | 'display.max_columns', None):
         | display(dataframeToShow)
         | 
         | # calling it while limiting the number of returned rows.
         | 
         | showAllRows(df.head(1000))
         | 
         | ```
         | 
         | Be warned though! if you call this function without limiting
         | the number of rows to be fetched, it is guaranteed you will
         | crash your machine. Always use head, sample or slices.
         | 
         | If do get a crush, then your only option is to open the ipynb
         | file with vi and manually delete the millions of lines this
         | function created.
         | 
         | Another function that I like is:
         | 
         | ```
         | 
         | def showColumns(df, substring):                   print([x for
         | x in df.columns if substring in x])              return
         | 
         | # calling it
         | 
         | showColumns(df, "year")
         | 
         | ```
         | 
         | This is useful in data frames with many columns, when you want
         | to find all the columns that have a specific string in their
         | name. It returns a string, which then you can pass it in the
         | dataframe to print only these columns.
        
       | sodimel wrote:
       | Looks like a Datasette[0] clone which runs on top of something
       | (jupyter) which runs on top of Python (ipython). I think I would
       | like to see how much time it takes to open a massive dataset in
       | Mito & in Datasette :P
       | 
       | [0]: https://datasette.io/
        
         | aarondia wrote:
         | Heyo, one of the Mito creators here. Thanks for sharing
         | Datasette. I haven't seen that one before. It looks neat!
         | 
         | You're right though, there are several tools that fit the
         | general shape of: GUI on top of Jupyter on top of Python.
         | There's a few general vectors to understand these tools by:
         | 
         | 1. Excel-ness: Although most (if not all) of these tools
         | incorporate some type of spreadsheet, the interface for
         | interacting with the data in that spreadsheet differs greatly.
         | Some tools, like Bamboolib [1] and Datasette [2] resemble Excel
         | only in the spreadsheet. Other tools, like Mito [3], stick to a
         | lot of the other Excel design decisions -- things like having a
         | toolbar with buttons and menu items to access functionality,
         | the ability to write spreadsheet formulas inside of the cell &
         | formula bar, etc. In many ways, this Excel-ness design vector
         | is a proxy for how easy it is to get started with the tool.
         | What we see, is that users are able to download Mito and get
         | something useful out their first analysis because the interface
         | is one that they are used to!
         | 
         | 2. Ownership of your analysis / lack of lockin: We believe that
         | the most powerful low-code spreadsheet tools allow spreadsheet
         | users to easily transition to full programming languages, if
         | they want to. Instead oflocking users into a limited and
         | proprietary product, it's better if users can transition to a
         | full programming language (like Python) very naturally. This
         | transition is super natural in Mito because we generate Python
         | code for every edit that a user makes. So if Mito doesn't
         | support the exact transformation that you want, you can use
         | Mito as a starting point for your analysis and customize the
         | script that Mito generates.
         | 
         | [1] https://bamboolib.8080labs.com/ [2] https://datasette.io/
         | [3] https://www.trymito.io/
        
           | kite_and_code wrote:
           | bamboolib co-founder here. We are also thinking about adding
           | Excel-type formulas to the UI and already have internal
           | prototypes.
           | 
           | However, please be aware that bamboolib might soon only be
           | available within Databricks notebooks instead of local
           | Jupyter notebooks like mito.
        
       | narush wrote:
       | Hey everyone. Mito cofounder here. Thanks to whoever posted this
       | - was a real surprise to find it here :-)
       | 
       | Mito (pronounced my-toe) was born out of our personal experience
       | with spreadsheets, and a previous (failed) spreadsheet version
       | control product.
       | 
       | Spreadsheets were the original killer app for computers, and are
       | the most popular programming language used worldwide today. That
       | being said, spreadsheets have some growing to do! They don't
       | handle large datasets well, they don't lead to repeatable or
       | auditable processes, and generally they disrespect many of the
       | hard won software engineering principals that us engineers fight
       | for.
       | 
       | More than that, as spreadsheet users run into these problems and
       | turn to Python to solve them, they struggle to use pandas to
       | accomplish what would have been two clicks in a spreadsheet.
       | Pandas is great, but the syntax is not always so obvious (not is
       | learning to program in the first place!)
       | 
       | Mito is the our first step in addressing these problems. Take any
       | dataframe, edit it like a spreadsheet, and generate code that
       | corresponds to those edits. You can then take this Python code
       | and use it in other scripts, send it to your colleagues, or just
       | rerun it.
       | 
       | We've been working on Mito for over a year now. Growth has really
       | picked up in the past few months - and we've begun working with
       | larger companies to help accelerate their transition to Python.
       | 
       | To any companies who are somewhere in that Python transition
       | process - please do reach out - we would love to see if we can be
       | helpful for all your spreadsheet users!
       | 
       | Feel free to browse my profile for other spreadsheet related
       | thoughts, I'm a bit of a HN junkie. Of course, any and all
       | feedback (positive or negative) is appreciated.
       | 
       | My cofounders and I will be trolling about in the comments. Say
       | hey! :-)
        
         | kite_and_code wrote:
         | If you are a large company trying to migrate to Python, you
         | might also want to have a look at bamboolib.com which was
         | acquired by Databricks.
         | 
         | bamboolib is very similar to mito (hard to tell who was first).
         | 
         | The advantage is that it runs within Databricks which gives you
         | the ability to scale to any amount of data easily and
         | Databricks has many (and growing) security certifications e.g.
         | HIPAA compliance.
         | 
         | bamboolib can be used in plain Jupyter. Also, bamboolib private
         | preview within Databricks is about to start within the next
         | days.
         | 
         | Full disclosure: I am a co-founder of bamboolib and employed by
         | Databricks
        
           | NoImmatureAdHom wrote:
           | bamboolib appears to be closed-source. You're at their mercy.
        
             | kite_and_code wrote:
             | bamboolib co-founder here:
             | 
             | It's correct that bamboolib is (still) closed-source (which
             | might be subject to change but I don't make promises).
             | 
             | It's also correct that customers can extend the bamboolib
             | UI in various ways via plugins that they can author
             | themselves. That empowers them to build bamboolib into the
             | kind of tool that they want.
             | 
             | Also, all the code is always exported and thus, there is at
             | least no "code lockin".
             | 
             | Regarding being "at their mercy", I can say that there are
             | many customers who are happy by the service that we
             | provide.
        
               | NoImmatureAdHom wrote:
               | I'm sure you have good intentions, but the fact of the
               | matter is the company may be acquired or the people
               | replaced, and those intentions might change.
               | 
               | IMHO investing in a closed-source product like bamboolib
               | as a tool for an important business function is very
               | risky. Imagine you're a small company, and you start
               | using bamboolib for some part of your data analysis
               | pipeline. Bamboolib gets acquired (you have exited
               | kite_and_code, congratulations), and the now very large
               | company that controls it decides to stop supporting some
               | feature critical to what you're doing, make an addition
               | that messes everything up, go full-on SaaS somehow, or
               | just shut the product down. What now? You've been
               | growing, so you've got a small team of junior non-experts
               | who were getting the hang of it...switching will be
               | painful (or you could lock yourself in that walled garden
               | and pay the SaaS price...).
        
               | nojito wrote:
               | Excel is closed source and it powers the world.
        
               | wanderingmind wrote:
               | If MS shuts down, there are better FOSS tools that can
               | process excel files (Librecalc), or in general the entire
               | office ecosystem. Can't say the same for small startups.
        
         | aarondia wrote:
         | Heyo! Another co-founder here. Excited to see Mito on HN :)
         | Thanks @alefnula for posting!
         | 
         | +1 to everything @narush said.
         | 
         | It's important to us that the software we build is empowering
         | to users and not restrictive. This plays out in two primary
         | ways: 1) Since Mito is open source and generates Python code
         | for every edit, Mito doesn't lock users into a 'Mito
         | ecosystem', instead it help users interact with the powerful &
         | robust Python ecosystem. 2) Because Mito is an extension to
         | Jupyter Notebooks + JupyterLab, Mito improves your existing
         | workflows instead of completely altering your data analytics
         | stack.
         | 
         | Excited to interact with you all in the comments :)
        
           | kite_and_code wrote:
           | Can you please clarify what you mean by "mito is open-
           | source"?
           | 
           | Last time I checked the code was under a proprietary license.
           | 
           | Edit: I found in another comment below that mito is now
           | available under GPL license here: https://github.com/mito-
           | ds/monorepo/blob/dev/LICENSE
           | 
           | Edit2: Just saw your answer now - thanks for the
           | clarification and links!
        
             | aarondia wrote:
             | Mito is licensed [1] under the AGPL liscence. The TLDR of
             | the license is that you can use, distribute, and modify
             | Mito for free, but any modifications that you make need to
             | be shared back with the Mito community.
             | 
             | There is an additional version of Mito, Mito Pro, that is
             | licensed under a different license that provides access to
             | advanced functionality only if you are paying for a Mito
             | Pro / Enterprise subscription.
             | 
             | [1] https://github.com/mito-ds/monorepo/blob/dev/LICENSE
             | [2] https://github.com/mito-
             | ds/monorepo/blob/dev/mitosheet/src/p...
        
               | teruakohatu wrote:
               | Does AGPL mean it can only be used in a notebook for
               | which the notebook itself is open source?
               | 
               | Or does it mean it can only be used with notebook
               | software (eg. Jupyter) that is open source but in a
               | closed source notebook?
        
       | whoevercares wrote:
       | Tricky question - what do you think about Databricks who acquired
       | Bamboolib and saying they will integrate pandas GUI into their
       | workspace?
        
       | rcarmo wrote:
       | The telemetry thing is... weird. So we can use it for free but
       | have no way to turn it off but upgrade to paid?
        
         | MadameBanaan wrote:
         | Yeah, I have a hard pass on anything that offers an "Open
         | Source" version, but actually meant to be a "Try it and be my
         | Guinea Pig".
        
           | [deleted]
        
         | aarondia wrote:
         | Thanks for that feedback. Mito's approach to telemetry is that
         | we never log any of your data or metadata about your data. We
         | don't track things like the size, shape, or content of your
         | data.
         | 
         | We do collect info about app usage, things like which buttons
         | users click. This allows us to focus development time on
         | improving the features that are used most often.
         | 
         | That being said, it's important to us that there is a way to be
         | totally telemetry-less if users don't want any information to
         | be leave their computer. Compared to most other cloud-based
         | sass data science tools where you pretty much have no hope of
         | total privacy, we're proud of the flexibility that we offer.
         | 
         | But of course, we're always open to feedback about how we can
         | continue to improve our practices!
        
           | learndeeply wrote:
           | I don't get it. What in the license prevents users from
           | removing the telemetry? AGPL just means the user needs to
           | open source that change, right?
           | 
           | Edit: To remove telemetry, just call:                  from
           | mitoinstaller.user_install import go_pro; go_pro();
           | 
           | No licensing or payment required, and doesn't violate the
           | license.
        
             | narush wrote:
             | Mito is open source, but using Pro features does actually
             | require a Pro or enterprise license. You can check out this
             | callout in the license [1], as well as the restrictions on
             | Mito Pro features here [2]. We're in the process of fixing
             | up the upgrade to Pro process a bit... as you can tell...
             | :)
             | 
             | You can of course fork Mito and turn off telemetry as long
             | as you open source your changes! Go for it - happy to hop
             | on a call and help you get set up with the codebase, if you
             | want. Yay open source!
             | 
             | [1] https://github.com/mito-
             | ds/monorepo/blob/974091b455950c6c50e... [2]
             | https://github.com/mito-
             | ds/monorepo/blob/dev/mitosheet/mitos...
        
               | teruakohatu wrote:
               | You should consider using something other than pip to
               | distribute an installer for the Pro version.
        
           | NoImmatureAdHom wrote:
           | Just to be very clear, the way to be "totally telemetry-less"
           | is to pay you?
        
             | aarondia wrote:
             | Yes
        
       | boringg wrote:
       | Looks neat - pandas is very powerful and it makes it more
       | approachable for non-programmers. However paid product like this
       | - I probably wouldn't make the switch to this and then have the
       | company go belly up leaving users stranded. Too much risk.
       | 
       | Hope for the best though - pandas is pretty fantastic.
        
         | okennedy wrote:
         | You might want to check out a tool Vizier:
         | https://vizierdb.info (I'm one of the devs). Direct interaction
         | with notebooks state (e.g., dataframes as spreadsheets) is one
         | of the central ideas, and it's fully open source.
        
           | aarondia wrote:
           | This looks cool :)
        
         | aarondia wrote:
         | One of the creators of Mito, here. Thanks for your feedback. I
         | wanted to share a couple of nuggets about Mito that have been
         | helpful in talking about this with other users.
         | 
         | 1. The core Mito product is open source. You can see our GitHub
         | here [1]. We also have a pro version that has some additional,
         | code visible, but non-open source features. The way that we
         | think about which features belong in which version of the
         | product is as following: Features that are needed to just get
         | any average analysis done are open source features. On the
         | other hand, features that are specifically useful in an
         | organization -- connecting to company databases, formatting /
         | styling data and graphs for a presentation, etc. -- are pro
         | features. So if you are a team that is relying on our pro
         | features, you're helping support the longevity & progress of
         | Mito. If you are not one of those users and using the open
         | source version, then you will always have access to Mito (and
         | can even help improve it!). Of course the line between what
         | features are specifically helpful in an organization and what
         | feature are needed for an average analysis is a bit blurry, and
         | is a moving target as we continue to expand Mito's offering.
         | 
         | 2. Mito is designed specifically to not force users to make a
         | big 'switch'. I've commented this elsewhere in this thread, but
         | just to recap: Because Mito is an extension to Juptyer and
         | because we generate python code for every edit you make, Mito
         | is designed to improve your existing workflow instead of lock
         | you into a new system. Many Mito users use Mito as a starting
         | point! They do as much of their analysis as they can in the
         | Mito spreadsheet and then continue writing more customized
         | Python code to finish up their work.
         | 
         | Not requiring a big switch is nice for the user and its nice
         | for Mito too! Lots of large companies have been able to get up
         | and running with Mito in 30 minutes because it fits into their
         | data stack.
         | 
         | Anyways, not that these are the only two reasons you might feel
         | uneasy about adopting Mito, but at least wanted to share why
         | the switch to Mito might be less scary than switching to other
         | tools.
         | 
         | [1] https://github.com/mito-ds/monorepo
        
           | kite_and_code wrote:
           | I love how mito enables companies to use the power of open-
           | source!
           | 
           | You might want to think about enabling companies to create
           | the company specific extensions themselves e.g. via a plugin
           | API. You might still request them to pay for this version of
           | Mito but they are enabled to extend it with their engineering
           | power instead of relying on you.
           | 
           | We had good experiences with this at bamboolib (I am one of
           | the co-founders) and in addition to recurring license revenue
           | it also increased demand for consulting from our end because
           | the internal company devs started working on plugins and then
           | wanted our direct guidance on how to get the more tricky
           | things to work.
        
             | narush wrote:
             | Yeah, we've thought a bit about a plugin API - for the
             | reasons you say, I think it would be an awesome feature to
             | open up to teams!
             | 
             | Any tips on going about it? No need to share the secret
             | sauce, unless you want :P
             | 
             | To be totally honest, we're not architected super well to
             | support plugins currently. The big challenge would be
             | allowing users to specify this plugin in pure Python (seems
             | like we want this) - but we think that hand-coded UIs
             | outperform autogenerated ones for now. We've been thinking
             | about how to do better though... maybe soon.
             | 
             | Of course, if Mito is missing features, we're open source
             | [1] -- all contributions are welcome! Also feel free to
             | open an issue and we can discuss :)
             | 
             | [1] https://github.com/mito-ds/monorepo
        
       | flakiness wrote:
       | Nice! The page looks more like a SaaS offering or something,
       | which initially scared me away a bit. I hope the emphasis is more
       | on the opensource library and showing paying options as some
       | premium thing.
       | 
       | I didn't realize that the "too nice" landing page makes me
       | anxious for open source software :-/
        
         | aarondia wrote:
         | Well first of off, thank you, I put a lot of effort into
         | implementing that landing page :)
         | 
         | We're super focused on the open source offering. The vast
         | majority of our users are on the open source version and the
         | vast majority of the features we release are open source! (You
         | can check out our PR's if you're interested in verifying)
         | 
         | The Mito Pro and Enterprise plans are designed for advanced
         | users and teams. In those versions we provide features that
         | make it easier to collaborate, create presentation-ready
         | materials, and hook up to other company resources.
         | 
         | But we're an open source tool through and through!
        
           | narush wrote:
           | Fancy seeing you here, writing the same comment as me... :}
        
         | narush wrote:
         | To pull the curtains back a bit: we probably spend about 85% of
         | our product and development time on open source code. Just this
         | week, we developed copy and paste, nan value filling, and
         | spilling a text column on a delimiter - all of these are open
         | source features.
         | 
         | As we've begun to engage with larger teams, we often take
         | features that we build out for their workflow and open source
         | them as well - a few of the teams have been explicit proponents
         | for the open source tool, which is awesome to see.
         | 
         | I'm sure our thinking on this will evolve over time, but we are
         | highly focused on developing just a _great_ piece of open
         | source software. And for folks that need more power, we want to
         | give them the chance to get it - while also supporting Mito's
         | development :)
         | 
         | P.S. Check out our Mito Pro roadmap here:
         | https://www.trymito.io/plans#mito_pro_roadmap. Feedback
         | appreciated!
        
         | kite_and_code wrote:
         | I am not so sure about the open-source fact. Please see
         | comments and thread below.
         | 
         | Edit: It is GPL by now as seen here https://github.com/mito-
         | ds/monorepo/blob/dev/LICENSE
        
       | noobker wrote:
       | Mito looks cool. I'm hopeful a tool like it can create a bridge
       | between Excel-based analysts/researchers and more mature
       | application flows.
       | 
       | Another tool like Mito is Bamboo: https://bamboolib.8080labs.com/
        
         | aarondia wrote:
         | Heyo, Mito cofounder here, bridging that gap is one of the main
         | ways that enterprises are using Mito today! Helping business
         | users become data self-sufficient in a world where Excel's data
         | size limitations make it a non-option is where Mito shines :)
        
       | pipeline_peak wrote:
       | The web page needs to be heavier
        
         | narush wrote:
         | Super fair, lol. We'll work on optimizing it - just a tiny team
         | and lots of things on our plate rn. The main issue is our
         | images / video, which I have tried compressing but can't do so
         | while maintaining the quality. Any tips are greatly
         | appreciated!
         | 
         | Believe it or not, the last version of this website was even
         | heavier...
        
       | harabat wrote:
       | For those who are going through the thread finding new tools:
       | pandas-profiling[0] is a library for automatic EDA (which
       | bamboolib[1], mentioned elsewhere, also does).
       | 
       | [0]: https://github.com/pandas-profiling/pandas-profiling [1]:
       | https://bamboolib.com/
        
         | kite_and_code wrote:
         | Lux might also be interesting: https://github.com/lux-org/lux
        
           | narush wrote:
           | Def check these all out! Lots of cool tools out there. For
           | anyone who's tried a bunch of these... that's a great topic
           | for a Medium post :)
        
       | jpalomaki wrote:
       | If others are interested, Mito does not work in vscode or Google
       | Collab. Only classic Jupyter Notebooks and Jupyter Labs are
       | supported currently [1].
       | 
       | [1] https://docs.trymito.io/misc/faq
        
         | malshe wrote:
         | Thanks for checking that! I use vscode so this is a no go...
        
           | aarondia wrote:
           | Yeah, Mito is limited to the Jupyter ecosystem (for now). We
           | want to expand to VSCode, Google Collab, and Streamlit!
           | 
           | For the time being, because Mito generates pandas code for
           | every edit you make, you can always use Mito in Jupyter to
           | generate code, and then copy it over to VSCode. Admittedly,
           | its not as nice of a workflow, but it does work!
        
             | pen2l wrote:
             | I see you guys provide convenient installers that can be
             | obtained with pip. Me, I run my JupyterLab that I got with
             | Conda on a Windows setup. Can you comment on whether it's a
             | thorn-free path to get Mito in such a setup? Or should I
             | use this as another sign to completely migrate to a nix
             | system for all my dev needs... :)
        
               | aarondia wrote:
               | You should be good to go with your conda setup on
               | windows! I run Mito on a windows machine through a conda
               | virtual environment often! We have some instructions for
               | how to do that here [1]
               | 
               | [1] https://docs.trymito.io/getting-started/installing-
               | mito/inst...
        
       | punk_ihaq wrote:
       | Wow love it! It would be cool to see a bidirectional Streamlit
       | custom component for Mito!
        
         | narush wrote:
         | This is on the roadmap! Would love to hear a bit more about how
         | you would use this component...
         | 
         | 1. Would you want the component to generate code? Or would it
         | just be the editing of a dataframe that is useful to you?
         | 
         | 2. What other components would be used in this dashboard? Would
         | love to hear a bit more about the workflow around Mito here.
         | 
         | The more detail you can provide - the more helpful in
         | prioritizing this! I think Mito in streamlit would be ...
         | awesome!
        
       | kite_and_code wrote:
       | Another alternative is bamboolib.com which was acquired by
       | Databricks last September to offer it within Databricks notebooks
        
         | filmor wrote:
         | Are you affiliated? There are three comments in this comment
         | page by you, and they all manage to mention bamboolib...
        
           | kite_and_code wrote:
           | Yes, I am one of the co-founders of bamboolib and employed by
           | Databricks.
           | 
           | I already added my disclosure to the following answer [0] in
           | this thread but I was hesitant to add it to every answer.
           | 
           | Do you prefer if I explicitly add my affiliation in every
           | comment that mentions bamboolib? If so, I will try to edit
           | them (if the HN UI still allows me to - I observed that it
           | stops allowing this after some time)
           | 
           | [0] https://news.ycombinator.com/item?id=31450910
        
             | Closi wrote:
             | > Do you prefer if I explicitly add my affiliation in every
             | comment that mentions bamboolib?
             | 
             | Personally I thought your original post's tone implied that
             | you weren't affiliated to me personally.
             | 
             | You don't have to add a formal 'disclosure', but you could
             | just say "I built x which is..." rather than "Another
             | product is x which is...".
        
       | santiagobasulto wrote:
       | I like this. Is a "friendlier" way to browse data. Said that, I
       | have to add:
       | 
       | Exploring large datasets requires a COMPLETELY different mindset.
       | When your data starts growing, it's impossible to keep it all in
       | a visual format (for 2 reasons[0]) and you have to start thinking
       | analytically. You have to start looking at the statistical values
       | of your data to understand what's its shape. That's why the
       | `.describe()` and `.info()` methods in Pandas are so useful.
       | After many years doing this, I can "see" the shape of my data
       | just by looking at the statistical information about it (mean,
       | median, std, min, max, etc).
       | 
       | After some time you don't need to rely on visual tools, just can
       | run a few methods, look at some numbers, and understand all your
       | data. Kinda feels like the operator of The Matrix that is looking
       | at the green numbers descend and knows what's going on behind the
       | scenes.
       | 
       | [0] Your eyes are really inefficient at capturing information and
       | there's only so much memory available: try loading a 15GB CSV in
       | Excel.
        
         | wenc wrote:
         | I would caution against this approach in general (unless you're
         | working with unusually uniform data from a deterministic source
         | -- in my world that is rarely the case). Summary statistics are
         | useful but taken in isolation they can mislead. One loses the
         | ability to get a feel for interesting non-aggregated
         | phenomenon.
         | 
         | I find it's important to actually "touch" the raw data even if
         | only in a buffered, random sampling sort of way to get a feel
         | for it. Sometimes with big datasets, looking through rows of
         | data feels tedious and meaningless but I've found that I've
         | often picked up on things I wouldn't have without actually
         | looking at the raw data. Raw data is often flawed, but there's
         | often some signal in it that tells a story hence it's important
         | not to overlook these through a lens of aggregate statistics.
         | 
         | The next step is to visualize the data multidimensionally in
         | something like Tableau. Tableau works on very large datasets
         | (it has an internal columnstore format called Hyper) and can
         | dynamically disaggregate and drill down. Insights are usually
         | obtained by looking at details, not aggregates.
        
           | kite_and_code wrote:
           | If you want to use open-source Python-based visualizations
           | instead of Tableau, the following tools allow the creation of
           | custom plots - including the ability to export the underlying
           | code.
           | 
           | - bamboolib (proprietary license - acquired by Databricks in
           | order to run within the Databricks notebooks)
           | 
           | - mito (GPL license)
           | 
           | - dtale (MIT license)
        
             | pea wrote:
             | If you can write visualisations in Python itself, I am a
             | big fan of Altair's syntax (https://github.com/altair-
             | viz/altair), which is based on vega-lite. A while back, I
             | wrote a brief guide and comparison of the main plotting
             | libraries: https://datapane.com/reports/87NNEJ7/the-
             | ultimate-guide-to-p...
             | 
             | One benefit of having them in actual code is that you can
             | programmatically automate the creation of things like
             | dashboards and reports. For instance, schedule a script to
             | share an interactive plot every Monday morning, or build a
             | live dashboard that updates every 10m. This opens up a lot
             | of possibilities that would be impossible in a traditional
             | drag-and-drop tool.
        
               | kite_and_code wrote:
               | Thanks for mentioning Altair. I am personally also a big
               | fan.
               | 
               | I am one of the co-founders of bamboolib and we are
               | actively thinking about adding support for altair to the
               | Plot Creator (instead of just relying on Plotly).
               | 
               | Since we are talking other viz options in Python, there
               | are of course also matplotlib, seaborn, plotly, and more.
        
               | aarondia wrote:
               | > programmatically automate the creation of things like
               | dashboards and reports.
               | 
               | That's an awesome use case for Python, and that sort of
               | script generation is one of the main reasons that we see
               | people adopting Python/Mito. And specifically,
               | graphing[1] is one of the most popular features in Mito.
               | 
               | Mito generates Plotly [2] graphs, and of course generates
               | the Plotly graph code too, so you can customize the
               | graphs to your perfect liking (Plotly has great
               | documentation and a lot of customizations) or schedule
               | the script to run automatically.
               | 
               | [1] https://docs.trymito.io/how-to/graphing [2]
               | https://plotly.com/
        
           | mejutoco wrote:
           | A good example of what you are warning against is Anscombe's
           | quartet
           | 
           | https://en.wikipedia.org/wiki/Anscombe's_quartet
        
             | santiagobasulto wrote:
             | Histograms and Boxplots (and IQRs) don't lie tho...
        
               | mrbungie wrote:
               | Boxplots don't lie, but they can mislead as any summary
               | statistic, data viz or model can.
               | https://blog.bioturing.com/wp-
               | content/uploads/2018/11/BoxVio...
               | 
               | Misleading histograms depend totally on the bin-width
               | tho.
        
           | santiagobasulto wrote:
           | Of course that `.head()`, `.tail()`, `iloc` and other
           | mechanisms to visualize the data of subsets is always
           | important. But would you really caution AGAINST this? Like,
           | literally telling someone NOT to use summary statistics to
           | explore a dataset?
        
             | wenc wrote:
             | No, I'm more cautioning against using summary statistics
             | _in isolation_ without looking at the raw data.
             | 
             | I was more responding to the statement that one can "see"
             | the shape of data through them and not needing visual
             | tools. The lens of summary statistics is a very narrow one
             | -- it's a necessary but almost always insufficient one.
             | Even .ilocs are insufficient --- it's hard to know what to
             | .iloc for. One really needs to browse the data
             | interactively to get a good sense of it.
        
               | santiagobasulto wrote:
               | Ah, ok. Sorry, I misunderstood. Yes, we're on the same
               | page. As usual, a good balance is necessary.
        
         | mint2 wrote:
         | Do you as a rule look at a sample of the individual raw data,
         | non aggregated?
        
           | santiagobasulto wrote:
           | Usually aggregated... then can start looking at "subsets".
           | For example, step 1 is look at the whole dataset. Then you
           | identify that there are a lot of rows with a type of missing
           | value, so you look at the statistical attributes of that
           | subset (all the rows with value X in null).
           | 
           | From time to time you can do a `.head()/.title()` or an
           | `.iloc[X:Y]` to check some things visually. But just as a
           | "refresher".
        
             | aarondia wrote:
             | This sort of bouncing back and forth between the aggregate
             | the raw data is something that Mito is really great at. To
             | view aggregate info, users tend to either look at graphs or
             | pivot tables of their data in Mito. They use that aggregate
             | view to identify subsets that need some further
             | investigation/cleaning/transforming. And then they filter
             | down to that subset, make the correction, and use the
             | aggregate view again to see the results.
             | 
             | Practically, this just looks like moving between two tabs
             | in the spreadsheet!
             | 
             | Something that we don't support right now, but would love
             | to support in the future is cross-filtering. It would be a
             | powerful/easy way of supporting that back and forth
             | workflow.
        
         | aarondia wrote:
         | This is a great point and something that we're actively working
         | on improving in Mito. If you have millions of rows of data, its
         | not enough to just scroll through your data, you need tools to
         | build your understanding.
         | 
         | Some of the tools that you mentioned exist in Mito today. For
         | example, Mito generates summary information about each column
         | (all of the .describe() info along with a histogram of the
         | data). And we're creating features for gaining a global
         | understanding of the data too.
         | 
         | In practice, one of the main ways that we see people use Mito
         | is for that initial exploration of the data. Often the first
         | thing that users do when they import data into Mito is to
         | correct the column dtype, delete columns that are irrelevant to
         | their analysis, and filter out/replace missing values.
        
           | pbronez wrote:
           | It would be super fun to implement an intelligent head()
           | function that shows a representative sample rather than the
           | first X rows. Do the profiling & identify a collection of
           | rows that represent the overall distribution.
           | 
           | You could develop some IP around efficient and effective ways
           | to do this. Probably would require an ensemble of
           | unsupervised methods.
        
             | aarondia wrote:
             | That's a cool idea! One helpful .head() function could
             | include the most unique data typed data. It could help you
             | identify which columns have mixed dtypes: mostly numbers,
             | and some cells that are supposed to be numbers but are
             | actually strings because of additional decimals.
        
         | narush wrote:
         | Good points! I also think that this is an area that Mito could
         | do better in. While we do provide pretty cool summary stats [1]
         | and graphing capabilities [2], there isn't a great view for the
         | summary stats of the entire dataframe. It's def on the roadmap
         | -- but this comment makes me think we should move on it quick.
         | 
         | Thanks for the feedback!
         | 
         | [1] https://docs.trymito.io/how-to/summary-statistics
         | 
         | [2] https://docs.trymito.io/how-to/graphing
        
         | awild wrote:
         | > try loading a 15GB CSV in Excel.
         | 
         | Or visualising it in r or pandas without meaningful
         | subsampling.
        
           | pea wrote:
           | One cool library I saw recently for helping on the
           | visualisation side is
           | https://github.com/vegafusion/vegafusion
           | 
           | It allows you to use Altair in Python for visualising data,
           | but does the computation in the backend using Arrow
           | DataFusion. Not for 15GB perhaps, but cool nonetheless.
        
         | CJefferson wrote:
         | I find the world is full of datasets with < 200 datapoints, and
         | that is where excel (in my experience) is great. With such
         | datasets it often makes sense to look through the data at
         | particular outliers.
         | 
         | Also, even with huge datasets I tend to always look at a random
         | sample, and the "most extreme" datapoints -- mainly because in
         | my experience there is a good chance some parts of the data are
         | malformed, and need to be recollected/fixed. Of course, if you
         | trust your data collection you don't need this!
        
           | kite_and_code wrote:
           | +1 - this is also how I operated as a Data Scientist myself
        
       | kite_and_code wrote:
       | To the founders of mito, regarding the mito GPL license:
       | 
       | What is your take on that regarding usage inside cloud provider's
       | notebooks like AWS, GCP, Azure, Databricks?
       | 
       | Is it allowed or not allowed by the license? And who should/can
       | control the usage since users can install any kind of Python
       | library in those environments.
       | 
       | And, separately from the maybe ambiguous legal answer: What is
       | your personal intention with the license?
       | 
       | Disclosure: I am employed by Databricks.
        
         | narush wrote:
         | Hiya kite_and_code - thanks for the question + good to see you
         | here :)
         | 
         | Our understanding of our license is evolving - we're first time
         | open source devs, and as I'm sure you know it can be a tricky
         | process. That being said: we totally support Mito users using
         | Mito from notebooks hosted in the cloud!
         | 
         | Currently, we have quite a few users using Mito in notebooks
         | hosted through AWS, GCP, etc. We're aiming to be good stewards
         | of open source software, and want to see Mito exist where ever
         | it is solving users problems!
         | 
         | We've had lots of folks in lots of environments request Mito,
         | and are actively working on prioritizing supporting those other
         | environments. We added classic Notebook support last month
         | (funnily, I thought it'd take weeks to support, and it took 2
         | days lol) - and are looking into VS Code, Streamlit, Dash, and
         | more!
         | 
         | EDIT: due to comment below, I edited this comment for clarity
         | that we 100% support users using Mito from notebooks in the
         | cloud!
        
           | kite_and_code wrote:
           | I can totally relate that finding a suitable open-source
           | business model is a fuzzy journey.
           | 
           | Nevertheless, from the user perspective I would love to hear
           | a more clear answer - at least for e.g. the next 6-12 months.
           | 
           | Currently, it seems like you are tolerating usage inside the
           | cloud providers without taking a clear stance. I think this
           | creates fear, uncertainty, doubt and slows down mito adoption
           | within the cloud.
           | 
           | I would appreciate a clear statement in the near future
           | around your thinking on how mito should be made available in
           | those environments. After all, the clouds are an environment
           | to where more and more users are migrating to. Or at least
           | use it in parallel to local setups.
           | 
           | I can understand if you don't want to answer on the spot in
           | case you don't have a clear stance yet. In this case, please
           | take your time and let us know when you made your decision.
           | 
           | Really love what you're doing and the innovation that you are
           | pushing for! <3
        
             | narush wrote:
             | Oh, sorry I wasn't clear! We totally expect that users will
             | use Mito in notebooks on the cloud cloud, and we are in
             | support of this usage!
             | 
             | Ideally, we will continue to extend our support to these
             | environments over time, as currently there are lots of
             | environments where users want Mito but we don't support it
             | yet (notebooks api differences, etc) - a good example being
             | AWS Sagemaker.
             | 
             | I'll edit my answer above to be more clear about this as
             | well. Thanks for the ask for clarification!
        
       | ryzvonusef wrote:
       | https://www.youtube.com/watch?v=T7YkWuTIlTw
       | 
       | video of it in use.
        
         | aarondia wrote:
         | Thanks for sharing. There's a few other YouTubers who have made
         | some cool videos about Mito -- The Data Professor [1] and Talk
         | Python to Me [2]
         | 
         | And some cool Medium posts too! Mitosheet: enabling
         | collaboration [3], Mito: One of the Coolest Python Libraries
         | You Have Ever Seen [4] Preparing a dataset for analysis [5]
         | 
         | [1] https://www.youtube.com/watch?v=l2nBO_LkkcQ [2]
         | https://www.youtube.com/watch?v=XAGmSPZsYLU [3]
         | https://medium.com/trymito/mitosheet-empowering-collaboratio...
         | [4] https://towardsdatascience.com/mito-one-of-the-coolest-
         | pytho... [5] https://medium.com/@twelsh37/preparing-a-dataset-
         | for-analysi...
        
       | alefnula wrote:
       | I have no affiliation with the project. Just found it, tried it
       | out, and it looks very promising...
        
         | aarondia wrote:
         | Thanks for posting!
        
       ___________________________________________________________________
       (page generated 2022-05-20 23:01 UTC)