[HN Gopher] Python for Data Analysis, 3rd Edition - The Open Acc...
       ___________________________________________________________________
        
       Python for Data Analysis, 3rd Edition - The Open Access Version
       Online
        
       Author : mariuz
       Score  : 194 points
       Date   : 2022-07-02 10:53 UTC (12 hours ago)
        
 (HTM) web link (wesmckinney.com)
 (TXT) w3m dump (wesmckinney.com)
        
       | argella wrote:
       | anyone have a suggestion for a best online course on Python
       | programming? Wanting to focus on Python the language and getting
       | solid at programming, not necessarily a particular library.
       | 
       | I'm pretty rusty at programming having last studied formally
       | algorithms, data structures, oop in c++ 15 years ago in an
       | undergrad compsci program.
       | 
       | I've done mostly sql coding and database work ever since but need
       | to level up my skill set.
        
       | civilized wrote:
       | Still waiting for Wes to admit that the pandas API is a mess and
       | support the development and adoption of a more dplyr-like Python
       | library.
       | 
       | Pandas was a great step on the path to making Python a decent
       | data analysis language, Wes is smarter than me, I never could
       | have built it, but it's time to move on.
        
         | wesm wrote:
         | If you read my slide decks over the last 7 years or so (while
         | I've been working actively on Arrow and sibling projects like
         | Ibis) I've been saying exactly this.
         | 
         | See e.g. https://ibis-project.org/
        
         | RobinL wrote:
         | I think that's part of the point of his current project: Arrow.
         | Specifically, one of its goals is to implement the low level
         | computations, whilst enabling more competition in the space of
         | data analysis API design.
        
         | closed wrote:
         | Hey--I maintain a port of dplyr to python, called siuba[1]!
         | 
         | Right now it supports a decent number of verbs + SQL
         | generation. I tried to break down why R users find pandas
         | difficult in an RStudioConf talk last year[2].
         | 
         | Between siuba and tools like polars and duckdb, I'm hopeful
         | that someone hits the data analysis sweet spot for python in
         | the next couple years.
         | 
         | [1]: http://github.com/machow/siuba
         | 
         | [2]: https://youtu.be/w4Mi0u4urbQ
        
       | TrackerFF wrote:
       | I use Python extensively for my analysis projects - and while
       | Pandas is my go-to library for many things, I feel it's just very
       | slow. I know that just doing stuff in numpy instead speeds things
       | up considerably, and that's way before doing other optimization,
       | but are there any other libraries out there similar to Pandas,
       | but made for higher performance?
        
         | sireat wrote:
         | Curious where do you experience slowness with Pandas?
         | 
         | Usually slowness in Pandas will be if you start doing non
         | vector things.
         | 
         | Whenever you catch yourself writing a loop in Pandas you know
         | you've gone the wrong way.
         | 
         | I too use Python extensively (with Pandas among other things)
         | and usually Pandas is perfectly fine. (havent gone over 64GB
         | memory usage yet)
         | 
         | I consider Pandas to be Numpy with benefits (methods) - since
         | Pandas dataframe is basically just a collection of Numpy
         | arrays. Those are about as close to C type arrays you can get
         | in Python.
         | 
         | The only practical problems have been with regressions among
         | supporting libraries such as pyarrow not playing nice with
         | numpy when working with parquet files.
        
         | HFguy wrote:
         | Polars and Modin support multiple cores and have optimizations.
         | Modin has most of the Pandas api supported. Pyspark api for
         | pandas supports multiple cores and clusters.
        
         | kzrdude wrote:
         | polars, but it's still developing.
        
       | johnpublic wrote:
       | You can save so much time and effort using the Google colab
       | notebooks rather than setting up python on your own machine (as
       | is recommended in this guide)
        
         | tmaly wrote:
         | Are there anything like this online that could run stuff
         | written for pygame? I know lots of beginners start with
         | Scratch, but having some type of gaming in the browser for
         | Python would be nice.
        
         | savant_penguin wrote:
         | Although this is true I find these online notebooks awfully
         | slow
        
         | TimSchumann wrote:
         | I think there may be many a use case where the data being
         | operated on cannot be shared with third parties.
        
         | anigbrowl wrote:
         | Absolutely true, but learning things that hard way was worth it
         | to me. Plus I am old-fashioned enough to like doing things on
         | my own hardware and not necessarily wanting to share my
         | data/code every time for reasons of security or modesty (as in
         | embarrassingly basic). I do like what Colab offers and
         | appreciate having all that processing power/infrastructure
         | available.
        
         | atty wrote:
         | Downloading an installing Anaconda is pretty much painless and
         | gives you more flexibility (and better responsiveness) than
         | collab.
        
           | [deleted]
        
           | auxym wrote:
           | Not sure about colab, but its important to note that Anaconda
           | is not free for commercial use.
        
         | analog31 wrote:
         | Worth mentioning Jupyter Lite in this context too.
         | 
         | Warning: This link will open a Jupyter notebook in your
         | browser:                 https://jupyter.org/try-jupyter/lab/
         | 
         | It's worked pretty smoothly for me so far. I can't vouch for
         | how it handles big data sets or obscure libraries, but seems
         | like a pretty good starting point for those who are learning
         | Python. It has become how I prefer to share simple notebooks
         | with colleagues too.
         | 
         | But either of these options is nice for dealing with the
         | situation of getting a beginner through the Python installation
         | process. Another is WinPython, which is my preferred
         | environment for local installation.
        
         | mkl wrote:
         | You can save a little bit of copy-paste that will take a few
         | minutes at most. If you do it, you can work directly with your
         | own files, work offline, control the hardware, etc. I think
         | it's simpler than this guide makes out too, as it tries to
         | minimise the amount of space used, which is often not needed.
         | Installing Anaconda instead of Miniconda would get you pretty
         | much set up in one step, plus a single copy-paste step if you
         | want all the packages the book uses.
        
       | wirthjason wrote:
       | The first edition of this book was a game changer. I still have
       | my hardcopy paperback but haven't turned to it in years. It's
       | probably largely out of date but I look at it fondly. Glad to see
       | the material updated with new APIs and Python 3.10. The open
       | access is very nice because it can easily be used for teaching
       | and reference.
       | 
       | I would like to see some of Wes' cautionary tales included into
       | the book. Pandas feels so magical at times that you can uses it
       | to solve anything. That can create problems. Ideas like this
       | should be some appendix to the book.
       | 
       | https://wesmckinney.com/blog/apache-arrow-pandas-internals/
       | 
       | Lots of tools in the data space. It feels like choosing JS web
       | front ends. I use pandas quite a lot but less frequently, as my
       | data has gotten larger I turn to spark more. I'm also impressed
       | by the work DuckDB is doing.
        
         | wdroz wrote:
         | I switched from pandas to polars[0] . I'm happy with the
         | performance gain. You still need to use Spark/dask on "big"
         | data, but polars give you more room before reaching to these
         | solutions.
         | 
         | [0] -- https://www.pola.rs/
        
           | ImageXav wrote:
           | Does polars have N-D labelled arrays, and if so can it
           | perform computations on them quickly? I've been thinking of
           | moving from pandas to xarray [0], but might consider poplars
           | too if it has some of that functionality.
           | 
           | [0] https://xarray.dev/
        
             | auxym wrote:
             | No experience with polars, but I've had quite a positive
             | experience with xarray.
             | 
             | I still use pandas for tabular data, but anytime I have to
             | deal with ND data, xarray is a lifesaver. No more dealing
             | with raw, unlabeled 5-D numpy arrays, trying to remember
             | which dimension is what.
        
             | kzrdude wrote:
             | polars only has 1D columns, it's columnar just like pandas.
             | 
             | IME xarray and pandas have to be used together. Neither can
             | do what the other does. (Well, some people try to use
             | pandas for stuff that should be numpy or xarray tasks.)
        
               | kzrdude wrote:
               | Addendum: Polars doesn't even have index so no multiindex
               | either. I haven't gotten deep enough into polars to
               | understand why and what it breaks, but it feels wrong to
               | replace pandas with something without index.
        
           | mistrial9 wrote:
           | glad it works for you but this sounds like bad advice for
           | most people.. why change to an outlier toolchain for some
           | percentage increase in performance and raw data capacity,
           | when you admit yourself that many Big Data sets are not
           | fitting there? No computer on my desk has less than 32GB RAM
           | now -- pandas works well with that.
        
             | nojito wrote:
             | It's not so much larger data than it is analyst velocity.
             | 
             | Polars runs orders of magnitudes faster than pandas does.
             | Which means EDA can be completed quicker.
        
               | geoalchimista wrote:
               | I think it depends on whether you use it for operations
               | or for data analysis. Speed is only one concern and it
               | may not always be the most relevant concern.
               | 
               | A statistician/data scientist wrangling data and making
               | plots would not have cared whether loading a CSV file
               | takes one second or one microsecond, because they may
               | only do it a handful of times for a project.
               | 
               | A data engineer has different requirements and
               | expectations. They may need to implement an operational
               | component that process CSV files repeatedly for billions
               | of time a day.
               | 
               | If your use case is the latter, then pandas is probably
               | not for you.
        
               | nojito wrote:
               | 1 s vs 1 ms is not a great comparison.
               | 
               | Polars excels when pandas operations take 30 seconds or a
               | minute to complete. Bringing that time down to the second
               | or ms mark is really amazing.
        
               | anigbrowl wrote:
               | I love pandas and work with quite small datasets for EDA
               | (10^(3..6) most of the time) but even then I run into
               | slowdowns. I don't really mind as I'm pursuing my own
               | research rather than satisfying an employer/client, and
               | often figuring out why something is slow turns into a
               | useful learning experience (the canonical rite of passage
               | for new pandas users is using lambda functions with
               | df.apply instead of looping).
               | 
               | I've definitely procrastinated doing some analyses or
               | turning prototypes into dashboards because of the
               | potential for small slowdowns to turn into big slowdowns,
               | so it's nice to have other options available. I'm very
               | interested in Dask but have also been apprehensive about
               | doing something stupid and incurring a huge bill by
               | failing to think through my problem sufficiently.
        
             | Dzugaru wrote:
             | Some percentage? I haven't used polars, but I ditched
             | pandas after observing x20 speedup by just rewriting a
             | piece of code to plain csv.reader. Pandas is unacceptably
             | slow for medium data, period.
        
               | curiousgal wrote:
               | Having had the pleasure of profiling pandas code, the
               | main bottleneck I observed was the handling of different
               | types and edge cases.
        
             | wdroz wrote:
             | I think that neither "outlier toolchain" nor "some
             | percentage increase" are fair. This benchmark [0] show
             | significant speedup while lowering the memory needs. You
             | still need to reach dask/spark for really big data where
             | you need a cluster of beefy computers for your tasks.
             | 
             | If you use an r5d.24xlarge-like[1] instance, you can skip
             | spark/dask for most workflows as 768 GB is plenty enough.
             | On top of that, polars will efficiently use the 96
             | available cores when you are computing your join, groupby,
             | etc.
             | 
             | Also polars is getting more and more popular[2]
             | 
             | [0] -- https://h2oai.github.io/db-benchmark/ [1] --
             | https://aws.amazon.com/fr/ec2/instance-types/c6a/ [2] --
             | https://star-history.com/#pola-rs/polars&Date
        
       | neves wrote:
       | Very nice to have an open access version. The paper version is
       | really expensive if your salary isn't in dollars. I've regretted
       | to by the print Brazilian edition. O'Reilly is my favorite
       | technical publisher, but the non English editions are crap. My
       | book didn't have an index, making it useless as a reference.
        
       ___________________________________________________________________
       (page generated 2022-07-02 23:00 UTC)