[HN Gopher] Python for Data Analysis, 3rd Edition - The Open Acc...
___________________________________________________________________
Python for Data Analysis, 3rd Edition - The Open Access Version
Online
Author : mariuz
Score : 194 points
Date : 2022-07-02 10:53 UTC (12 hours ago)
(HTM) web link (wesmckinney.com)
(TXT) w3m dump (wesmckinney.com)
| argella wrote:
| anyone have a suggestion for a best online course on Python
| programming? Wanting to focus on Python the language and getting
| solid at programming, not necessarily a particular library.
|
| I'm pretty rusty at programming having last studied formally
| algorithms, data structures, oop in c++ 15 years ago in an
| undergrad compsci program.
|
| I've done mostly sql coding and database work ever since but need
| to level up my skill set.
| civilized wrote:
| Still waiting for Wes to admit that the pandas API is a mess and
| support the development and adoption of a more dplyr-like Python
| library.
|
| Pandas was a great step on the path to making Python a decent
| data analysis language, Wes is smarter than me, I never could
| have built it, but it's time to move on.
| wesm wrote:
| If you read my slide decks over the last 7 years or so (while
| I've been working actively on Arrow and sibling projects like
| Ibis) I've been saying exactly this.
|
| See e.g. https://ibis-project.org/
| RobinL wrote:
| I think that's part of the point of his current project: Arrow.
| Specifically, one of its goals is to implement the low level
| computations, whilst enabling more competition in the space of
| data analysis API design.
| closed wrote:
| Hey--I maintain a port of dplyr to python, called siuba[1]!
|
| Right now it supports a decent number of verbs + SQL
| generation. I tried to break down why R users find pandas
| difficult in an RStudioConf talk last year[2].
|
| Between siuba and tools like polars and duckdb, I'm hopeful
| that someone hits the data analysis sweet spot for python in
| the next couple years.
|
| [1]: http://github.com/machow/siuba
|
| [2]: https://youtu.be/w4Mi0u4urbQ
| TrackerFF wrote:
| I use Python extensively for my analysis projects - and while
| Pandas is my go-to library for many things, I feel it's just very
| slow. I know that just doing stuff in numpy instead speeds things
| up considerably, and that's way before doing other optimization,
| but are there any other libraries out there similar to Pandas,
| but made for higher performance?
| sireat wrote:
| Curious where do you experience slowness with Pandas?
|
| Usually slowness in Pandas will be if you start doing non
| vector things.
|
| Whenever you catch yourself writing a loop in Pandas you know
| you've gone the wrong way.
|
| I too use Python extensively (with Pandas among other things)
| and usually Pandas is perfectly fine. (havent gone over 64GB
| memory usage yet)
|
| I consider Pandas to be Numpy with benefits (methods) - since
| Pandas dataframe is basically just a collection of Numpy
| arrays. Those are about as close to C type arrays you can get
| in Python.
|
| The only practical problems have been with regressions among
| supporting libraries such as pyarrow not playing nice with
| numpy when working with parquet files.
| HFguy wrote:
| Polars and Modin support multiple cores and have optimizations.
| Modin has most of the Pandas api supported. Pyspark api for
| pandas supports multiple cores and clusters.
| kzrdude wrote:
| polars, but it's still developing.
| johnpublic wrote:
| You can save so much time and effort using the Google colab
| notebooks rather than setting up python on your own machine (as
| is recommended in this guide)
| tmaly wrote:
| Are there anything like this online that could run stuff
| written for pygame? I know lots of beginners start with
| Scratch, but having some type of gaming in the browser for
| Python would be nice.
| savant_penguin wrote:
| Although this is true I find these online notebooks awfully
| slow
| TimSchumann wrote:
| I think there may be many a use case where the data being
| operated on cannot be shared with third parties.
| anigbrowl wrote:
| Absolutely true, but learning things that hard way was worth it
| to me. Plus I am old-fashioned enough to like doing things on
| my own hardware and not necessarily wanting to share my
| data/code every time for reasons of security or modesty (as in
| embarrassingly basic). I do like what Colab offers and
| appreciate having all that processing power/infrastructure
| available.
| atty wrote:
| Downloading an installing Anaconda is pretty much painless and
| gives you more flexibility (and better responsiveness) than
| collab.
| [deleted]
| auxym wrote:
| Not sure about colab, but its important to note that Anaconda
| is not free for commercial use.
| analog31 wrote:
| Worth mentioning Jupyter Lite in this context too.
|
| Warning: This link will open a Jupyter notebook in your
| browser: https://jupyter.org/try-jupyter/lab/
|
| It's worked pretty smoothly for me so far. I can't vouch for
| how it handles big data sets or obscure libraries, but seems
| like a pretty good starting point for those who are learning
| Python. It has become how I prefer to share simple notebooks
| with colleagues too.
|
| But either of these options is nice for dealing with the
| situation of getting a beginner through the Python installation
| process. Another is WinPython, which is my preferred
| environment for local installation.
| mkl wrote:
| You can save a little bit of copy-paste that will take a few
| minutes at most. If you do it, you can work directly with your
| own files, work offline, control the hardware, etc. I think
| it's simpler than this guide makes out too, as it tries to
| minimise the amount of space used, which is often not needed.
| Installing Anaconda instead of Miniconda would get you pretty
| much set up in one step, plus a single copy-paste step if you
| want all the packages the book uses.
| wirthjason wrote:
| The first edition of this book was a game changer. I still have
| my hardcopy paperback but haven't turned to it in years. It's
| probably largely out of date but I look at it fondly. Glad to see
| the material updated with new APIs and Python 3.10. The open
| access is very nice because it can easily be used for teaching
| and reference.
|
| I would like to see some of Wes' cautionary tales included into
| the book. Pandas feels so magical at times that you can uses it
| to solve anything. That can create problems. Ideas like this
| should be some appendix to the book.
|
| https://wesmckinney.com/blog/apache-arrow-pandas-internals/
|
| Lots of tools in the data space. It feels like choosing JS web
| front ends. I use pandas quite a lot but less frequently, as my
| data has gotten larger I turn to spark more. I'm also impressed
| by the work DuckDB is doing.
| wdroz wrote:
| I switched from pandas to polars[0] . I'm happy with the
| performance gain. You still need to use Spark/dask on "big"
| data, but polars give you more room before reaching to these
| solutions.
|
| [0] -- https://www.pola.rs/
| ImageXav wrote:
| Does polars have N-D labelled arrays, and if so can it
| perform computations on them quickly? I've been thinking of
| moving from pandas to xarray [0], but might consider poplars
| too if it has some of that functionality.
|
| [0] https://xarray.dev/
| auxym wrote:
| No experience with polars, but I've had quite a positive
| experience with xarray.
|
| I still use pandas for tabular data, but anytime I have to
| deal with ND data, xarray is a lifesaver. No more dealing
| with raw, unlabeled 5-D numpy arrays, trying to remember
| which dimension is what.
| kzrdude wrote:
| polars only has 1D columns, it's columnar just like pandas.
|
| IME xarray and pandas have to be used together. Neither can
| do what the other does. (Well, some people try to use
| pandas for stuff that should be numpy or xarray tasks.)
| kzrdude wrote:
| Addendum: Polars doesn't even have index so no multiindex
| either. I haven't gotten deep enough into polars to
| understand why and what it breaks, but it feels wrong to
| replace pandas with something without index.
| mistrial9 wrote:
| glad it works for you but this sounds like bad advice for
| most people.. why change to an outlier toolchain for some
| percentage increase in performance and raw data capacity,
| when you admit yourself that many Big Data sets are not
| fitting there? No computer on my desk has less than 32GB RAM
| now -- pandas works well with that.
| nojito wrote:
| It's not so much larger data than it is analyst velocity.
|
| Polars runs orders of magnitudes faster than pandas does.
| Which means EDA can be completed quicker.
| geoalchimista wrote:
| I think it depends on whether you use it for operations
| or for data analysis. Speed is only one concern and it
| may not always be the most relevant concern.
|
| A statistician/data scientist wrangling data and making
| plots would not have cared whether loading a CSV file
| takes one second or one microsecond, because they may
| only do it a handful of times for a project.
|
| A data engineer has different requirements and
| expectations. They may need to implement an operational
| component that process CSV files repeatedly for billions
| of time a day.
|
| If your use case is the latter, then pandas is probably
| not for you.
| nojito wrote:
| 1 s vs 1 ms is not a great comparison.
|
| Polars excels when pandas operations take 30 seconds or a
| minute to complete. Bringing that time down to the second
| or ms mark is really amazing.
| anigbrowl wrote:
| I love pandas and work with quite small datasets for EDA
| (10^(3..6) most of the time) but even then I run into
| slowdowns. I don't really mind as I'm pursuing my own
| research rather than satisfying an employer/client, and
| often figuring out why something is slow turns into a
| useful learning experience (the canonical rite of passage
| for new pandas users is using lambda functions with
| df.apply instead of looping).
|
| I've definitely procrastinated doing some analyses or
| turning prototypes into dashboards because of the
| potential for small slowdowns to turn into big slowdowns,
| so it's nice to have other options available. I'm very
| interested in Dask but have also been apprehensive about
| doing something stupid and incurring a huge bill by
| failing to think through my problem sufficiently.
| Dzugaru wrote:
| Some percentage? I haven't used polars, but I ditched
| pandas after observing x20 speedup by just rewriting a
| piece of code to plain csv.reader. Pandas is unacceptably
| slow for medium data, period.
| curiousgal wrote:
| Having had the pleasure of profiling pandas code, the
| main bottleneck I observed was the handling of different
| types and edge cases.
| wdroz wrote:
| I think that neither "outlier toolchain" nor "some
| percentage increase" are fair. This benchmark [0] show
| significant speedup while lowering the memory needs. You
| still need to reach dask/spark for really big data where
| you need a cluster of beefy computers for your tasks.
|
| If you use an r5d.24xlarge-like[1] instance, you can skip
| spark/dask for most workflows as 768 GB is plenty enough.
| On top of that, polars will efficiently use the 96
| available cores when you are computing your join, groupby,
| etc.
|
| Also polars is getting more and more popular[2]
|
| [0] -- https://h2oai.github.io/db-benchmark/ [1] --
| https://aws.amazon.com/fr/ec2/instance-types/c6a/ [2] --
| https://star-history.com/#pola-rs/polars&Date
| neves wrote:
| Very nice to have an open access version. The paper version is
| really expensive if your salary isn't in dollars. I've regretted
| to by the print Brazilian edition. O'Reilly is my favorite
| technical publisher, but the non English editions are crap. My
| book didn't have an index, making it useless as a reference.
___________________________________________________________________
(page generated 2022-07-02 23:00 UTC)