[HN Gopher] Exploratory data analysis for humanities data
___________________________________________________________________
Exploratory data analysis for humanities data
Author : yarapavan
Score : 95 points
Date : 2023-10-06 16:31 UTC (6 hours ago)
(HTM) web link (awk.dev)
(TXT) w3m dump (awk.dev)
| simonw wrote:
| Wow, imagine being a humanities major and having Brian Kernighan
| teach you Awk!
| smlavine wrote:
| If I were at Princeton, I would take every one of Kernighan's
| classes that I could! I wonder if that's a problem there.
| EvanKelly wrote:
| I'm at this point 15 years removed, but Prof Kernighan was one
| of the most accessible professors and taught the most popular
| CS survey course (333).
|
| I have at least a half a dozen times where I was pointed his
| direction from another professor and Kernighan spent an hour
| with me looking into how to scrape a dynamic website for my
| auction theory project. When he was stumped he introduced me to
| a professor at another school who he knew had looked into the
| topic.
| yarapavan wrote:
| Course website (linked from the article): https://www.hum307.com/
| mbb70 wrote:
| I'm all for this kind of exploratory hacking around before
| booting up python/R/Excel/duckdb, especially in constrained
| environments. A classic pain point is having to deal with column
| numbers, so I'll share my favorite trick:
|
| `head -n1 /path/to/file.csv | tr ',' '\n' | nl | grep
| desired_column`
|
| gives you the column number of desired_column
| tejtm wrote:
| yep, without knowing about `nl` I used `...| grep -n
| column_header` or `...|grep -n .` to replicate the 'nl'
| behavior.
|
| edit: I like your 'nl' better as it is using white space
| instead of colon as a separator.
| patrec wrote:
| Unless there is a quoted comma or an empty column beforehand
| (nl "helpfully" skips empty lines for numbering purposes).
| [deleted]
| i15e wrote:
| Something to watch out for with _nl_ is that by default it
| doesn 't number empty lines. e.g.: $ printf
| 'one\n\nthree\n' | nl 1 one 2 three
|
| Set _-ba_ to enable numbering all lines.
|
| For this use case I usually end up running _cat -n_ instead
| since I find it easier to remember.
| chaps wrote:
| grep -n also works in place of `nl`!
| pmarreck wrote:
| Recent Awk convert (after, like most people, just using it for
| one-liners for years); it's aged remarkably well (although I wish
| it used more functional constructs, permitted proper variable
| initialization, and had interrupt handling... but at that point,
| it's probably best to switch to a "full" language...)
| jph wrote:
| Awk is awesome and Dr. Kernighan has taught me so much.
|
| If you like exploratory data analysis using awk, you may like the
| "num" command:
|
| https://github.com/numcommand/num
|
| Num uses awk for command line statistics, such as standard
| deviation, kurtosis, quartiles, uniqueness, ordering, and more.
| Num runs on a very wide range of Unix systems, such as systems
| without package managers.
|
| Feature requests and PRs are welcome.
| qsort wrote:
| The article nails down a very real pain point with libraries like
| Pandas:
|
| > looping over a set of input lines seems more natural than the
| dataframe selectors that Pandas favors
|
| Row-oriented operations, as opposed to aggregations and other
| OLAP-style queries are kind of painful. The generator machinery
| (yield from) is a partial fix to this, but Pandas itself offers
| little relief.
| esafak wrote:
| pandas has a poor API. I'd rather use SQL with DuckDB.
| importantbrian wrote:
| This has become my workflow too. Admittedly though I've spent
| most of my career writing large amounts of SQL, and was a
| pretty heavy Tidyverse user for a while, so that all makes a
| lot more sense to me than Pandas. I generally get my data
| into whatever shape I need it in and then load it into
| pandas.
| slt2021 wrote:
| pandas is way more powerful that most people use it.
|
| when you have to deal with thousands of text files, mish mash
| of csv, tsv, some rows overlap between the files, some files
| spread across multiple different locations (shared drive, s3
| bucket, URL, SQL db, etc), with column names that look
| similar but not exactly similar - this is perfect use case
| for pandas.
|
| read csv file? just pd.read_csv()
|
| read and concat N csv files? just pd.concat([pd.read_csv(f)
| for f in glob("*.csv")])
|
| read parquet or read_sql()? not a problem at all.
|
| need to do some custom rules for data cleansing, or regex
| matching or fuzzy matching on column names, converting data
| from/to csv/parquet/sql - it will be pandas 1 liner
|
| a lot of painful data processing/cleaning, correcting data is
| just 1-liner in pandas, and I dont know of better tool that
| can beat pandas - probably tidyR but it is essentially same
| pandas just for R
| IKantRead wrote:
| > essentially same pandas just for R
|
| You are aware the pandas was designed to replicate the
| behavior of base R's dataframes?
|
| I've been a heavy user of both and R's data frames are
| still superior to pandas even without the tidyverse.
|
| Pandas is really nice for the use case it was designed for:
| working with financial data. This is a big part of why
| Pandas's indices feel so weird for everything else, but if
| your index is a time in a financial time series then all of
| a sudden Pandas makes sense and works great
|
| When not working with financial data I try to limit the
| amount of time my code touches pandas, and increasingly
| find numpy + regular python works better and is easier to
| build out larger software with. It also makes it much
| easier to port your code into another language for use in
| production (i.e. it's quick and easy to map standard python
| to language X, but not so much a large amount of non-
| trivial pandas).
| palae wrote:
| R also has data.table, which extends data.frame and is
| pretty powerful and very fast
| slt2021 wrote:
| with pandas2.0 and using arrow backend instead of numpy -
| pandas became "cloud datalake native" - you can
| essentially read from arrow files in S3 very efficiently
| and at any large scale - and store/process arbitrarily
| large amounts of files in a cheap serverless infra. Arrow
| format is also supported by other languages.
|
| with s3+sqs+lambda+pandas - and you can build cheap
| serverless data processing pipelines and iterate
| extremely quickly
| Karrot_Kream wrote:
| Do you have any benchmarks about how much data a given
| lambda can search/process after loading Arrow data? Not
| trying to argue, I'm curious because I never thought of
| this architecture myself, because I would think that the
| time it takes to ingest the Arrow data and then search
| through it would be too long for a lambda but I may be
| totally off base here. I've not played around in detail
| with lambdas so I don't have particularly robust mental
| model on their limitations.
| slt2021 wrote:
| reading/writing Arrow is zero serde overhead operation
| to/from memory to disk.
|
| I think of lambda as a thread, and you can put a trigger
| on S3 bucket on each incoming file - to get processed.
| This allows you to get around GIL, and lets you invoke
| your lambda for each mini-batch.
|
| assuming you have high volume and frequency of data - you
| will need to "cool down" your high frequency data, and
| switch from row-basis (like millions of rows per second)
| to mini-batch basis (like one batch file per 100Mb).
|
| This can be achieved by having kafka with high partition
| number on the ingestion side, and sink to s3.
|
| from S3 for each new file your lambda will be invoked and
| minibatch will be processed by your python code, and you
| can right size your lambda's RAM, but usually I reserve
| 2-3x size of a batch file for lambda.
|
| the killer feature is zero ops. Just by tuning your
| minibatch size you can regulate how many times your
| lambda will be invoked
| esafak wrote:
| You can do that with other tools too.
|
| https://duckdb.org/docs/data/csv/overview.html
|
| https://duckdb.org/docs/data/parquet/overview
|
| https://duckdb.org/docs/data/multiple_files/overview.html
| slt2021 wrote:
| interesting, but I would still prefer pandas for data
| cleansing/manipulation, just because I won't be limited
| by SQL syntax - and can always use df.apply() and/or any
| python package for custom processing.
|
| pandas using apache arrow backend also makes it high
| performance and compatible with cloud native data lakes
|
| plus compatibility with sklearn package makes it a killer
| feature, with just few lines you can bolt on ML model on
| top of your data
| aidos wrote:
| It definitely has its place. I like to get it to grab the
| data, clean it up and get out into python / Postgres. I
| don't like to have spreading through the codebase.
| wheresmycraisin wrote:
| > Pandas is way more powerful
|
| Only if you 1) don't know SQL and 2) working with tiny
| datasets that are around 5% of your total RAM.
| faizshah wrote:
| I guess it depends on who you ask but personally I am
| able to write pandas much faster than loading data into a
| DB and then processing it. The reason is pandas defaults
| on from_ and to_ are very sane and you don't need to
| think about things like escaping strings and stuff. It's
| also easy to deal with nulls quickly in pandas and
| rapidly get some EDA graphs like in R.
|
| The other benefit of pandas is it's in python so you can
| use your other data analysis libraries whereas with SQL
| you need to marshal back and forth between python and
| SQL.
|
| My usual workflow is: Explore data in pandas/datasette,
| if it's big data I explore just a sample and use bash
| tools to pull out the sample -> write my notebook in
| pandas -> scale it up in spark/dask/polars depending on
| use case.
|
| This is pretty good cause ChatGPT understands pandas,
| pyspark, and SQL really well so you can easily ask it to
| translate scripts or give you code for different things.
|
| On scalability if you need scale there's many options
| today to process large datasets with a dataframe api e.g
| koalas, polars, dask, modin etc.
| slt2021 wrote:
| >>Only if you 1) don't know SQL and 2) working with tiny
| datasets that are around 5% of your total RAM.
|
| this is only true only for newbie python devs that
| learned about pandas from blogs on medium.com. I have
| pipelines that process terabytes per day in a serverless
| datalake, and it requires zero DBA work that usually
| comes if you use anything *Sql
| wslh wrote:
| Ah! That is awkard! Sorry, I couldn't resist, I have all respect
| for Awk.
___________________________________________________________________
(page generated 2023-10-06 23:00 UTC)