[HN Gopher] My favourite API is a zipfile on the European Centra...
___________________________________________________________________
My favourite API is a zipfile on the European Central Bank's
website
Author : qsantos
Score : 518 points
Date : 2023-09-15 16:59 UTC (6 hours ago)
(HTM) web link (csvbase.com)
(TXT) w3m dump (csvbase.com)
| lxgr wrote:
| The example is failing at decompression for me, and I suspect
| it's because of this:
|
| > That data comes as a zipfile, which gunzip will decompress.
|
| Doesn't gunzip expect gzip files, as opposed to ZIP (i.e. .zip)
| files?
|
| On Linux I get further (apparently Linux gunzip is more tolerant
| of the format error than the macOS default one), but there, I
| then run into:
|
| > Error: no such column: Date
| halJordan wrote:
| Here's the gzipped tarball of the docs you can read to find out
|
| https://www.gnu.org/software/gzip/manual/gzip.txt.gz
| jandrese wrote:
| This is slightly abusive of gunzip, but it does work, but only
| because the zipfile in question only contains a single file and
| you are piping it in via shell. If you had tried to download
| the file and then run gunzip on it the program would have
| refused.
|
| As for the second error, I think you might be trying to import
| an empty file or maybe an error message? Probably related to
| the first problem.
| lovasoa wrote:
| the second error is because you need to add a `-csv` argument
| to sqlite
| lxgr wrote:
| > but it does work
|
| Only on some gzip implementations, e.g. the one shipping with
| many Linux distributions. It doesn't work on macOS, for
| example.
| pixl97 wrote:
| At least in Ubuntu it doesn't do anything at all.
|
| # gunzip -d master.zip gunzip: master.zip: unknown suffix --
| ignored
| lxgr wrote:
| It's a ZIP file, so you'll want to use unzip, not gunzip.
|
| Ubuntu's version of g(un)zip actually does tolerate ZIP
| inputs though; for that usage, just omit the '-d'.
| [deleted]
| stevoski wrote:
| Ah, I remember this specific file from my time at the ECB 15-ish
| years ago.
|
| IIRC it was by far the most downloaded file on the ECB website.
| Tons of people, including many financial institutions, downloaded
| it daily, and used it to update their own systems.
|
| IIRC #2 in the minutes immediately after the daily scheduled time
| for publishing this file, there was a massive traffic spike.
|
| It was a conscious decision to make it a simple CSV file (once
| unzipped): it made it possible to serve the file reliably, fast,
| and with little resources needed.
|
| The small team responsible at the time for the ECB'S public
| website was inordinately proud of the technical decisions made to
| serve this data in a single static file. And rightly so.
| mstade wrote:
| Do you know why they decided to host a zip file instead of just
| hosting the CSV and relying on HTTP compression?
| theamk wrote:
| HTTP compression is optional, so they either have to compress
| on the fly (wasting cpu) or provide multiple versions
| (complicating setup and deployment) or make some HTTP clients
| not work.
|
| simgle zip file is really the easiest solution for cases when
| the file must absolutely be compressed
| Nux wrote:
| I could see 2 reasons:
|
| 1 - save on cpu usage, compress once, serve many
|
| 2 - with zip you can have some rudimentary data integrity
| checks (unzip -t)
| nlehuen wrote:
| It's a bit scary to imagine the consequences if this file was
| somehow corrupted (wrong column headers for instance)!
| thghtihadanacct wrote:
| Just wait until you inherit a service that, sans
| documentation, pulls in a web resource file that suddenly is
| no longer available :(
| andylynch wrote:
| I know this file too, I was one of them. Of all the data
| sources we used, it was the best to work with.
| [deleted]
| supergeek133 wrote:
| I used to help run the Best Buy product catalog API. We did
| nightly CSVs instead of making people poll/paginate the entire
| catalog. It was super widely used.
| swader999 wrote:
| Csv is surprisingly common and I prefer it despite parsing,
| quoting, encoding and escaping woes.
|
| Amazon S3 let's you query csv files already loaded in buckets
| which is interesting but I haven't used yet.
|
| One company I worked at a long time ago used free dropbox
| accounts as a ftp like drop that they would consume. Was
| hilarious and it worked well and was easy to stay under the free
| limits.
| andrewmcwatters wrote:
| The EDGAR API bulk data is similar in nature, albeit in JSON
| instead.[1]
|
| [1]: https://github.com/andrewmcwattersandco/programming-
| language...
| eql5 wrote:
| Simplicity Wins.
|
| All. Ways. (TM)
| sdfghswe wrote:
| What's the use case here? For a quick look up you're not going to
| write that monster command. And for a more ongoing repeated use,
| you're not going to download and uncompress the file every time
| you want to query it...
|
| I get the satisfaction as a curiosity, but other than that, to me
| that wouldn't be enough to make the my "favorite" or even "good".
| accrual wrote:
| I viewed it as an example of how simple it could be
| (essentially a one-liner), but not meant for actual repeated
| use as your points suggest.
| Scene_Cast2 wrote:
| There's an issue with openly & freely available financial data.
| It tends to be limited (i.e. you need many sources to answer the
| questions you want and not just the questions a particular source
| is able to answer). And if you're bringing in different sources
| of data, each source requires custom fetching and parsing, along
| with being just annoyingly slightly different from other sources.
| Even if the value is given on a per-day basis (which is a huge
| assumption, lots of things like interest rates are published much
| slower), you have things like "is the value at the start of day,
| end of day, market close, median? What time zone? Which exchange?
| What about days when the exchange was closed? What about odd
| exchange hours or halts?", "what's the date and time format,
| what's the naming convention?"
| mhh__ wrote:
| This why Bloomberg and friends make money. They're a cartel but
| they at least do both making all this data work together and
| also transcribe stuff that is written in English into
| structured data etc (hence their interest in AI)
| mhh__ wrote:
| _Target_ rates aren 't published very frequently but interest
| rates themselves are typically daily or better.
| liamkinne wrote:
| I once hade the unfortunate experience of building an API for a
| government org where the data changed once a year or when
| amendments were made which happens very infrequently.
|
| The whole data set could have been zipped into a <1MB file but
| instead a "solution architect" go their hands on the
| requirements. We ended up with a slow API because they wouldn't
| let us cache results in case the data had changed just as it was
| requested. And an overly complex webhook system for notifying
| subscribers of changes to the data.
|
| A zip file probably was too simple, but not far off what was
| actually required.
| ipaddr wrote:
| If data changes only once a year or rarely that would imply
| usage of the api is a rare event for a user of the data so
| speed isn't a huge concern. Caching would introduce more
| complexities and the risk of needing to manually revalidate the
| cache. The solution architect was probably right.
| justsomehnguy wrote:
| cat /api/version.txt 2023.01.01 ls /api
| version.txt data.zip
| accrual wrote:
| Or maybe encode the version into the filename? It would
| overwrite if nothing changed, and the previous versions
| would remain available.
| 2023.01.01-data.zip
| justsomehnguy wrote:
| That requires preprocessing on the client and there are
| some ppl who has.. weird assumptions about how the dates
| should be written.
|
| The version file can be quired at least the two ways:
|
| the ETag/If-Modified-Since way (metadata only)
|
| content itself
|
| The best part with the last one - you don't need semver
| shenanigans. Just compare it with the latest dloaded
| copy, if version != dloaded => do_the_thing
| paulddraper wrote:
| > Caching would introduce more complexities
|
| Apache/nginx do it just fine...
| xg15 wrote:
| Why do rare writes imply rare usage? It's possible the file
| is read often and by different systems even if changes are
| infrequent.
|
| If the API was used rarely, that would be even more of an
| argument for a simple implementation and not a complex system
| involving webhooks.
| pests wrote:
| Can't cache so you need to read it whenever you use the data,
| not just when it changes.
| xg15 wrote:
| I think for <1MB of data, with changes once (or twice) a year,
| the correct API is a static webserver with working ETag/If-
| Modified-Since support.
|
| If you want to get really fancy, offer an additional webhook
| which triggers when the file changes - so clients know when to
| redownload it and don't have to poll once a day.
|
| ...or make a script that sends a predefined e-mail to a mailing
| list when there is a change.
| justsomehnguy wrote:
| > ETag/If-Modified-Since
|
| See above. Also you can just publish the version in DNS with
| a long enough TTL
| [deleted]
| calpaterson wrote:
| > working ETag/If-Modified-Since support
|
| I completely agree and csvbase already implements this (so
| does curl btw), try: curl --etag-compare
| stock-exchanges-etag.txt --etag-save stock-exchanges-etag.txt
| https://csvbase.com/meripaterson/stock-exchanges.parquet -O
| deeringc wrote:
| A zip file on a web server that supports etags, that's polled
| every time access is required. When nothing has changed since
| last time, you get an empty HTTP 304 response and if it has
| changed then you simply download the <1MB Zip file again with
| the updated etag. What am I missing?
| throwing_away wrote:
| You forgot to get yourself paid.
| mhh__ wrote:
| Lots of exchanges and banks still use FTP and and friends to
| publish data to both the public and counterparties
| accrual wrote:
| Indeed, I think ACH (Automated Clearing House) specifically
| still relies on FTP.
| swader999 wrote:
| I would say almost all do, especially the larger ones. Looking
| at you wells fargo with your tabbed and nested record csv's.
| pharrington wrote:
| Yeah having a local snapshot of an entire dataset is obviously
| nice when it's feasible. But there's no need to conflate that
| with "application programming interface" just to market csvbase.
| j7ake wrote:
| The other annoying thing is the large file that is behind some
| click button, so you cannot easily copy and paste the url and
| download it via wget.
| AceJohnny2 wrote:
| Key point:
|
| > _Some things we didn 't have to do in this case: negotiate
| access (for example by paying money or talking to a salesman);
| deposit our email address/company name/job title into someone's
| database of qualified leads, observe any quota; authenticate
| (often a substantial side-quest of its own), read any API docs at
| all or deal with any issues more serious than basic formatting
| and shape._
| lovasoa wrote:
| I am skeptical about the "observe any quota" part. Bandwidth is
| not free.
| crazygringo wrote:
| No, it's a terrible _API_ if retrieving the result '2000-10-26'
| requires downloading a 565 KB file.
|
| I don't want to seem overly negative -- zipped CSV files are
| fantastic when you want to import lots of data that you then re-
| serve to users. I would vastly prefer it over e.g. protobufs that
| are currently used for mass transit system's live train times,
| but have terrible support in many languages.
|
| But it's incredibly wasteful to treat it like an API to retrieve
| a single value. I hope nobody would ever write something like
| that into an app...
|
| (So the article is cool, it's just the headline that's too much
| of a "hot take" for me.)
| piaste wrote:
| It's historical data. There is zero reason to request it more
| than once a day, and the users of such data will be interested
| in wildly different filters or aggregates over that data.
|
| If this were being used to get the current rates, then yes, it
| would be a terrible design choice. But there are other services
| for that. This one fits its typical use case.
| mhh__ wrote:
| But what if I want the average over time? The query depends on
| every value, should everything be computed on the server?
| berkes wrote:
| Why do you assume that 565KB even matters?
|
| I've been building loads of financial software, both back and
| front ends. In frontend world it, sadly is quite common to send
| that amount of "data" over wires before even getting to actual
| data
|
| And in backends it's really a design decision. There's nothing
| faster than a Cron job parsing echange rates nightly and
| writing them to a purpose designed todays-rates.json served as
| static file to your mobile, web or microservices apps.
|
| Nothing implies your mobile app has to consume this zip-
| csv_over_http
| netsharc wrote:
| I'm an old fart, seeing the curl in every example is like nails
| on chalkboard.
|
| 565 KB, that's about 3 minutes download on a 28.8kbps modem I
| started getting online with...
| zamadatix wrote:
| When you got that modem were you concerned with how long it
| would have taken to type/write the information sent over it
| or were you just happy it didn't matter anymore?
|
| I often wonder what I'll think of technology in say another
| 20 years but I can never tell if it's all just some general
| shift in perspective or, as you look farther back, if certain
| people were always just about a certain perspective (e.g.
| doing the most given the constraints) and technology changes
| enough that different perspectives (e.g. getting the most
| done easily for the same unit of human time) become the most
| common users for the newer stuff and that these people will
| also have a different perspective than the ones another 20
| years down the line from them and so on.
|
| For example, maybe you think it's crazy to ask for the exact
| piece of data you need, I think it's crazy to do all the work
| to not just grab the whole half MB and just extract what you
| need quickly on the client side as often as you want, and
| someone equidistant from me will think it's nuts to not just
| feed all the data into their AI tool and ask it "what is the
| data for ${thing}" instead of caring about how the data gets
| delivered at all. Or maybe that's just something I hope for
| because I don't want to end up a guy who says everything is
| the same just done slower on faster computers since that
| seems... depressing in comparison :).
| maxbond wrote:
| Here's an API built on top of this data that allows for more
| fine tuned queries.
|
| https://exchangeratesapi.io/
|
| https://github.com/exchangeratesapi/exchangeratesapi
| Hamuko wrote:
| How much bandwidth does the average API documentation page use?
| eddythompson80 wrote:
| Does your client query the API documentation every time it's
| querying an API?
| ajcp wrote:
| Well of course it does; how else would it know how to query
| the API? /s
| dang wrote:
| > (So the article is cool, it's just the headline that's too
| much of a "hot take" for me.)
|
| It's one of those headlines that we'd never allow if it were an
| editorialization by the submitter (even if the submitter were
| the author, in this case), but since it's the article's own
| subtitle, it's ok. A bit baity but more in a whimsical than a
| hardened-shameless way.
|
| (I'm sure you probably noticed this but I thought the gloss
| might be interesting)
| jrm4 wrote:
| Kinda feels like 20 years ago called and wants its argument
| back, like have you seen any javascript ever?
|
| 565 KB + the logic to get the big one is _miniscule_ today by
| any reasonable factor.
| scubbo wrote:
| True, though ironic that also OP is implicitly making the
| argument for GraphQL-like interfaces ("just specify the data
| that you want returned" rather than "get all the data, then
| process it"), which are themselves the New Hotness in some
| areas.
| thekashifmalik wrote:
| At risk of sounding pedantic, REST also allows for "just
| specify the data that you want returned" style APIs.
|
| Something like: curl https://www.ecb.euro
| pa.eu/stats/eurofxref/api?fields=Date&order=USD&limit=1
| scubbo wrote:
| A very fair point well-made!
| smallpipe wrote:
| How much data do I need to download before I can do a protobuf
| request ?
| eddythompson80 wrote:
| Are you counting the client size? TLS handshake? Because
| otherwise the answer is none.
| theLiminator wrote:
| They should ship parquet, supports predicate pushdown and is
| dramatically more compact, while you can't get row level data,
| it's great for analytical queries.
| IanCal wrote:
| IMO this is the best simple option right now.
|
| For sorted data you only need the relevant row groups which
| can be tunable to sensible sizes for your data and access
| pattern.
| calpaterson wrote:
| csvbase does ship parquet!
|
| Just add ".parquet" - eg
| https://csvbase.com/meripaterson/stock-exchanges.parquet
| theLiminator wrote:
| Nice!
| RobinL wrote:
| Totally agree! So much that I wrote a whole article about it
| a while back!
|
| "Why parquet files are my preferred API for bulk open data"
| https://www.robinlinacre.com/parquet_api/
| swader999 wrote:
| Parquet is a level better than csv but difficult to get
| customers to adopt and transmit in that format.
| Telemakhos wrote:
| I think a lot of people in this thread are glossing over a
| difference in definitions. Some people see "API" as "how to get
| data, even if that means all data with no filtering."
| Personally, I regard downloading an entire table as downloading
| a data model without any logic operating on the model, and an
| API as logic that returns only part of the model filtered in
| some way that interests me.
| pixl97 wrote:
| > I hope nobody would ever write something like that into an
| app...
|
| I have got some bad news for you...
|
| Not directly API related, but I remember supporting some land
| management application, and a new version of their software
| came out. Before that point it was working fine on our slow
| satellite offices that may have been on something like ISDN at
| the time. New version didn't work at all. The company said to
| run it on an RDP server.
|
| I thought their answer was bullshit and investigated what they
| were doing. One particular call, for no particular reason was
| doing a 'SELECT * FROM sometable' for no particular reason.
| There were many other calls that were using proper SQL select
| clauses in the same execution.
|
| I brought this up with the vendor and at first they were
| confused as hell how we could even figure this out. Eventually
| they pushed out a new version with a fixed call that was usable
| over slow speed lines, but for hells sake, how could they not
| figure that out in their own testing and instead pushed
| customers to expensive solutions?
| gray_-_wolf wrote:
| > how could they not figure that out in their own testing
|
| This one is easy. Testing with little data on fast network
| (likely localhost).
| codetrotter wrote:
| Also, if there are any ORMs involved it could be that it's
| not immediately obvious from their code itself that this
| would happen.
|
| I've seen code that was using an ORM, where they needed to
| find some data that matched certain criteria. With plain
| SQL it would have resulted in just a few rows of data. Put
| instead with their use of the ORM they ended up selecting
| all rows in the table from the database, and then looping
| over the resulting rows in code.
|
| The result of that was that the code was really slow to
| run, for something that would've been super fast if it
| wasn't for the way they used the ORM in that code.
| mrighele wrote:
| I just see recently an example of that, a REST call
| returning a few KB of data that would fetch a few million
| rows from the database and use 10+ GB of memory
| (unfortunately some people think that you should always
| use join fetches with JPA...).
| 1vuio0pswjnm7 wrote:
| curl -s https://www.ecb.europa.eu/stats/eurofxref/eurofxref-
| hist.zip \ | gunzip \ | sqlite3 ':memory:' '.import
| /dev/stdin stdin' \ "select Date from stdin order by USD
| asc limit 1;"
|
| SQLite can read and write zip files.
|
| https://sqlite.org/zipfile.html
|
| Is it possible to use sqlite3 instead of gunzip for
| decompression.
| hamachi4455 wrote:
| [dead]
| shortrounddev2 wrote:
| The site apparently serves .csv directly as well (replace .zip
| with .csv). I used this to convert the expression to powershell
| (using only built-in powershell cmdlets and commands):
| $> (invoke-webrequest
| "https://www.ecb.europa.eu/stats/eurofxref/eurofxref-
| hist.csv").Content | ConvertFrom-Csv |
| sort "usd" | select "date" -first 1
| Date ---- 2000-10-26
|
| Doing it with a zip file would be a little more verbose since
| there is no built in "gunzip" type command which operates on
| streams, but you can write one which does basically that out of
| built in .Net functions: function ConvertFrom-
| Zip { param( [Parameter(Position=0,
| Mandatory=$true, ValueFromPipeline=$true)]
| [byte[]]$Data ) process {
| $memoryStream = [System.IO.MemoryStream]::new($Data)
| $zipArchive =
| [System.IO.Compression.ZipArchive]::new($memoryStream)
| $outputStreams = @() foreach
| ($entry in $zipArchive.Entries) { $reader =
| [System.IO.StreamReader]::new($entry.Open())
| $outputStreams += $reader.ReadToEnd()
| $reader.Close() }
| $zipArchive.Dispose() $memoryStream.Dispose()
| return $outputStreams } }
|
| and call it like: # unary "," operator required
| to have powershell # pipe the byte[] as a single argument
| rather # than piping each byte individually to #
| ConvertFrom-Zip $> ,(invoke-webrequest
| "https://www.ecb.europa.eu/stats/eurofxref/eurofxref-
| hist.zip").Content | ConvertFrom-Zip |
| ConvertFrom-Csv | sort "usd" | select
| "date" -first 1 Date ----
| 2000-10-26
|
| I love powershell
| llimllib wrote:
| Here it is in nushell: /tmp> # be kind to the
| server and only download the file if it's updated /tmp>
| curl -s -o /tmp/euro.zip -z /tmp/euro.zip
| https://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist.zip
| /tmp> unzip -p /tmp/euro.zip | from csv | select Date USD |
| sort-by USD | first +------+------------+ |
| Date | 2000-10-26 | | USD | 0.83 |
| +------+------------+
|
| (I removed the pipe to gunzip because 1. gunzip doesn't work
| like that on mac and 2. it's not something you should expect to
| work anyway, zip files often won't work like that, they're a
| container and their unzip can't normally be streamed)
| shortrounddev2 wrote:
| > be kind to the server and only download the file if it's
| updated
|
| I wonder if their webserver supports the If-modified-since
| http header
| sltkr wrote:
| You can check with `curl -v`.
|
| tl;dr: the server doesn't support that header, but since
| the response does include a Last-Modified header, curl
| helpfully aborts the transfer if the Last-Modified date is
| the same as the mtime of the previously downloaded file.
| NelsonMinar wrote:
| Simple file downloads and CSV files are fantastic. I wish more
| folks would publish data in simple formats like this. Every time
| I have to fill a "shopping cart" for a US government data
| download I die a little.
|
| There's a bunch of wrapper tools to make this particular pipeline
| easier. Also something like Datasette is great if you want a web
| view and some fancier features.
| username_my1 wrote:
| it blows my mind that you can use sqlite with csv as input and
| then query it, it sounds so logical and useful yet I never came
| by it.
|
| we have lots of reporting in CSV, can't wait to start using it
| to run queries quickly
| dima55 wrote:
| Or the vnlog tools can do this. There are many ways to do
| data processing on the commandline now.
| whartung wrote:
| What's really interesting about it is that Awk is now,
| finally, getting support for CSV. But I bet a large amount of
| Awk+CSV use cases can be met with SQLite and SQL+CSV.
| nuc1e0n wrote:
| AWK's new CSV and UTF-8 support is great, but when querying
| data I think in terms of SQL.
| wiredfool wrote:
| Look at duckdb. Queries against csv, parquet, Jason, locally
| or via http. It's like SQLite, but faster and better.
| shortrounddev2 wrote:
| You should checkout powershell; it supports converting CSV
| into in-memory structured data and then you can run regular
| powershell queries on that data: $> csvData
| = @" Name,Department,Salary John
| Doe,IT,60000 Jane Smith,Finance,75000
| Alice Johnson,HR,65000 Bob Anderson,IT,71000
| "@; $> csvData | ConvertFrom-
| Csv | Select Name, Salary | Sort
| Salary -Descending Name Salary
| ---- ------ Jane Smith 75000
| Bob Anderson 71000 Alice Johnson 65000
| John Doe 60000
|
| You can also then convert the results back into CSV by piping
| into ConvertTo-Csv $> csvData
| | ConvertFrom-Csv | Select Name, Salary
| | Sort Salary -Descending | ConvertTo-Csv
| "Name","Salary" "Jane Smith","75000"
| "Bob Anderson","71000" "Alice Johnson","65000"
| "John Doe","60000"
| llimllib wrote:
| nushell does too: /tmp/>
| "Name,Department,Salary ::: John Doe,IT,60000
| ::: Jane Smith,Finance,75000 ::: Alice
| Johnson,HR,65000 ::: Bob Anderson,IT,71000" |
| ::: from csv | ::: select Name Salary |
| ::: sort-by -r Salary
| +---+---------------+--------+ | # | Name
| | Salary | +---+---------------+--------+ |
| 0 | Jane Smith | 75000 | | 1 | Bob Anderson |
| 71000 | | 2 | Alice Johnson | 65000 | | 3
| | John Doe | 60000 |
| +---+---------------+--------+
| shortrounddev2 wrote:
| Crazy how similar the commands are.
| marzell wrote:
| for something a bit more robust, check out DuckDB. It's a
| library you can embed, use it to run SQL on local files
| as well as connect to databases, do joins, analytics,
| etc.
| scumola wrote:
| Agreed. The article mentioned duckdb and I'm her to
| thumbs-up the use of DuckDB wholeheartedly. If you like
| the world of public CSV files as data sources that you
| can query or cross-query, duckdb is the tool for you.
| Just follow the demo on the duckdb website and you'll be
| wow'd for sure.
| rzmk wrote:
| qsv (https://github.com/jqnatividad/qsv) also has a sqlp
| command which lets you run Polars SQL queries on CSV(s).
|
| Here I'll: - Send the csv data from stdin
| (using echo and referred to in the command by -) -
| Refer to the data in the query by stdin. You may also use
| the _t_N syntax (first table is _t_1, then _t_2, etc.), or
| the file name itself before the .csv extension if we were
| using files. - Pipe the output to the table command
| for formatting. - Also, the shape of the result is
| printed to stderr (the (4, 2) below). $ echo
| 'Name,Department,Salary John Doe,IT,60000
| Jane Smith,Finance,75000 Alice Johnson,HR,65000
| Bob Anderson,IT,71000' | qsv sqlp - 'SELECT Name,
| Salary FROM stdin ORDER BY Salary DESC' | qsv table
| (4, 2) Name Salary Jane Smith
| 75000 Bob Anderson 71000 Alice Johnson
| 65000 John Doe 60000
| freedude wrote:
| perl with the right plugins makes data janitor situations
| simplified.
| fiddlerwoaroof wrote:
| The clickhouse-local tool is also really great for querying a
| bunch of systems, and it has connectors and converters for a
| whole bunch of other systems and formats.
| WorldMaker wrote:
| You might find a lot of interesting tools in the Datasette
| ecosystem. Data dashboarding for SQLite with all sorts of
| import and export and visualization plugins.
|
| https://datasette.io/
| j0hnyl wrote:
| Have you used duckdb? It's great for that.
| x86x87 wrote:
| Lol. Tell me you never had to parse CSV files without telling
| me.
|
| CSV files can be a nightmare to work with depending where they
| come from and various liberties that were taken when generating
| the file or reading the file.
|
| Use a goddam battle tested library people and don't reinvent
| the wheel. /oldman rant over
| thefurdrake wrote:
| Hey, CSV is hard, guys.
|
| I've found template injection in a CSV upload before because
| they didn't anticipate a doublequote being syntactically
| relevant or something.
|
| It was my job to find these things and I still felt betrayed
| by a file format I didn't realize wasn't just comma separated
| values only.
| wavemode wrote:
| Yes, you eventually realize the hard way that "CSV" is
| actually a blanket of various similar formats, with different
| rules and conventions. The way one program outputs CSV's may
| be completely different from another.
| nuc1e0n wrote:
| It's not as bad as all that. There's some gochas sure but you
| can cover them all with about 200 lines of code.
|
| However, I would recommend using a tested library to do the
| parsing, sqlite for example, rather than rolling your own.
| Unless you have to of course.
| jhwhite wrote:
| clinicaltrials.gov let's you save a search term then download
| the results as a zip. But there's an xml file for each search
| result for the trial.
|
| One of the first things I played around with was using Python
| to get that file, unzip it, then iterate through the xml files
| grabbing the info I wanted and putting it into ElasticSearch to
| make it searchable then putting an angular front end on it.
|
| I used to have it published somewhere but I think I let it all
| die. :(
| wefarrell wrote:
| One nice thing about CSV files being zipped and served via the
| web is they can be streamed directly into the database
| incredibly fast without having to persist them anywhere (aside
| from the db).
|
| You can load the zip file as a stream, read the CSV line by
| line, transform it, and then load it to the db using COPY FROM
| stdin (assuming Postgres).
| heavenlyblue wrote:
| That doesn't sound like an amazingly safe idea
| dambi0 wrote:
| What specific risks do you foresee with this approach?
| diroussel wrote:
| Seem totally fine to me. As long as you can rollback if
| the download is truncated or the crc checksum doesn't
| match.
| berkes wrote:
| It isn't. But that's easily mitigated with temp tables,
| ephemeral database and COPY etc.
|
| Upstream can easily f-up and (accidentally) delete
| production data if you do this on a live db. Which is why
| PostgreSQL and nearly all other DBS have a miriad of tools
| to solve this by not doing it directly on a production
| database
| eastbound wrote:
| I you feel risky, try a Foreign Data Wrapper ;)
| thibaut_barrere wrote:
| Definitely, it is much easier to stream CSV than say JSON or
| XML (even if JSONL/Sax parsers exist etc).
| mhh__ wrote:
| Someone posted me a pdf in the last year! Online even! (But
| posted!)
| digging wrote:
| Can you explain what the shopping cart is?
|
| I mean... do they make you select one or more files, then
| navigate to another page to download your selected files?
| pininja wrote:
| Pretty much. USGS's EarthExplorer requires a login just to
| download free files https://earthexplorer.usgs.gov/
|
| There are other ways to access the data on here, but they're
| fragmented. It's nicely organized here so it's a bummer they
| make it hard to programmatically retrieve files once you find
| what you're looking for.
| RosanaAnaDana wrote:
| Quite literally yes. There are often multiple click throughs.
| Every Department, agency, sub-agency, all the way down to
| federal dog catcher has decided the most important thing is
| to invent a new way of getting data from them.
| imchillyb wrote:
| Solutions vary as the budgets and talent assigned the
| projects do.
|
| Federal, State, local and hyper local solutions cannot be
| the same unless the financier is also the same.
| crizzlenizzle wrote:
| Sounds like Germany's company register[1].
|
| You can search for companies, select the documents you'd like
| to see (like shareholder lists), then you go through a
| checkout process and pay 0 EUR (used to be like a few euros
| years ago), and then you can finally download your file.
| Still a super tedious process, but at least for free
| nowadays.
|
| [1] https://www.unternehmensregister.de/ureg/
| rewmie wrote:
| > Simple file downloads and CSV files are fantastic. I wish
| more folks would publish data in simple formats like this.
|
| The document format doesn't seem to have much to do with the
| problem. I mean, if the CSV is replaced by a zipped JSON doc
| then the benefits are the same.
|
| > Every time I have to fill a "shopping cart" for a US
| government data download I die a little.
|
| Now that seems to be the real problem: too many hurdles in the
| way of a simple download of a statically-served file.
| dividedbyzero wrote:
| > if the CSV is replaced by a zipped JSON doc then the
| benefits are the same.
|
| Being able to use things like jq and gron might make simple
| use cases extremely straightforward. I'm not aware of
| anything similarly nimble for CSV.
| BeefWellington wrote:
| csvtool is probably what you're looking for, though I think
| the use case for JSON vs CSV is different as one is
| hierarchy-oriented and the other is flat.
| darkwater wrote:
| You dont gron with CSV, normal grep will work wonders.
| two_handfuls wrote:
| For CSV? DuckDB, Datasette, awk, pawk, and tons of others.
| msla wrote:
| CSV would be great if there were _one_ CSV, and if you
| absolutely guarantee that nobody has "improved" the CSV by
| editing it with a spreadsheet program (including opening it
| with a spreadsheet program) or their own deft little fingers in
| a text editor.
|
| For example: "Look, this contains
| \"quotes\"!",012345
|
| Or: "Look, this contains ""quotes""!",012345
|
| Or, for some degenerate examples: "Look, this
| contains "quotes"!",012345
|
| Or: Look, this contains "quotes"!,012345
|
| Or the spoor of a spreadsheet: "Look, this
| contains ""quotes""!",12345
|
| Theoretically, JSON isn't immune to being hand-hacked into a
| semi-coherent mess. In practice, people don't seem to do that
| to JSON files, at least not that I've seen. Ditto number
| problems, in that in JSON, serial numbers and such tend to be
| strings instead of integers a "helpful" application can lop a
| few zeroes off of.
| AshamedCaptain wrote:
| > CSV would be great if there were one CSV, and if you
| absolutely guarantee that nobody has "improved" the CSV by
| editing it with a spreadsheet program (including opening it
| with a spreadsheet program)
|
| Practically no formats actually pass those rules. Even plain
| text is bound to be "improved" by text editors frequently
| (uniformation of line endings, removal of data not in a known
| encoding, UTF BOM, UTF normalization, etc.)
|
| Just don't do that.
| nuc1e0n wrote:
| Stick to files conforming to RFC-4180 then
| ianburrell wrote:
| JSONL should replace CSV. It is standardized and the escapes
| mostly well specified. It is effectively CSV with "[" and "]"
| surrounding lines.
|
| Regular JSON would work fine for static file, and make Schema
| and links (JSON-LD) possible. But then the file could be any
| structure. JSONL works better for systems that assume line-
| based records, and are more likely to have consistent, simple
| records.
| pbreit wrote:
| Pretty much all data eventually ends up in 2 dimensions (rows &
| columns) so all these complicated data models are just mostly
| complicating things.
| thibaut_barrere wrote:
| Agreed! People are surprised but CSV files (while I would not
| use them for everything) work great in low-tech environment,
| for instance, or when files must be produced by many different
| actors with different technical levels.
|
| They also work nicely as an interop between different stacks
| (Cobol <-> Ruby).
|
| One concrete example is the French standard to describe
| Electrical Vehicles Charge Points, which is made of 2 parts
| (static = position & overall description, dynamic = current
| state, occupancy etc). Both "files" are just CSV files:
|
| https://github.com/etalab/schema-irve
|
| Both sub-formats are specified via TableSchema
| (https://frictionlessdata.io/schemas/table-schema.json).
|
| Files are produced directly by electrical charge points
| operators, which can have widely different sizes &
| technicality, so CSV works nicely in that case.
| rjh29 wrote:
| I worked for a company that used TSV for huge datasets. It
| was extremely efficient, far more so than any database.
| xp84 wrote:
| TSV is criminally underrated compared to CSV.
| chankstein38 wrote:
| I don't know if this is the cart you're talking about but the
| USGS does this for LiDAR data and yeah I'm with you I die a
| little every time I use that site. I love that the data is
| available but why in the world do we have to cart everything?
| Just give me a one click download
| stormfather wrote:
| I've never stopped to question the absurdity of the gov data
| shopping cart thing. WHY!?!? Is there some justification for
| that?
| phatskat wrote:
| Bureaucracy = $$$
| nerevarthelame wrote:
| Who is making money with shopping cart-style downloads?
| [deleted]
| redavni wrote:
| The web development companies that are subcontracted by
| the government agencies to repurpose their silly shopping
| cart software.
|
| I will decline to share my personal anecdote's about
| these companies because I am like 10+ years out of date,
| but I can tell you that most of these companies seemed to
| have certain very specific non-technical things in
| common.
| NelsonMinar wrote:
| I assume it's a vestige from the old days when you ordered
| actual tapes or printouts from government agencies. The
| notion of an order that had to be prepared is baked into how
| they think about the service and product.
|
| All sorts of strange things happen with accessing US
| government data. But most agencies have a lot of excellent
| data available for free and motivated data scientists who
| want to make it available to you.
| icyfox wrote:
| I can almost assure you it was an explicit requirement in an
| RFP that was copied from some master template. So not a good
| justification - but a justification in the eyes of the
| bureaucracy.
|
| The book Recoding America has a lot of anecdotes to this
| effect; most of these situations reduce to a Congressional
| mandate that got misinterpreted along the way. My favorite
| was for an update to Social Security. The department in
| charge of the implementation swore that Congress was forcing
| them to build a facebook for doctors (literally where doctors
| could friend other doctors and message them). Congress had no
| such intention; it was actually 3rd party lobbying that
| wanted the requirement so they could build their own solution
| outside of government. Really crazy stuff.
| reverius42 wrote:
| > Congress had no such intention; it was actually 3rd party
| lobbying
|
| Right, but 3rd party lobbying can't force anyone to do
| anything, whereas Congress can (and did) give this mandate
| the force of law. The fact that lobbyists got Congress to
| do something that they had "no such intention" to do is its
| own problem, but let's not lose sight of who is responsible
| for laws.
| icyfox wrote:
| That's the interesting part of this story; Congress
| didn't think this requirement existed, neither did the
| lobbyists. But the language that congress adopted (with
| the consultation of this lobbying group) made the
| agencies _think_ it was what congress wanted. So the
| agency was left holding the bag for something no one
| actually wanted in the first place. Like a big game of
| telephone.
|
| I agree with your broader point however. Congress needs
| to do a better job of owning outcomes.
| d1sxeyes wrote:
| Probably because it's the out-of-the-box functionality on
| ServiceNow or whatever tool they're using.
| btown wrote:
| Probably made by the same people who made
| https://pacer.uscourts.gov/pacer-pricing-how-fees-work ...
|
| (As a side note, I can understand why in years past it would
| cost multiple cents per page to physically photocopy a
| federal document - but it is absolutely absurd that _already-
| digitized_ documents, documents which are fundamentally part
| of the precedent that decides whether our behavior does or
| doesn 't cause civil or criminal liability, are behind a
| paywall for a digital download!)
| dmvdoug wrote:
| Pacer physically hurts to use. They should have to pay us
| for having to use it.
| RicoElectrico wrote:
| My guess would be to gather information about who is using
| the data - and present it to the stakeholders.
|
| Sometimes they send questionnaires to data consumers.
| jstarfish wrote:
| Some records are only released for a fee, so I always assumed
| implementing a standard interface to handle commercial
| transactions and making select items free was easier than
| maintaining separate commercial/noncommercial sites.
| dTal wrote:
| Which in itself is fairly antidemocratic.
| ics wrote:
| I for one was tickled the first time I paid an NYC parking
| ticket online and had to add it to my cart, as if they might
| upsell me on some other more serious violation. Act now
| before it's too late!
| dkarl wrote:
| When you're paying for things, and can pay for several
| things at the same time, it makes sense. I helped my mother
| pay her property taxes this year, and for two properties we
| had to make four payments. Without the shopping cart (or a
| functional equivalent) I would have had to enter my payment
| information four times instead of once.
| yard2010 wrote:
| And if you don't like the service you have free returns for
| 14 days no questions asked
| lovasoa wrote:
| > curl -s https://www.ecb.europa.eu/stats/eurofxref/eurofxref-
| hist.zip | gunzip | sqlite3 ':memory:' '.import /dev/stdin stdin'
| "select Date from stdin order by USD asc limit 1;" Error:
| in prepare, no such column: Date (1)
|
| There is a typo in the example (that is not in the screenshot):
| you need to add a -csv argument to sqlite.
| calpaterson wrote:
| This is odd, I did originally have that argument but I removed
| it because it didn't seem to matter - it works without ("On My
| Machine"(tm))
|
| Erk - readding it and busting the cache. After I put my kids to
| bed I will figure out what is wrong
|
| EDIT: the reason it works for me is because I've got this
| config in ~/.sqliterc: .separator ','
|
| Apparently at some point in the past I realised that I mostly
| insert csv files into it and set that default.
| c7b wrote:
| Generally curious - any particular reasons you chose sqlite
| and gnuplot for this task rather than, say, Python?
| calpaterson wrote:
| I do also use Python (pandas) partway down the page :) I'm
| just trying to show different tools to give people ideas
| about how much is possible with stuff that is already in
| /usr/bin/
|
| If you were to ask my own tool preferences which I use for
| $DAYJOB: pandas for data cleaning and small data because I
| think that dataframes are genuinely a good API. I use SQL
| for larger datasets and I am not keen on matplotlib but
| still use it for graphs I have to show to other people.
| taldo wrote:
| A very simple optimization for those complaining about having to
| fetch a large file every time you need a little datapoint: if
| they promised the file was append-only, and used HTTP
| gzip/brotli/whatever compression (as opposed to shipping a zip
| file), you could use range requests to only get the new data
| after your last refresh. Throw in an extra checksum header for
| peace of mind, and you have a pretty efficient, yet extremely
| simple incremental API.
|
| (Yes, this assumes you keep the state, and you have to pay the
| price of the first download + state-keeping. Yes, it's also
| inefficient if you just need to get the EUR/JPY rate from
| 2007-08-22 a single time.)
| acqq wrote:
| Also, on the topic of range requests, when a server allows the
| range requests for zip files, the zip files are huge and one
| needs just a few files from inside, one can actually download
| just the "central directory" and the compressed data of the
| needed files without downloading the whole zip file:
|
| https://github.com/gtsystem/python-remotezip
| calpaterson wrote:
| Absolutely! I have a plan for a client lib that uses ETags (+
| other tricks) to do just that.
|
| Very WIP but check out my current "research quality" code here:
| https://pypi.org/project/csvbase-client/
| jakswa wrote:
| I'll throw out the the GTFS transit standard involves publishing
| zipped CSVs representing your DB tables regularly. There are nice
| things about it and downsides IMO. This is how google map's
| transit directions function -- they rely on agencies to publish
| their schedules regularly in zipped CSVs I think.
|
| One downside is that dev experience is pretty bad in my opinion.
| It took me years of fiddling in my free time to realize that if
| you happen to try to use the CSVs near a schedule change, you
| don't know what data you're missing, or that you're missing data,
| until you go to use it. My local agency doesn't publish
| historical CSVs so if you just missed that old CSV and need it,
| you're relying on the community to have a copy somewhere.
| Similarly, if a schedule change just happened but you failed to
| download the CSV, you don't know until you go to match up IDs in
| the APIs.
| tomp wrote:
| Interesting.
|
| As someone with a lot of data experience, and in particular
| including financial data (trading team in a hedge fund) I
| definitely prefer the _column_ format where each currency is its
| own column.
|
| That way it's very easy to filter for only the currencies I care
| about, and common data processing software (e.g. Pandas for
| Python) natively support columns, so you can get e.g. USDGBP rate
| simply by dividing two columns.
|
| The biggest drawback of `eurofxref-hist` IMO is that it uses EUR
| as the _numeraire_ (i.e. EUR is always 1), whereas most of the
| finance world uses USD. (Yeah I know it 's in the name, it's just
| that I'd rather use a file with USD-denominated FX rates, if one
| was available.)
| andylynch wrote:
| But why would the ECB publish its Euro fixings in some other
| currency?
| fastasucan wrote:
| Are you really asking for the european central bank to not use
| its own currency?
| dist-epoch wrote:
| Not sure what you mean by "most of the finance world", but in
| forex trading the EUR/USD rate is denominated exactly as in
| this file - 1.066 for today and not the inverse 0.094.
| tomp wrote:
| right, I mean quant trading - i.e. where people actually use
| data :D you want things relatively standardised, which means
| using a common unit of account - for most funds AFAIK that's
| USD
| babblingfish wrote:
| I believe that curl does not have client side caching so every
| time you run the command it downloads the csv. While downloading
| the csv and then analyzing it would no longer makes it a magic
| one-liner you can send to people. It would save bandwidth and
| reduce traffic on the API.
|
| Unless there is caching going on here? Perhaps a CDN cache on the
| server side?
| RyanHamilton wrote:
| csvbase looks really good. Nice landing page, well written docs.
| Great work shipping.
| gigatexal wrote:
| Yup! I built a pipeline to grab this and create a historical
| table of rates on GCP. It was a fun little project.
| gkfasdfasdf wrote:
| How about caching the zip file and adding a '-z' to the curl
| command to only download if newer.
| paradox460 wrote:
| If this is his favorite API, he should check out patent office
| gazettes
| albertzeyer wrote:
| I get: gunzip: unknown compression format
| lucb1e wrote:
| _Your comment was closed with reason: Missing steps to
| reproduce_
| Waterluvian wrote:
| For little problems where you can just download the entire
| database every time and do read-only work, never underestimate
| the value of keeping it simple! I really like SQLite for this
| because it has the portability of a .Json or .csv file but it's
| more ready to interact with as a database.
| semi-extrinsic wrote:
| Just use clickhouse-local, and you can interact with any old
| CSV file as if it was a database.
| mmcnl wrote:
| Great read. I didn't know it's so easy to plot something in the
| terminal. Also really shows the power of 2D data structures. A
| good long format and you can do anything.
___________________________________________________________________
(page generated 2023-09-15 23:00 UTC)