[HN Gopher] Flat Data
___________________________________________________________________
Flat Data
Author : idan
Score : 179 points
Date : 2021-05-18 17:21 UTC (5 hours ago)
(HTM) web link (octo.github.com)
(TXT) w3m dump (octo.github.com)
| FemmeAndroid wrote:
| The really interesting thing about this to me is that if this
| wasn't being put out via GitHub, I would have dismissed it as
| being potentially against the TOS or abuse of GitHub's free
| service. But with them putting it out, I'm quite interested in
| reevaluating my use cases for GitHub.
| gerner wrote:
| See the comment from @jasoncwarner about GitHub actions being a
| platform for much more than CI.
|
| I wonder how far that extends to non-GitHub provided services.
| For instance, could we leverage GitHub actions, perhaps even
| Flat Data, to scrape some web site and store it (perhaps
| uploading elsewhere) in a more comprehensive way vs. storing
| some small snippet of the data in a git repo?
| VWWHFSfQ wrote:
| you mean like in a database
| gerner wrote:
| Yes. Or S3 bucket, or whatever. The thing I'm getting at
| is, can we use GitHub actions for application tasks like
| web sraping that need compute and network access, but that
| don't really do much with with a git repo. Does GitHub want
| to support that?
| yNeolh wrote:
| Interesting, is not an official product from Github, but I love
| the idea and they being upfront about their inspiration from
| Simon, a really interesting person to follow, I love his
| investment in Datasette, SQLite utils and Django.
|
| The thing about Git Scrapping, although I think the idea is
| awesome, I thought It was against Github Actions rules, or at the
| very least being on the edge. So I don't know what the position
| from Github is about this as this is not an official thing from
| them, but this gives me positive vibes.
| simonw wrote:
| Same here! I was reasonably confident that Git scraping was
| within the boundaries of GitHub Actions supported use-cases but
| it did always feel a little bit on the edge, this is fantastic
| confirmation that it's a supported technique.
| ellimilial wrote:
| Very interesting how Github comes with more and more interesting
| 'actions' to turn repos into 'platforms' and moves us closer to
| serverless future.
|
| @idan how does it scale with the size (including storage)? Is 'a
| billion rows' a goal or an actual tested use case?
| jasoncwarner wrote:
| Hi! Jason, CTO @ GitHub, here
|
| You're getting at the heart of Actions. Actions was never
| intended to be "CI" or any such vertical capability. It has
| always intended to be a platform that exposes capabilities like
| CI or packages etc out to the world, but the underlying
| serverless very flexible workflow platform is the bedrock upon
| which we want to build the future
|
| My long held view that the only real 'competitor' to what I
| want github to be was AWS/major cloud infra companies and if
| you believe in that view along with me, you likely see what the
| why the past four years of github and the next few years of
| github make a lot of sense
|
| And it even makes more sense when you squint just a bit and
| realize what codespaces + repos + actions (CI/security/packages
| + other things) + automated workflows would eventually do. Now
| imagine a bit further out into the future and what it would
| mean if we understood your production workloads a bit more
| ellimilial wrote:
| Hi Jason, thank you very much for the background and the
| explanation. It is fascinating to see the progress in this
| direction.
|
| I started raising my eyebrow (in the best possible sense)
| upon seeing parts of tooling very similar to ours but simpler
| and more importantly - without moving parts. We operate in
| biomedical data space and deal with flat/static data a lot,
| for example we power https://biokeanos.com with data-in-repo,
| so Flat Data was immediately interesting.
|
| It is really inspiring to see GitHub actions to having a
| foray in this direction, definitely something to keep an eye
| on.
| eksabajt wrote:
| It's storing the files in the repository which has a file size
| limit of 100MB. I think the repositories themselves have a soft
| limit of 5GB and a hard limit of 100GB.
| idan wrote:
| It doesn't scale! This isn't a replacement for databases.
|
| Our take on this is about "working sets" of data -- if you have
| billions of rows, that's a lot bigger than a working set! At
| some point, you have to query, filter, and aggregate to get
| your data down to a chewable size for work.
|
| You can do that in your code too, and sometimes that's
| absolutely the right approach! But often it's easier to push
| that work to "outside your code," and that is what Flat is
| great for.
| ellimilial wrote:
| Thank you for the response and clearing up the 'billion rows'
| / surly bonds confusion I had from reading project's Why Flat
| Data? section. I think I understand the target use case
| slightly better now.
|
| One of the strong arguments for object-like storage (S3 etc)
| in the context of plain / flat data is scalability and
| availability for large scale processing frameworks. Databases
| are only occasionally relevant.
| danso wrote:
| As someone who has written so much boilerplate data-collection
| code (i.e. scripts that I cron on my local repo, then push to
| Github), this is really incredible. I've been really impressed
| with what Simon W. has shown off with Github Actions but hadn't
| yet felt compelled enough to dive in and learn the
| conventions...but this looks like a great entry point.
|
| Don't know if this is the place to report bugs, but I was trying
| the github>>flatgithub data viewer trick on an old repo that has
| a name of `white_house_salaries`.
|
| My data subdirectories have several files named
| _white_house_salaries.csv_ -- e.g. _data
| /wrangled/white_house_salaries.csv_ is the "finished" version.
| However, visiting that file in flatgithub.com gives me a "No
| valid data" error:
|
| https://flatgithub.com/storydrivendatasets/white_house_salar...
|
| I get the same error when visiting _data
| /fused/white_house_salaries.csv_.
|
| However, when I rename the file to something other than
| "white_house_salaries.csv", like, _data
| /wrangled/white_house_salaries_wrangled.csv_, it works as
| expected:
|
| https://flatgithub.com/storydrivendatasets/white_house_salar...
|
| I'm guessing there must be some issue with the data filename
| (white_house_salaries.csv) sharing the same name as the repo
| (storydrivendatasets/white_house_salaries)?
| rothenbizzle wrote:
| Hey there! Matt from the DevEx team here. Apologies for the
| lack of polish - I _think_ the issue here is that the
| flatgithub.com URL only works when you specify the repo owner
| and repo name, a la https://flatgithub.com/storydrivendatasets/
| white_house_salar....
|
| It gets confused by all of the other stuff afterward,
| "tree/master/data/wrangled".
|
| Let me know if that gets you sorted!
| whats_spinning wrote:
| How big of data can this handle?
| [deleted]
| trinovantes wrote:
| I once ran a web scraper on an hourly schedule with GitHub Action
| that wrote to a json file in my gh-pages branch and saved its
| results with sh "git commit --amend". Glad to see this workflow
| in a more integrated environment than my janky hack
| gerner wrote:
| I don't know much about Flat Data, but I'm impressed with how
| much GitHub is doing as GitHub since the MSFT acquisition. They
| continue to offer compelling services to developers, and
| increasingly to enterprise customers. All without abandoning much
| of what made GitHub great: a focus on developers and easy to
| access dev productivity.
|
| Notice the prominence of the VSCode integration here. Notice the
| dramatically increased presence of MSFT on GitHub in general. It
| seems like they've managed to integrate these two cultures and
| product-sets in sensible ways. Given how hard big integrations
| like this are to pull off, I feel like the community really
| dodged a bullet in terms of access to products/tools.
| alexander-litty wrote:
| Dodged a bullet for now.
|
| I'm worried this is their extend-embrace stage, and the
| extinguish is yet to come.
|
| I truly hate to be pessimistic, and I'm not trying to start a
| flame war. I just can't see this behavior lasting in the long
| run.
| pwdisswordfish8 wrote:
| It's already here, is just that the userbase and third
| parties are (happily) doing the dirty work for them. Try
| going GitHub-free for a month or three and you'll notice how
| many things rest on the assumption that you have a GitHub
| account.
|
| Look at how it shat on Markdown with what it calls "GitHub
| Flavored Markdown". Look at the things that it calls "wikis".
| Look at how GitHub's PR merge tool junks up the commit log.
| Look at how many projects don't even have a way to accept a
| fix unless you submit it with GitHub's janky pull request
| workflow. Hell, a bug in Netlify's command-line client
| managed to make its way into release versions that would
| straight up cause the process to terminate for bog standard
| "hello world"-style static sites due to unhandled exception
| when cwd was a repo that wasn't hosted on github.com.
|
| The tacit assumption that you're using GitHub is like the
| tacit assumption 15 years ago that you were using Visual
| Studio, and "Log in with GitHub" is essentially what
| Microsoft hoped for with Passport, if Passport had actually
| been successful.
| agency wrote:
| I have no particular love for MSFT but I don't think any of
| the issues you mentioned began after the acquisition.
| pwdisswordfish8 wrote:
| ...so?
|
| They acquired a company that was doing the thing that
| they are wont to do and are criticized for, and have
| poured the significant resources at their disposal into
| growing the circle of impact. Where it originates from
| and whether it was or wasn't already independently in
| full swing (or partial, in this case) before their
| involvement doesn't matter, the effect on the user is the
| same. Besides that, if a person's problem with a given
| practice is whether or not Microsoft is the perpetrator,
| then that person is a hypocrite and doesn't actually give
| a shit about the the thing they claim to be concerned
| about.
| gerner wrote:
| Agree, it's important that we keep an eye on things and,
| however we can, hold MSFT and GitHub accountable to keep up
| the good showing.
|
| We've seen new features launched (e.g. this one) long enough
| after the acquisition that much (most, all?) of the work
| happened in the post acquisition environment that I'm
| optimistic. But I've been wrong before.
| idan wrote:
| The OCTO DevEx team reaaaaaallly loves VS Code -- beyond the
| editor, it's just a great surface for experimental developer
| tooling!
|
| GitHub Codespaces aren't generally available yet, but being
| able to target both "native" VS Code as well as in-browser VS
| Code with the same extension is super powerful. Expect a lot
| more from us on that front.
|
| We've also released a pair of little projects re VS Code
| development that we've extracted from our work:
|
| https://github.com/githubocto/tailwind-vscode: a Tailwind CSS
| plugin which creates Tailwind color tokens for each of the VS
| Code theme colors, easing theme-native styling in VS Code.
|
| https://github.com/githubocto/snowpack-vscode-extension-
| temp...: a VS Code extension template that incorporates the
| fastest toolchain with the wisdom we've accumulated about
| webview development.
| adamcstephens wrote:
| Can you help me get a Codespaces invite? ;)
| duped wrote:
| The monthly downtime during working hours has been getting to
| me lately.
| dataangel wrote:
| ...they reinvented cron? it just commits a file on a timer
| idan wrote:
| Correct! And if you're Simon Willison, this is a super easy
| thing to Just(tm) implement manually.
|
| The point of Flat Data is to push the edges of that bubble
| outwards. Add tooling and examples. Add a viewer. Make the
| "happy path" situations where this is helpful really fast and
| easy.
|
| We're pretty upfront about this not being a major technological
| advance. The difference between a difficult-to-use API and a
| good API is usually just about the mental model. We like this
| mental model, and the kinds of patterns it encourages!
| abuehrle wrote:
| This is really cool! I would have liked to have incorporated this
| into my vaccine appointment slot finder tool a few months ago. I
| like using git commits for change tracking too. Seems not
| dissimilar (though not identical) to what they're doing at Dolt
| (https://www.dolthub.com/).
| idan wrote:
| Yup, there's Dolt, and DVC, and probably a dozen other projects
| I'm forgetting or haven't heard of. Dat!
|
| There's more than one way to data. We looked at a bunch of
| them, and the key thing we keep coming back to is git
| semantics. In many ways, all these other projects attempt to
| graft git semantics on top of more scalable datastores,
| allowing you to "fork" your data or roll it back to a given
| version. Trouble is, these abstractions have subtly different
| semantics or behaviors. These aren't inherently bad -- just not
| the same as the ones you know from git.
|
| This approach sacrifices "scalability" in order to let you Just
| Use Git(tm). It won't work (well) for a larger dataset, but we
| find that it's useful in a ton of situations.
|
| For example: I have personally shipped bugs to production
| because my test fixtures had stale example data. I should have
| remembered to create new fixtures, but I didn't. Flat could
| have made them for me, on a schedule, subsampling and
| anonymizing production data as it worked.
|
| It's a subtle difference in appplication. If your goal is to
| version $BIGDATA, then Flat isn't the right tool for the job,
| and you should check out Dolt, DVC &co.
| FractalHQ wrote:
| Funny, I'm currently working on a project where I'm fetching post
| data from a Wordpress backend with a few GQL queries via the
| WPGraphQL plug-in and `@urql/svelte` to populate a static SSG'd
| frontend. While developing locally, I copied and pasted the JSON
| response into a local file in the repo to develop against. I was
| thinking this would be nice to automate.
|
| If I'm understanding correctly, it seems like this tool more or
| less automates that process?
|
| Can it send a GQL query?
| idan wrote:
| This is a really powerful use-case! If you saw Alex Gaynor's
| election tracker[1] during the US 2020 elections, it's exactly
| how it worked. Actions scraped the NYT election results.json,
| and a static site on GH pages rendered the data, XHRing the
| scraped JSON out of the repo periodically.
|
| There's no GraphQL backend yet! We've only done HTTP and SQL
| backends so far. If your GQL query is simple enough, you might
| be able to squeak by with an HTTP flat action whose target is
| https://your.site/graphql?query=whatever ?
|
| [1] https://alex.github.io/nyt-2020-election-
| scraper/battlegroun...
| simonw wrote:
| If you want to run GraphQL queries against this kind of data I
| have a roundabout way of doing it:
|
| 1. Set up a repo that uses actions to scrape data into a CSV
|
| 2. Set up another action that converts that CSV to SQLite
| (using my sqlite-utils tool) and then...
|
| 3. Publishes that database to Cloud Run or Vercel with
| Datasette and with the datasette-graphql plugin
|
| Here's an example repo that does exactly that:
| https://github.com/simonw/cdc-vaccination-history
|
| It scrapes vaccination data from the CDC, complies that into a
| SQLite database and publishes it using Datasette on Vercel at
| https://cdc-vaccination-history.datasette.io/
|
| Then you can run GraphQL queries at https://cdc-vaccination-
| history.datasette.io/graphql
|
| (Here's the plugin: https://datasette.io/plugins/datasette-
| graphql)
|
| Another demo: https://covid-19.datasettes.com/graphql runs from
| this repo: https://github.com/simonw/covid-19-datasette
| bob1029 wrote:
| I am sensing some interesting capabilities here, but also get the
| impression that this is more about denormalized views of data
| (JSON/CSV/etc) than anything else. It's also in the name -
| 'Flat'.
|
| Perhaps it is actually supported and I can't read properly, but I
| feel like you are just 1 tiny step away from allowing someone to
| write one of these things such that it can ETL any arbitrary data
| source into a SQLite database (i.e. many tables). There's not a
| whole lot of difference between CSV and SQLite when it comes to
| repository file management. Granted, SQLite databases would
| present as opaque blobs at code review time, but this is
| something we can tolerate because you still get all of the nice
| versioning & project consistency. Hell, you could probably write
| a special GitHub-branded diff viewer that allows you to compare 2
| different SQLite databases, schema & all.
|
| SQLite in general is such a force to be reckoned with. You could
| do a lot of damage (in a good way) with product features built up
| around the most popular database engine on earth.
| nt2h9uh238h wrote:
| I'm actually very excited about it. Could start a new era of how
| we develop and work with data.
| idan wrote:
| Hi HN! Our team has loved building this, as well as all of the
| storytelling and examples. We'd love your feedback!
| everybodyknows wrote:
| The screen videos are interesting, but too fast to follow, and
| make reading the accompanying text impossible for those of us
| with fragile concentration. Reader View (Safari) drops the
| screen imagery entirely, so goes too far the other way.
|
| How about a video pause/seek control?
| idan wrote:
| Hey! This is a great callout, we'll think about how to make
| it better!
| dariosalvi78 wrote:
| nice idea, but exploring the data is very limited. Would be even
| better if it had some sort of query language and maybe an API.
___________________________________________________________________
(page generated 2021-05-18 23:01 UTC)