[HN Gopher] Architecture Notes: Datasette
___________________________________________________________________
Architecture Notes: Datasette
Author : pcr910303
Score : 216 points
Date : 2022-05-27 15:25 UTC (7 hours ago)
(HTM) web link (architecturenotes.co)
(TXT) w3m dump (architecturenotes.co)
| simonw wrote:
| This is the first issue of the new Architecture Notes publication
| (website and newsletter) run by Mahdi Yusuf.
|
| Mahdi interviewed me about the architecture of
| https://datasette.io/ - topics we covered include:
|
| - Building a modern Python app using ASGI
|
| - Benefits of SQLite
|
| - Designing plugin hooks
|
| - Safely allowing SQL injection
|
| - Using SQLite from asyncio
|
| - The Baked Data architectural pattern
|
| - Bundling a Python web application in Electron
|
| - Packaging a Python for WebAssembly
| samwillis wrote:
| Simon, I have a wish list item for Datasette, pagination of ad-
| hock queries. I know thats a really difficult thing to
| implement as it would require parsing and altering the sql
| query, but with large datasets it would be so useful!
|
| (Love Datasette!)
| simonw wrote:
| I want this too!
|
| Datasette avoids offset/limit pagination because it performs
| poorly on huge queries - and I don't want random visitors
| (and crawlers) hurting performance of a public instance by
| crawling through offset/limit of thousands of pages.
|
| That's why table pages implement keyset pagination instead -
| so you can do https://congress-
| legislators.datasettes.com/legislators/legi... and get back
| records following the one with A000106, which is a fast query
| because the ID column has an index on it.
|
| Supporting this with arbitrary queries is harder. One idea I
| had is to allow the user to specify which column and sort
| order should be used for keyset pagination - so you could
| construct a URL like this: /?sql=select+*+f
| rom+legislators+order+by+id&_pagination_column=id
|
| If a pagination column has been specified, Datasette would
| use the same trick it uses on regular table pages and add
| next links that way.
|
| Would that work for you?
|
| The other, probably easier option is a setting that enables
| offset/limit pagination of arbitrary SQL queries - turned off
| by default, but easy to turn on for users who are running
| Datasette on a private server. If that takes several seconds
| people can at least opt into it.
| samwillis wrote:
| Either would work, but as you said the latter is easer to
| implement and would have done the job for what I was
| working on. If I had had more time I would have looking
| into trying to do it myself, but there is never enough
| time...
| prepend wrote:
| What can government data providers like data.cdc.gov do to
| support datasette better? Is it expected that these providers
| would make SQLite distros of data? Or do you think people will
| just suck stuff down in csv and convert it to datasette to make
| it easier to work with and reproduce the work of others?
| simonw wrote:
| My ultimate dream is that providers like that will use
| Datasette itself - or a similar system that has the same
| characteristics: make it really easy for people to slice and
| dice the data using querystring parameters or even SQL
| queries and get back out just the data they want as JSON, CSV
| and other formats.
|
| I do think that SQLite is a really interesting format for
| publishing data, and I'd love to see more places publish raw
| SQLite files. It's much better at preserving things like
| column type information and relationships between tables than
| CSV is.
| googletron wrote:
| Hello creator of the project. This entire project was motivated
| by another HN post where someone was asking where we can get
| information about architectures and system designs across our
| industry. Seeing there wasn't anything available, I decided to
| start it; glad to see its resonated with the community.
|
| I hope to create high quality posts from engineers that work on
| these systems and the challenges they are trying to solve. Really
| dig into the problems and the technologies and strategies they
| used to solve it.
|
| What they have learned? Where they have failed?
|
| I also plan to write up deep technical dives on technologies we
| all rely on and use everyday.
|
| If there is any feedback to improve or you have something you
| want to write up with me please reach out.
| eatonphil wrote:
| By creator of the project I assume you mean creator of the
| blog/newsletter, not creator of Datasette?
| googletron wrote:
| Correct. :)
| singhrac wrote:
| Hey, I love this. Please keep doing it, we need more of this
| kind of high quality writing available on the web. When I was
| just getting started programming, I read the AOSA guides, but
| they were considerably harder to read than this.
|
| There are so many things I love about this: the art style, the
| large font, the summary image, etc. Great work.
| googletron wrote:
| We worked hard to make very approachable, explain concepts
| people assume people are aware of. This means a lot and hope
| to continue producing these and more.
| singhrac wrote:
| Is there a way for me to sponsor or donate? I know the most
| limiting resource is probably your time but let me know if
| there's we can do to help.
| googletron wrote:
| We have paid membership that currently doesn't have much
| yet.. https://architecturenotes.co/membership/
|
| So you can support there though. Do appreciate it. Hoping
| to do more technical dives on technologies and use cases
| here soon.
| pc86 wrote:
| Looks great so far, one nit pick is the menu isn't playing
| well[0] for Brave Version 1.39.111 Chromium: 102.0.5005.61
| (Official Build) (arm64)
|
| [0] https://imgur.com/a/WSEVCFG
| googletron wrote:
| Thanks for the feedback. I will look into it.
| lelandfe wrote:
| Kill the z-index properties entirely on `.toc-wrap` and
| `.toc` to make it play nicely with full width desktop
| images :)
|
| (or maybe make intentional with e.g. a big blurry box-
| shadow)
|
| Also: thanks for the resource! I felt incredibly lost doing
| system design interviews over the last few months, this
| seems like a fantastic site.
| googletron wrote:
| Thanks for the help there. I am struggling my way through
| that.
|
| I am glad it can be of help and plan to do much more. :)
| klooney wrote:
| There's some prior art: http://aosabook.org/en/index.html
| usrme wrote:
| Thanks for doing this! The only things I noticed is that in
| this image (https://architecturenotes.co/content/images/2022/05
| /Datasett...) the word "acquire" is misspelled and near the end
| GitHub is written as "Github" with incorrect casing, despite it
| being correct earlier on.
| googletron wrote:
| Good catch. I will revise this. Thanks for pointing it out.
| googletron wrote:
| Here is the sketch note of the entire post as well.
|
| https://architecturenotes.co/content/images/size/w1600/2022/...
| emadda wrote:
| You can load your Stripe account into a Datasette instance using
| this:
|
| https://github.com/tabledog/datasette-stripe
| jrvarela56 wrote:
| Congrats on the launch and what an awesome first guest! Looking
| fwd to more posts :D
| beebmam wrote:
| I have a hard time believing people feel comfortable using Python
| applications in production for anything other than prototyping. I
| have seen some shit over my career, across many kinds of
| interpreted languages, that will never let me approve of that.
|
| Unless one simply doesn't care about runtime quality.
| simonw wrote:
| Believe it. I've spent almost my entire career running Python
| applications in production, as have many of my friends, and
| many large companies that I've worked for or worked with.
|
| Given the number of terrible, buggy sites I've seen built using
| Java or .NET I personally have trouble believing companies run
| those in production, but evidently they do!
| simonw wrote:
| A slightly less snarky answer: the thing I care about isn't
| the language, it's the process and environment around the
| project.
|
| If I'm going to put something in production, I want it to
| have:
|
| - Comprehensive tests, protected by CI
|
| - Thorough, up-to-date documentation
|
| - Code that lives in version control, with good commit
| messages that help answer "why" questions about how it works
|
| - Good development environments
|
| - A robust deployment process
|
| The language influences these in as much as different
| languages have different cultures and tooling around them,
| but conceptually they are pretty language agnostic.
|
| I know how to do all of these things well in Python, which is
| why I tend to continue to spend my time in Python land.
| googletron wrote:
| I am surprised how this hot take still has legs. World's
| biggest site's use Python.
| sontek wrote:
| Yeah, it'd be crazy if sites like Reddit, SurveyMonkey,
| Dropbox, Spotify, Instagram, Pinterest, Lyft, and Sentry were
| built on a silly little prototyping language like Python
| instead of a _real_ programming language. Right?
| noSyncCloud wrote:
| What about production apps running on scripting languages like
| JS?
| getpost wrote:
| Production sites like YouTube, Instagram, Netflix, Reddit, ...?
| https://www.botreetechnologies.com/blog/top-15-websites-buil...
| redredrobot wrote:
| Datasette is pretty cool.
|
| But AFAICT, it just doesn't scale whatsoever. That SQLite db is
| both the dataset index and the dataset content combined, right?
| So you're limited by how big that SQLite db can realistically be.
| The docs say "share data of any shape or any size", but AFAICT it
| can't handle large datasets containing large unstructured data
| like images and video and multi-billion data point datasets are
| hard to store in a single machine/file.
|
| Not really a criticism, but more wondering if there are scale
| optimizations in Datasette I'm not aware of since the docs do say
| any shape or size.
| wswope wrote:
| > AFAICT it can't handle large datasets containing large
| unstructured data like images and video and multi-billion data
| point datasets are hard to store in a single machine/file
|
| Images and videos can easily be yeeted in as binary blobs (same
| as with any other standard DB), and SQLite DBs scale into the
| hundreds of TB range as a single file. Are you comparing the
| single file strategy to something like a sharded cluster of
| DBs, or is your thought that a DB that stores objects as
| independent files is somehow superior?
| simonw wrote:
| You're right, Datasette isn't the right tool for sharing
| billion point datasets (actually low-billions might be OK if
| each row is small enough).
|
| I think of Datasette as a tool for working with "small data" -
| where I define small data as data that will fit on a USB stick,
| or on my phone.
|
| My iPhone has a TB of storage these days, so small data can get
| you a very long way!
|
| Using it for unstructured image and video would work fine using
| the pattern where those binary files live somewhere like S3 and
| the Datasette instance exposes URLs to them. I should find
| somewhere in the documentation to talk about that.
|
| But yes, I should probably take "of any size" off the homepage,
| it does give a misleading impression.
| simonw wrote:
| Opened an issue exploring alternatives here:
| https://github.com/simonw/datasette.io/issues/109
|
| I decided to just drop "any size" but keep "any shape".
| bspammer wrote:
| Very interesting idea to use GPT3 as a starting point for
| rewording text. I can see it being an effective way to
| break writer's block.
| samwillis wrote:
| Not quite the scale you are suggesting but I used it with a 7gb
| 20m row dataset and it worked incredible well.
| redredrobot wrote:
| Yeah - it's probably unfair of me to say it doesn't scale at
| all. But between large data and 2 extra orders of magnitudes
| of rows, the single SQLite file approach quickly breaks down,
| even if you don't store the large content in-db.
| jandrese wrote:
| I was slightly disappointed that this wasn't an article
| describing the technical details of systems that store data on
| cassette tapes.
| jmbwell wrote:
| This is the second time this 'datassette' has tricked me for
| the same reason.
| tesseract wrote:
| Here's one then (in German):
| https://www.c64-wiki.de/wiki/Aufzeichnungsformat_der_Datasse...
___________________________________________________________________
(page generated 2022-05-27 23:00 UTC)