[HN Gopher] File Attachments: Databases can now store files and ...
___________________________________________________________________
File Attachments: Databases can now store files and images
Author : todsacerdoti
Score : 90 points
Date : 2023-08-30 16:21 UTC (6 hours ago)
(HTM) web link (xata.io)
(TXT) w3m dump (xata.io)
| excalibur wrote:
| This sounds useful for malware persistence after a backup
| recovery
| pmpjr wrote:
| [flagged]
| pmpjr wrote:
| [flagged]
| TOGoS wrote:
| Skimmed the article, grepped for 'hash', didn't find it.
|
| > they are stored in AWS S3
|
| Not sure why anyone would feel the need to couple this sort of
| thing to S3. Just store file hashes in the database, and have a
| blobstore mapping hashes to content. I like bitprint URNs for
| this purpose[1] but any strong cryptographic hash function will
| do. Then you can use any old datastore you want, be it a plain
| old filesystem, S3, a Git repo, or something else. It's trivial
| to bolt on alternate backends because you can always guarantee
| that what you asked for is what you got. If not, try the next
| one.
|
| [1] http://www.nuke24.net/docs/2015/HashURNs.html
| tudorg wrote:
| hey HN, and thank you todsacerdoti for posting it!
|
| This is a feature that we wanted for a long time, but we also
| wanted to get it right. It's somehow the equivalent of storing
| files and images in an S3 bucket and then putting URLs to them in
| the database, but then you have to take care of keeping things in
| sync, take care of security permissions, delete the files when
| the database row is deleted, etc. We automate all that for you
| and even more: we cache the files in a CDN, and if they are
| images we support transformations out of the box.
| adrianmsmith wrote:
| I think for smaller projects just storing images as BLOBs in e.g.
| PostgreSQL works quite well.
|
| I know this is controversial. People will claim it's bad for
| performance.
|
| But that's only bad if performance is something you're having
| problems with, or going to have problems with. If the system
| you're working on is small, and going to stay small, then that
| doesn't matter. Not all systems are like this, but some are.
|
| Storing the images directly in the database has a lot of
| advantages. They share a transactional context with the other
| updates you're doing. They get backed up at the same time, and if
| you do a "point in time" restore they're consistent with the rest
| of your data. No additional infrastructure to manage to store the
| files.
| andyp-kw wrote:
| I've seen companies store pdf invoices in the database too, for
| the same reasons you spoke of.
| lovasoa wrote:
| How big does the database need to be for large BLOBs to become
| a problem ? "big" and "small" are quite subjective terms.
|
| How many BLOBs does one need to have, and how often to we need
| to touch them for this solution to become untenable ?
| spiffytech wrote:
| One datapoint on BLOB performance: _SQLite: 35% Faster Than
| The Filesystem_
|
| https://www.sqlite.org/fasterthanfs.html
| crabbone wrote:
| In storage you measure things in blocks. Historically, blocks
| were meant to be 512 bytes big, but today the tendency is to
| make them bigger, 4K would be the typical size in server
| setting.
|
| So, the idea here is this: databases that store structured
| information, i.e. such that needs to store integers,
| booleans, short strings are typically something like
| relational databases, eg. PostgreSQL.
|
| Filesystems (eg. Ext4) usually think about whole blocks, but
| are designed with the eye for smaller files, i.e. files
| aren't expected to be more than some ten or hundred blocks in
| size for optimal performance.
|
| Object stores (eg. S3) are the kinds of storage systems that
| are supposed to work well for anything larger than typical
| files.
|
| This gives the answer to your question: blobs in a relational
| database are probably OK if they are under one block big.
| Databases will be probably able to handle bigger ones too,
| but you will start seeing serious drops in performance when
| it comes to indexing, filtering, searching etc. because such
| systems optimize internal memory buffers in such a way that
| they can fit a "perfect" number of elements of the "perfect"
| size.
|
| Another concern here is that with stored elements larger than
| single block you need a different approach to parallelism.
| Ultimately, the number of blocks used by an I/O operation
| determines its performance. If you are reading/writing sub-
| block sized elements, you try to make it so that they come
| from the same block to minimize the number of requests made
| to the physical storage. If you work with multi-block
| elements, your approach to performance optimization is
| different -- you try to pre-fetch the "neighbor" blocks
| because you expect you might need them soon. Modern storage
| hardware has a decent degree of parallelism that allows you
| to queue multiple I/O requests w/o awaiting completion. This
| later mechanism is a lot less relevant to something like
| RDBMS, but is at the heart of an object store.
|
| In other words: the problem is not the function of the size
| of the database. In principle, nothing stops eg. PostgreSQL
| from special-casing blobs and dealing with them differently
| than it would normally do with "small" objects... but they
| aren't probably interested in doing so because you already
| have appropriate storage for that kind of stuff, and
| PostgreSQL, like most other RDBMS sits _on top_ of the
| storage for larger objects (filesystem), so they have no
| hopes of doing it better than the layer below them.
| brazzy wrote:
| Most of what you wrote there is simply not true for modern
| DBMS, specifically PostgreSQL has a mechanism called TOAST
| (https://www.enterprisedb.com/postgres-
| tutorials/postgresql-t...) that does exactly what you claim
| "they aren't probably interested in doing" and completely
| eliminates any performance penalty of large objects in a
| table when they are not used.
| doubled112 wrote:
| I've seen BLOBs in an Oracle DB used to store Oracle install
| ISOs, which I think is ironic on some level.
|
| Let's attach what we're using to the ticket. All of it. Why is
| the ticketing DB huge? Well, you attached all of it.
| rajman187 wrote:
| Several years ago Walmart dramatically sped up their online
| store's performance by storing images as blobs in their
| distributed Cassandra cluster.
|
| https://medium.com/walmartglobaltech/building-object-store-s...
| mohamedattahri wrote:
| BLOB on Postgres are awesome, but there's also a full-featured
| file API called Large Objects for when the use-case requires
| seeking/streaming.
|
| Wrote a small Go library to interface with it:
| https://github.com/mohamedattahri/pgfs
|
| Large Objects:
| https://www.postgresql.org/docs/current/largeobjects.html
| dventimi wrote:
| Large Objects works great for images and stylesheets and
| other file assets for my small PostgREST projects.
| ibgeek wrote:
| TIL thanks!
| bastawhiz wrote:
| I think the one big problem with BLOBs, especially if you have
| a heavily read-biased DB, is you're going to run up against
| bandwidth/throughput as a bottleneck. One of the DBs I help
| maintain has some very large JSON columns and we frequently see
| this problem when traffic is at its peak: simply pulling the
| data down from Postgres is the problem.
|
| If the data is frequently accessed, it also means there are
| extra hops the data has to take before getting to the user.
| It's a lot faster to pull static files from S3 or a CDN (or
| even just a dumb static file server) than it is to round trip
| through your application to the DB and back. For one, it's
| almost impossible to stream the response, so the whole BLOB
| needs to be copied in memory in each system it passes through.
|
| It's rare that any request for, say, user data would _also_
| return the user avatar, and so you ultimately just end up with
| one endpoint for structured data and one to serve binary BLOB
| data which have very little overlap except for ACL stuff, but
| signed S3 URLs will get you the same security properties with
| much better performance overall.
| refulgentis wrote:
| Do you have any thoughts on when a JSON column is too large?
| I've been wondering about the tradeoffs between a jsonb
| column in postgres that may have values, at the extreme, as
| large as 10 MB, usually just 100 KB, versus using S3.
| cryptonector wrote:
| > But that's only bad if performance is something you're having
| problems with, or going to have problems with.
|
| The first part is easy enough, but how do you predict when
| you're going to hit a knee in the system's performance?
| SigmundA wrote:
| Main issues with large blobs in DB is moving and or deleting
| them. Not sure if PG has this issue I would like to confirm,
| but in say SQL server if you delete a blob or try and move it
| to another file group you get transaction logging equal to the
| size of the blobs.
|
| You would think at least with a delete especially in PG with
| the way it handles MVCC the blob would just have an entry
| saying which blob was deleted or something then if you need
| rollback you just undelete the blob.
|
| So just being able to move and delete the data after it is in
| there becomes a real problem with a lot of it.
| vbezhenar wrote:
| It makes backups PITA. I migrated blobs to S3 and managed to
| improve backups from once a month to once a day. Database is
| now very slim. HDD space is no longer an issue. Can delete code
| which serves files. Lots of improvements with no downsides so
| far.
| crooked-v wrote:
| Of course, that also means that now you don't have backups of
| your files.
| 0x457 wrote:
| They do have backups if versioning is enabled on S3.
| ericbarrett wrote:
| Yes, S3 is extremely reliable and versioning protects
| against "oopsies." I do always recommend disallowing
| s3:DeleteObjectVersion for all IAM roles, and/or as a
| bucket policy; manage old versions twith a lifecycle
| policy instead. This will protect against programmatic
| deletion by anybody except the root account.
| brazzledazzle wrote:
| I second these sensible protections. Also would recommend
| replicating your bucket as a DR measure. Ideally to
| another (infrequently used and separately secured)
| account. Even better if you do it to another region but I
| think another account is more important since the chances
| of your account getting owned is higher than amazon
| suffering a total region failure.
| [deleted]
| [deleted]
| ltbarcly3 wrote:
| I think there is too much emphasis on 'the right way'
| sometimes, which leads people to skip over an analysis for the
| problem they are solving right now.
|
| For example, if you are building something that will have
| thousands or millions of records, and storing small binaries in
| postgresql lets you avoid integrating with S3 at all, then you
| should seriously consider doing it. The simplification gain
| almost certainly pays for any 'badness' and then some.
|
| If you are working on something that you hope to scale to
| millions of users then you should just bite the bullet and
| integrate with S3 or something, because you will use too much
| db capacity to store binaries (assuming they aren't like 50
| bytes each or something remarkably small) and that will force
| you to scale-out the database far before you would otherwise be
| forced to do so.
| sorintm wrote:
| Brilliant point! It always depends on the problem you're
| trying to solve. What Xata did, was integrate files as a
| database column, BUT stored the binary content in S3. Thus
| offering both transactional file management and point in time
| recovery as well as high performance direct (S3) file access,
| through a CDN. Basically Xata implements the orchestration
| and takes away the pain of managing 2 services.
| harshitaneja wrote:
| Going on a tangent here but why s3? Why not avoid vendor lock
| in and just serve static files?
|
| My setup is to post blobs over a small rest api which
| balances storage load over all the file servers using nginx.
| It responds with the serverSubdomainId which in turn gets
| saved in the database. So the url for file access can be
| generated https://{serverSubdomainId}.domain.tld/{resourceId}
| The only costly part is writing which with caching of
| available storage of all the servers isn't bad. The whole
| system is horizontally and vertically scalable. What am I
| missing that s3 is still the defacto even with vendor lock in
| and high egress costs?
| connordoner wrote:
| I can't speak for everyone but, in my case, it comes down
| to the ease of setup and having someone else manage it.
| timando wrote:
| "S3" is often used to refer to any S3 compatible object
| storage system. You can even self-host it with something
| like MinIO.
| aforwardslash wrote:
| There are plenty of oss solutions that talk "s3", like
| swift, ceph and seaweedfs.
|
| Why object storage? Its easier to backup, version and to
| scale horizontally. Most solutions will also provide
| encryption capabilities that can be commanded either via
| external key management systems or from the client
| application. Also, custom access policies are great for
| private documents and file uploads.
|
| Using static files is a good solution in some cases,
| nothing against it. But in many scenarios, there are huge
| benefits on using object storage instead, even if it is
| quite slower.
| crooked-v wrote:
| > and that will force you to scale-out the database far
| before you would otherwise be forced to do so
|
| Disk space is cheap, even when it's on a DB server and not
| S3. Why worry that much about it?
| tomnipotent wrote:
| > Disk space is cheap
|
| Disk I/O less so. An average RDBMS writing a 10MB BLOB is
| actually writing at minimum 20-30MB to disk - once to
| journal/WAL, once to table storage, and for updates/deletes
| a third copy in the redo/undo log.
|
| You also get a less efficient buffer manager with a higher
| eviction rate, which can further exasperate disk I/O.
| crooked-v wrote:
| But if the files are important enough that you need
| transactional updates or point-in-time recovery for them,
| you're stuck with the equivalents of those things anyway,
| wherever those files end up, plus the extra overhead of
| coordinating two systems.
| [deleted]
| fiedzia wrote:
| > If the system you're working on is small, and going to stay
| small, then that doesn't matter.
|
| Having good default solution saves a lot of problems with
| migration. Storing files outside of database today really isn't
| that more complicated, while benefits even for small files are
| significant: you can use CDN, you don't want traffic spikes to
| affect database performance, you will want different
| availability guarantees for your media than for database and so
| on.
| EvanAnderson wrote:
| The issue I've run into re: storing files as BLOBs in a
| database has been the "impedance mismatch" coming from others
| wanting to use tools that act on filesystem objects against the
| files stored in the database.
|
| That aside I've had good experiences for some applications.
| It's certainly a lot easier than keeping a filesystem hierarchy
| in sync w/ the database, particularly if you're trying to
| replicate the database's native access control semantics to the
| filesystem. (So many applications get this wrong and leave
| their entire BLOB store, sitting out on a filesystem,
| completely exposed to filesystem-level actors with excessive
| permission.)
|
| Microsoft has an interesting feature in SQL Server to expose
| BLOBs in a FUSE-like manner: https://learn.microsoft.com/en-
| us/sql/relational-databases/b...
|
| I see postgresqlfs[0] on Github, but it's unclear exactly what
| it does and it looks like it has been idle for at least 6
| years.
|
| [0] https://github.com/petere/postgresqlfs
| harlanji wrote:
| Minio is easy enough to spin up, S3-compatible. That seems
| like my default path to persistence going forward. More and
| more deployments options seem like they'll benefit from not
| using the disk directly but instead using the dedicated
| storage service path, so might as well use a tool designed
| for that.
|
| S3 can be a bit of a mismatch for people who want to work
| with FS objects as well, but there are a couple options that
| are a lot easier than dealing with blob files in PGsql.
| S3cmd, S3fs; perhaps SSHfs to the backing directory of Minio
| or direct access on the host (direct routes untested, unsure
| if it maps 1:1).
| maerF0x0 wrote:
| You may not like Mongo, but it's had this feature for a long
| time. https://www.mongodb.com/docs/manual/core/gridfs/ Title
| maybe needs to say "Xata now can store..." ?
| winrid wrote:
| With built in replication, horizontal scalability, sharding,
| detailed monitoring, built in LRU memory cache, reading your
| own writes, automatic file expiration, etc.
| [deleted]
| benatkin wrote:
| It's hard to find when CouchDB and MongoDB got attachments, but
| I suspect CouchDB had them first, so it may have been an
| obvious choice for MongoDB.
| EMM_386 wrote:
| "Databases can now store files and images"
|
| Just with SQL Server alone, for decades now you can store BLOB
| data.
|
| Need more than 2GB?
|
| There's FILESTREAM, where enormous files can sit on the disk,
| outside the DB, but be referenced as if they were in the database
| via a varbinary(max) column.
|
| "Think of it as having a new database column type where you can
| store files of any size"
|
| We have that. It might not be AWS S3 and cached on a CDN, but
| that is just adding additional cost, complexity, and lock-in.
| saulrh wrote:
| Being able to demote files from "this is an object on disk" to
| "this is just another record, albeit a very big one" is a huge
| win for data architecture. I've seen this implemented ad-hoc in
| dozens of different instances. Having it work out-of-the-box with
| a plugin and custom column type would be really nifty.
| orf wrote:
| Just a note, if you are reading this post then click the logo in
| the top right, you cannot navigate back to the post - it
| immediately jumps to the homepage again.
|
| Safari on iOS.
| meepmorp wrote:
| Works fine for me with Safari on iOS.
| GreeningRun wrote:
| Works fine on chrome/android
| crabbone wrote:
| Coming from storage, I really hate it when people (unknowingly or
| deliberately) misuse terminology in this field.
|
| Whoever Xata is, they aren't storing _files_ in their database.
| They are storing blobs, or objects... depends on how you look at
| it, but definitely not files.
|
| Simple way to see that that's not true: can they store write-only
| files in their database (eg. /dev/null?) Or can they store UNIX
| socket files? What about device files? Can they recognize that
| the file has setuid bit? And so on... practically no useful
| attributes of files are stored in their database.
|
| They obviously didn't even want to store files... It just feels
| like some kind of marketing trick (especially since the article
| is excessively peppered with self-congratulatory quotes from
| people who have allegedly found the product useful). Just say it
| as it is: _Blob attachments_.
| HumanOstrich wrote:
| I think you're conflating files with file metadata.
| berkes wrote:
| I'm no expert in this field. I presume you are technically
| correct. But also probably too focused on technical details.
|
| Send a file or path to a db server, in an INSERT query (INSERT
| email, pwhash, avatar (?,?,?) into users), and that than
| handles the actual storage, all abstracted away, for the user
| (programmer) of this system, this is "storing the file in the
| DB". If I can then do a "SELECT email, variation('tiny',
| avatar)::cdn_url FROM users", that is "getting the image from
| the DB. Well, technically getting the email and a URL where
| this file cane be retrieved.
|
| To me, as a user of such a database, it matters not if some
| engine stores it as a blob, base64, pointer, or just a URI to
| an S3 block. What matters is that it's in the same layer of
| abstraction.
| DaiPlusPlus wrote:
| > To me, as a user of such a database, it matters not if some
| engine stores it as a blob, base64, pointer, or just a URI to
| an S3 block. What matters is that it's in the same layer of
| abstraction.
|
| By "user of such a database" you mean the end-user of an
| application, right? As opposed to the devs, DBAs, SREs, et
| cetera? Because those users absolutely care about how this
| feature works and absolutely do not want anything to be
| abstracted away, because that's the dif
|
| Also, data: URIs need to die. It started off with lazy JS
| devs using <canvas>'s to-data-uri (instead of toBlob) and now
| it's infecting the entire dev ecosystem: just go to
| StackOverflow and there'll be a post every few hours where
| some asker genuinely believes we're supposed to store large,
| gigabyte-sized files in a Postgres or MSSQL table as a Base64
| string in a varchar column with a text collation
| (aiiieeeeee).
| rat9988 wrote:
| "practically no useful attributes of files are stored in their
| database."
|
| Probably because you work on the field, you miss what the
| layman wants to see. The most useful attrivute for me is the
| data. The other use case you cited are legitimate, it's just
| that I, as a layperson, don't think about them when we talk
| about files.
| deadbeeves wrote:
| Early operating systems did not support directory trees, and
| I'm sure some didn't even support timestamps and such. Would
| you say that those system do not store files at all? What even
| is the defining characteristic of a "file"?
| fiedzia wrote:
| >Think of it as having a new database column type where you can
| store files of any size, and behind the scenes they are stored in
| AWS S3 and cached through a global CDN.
|
| So the files are not stored in the database, but somewhere else.
| This belongs to a web framework, not database.
| karlmush wrote:
| [dead]
___________________________________________________________________
(page generated 2023-08-30 23:00 UTC)