hngopher.com

       [HN Gopher] Ask HN: How to handle user file uploads?
       ___________________________________________________________________
        
       Ask HN: How to handle user file uploads?
        
       hey, i work as an SRE for a company where we allow users to upload
       media files (e.g. profile picture, attach docs or videos to
       tasks..the usual). We currently just take a S3 pre-signed URL and
       let the user upload stuff. Occasionally, limits are set on the
       <input/> element for file types.  I don't feel this is safe enough.
       I also feel we could do better by optimizing images on the BE, or
       creating thumbnails for videos. But then there is the question of
       cost on AWS side.  Anybody have experience with any of this? I
       imagine having a big team and dedicated services for media
       processing could work, but what about small teams?  All
       thoughts/discussions are welcome.
        
       Author : matusgallik008
       Score  : 168 points
       Date   : 2024-05-01 09:39 UTC (3 days ago)
        
       | cebert wrote:
       | We allow uploads with pre-singed URLs at my work too. We added a
       | few more constraints. First, we only allow clients to upload
       | files with extensions that are reasonable for our apps. Files
       | uploaded to S3 are quarantined with tags until we validate the
       | binary contents appears to match the signature for a given
       | extension/mime-type with a Lambda. Second, we use ClamAV to scan
       | the uploaded files. Once these checks have completed, we generate
       | a thumbnail with a Lambda and then make the file available to
       | users.
       | 
       | I'm honestly surprise this isn't a value-added capability offered
       | by AWS S3 because it's such a common need and undifferentiated
       | work.
        
         | arethuza wrote:
         | Azure Storage seems to have an integrated AV option:
         | 
         | https://learn.microsoft.com/en-us/azure/defender-for-cloud/d...
         | 
         | Mind you, last time I designed something using this we just
         | used ClamAV - which is pretty easy to develop against even if
         | it is a slight pain to manage on an ongoing basis.
        
         | LinuxBender wrote:
         | I am not a proper developer so I don't have the technical
         | details on this, but another thing I have seen done is to buy
         | access to an API on VirusTotal [1] and submit file hashes to be
         | compared against their database of about 70 or so virus
         | scanners. Not foolproof, as malware can be recompiled and
         | hidden from VT until the actual file is submitted for sandbox
         | analysis _which you probably dont want to do with potentially
         | sensitive data_ so it really only catches known-knowns but
         | apparently some companies find it useful. I utilize it on a
         | windows machine using ProcessExplorer that can automatically
         | submit hashes but the API volume is limited for free  /
         | anonymous access.
         | 
         | [1] - https://docs.virustotal.com/reference/overview
        
           | rpigab wrote:
           | If I was a malware distributor, I'd just throw a couple
           | random bytes at the end each time I upload a malicious file,
           | it'd still work and have a different checksum/hash.
        
             | ecaradec wrote:
             | Antivirus are really stupid tools, but not that stupid. I
             | said that from a time where I had to work around tools
             | flagged by antivirus. Among stupid things they do are
             | flagging a part of an executable, some nsis plugin flagged
             | the whole package as virus as soon as you included them. I
             | think they probably hash files by chunks, if you have too
             | many bad chunks then you are a virus. A few bytes at the
             | end doesn't change that.
        
               | eyegor wrote:
               | I'm not sure if it's still true, but it used to be that
               | ~half of all antivirus would flag an executable if you
               | compressed it with UPX.
        
               | cinntaile wrote:
               | You could sometimes bypass this by opening the file with
               | a hexeditor and change a meaningless value. When UPX was
               | popular there were also alternative file compressors that
               | could also be used to sometimes bypass this issue.
        
           | toomuchtodo wrote:
           | CISA is offering a malware analysis service now, requires a
           | Login.gov account to use and the user be a US citizen. Unsure
           | how it compares to VirusTotal, it's in our backlog to spike a
           | poc in our IR workflow.
           | 
           | https://www.cisa.gov/resources-tools/services/malware-
           | next-g...
        
             | yard2010 wrote:
             | I have to say this user agreement sounds like TikTok
             | privacy policy, but my oh my! The government does its job.
             | RESPECT.
        
         | matusgallik008 wrote:
         | Thanks for the insights! I had a look at ClamAV - never
         | heard/worked with it. This was something I was looking for.
         | 
         | Do you also handle large files e.g. videos with lambda? Don't
         | you struggle with Lambda limits then?
        
           | cebert wrote:
           | Great question. We use spot EC2 instances for ClamAV scans.
           | For thumbnails, we still use Lambdas but use AWS Media
           | Convert for Video files to select a frame for the thumbnail.
        
         | dividuum wrote:
         | Is using ClamAv because of some compliance checklist that
         | forces you to include a virus scanner in there? Otherwise I
         | don't see how adding a ton of additional complexity with a
         | proven track record of CVEs would add any security at all.
        
           | cebert wrote:
           | No compliance reason other than adding it in couldn't hurt.
        
             | robertlagrant wrote:
             | How does
             | 
             | > a ton of additional complexity with a proven track record
             | of CVEs
             | 
             | become:
             | 
             | > adding it in couldn't hurt
             | 
             | ?
        
               | matja wrote:
               | By running it in an ephemeral container with no network
               | access for each file you want to check?
        
         | Kinrany wrote:
         | If S3 did this, what would that look like?
        
         | thebeardedone wrote:
         | It's been a while but I would benchmark how ClamAV fares. At
         | $(day job) we implemented scanning for file uploads and after
         | our security team tested they noticed that it detected things
         | fairly poorly (60% was on the high side) for very well known
         | malicious content iirc.
         | 
         | We ultimately scrapped it, if anyone else has any better
         | experience I'd love to hear it.
        
         | GordonS wrote:
         | Virus scanning user uploaded files always seemed like a very
         | sensible thing to me.
         | 
         | I've used ClamAV before in web apps, and it was fairly easy to
         | get working - but it has a few big flaws: it's very slow, it
         | uses a huge amount of memory, and detection rates are...
         | honestly, pretty terrible.
         | 
         | I've used Verisys Antivirus API for 2 recently projects, and it
         | works great, a huge improvement over self-hosted ClamAV. There
         | are a couple of other virus scan APIs available, including
         | Scanii[1] which I think has been mentioned on HN before, and I
         | Trend Micro has something too - but their pricing is secret, so
         | I have to assume it's expensive.
         | 
         | [0] https://www.ionxsolutions.com/products/verisys-virus-api
         | [1] https://scanii.com
        
       | brudgers wrote:
       | _I don 't feel this is safe enough. I also feel we could do
       | better..._
       | 
       | What is the business case for making the necessary changes?
       | 
       | Good luck.
        
         | matusgallik008 wrote:
         | Well, security for starts. Having user A upload a
         | malicious_file.png and then user B download it would not be
         | ideal. Secondly, UX - having thumbnails for files can increase
         | the loading speed. Arguably not the most important UX feature,
         | but we have the resources to spend on this, so why not.
        
           | julik wrote:
           | > Having user A upload a malicious_file.png and then user B
           | download it
           | 
           | You will usually want the profile image of user A to be shown
           | to user B. Same for videos and other files - users likely
           | upload them so that others can download them / consume them
           | in some way, right?
           | 
           | I think where you need to start is doing some threat
           | analysis, and proceed from there. Hosting user content can be
           | built out from "very small" to "very big", depending on the
           | particular threat scenario/use cases/your particular
           | userbase. With the description that you are giving, I would
           | say you are more at risk of building an overcomplicated
           | ("oversecured") solution which might compromise UX for the
           | sake of some protection that is not necessarily needed.
           | 
           | If you are a small team, likely you could use an image
           | resizing / video thumbnailing proxy server such as
           | https://www.imgix.com/ https://imgproxy.net/ etc. You
           | generate a signed URL to it and then the service picks up the
           | file from S3 and does $thing to it. https://www.thumbor.org/
           | is another such tool. There are quite a few.
           | 
           | Re. uploads and downloads - you have quite some options with
           | S3. You can generate the presigned upload URL on the server
           | (in fact: you should do just that), make it time limited and
           | add the Content-Length of the upload to the signed headers -
           | this way the server may restrict the size of the upload.
           | Similarly, access to display the images can be done via a CDN
           | or using low-TTL signed URLs... plenty of things to do.
        
       | jjice wrote:
       | You'll probably have no issue sending the image to you backend
       | directly, doing whatever you want to it (compression, validation,
       | etc), and then uploading it to S3 from there. It's not a lot of
       | overhead (and I'd argue more testable and easy to run locally).
       | 
       | You can do the math on the ingress to your service (let's say
       | it's EC2), and then the upload from EC2 to S3.
       | 
       | It appears that AWS doesn't charge for ingress [0]. "There is no
       | charge for inbound data transfer across all services in all
       | Regions".
       | 
       | S3 is half a cent per 1000 PUT requests [1], but you were already
       | going to pay that anyway. Storage costs would also be paid anyway
       | with a presigned URL.
       | 
       | You'll have more overhead on your server, but how often do people
       | upload files? It'll depend on your application. Generally, I'd
       | lean towards sending it to your backend until it becomes an
       | issue, but I doubt it ever will. Having the ability to run all
       | the post processing you want and validate it is a big win. It's
       | also just so much easier to test when everything is right there
       | in your application.
       | 
       | [0] https://aws.amazon.com/blogs/architecture/overview-of-
       | data-t...
       | 
       | [1] https://aws.amazon.com/s3/pricing/
        
         | JimDabell wrote:
         | Aside from the extra load (and _long_ requests, which is a
         | bigger factor for non-async app servers), you need to take into
         | account the physical location of your app servers. Depending on
         | where the user is, there can be _drastic_ differences in
         | performance uploading to, say, us-east-1 vs ap-southeast-1.
         | With S3, you can enable transfer acceleration so that the user
         | uploads to whichever is their nearest region and then AWS
         | handles it from there. This can give huge speedups.
        
       | moomoo11 wrote:
       | I use lambda for image uploads.
       | 
       | The function ensures images aren't bigger than 10mb, the image is
       | compressed to our sizing, and put into s3.
        
         | pier25 wrote:
         | With Google Cloud Storage you can simply set a header with the
         | max size allowed when signing the upload URL.
        
           | matusgallik008 wrote:
           | I believe AWS S3 allows setting the Content-Length header as
           | well. So you might not need the lambda
        
             | moomoo11 wrote:
             | Can you explain what you mean? I'm confused about the
             | "might not need the lambda". Thanks!
             | 
             | In my case we allow up to 10mb file upload. For example
             | just testing something right now from my iPhone I selected
             | a 3.7mb image which was uploaded and resized to the "frame"
             | it will go into in the UI and its now 288kb in S3. We have
             | a few "frame" sizes which are consistent throughout our
             | application. Now, my cloudfront is serving 288kb file
             | instead of 3.7mb, which is good for me because I want to
             | avoid the bandwidth costs and users honestly can't tell the
             | difference, and then also we aren't a photo gallery app.
             | 
             | So in my case I need the lambda to resize the image to our
             | "frame" sizes. I wonder if I didn't explain right. I'm ESL
             | so just want to make sure since I'm confused about the your
             | last sentence.
        
               | aprilnya wrote:
               | They mean you might not need the lambda if all you want
               | to do is enforce a max size; if you want to do
               | resizing/other processing then you'd still need the
               | lambda
        
               | matusgallik008 wrote:
               | You're good mate. I meant exaclty what aprilnya said. I
               | thought you are using the Lambda just to validate the
               | 10mb limit - and just for that you don't need the lambda
               | :) and could save yourself some cents & effort.
               | 
               | But if you do resizing as well, then ofc you need the
               | Lambda (or some other compute engine).
        
       | pier25 wrote:
       | For images we simply use Cloudflare Images which takes care of
       | everything.
       | 
       | Images are easy to display but for other media files you will
       | probably need some streaming solution, a player, etc.
       | 
       | We're in the audio hosting and processing space. We still don't
       | have an api though.
       | 
       | For video maybe look into Wistia, Cloudflare, BunnyCdn, etc.
        
       | kevincox wrote:
       | I would encourage not directly using the user-uploaded images.
       | But uploading directly to S3 is probably fine. I just wouldn't
       | use the raw file.
       | 
       | 1. Re-encoding the image is a good idea to make it harder to
       | distribute exploits. For example imaging the recent WebP
       | vulnerability. A malicious user could upload a compromised image
       | as their profile picture and pwn anyone who saw that image in the
       | app. There is a chance that the image survives the re-encoding
       | but it is much less likely and at the very least makes your app
       | not the easiest channel to distribute it.
       | 
       | 2. It gives a good place to strip metadata. For example you
       | should almost certianlly be stripping geo location. But in
       | general I would recommend stripping everything non-essential.
       | 
       | 3. Generating different sizes as you mentioned can be useful.
       | 
       | 4. Allows accepting a variety of formats without requiring
       | consumers to support them all. As you just transcode in one
       | place.
       | 
       | I don't know much about the cost on the AWS side, but it seems
       | like you are always at some sort of risk given that if the user
       | knows the bucket name they can create infinite billable requests.
       | Can you create a size limit on the pre-signed URL? That would be
       | a basic line of defence. But you probably also want to validate
       | once the URL expires the data uploaded and decide if it conforms
       | to your expectations (and delete it if you aren't interested in
       | preserving the original data).
        
         | hju22_-3 wrote:
         | Do also note that re-encoding can also be used as part of the
         | exploit. E.g. Team Fortress 2 recently had one that exploited a
         | similar system.
        
           | toast0 wrote:
           | There'a been explpots in image, video, and audio codecs...
           | Which is why it's important to protect your users, but also
           | your servers...
           | 
           | Best to sandbox/jail/etc as tightly as possible, and limit
           | the codecs to only what you need. You can configure the
           | ffmpeg builds pretty granularly... default will include too
           | much.
        
             | beeboobaa3 wrote:
             | I care more about protecting my servers than my users. If
             | one user attacks another that's not really my problem,
             | blame can easily be shifted to the browser. But if someone
             | hacks my servers and leaks everyone's data, that's my
             | problem.
             | 
             | So not encoding is probably the safer way to go for the
             | business.
        
               | mthoms wrote:
               | Yikes.
               | 
               | What if the other user getting attacked is _you_ or
               | another admin on your team?
               | 
               | Now the attacker has admin access and can compromise your
               | servers and "leak everyone's data" just fine.
               | 
               | I don't think you've thought this through.
        
               | bawolff wrote:
               | Trust me, your website will always get blamed regardless
               | of its a user fault.
               | 
               | If websites get blamed when users reuse passwords, they
               | are definitely getting blamed if you distribute malicious
               | files.
        
           | kevincox wrote:
           | I don't think the exploit in that case was re-encoding. What
           | happened is an image with very large dimensions was uploaded.
           | When this was decoded into a raw pixel buffer on the client
           | it used tons of memory. It was effectively a zip bomb attack.
           | 
           | In fact re-encoding probably would have solved this as the
           | server could enforce the expected dimensions and rescale or
           | reject the image.
        
         | arrowsmith wrote:
         | Sounds complicated. Any recommendations for existing libraries
         | I can use to handle this?
        
           | RobotToaster wrote:
           | I know lemmy uses pict-rs https://git.asonix.dog/asonix/pict-
           | rs/
        
           | JimDabell wrote:
           | https://www.libvips.org
        
         | tetris11 wrote:
         | > recent WebP vulnerability
         | 
         | For anyone wondering:
         | 
         | https://blog.cloudflare.com/uncovering-the-hidden-webp-vulne...
         | 
         | > The vulnerability allows an attacker to create a malformed
         | WebP image file that makes libwebp write data beyond the buffer
         | memory allocated to the image decoder. By writing past the
         | legal bounds of the buffer, it is possible to modify sensitive
         | data in memory, eventually leading to execution of the
         | attacker's code.
        
         | vitro wrote:
         | > 3. Generating different sizes as you mentioned can be useful.
         | 
         | For years now I use nginx's image filter [1] which handles file
         | resizing quite nicely. Resized images are cached by nginx. For
         | some usecases it works very vell and I no longer need to
         | specify sizes beforehand, you just ask for the size by crafting
         | your url properly.
         | 
         | [1]
         | https://nginx.org/en/docs/http/ngx_http_image_filter_module....
        
           | TylerE wrote:
           | How does that scale if? Say, image access is relatively
           | random and your data set exceeds server ram by a couple
           | orders of magnitude?
        
         | smackeyacky wrote:
         | This is good advice. Just pick a resize dimension and try to
         | resize everything that comes in. If it fails it's not an image.
         | 
         | You can hang an event off S3 and have a lambda that does the
         | work / warns you of a bad upload.
        
           | kijin wrote:
           | Resizing also helps reduce costs. The latest phones can
           | generate ridiculously large images with 200+ megapixels. You
           | really don't want to dump that kind of behemoth in your S3
           | bucket and serve it as somebody's profile pic.
           | 
           | Videos will add even more to your AWS bill if you're not
           | careful. Re-encode that 4K cat video as soon as it comes in,
           | or wire up a CDN to do it for you.
        
         | mmsc wrote:
         | >Re-encoding the image is a good idea to make it harder to
         | distribute exploits. For example imaging the recent WebP
         | vulnerability.
         | 
         | And now your server which is doing re-encoding is pwned. Gotta
         | segment the server doing that somehow, or use pledge, capsicum,
         | or seccomp I guess.
        
           | dpkirchner wrote:
           | This is a pretty good use case for serverless endpoints,
           | assuming the volume is pretty low.
        
             | hamandcheese wrote:
             | Serverless + making sure the serverless function doesn't
             | have any privileges. More often than not I see people use
             | serverless in the name of security, and then give the
             | function write access to prod resources.
        
             | Hamuko wrote:
             | I've ran into a security issue where a serverless function
             | had pretty large range of AWS access and a pentester was
             | able to utilise that.
        
               | dpkirchner wrote:
               | That's a bad use case for serverless :)
        
             | cyberpunk wrote:
             | Ideal use case for a cheeky OpenBSD machine...
             | 
             | Not that OpenBSD is actually unhackable or anything, but I
             | doubt many attackers would guess you're running imagemagick
             | on OpenBSD in your image pipeline.
             | 
             | I rather like it for such use cases; it has the added
             | benefit that it never, ever seems to die. I found a 6.0
             | machine I setup doing some kind of risky Kafka processing
             | that had an uptime of 6 years the other week (since
             | migrated).
        
               | Moru wrote:
               | I had to reboot one of our servers last year, also just
               | over 6 years. Reboot because physical move to another
               | server hotel, not because it wasn't working :-)
               | 
               | It is running imagemagic to optimize images, create
               | different resolutions and reencode them. It's only open
               | for us to upload manually from our customers though, they
               | can't upload themselves. Input anything, output jpg, very
               | easy to use.
        
           | forgotusername6 wrote:
           | Draw image to canvas in the browser. Read image from canvas.
           | Upload. If you are completely paranoid you could then upload
           | raw pixel data only and construct whatever image format you
           | wanted server side.
        
             | samus wrote:
             | The user could push the file to the endpoint directly
             | without using the client-side functionality.
        
             | beeboobaa3 wrote:
             | Did you forget to never trust the client? Please tell me
             | you haven't built any products using this philosophy.
        
               | Arch485 wrote:
               | It's not a terrible idea... This doesn't "trust" the
               | client, it just interprets the data that the client sent
               | as an array of pixel values. In a memory safe language
               | (e.g. JS, C#, Go, Rust, ...), that would make it
               | basically impossible to pwn: the worst thing an attacker
               | could do is upload an arbitrary image.
        
               | cyberpunk wrote:
               | The image is still being posted somewhere, right? What
               | guarantee do you have that it was your wasm blob doing
               | the post vs some j33t haxxors curl command from his kali
               | vm?
        
               | TylerE wrote:
               | The imaging handling libraries I. Those langs are almost
               | all written in C/C++. If you're just wrapping ffmpeg or
               | imagemagick or libpng you're not really protected from
               | much of anything.
               | 
               | False security, if anything.
        
               | smsm42 wrote:
               | It's just not secure - anything you do on the client can
               | be trvially circumvented.
        
           | afiori wrote:
           | For video you might want to use the GPU, but for images this
           | sounds like a good use case for Wasm
        
           | SkyPuncher wrote:
           | We run these services on isolated machines.
        
         | JimDabell wrote:
         | > For example you should almost certianlly be stripping geo
         | location. But in general I would recommend stripping everything
         | non-essential.
         | 
         | Including animation in most cases. Otherwise somebody can use a
         | single frame with very long duration that will be reviewed by
         | moderators, then follow that frame with a different frame
         | containing objectionable material which will eventually be
         | shown to people.
        
           | joshstrange wrote:
           | Wow, that's something I had never considered. I did run into
           | a bug with someone uploading a gif to our servers and our
           | resize script spitting out a file per resized gif frame
           | (base_1.png, base_2.png, ...) that I had to fix but I'd never
           | have thought of your example. We just took the first gif
           | frame in our case, which would have been safe from this
           | thankfully.
        
             | qingcharles wrote:
             | I often find you can upload animated GIFs to sites that
             | don't allow them by just renaming them to PNG first.
        
         | busymom0 wrote:
         | > Can you create a size limit on the pre-signed URL
         | 
         | Yes, pre-signed URL can have the `Content-Length` set and
         | amazon S3 checks it. However, note that this is true for Amazon
         | S3 but not for others like BackBlaze or R2. Last time I tried,
         | BackBlaze didn't support it.
        
           | gehen88 wrote:
           | Only with createPresignedPost, not with getSignedUrl.
        
             | perpil wrote:
             | Presigned post lets you set content-length-range: https://d
             | ocs.aws.amazon.com/AmazonS3/latest/API/sigv4-authen... You
             | can specify content length on presigned PUTs, but it needs
             | to be set as a header and added to the signed headers for
             | SIGv4. It can't be set as a query param.
        
         | badrabbit wrote:
         | You can tell a lot from the exif metadata of images so that's
         | one reason (user privacy) to always re-encode images.
        
         | sdsd wrote:
         | >Re-encoding the image is a good idea to make it harder to
         | distribute exploits.
         | 
         | Famously, the Dangerous Kitten hacking toolset was distributed
         | with the classic zip-in-a-jpeg technique, because imageboards
         | used to not re-encode images.
         | 
         | https://web.archive.org/web/20120322025258/http://partyvan.i...
        
         | smsm42 wrote:
         | There's also an old bug where browsers re-interpreted image as
         | HTML (even with correct MIME type set) and this allowed to host
         | exploits on user-upload sites. Not sure if any modern browser
         | still has this problem, but it used to be a concern. Recoding
         | the image usually broke those exploits. Though it also could
         | break some metadata - e.g. if you go from JPEG to PNG you could
         | lose EXIF data.
        
         | perpil wrote:
         | > Can you create a size limit on the pre-signed URL?
         | 
         | Yes, if you use the POST method you can set the content-length-
         | range property in your presigned URL form inputs to limit min
         | and max bytes.
         | https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-authen...
        
       | p2hari wrote:
       | Look at https://uppy.io/ open source and lot of integrations. You
       | can keep moving to different levels of abstraction as required
       | and see some good practices of how things are done.
        
       | junto wrote:
       | You should quarantine them until you've analyzed them.
       | 
       | Like you stated, an async process using a function would suffice.
       | Previously used ClamAV for this in a private cloud solution, I've
       | also used the built in anti-virus support on Azure Blob Storage
       | if you don't mind multi-cloud, plus an Azure Function has the
       | ability to support blob triggers, which is a nice feature.
       | 
       | The file types scan is relatively simple. You just need a list of
       | known "magic string" header values to do a comparison, and for
       | that you only need a max of 40 bytes of the beginning of the file
       | to do the check (from memory). Depending on your stack, there are
       | usually some libraries already available to perform the matching.
       | 
       | And it goes without saying, but never trust the client, and
       | always generate your own filenames.
       | 
       | https://en.m.wikipedia.org/wiki/List_of_file_signatures
        
         | dividuum wrote:
         | I'm pretty sure ClamAv would add more vulnerabilities to your
         | stack than it might prevent from being exploited.
        
           | junto wrote:
           | It's an old stack. Can you suggest an alternative self hosted
           | option?
        
             | dividuum wrote:
             | I consider the approach of using virus scanning inherently
             | flawed: They rely on heuristics and rules to essentially
             | create a blacklist. And they add a ton of complexity (more
             | code -> usually more bugs) which is not something you want
             | if you work with untrustworthy data.
             | 
             | What you instead want is a whitelist: Only allow properly
             | formatted images and videos and ruthlessly reject anything
             | else. I wrote about how I implemented this for my service
             | in another response.
        
       | stevoski wrote:
       | My SaaS uses Cloudinary for uploading and storing images.
       | 
       | It's not particularly cheap. But it is fast and flexible and
       | safe.
        
         | sairamkunala wrote:
         | My previous company used Cloudinary for the purpose and
         | amazingly lot of time was saved.
        
       | arcza wrote:
       | Kinda incredible how nearly every comment here mentions S3. Cloud
       | storage is not the only backend in existence :)
        
         | Dowwie wrote:
         | This is your moment. Let's hear about alternatives.
        
         | hnlmorg wrote:
         | That's because the requester is already using AWS and S3
         | specifically.
         | 
         | One of the most important, yet oft ignored on HN, principles of
         | architecting solutions is having an unbiased view of products
         | and then choosing what will fit within an existing
         | organisation's architecture.
         | 
         | Given the specifications shared, utilising S3 further is the
         | correct advice.
        
         | morbicer wrote:
         | It gives you at least some basic security. Nothing can get
         | executed on S3, there is no lateral move. If you are using your
         | own servers for user uploads you have to do a damn good job of
         | quarantining them.
        
       | ytch wrote:
       | https://github.com/google/magika                   Magika is a
       | novel AI powered file type detection tool that relies on the
       | recent advance of deep learning to provide accurate detection.
       | 
       | It's not a sliver bullet, but I use it recently for inspecting
       | file type instead of magic file.
       | 
       | One advantage is that detecting composed files. Take pdf+exe file
       | for example, the library will report something like 70% pdf and
       | 30% pebin file.
        
         | mceachen wrote:
         | I don't think this project was actually meant for production
         | use, and especially not under decidedly hostile conditions.
         | 
         | I'd suggest existing tooling like `exiftool` to do mimetype
         | detection and metadata stripping.
        
       | PaywallBuster wrote:
       | 2 buckets:
       | 
       | - upload bucket
       | 
       | - processed bucket
       | 
       | upload bucket has an event triggered on new file upload which
       | triggers a lambda, the lambda will re-encode and do wtv you deem
       | fit and upload to new bucket
       | 
       | your app will use the processed bucket
        
       | sim7c00 wrote:
       | as poijted out already magika is useful for good file type
       | analysis. also, scan thoroughly for viruses, potebtially using
       | multiple engines. now this can be tricky depending on the
       | confidentiality of the uploaded files so take good care not to
       | submit them to sandboxes, av platforms etc. if not allowed. i
       | would really recommend it though.
       | 
       | if you want to get into the nitty grit of filetype abuse, to
       | learn how to possibly detect that good. ange albertini's work on
       | polymorphic filetypes and the ezine poc||gtfo as well as lots of
       | articles by AV code devs are available. its really hard problem
       | and also depends a lot on what program is interpreting the files
       | submitted. if its some custom tool.there might even be unique
       | vectors to take into account. fuzzing and penteting the upload
       | form and any tools handing these files can shed light on those
       | issues potentially.
       | 
       | (edit: fat fingers)
        
       | marcpaq wrote:
       | Have you considered a commercial solution?
       | 
       | https://developer.massive.io/js-uploader/
       | 
       | Just point it to your local files, as big as you want, and it
       | does the rest. It handles the complexities of S3 juggling and
       | browser constraints. Your users pay nothing to upload, you pay
       | for egress.
       | 
       | Full disclosure: I'm MASV's developer advocate.
        
       | fy20 wrote:
       | Lots of other comments give good suggestions on how to handle
       | uploading and processing, but nothing mentions serving resulting
       | content, so let me chime in:
       | 
       | Do not serve content from S3 directly.
       | 
       | ISPs often deprioritize traffic from S3, so downloading assets
       | can be very slow. I've seen kbytes/s on a connection that
       | Speedtest.net says has a download speed of 850 mbit. Putting
       | Cloudfront in front of S3 solves that.
        
         | throwaway984393 wrote:
         | Not to mention using Cloudfront in front of S3 can lower cost,
         | due to caching.
        
         | maayank wrote:
         | > ISPs often deprioritize traffic from S3
         | 
         | I wonder why
        
         | bagels wrote:
         | Cloudfront does not solve it. Comcast deprioritizes any content
         | that comes from Amazon.
        
       | grishka wrote:
       | In my fediverse server project[1], I convert all user-uploaded
       | images to high-quality webp and store them like that. I discard
       | the original files after that's done. I use imgproxy[2] to
       | further resize and convert them on the fly for actual display. In
       | general, I try my best to treat the original user-uploaded files
       | like they're radioactive, getting rid of them as soon as
       | possible.
       | 
       | I don't do videos yet, but I'm kinda terrified of the idea of
       | putting user-uploaded files through ffmpeg if/when I'll support
       | them.
       | 
       | [1] https://github.com/grishka/Smithereen
       | 
       | [2] https://github.com/imgproxy/imgproxy
        
       | JohnCClarke wrote:
       | What do you do with the uploaded images? You could be exposed to
       | risks that may not be immediately obvious.
       | 
       | I have seen a team struggle for over a month to eliminate NSFW
       | content - avatars - uploaded by a user that lead to their site
       | being demonetised.
        
         | PaulHoule wrote:
         | I had a site demonetized because about 1 in 10,000 pictures
         | were of deformed penises, dead nazis, things like that.
        
         | samus wrote:
         | I think these days you'd have to deploy an NSFW detector model.
         | And you'd have to silently reject the upload* and limit profile
         | picture change rate, else the attacker could use the endpoint
         | as an oracle to build an adversarial model, which can
         | eventually generate inputs that the detector won't flag.
         | 
         | *: act as if successful, including processing time, but throw
         | away the upload
        
       | santiagobasulto wrote:
       | It's amazing that this was an issue in 2004 and it's still an
       | issue today. I don't have much to add aside from what was already
       | said. There are services like uppy, transload it, etc that
       | simplify this, but might be more expensive than S3+CF.
        
       | dividuum wrote:
       | I recently redesigned my stack of validating uploaded files and
       | creating thumbnails from them. My approach is to have different
       | binaries per file type (currently images JPEG/PNG, videos
       | H264/265 and truetype fonts). Each of them is implemented as in a
       | way that they receive the raw data stream via stdin and then
       | either generate an error message or a raw planar RGBA data stream
       | via stdout. The validation and thumbnail process is triggered
       | after first locking in the process in a seccomp strict mode jail
       | before touching any of the untrustworthy data. Seccomp prevents
       | them from basically every syscall except read/write. Even if
       | there would be an exploit in the format parser, it would very
       | likely not get anywhere as there's literally nothing it could do
       | except write to stdout. Outside a strict time limit is enforced.
       | 
       | The raw RGBA output is then received and converted back into PNG
       | or similar. It was a bit tricky to get everything working without
       | additional allocation and using syscalls triggered by glibc
       | somewhere, but works pretty well now and is fast enough for my
       | use case (around 20ms/item).
        
         | artwr wrote:
         | Oh could you expand briefly on what the stack looks like to
         | accomplish this? Or do you have a write up on a blog/site you
         | could share?
        
           | dividuum wrote:
           | I wrote a little bit more about this here:
           | https://community.info-beamer.com/t/an-updated-approach-
           | to-c...
           | 
           | I'm a huge fan of building minimal self-contained tools, so
           | all of the C programs statically link in the required parser
           | libs (libavcodec/wuffs/freetype) so the resulting binaries
           | don't require additional dependencies on the target machine.
           | The python wrapping code is rather straightforward as well
           | and is only like 300 lines of code.
        
       | TacticalCoder wrote:
       | > I don't feel this is safe enough. I also feel we could do
       | better by optimizing images on the BE, or creating thumbnails for
       | videos.
       | 
       | Yeah definitely. Even optimizing the vids. I just spend time
       | writing scripts to convert, in parallel, a _massive_ amount of
       | JPG, PNG, PDFs, _mp4_ videos and even some HEIC files customers
       | sent of their ID (identity card or passport, basically). I did
       | shrink them all to reasonable size.
       | 
       | The issue is: if you let user do anything, you'll have that one
       | user, once in a while, that shall send a 30 MB JPG of his ID.
       | Recto. Than Verso.
       | 
       | Then the signed contracts: imagine a user printing a 15 pages
       | contract, signing/paraphing every single page, then not scanning
       | it but taking a 30 MB picture, with his phone, in diagonal, in
       | perspective. And sending all the files individually.
       | 
       | After a decade, this represented a _massive_ amount of data.
       | 
       | It was beautiful to crush that data to anywhere from 1/4th to
       | 1/10th of its size and see all the cores working at full speed,
       | compressing everything to reasonable sizes.
       | 
       | Many sites and 3rd party identity verification services (whatever
       | these are called) do put limit on the allowed size per document,
       | which already helps.
       | 
       | In my case I simply used ImageMagick (mogrify), ffmpeg (to
       | convert to x265) and... GhostScript (good old _gs_ command). PDFs
       | didn 't have to be searchable for text so there's that too (and
       | often already weren't at least not easily, due to users taking
       | pictures then creating a PDF out of the picture).
       | 
       | This was not in Amazon S3 but basically all in Google Workspace:
       | it was for a SME to make everything leaner, snapper, quicker,
       | smaller. Cheaper too (no need to buy additional storage).
       | 
       | Backups of all the originals, full size, files were of course
       | made too but these shall probably never be needed.
       | 
       | In my case I downloaded _everything_. Both to create backups
       | (offsite, offline) and to crush everything locally (simply on an
       | AMD 7700X: powerful enough as long as you don 't have months of
       | videos to encode).
       | 
       | > Anybody have experience with any of this? I imagine having a
       | big team and dedicated services for media processing could work,
       | but what about small teams?
       | 
       | I did it as a one-person job. Putting limits in place or
       | automatically resizing, right after upload, a 30 MB JPG file
       | which you know if of an ID card to a 3 MB JPG file doesn't
       | require a team.
       | 
       | Same for invoking the following to downsize vids:
       | ffmpeg -i input.mp4 -vcodec libx265 -crf 28 output.mp4    (I
       | think that's what I'm using)
       | 
       | My script's logic were quite simple: files above a certain size
       | were candidates for downsizing then downsizing then if the output
       | was successful and took less than a certain amount of time, use
       | that, otherwise keep the original.
       | 
       | I didn't bother verifying that the files visually matched (once
       | again: all the originals are available on offline backups in case
       | something went south and some file is really badly needed) but I
       | could have done that too. There was a blog post posted here a few
       | years ago where a book author would visually compare thumbnails
       | of different revisions of his book, to make sure that nothing
       | changed too much between two minor revisions. I considered doing
       | something similar but didn't bother.
       | 
       | Needless to say my client is _very_ happy with the results and
       | the savings.
       | 
       | YMMV but worked for me and worked for my client.
        
       | mattpavelle wrote:
       | If you're looking for a good image optimization product, I've had
       | excellent results from ImageOptim (I have no affiliation with
       | them). They have a free Mac App, they have an API service, but
       | also they kindly link to lots of other free similar products and
       | services: https://imageoptim.com/versions.html
       | 
       | If you can spare the CPU cycles (depending on how you optimize,
       | they can actually be expensive) and if your images will be
       | downloaded frequently, your users will thank you.
        
       | 4RealFreedom wrote:
       | Read through the comments and was surprised no one mentioned
       | libvips - https://github.com/libvips/libvips. At my current small
       | company we were trying to allow image uploads and started with
       | imagemagick but certain images took too long to process and we
       | were looking for faster alternatives. It's a great tool with
       | minimum overhead. For video thumbnails, we use ffmpeg which is
       | really heavy. We off-load video thumbnail generation to a queue.
       | We've had great luck with these tools.
        
         | gh123man wrote:
         | +1 to vips! It's amazingly fast and stable. I even wrote (some
         | minimal) Swift bindings for it to be used with a Swift backend:
         | https://github.com/gh123man/SwiftVips
        
       | tylergetsay wrote:
       | I have used https://github.com/cshum/imagor infront of S3 before
       | and liked it, there is many (some commercial) offerings for this
        
       | jfengel wrote:
       | I don't understand the details of cloud stuff, and it took me a
       | heck of a long time to google just what a "pre-signed URL" was.
       | In case anybody else is in the same bucket (ahem):
       | 
       | Users can't upload to your S3 storage because they lack
       | credentials. (It would be dangerous to make it public.) But you
       | can give them access with a specially-generated URL (generated
       | for each time they want to upload). So your server makes a
       | special URL, "signed" with its own authorization. That lets them
       | upload one file, to that specific URL.
       | 
       | (I dunno about anybody else, but I find working with AWS always
       | involves cramming in a lot of security-related concepts that I
       | had never had to think about before. It puts a lot of overhead on
       | simple operations, but presumably is mandatory if you want an
       | application accessible to the world.)
        
         | ilogik wrote:
         | you can use an EBS volume attached to the pod, or shared EFS
         | volume as well.
         | 
         | but for certain applications, s3 is simpler to use, especially
         | when you need to scale
        
           | jfengel wrote:
           | I have zero idea what an EBS, EFS, or pod is.
           | 
           | Not that I'm asking for an explanation. Just illustrating how
           | much stuff there is to learn for the basic operation of "run
           | this on your computer", even for an experienced developer.
           | (I've been doing this for nearly 40 years.)
        
             | danrob wrote:
             | You should probably learn.
        
         | bombela wrote:
         | It's a cool feature. You don't need to proxy the upload via
         | your web server. But instead it is directly handled by AWS.
         | 
         | It's nost likely more efficient, faster, and cheaper (no need
         | to handle the traffic and hardware to proxy).
        
         | busymom0 wrote:
         | pre-signed URLs also allow you to set conditions such as file
         | size limit, type of file, file name etc to ensure malicious
         | party isn't uploading a massive 1 TB file into your bucket
         | which you serve as profile pics. However, while amazon S3
         | supports these "conditions", others like backblaze implementing
         | S3 may or may not implement it. So beware.
        
           | gehen88 wrote:
           | Note that limiting content length only works with
           | createPresignedPost, not with getSignedUrl.
        
         | junto wrote:
         | Both Azure and AWS have this feature and it's very useful to
         | avoid having to proxy large files through your own backend.
        
         | rendaw wrote:
         | Calling it a "single use upload url" would probably be a lot
         | clearer. Describe the purpose, not the mechanism...
        
       | ben_jones wrote:
       | Assuming you serve that content out through a CDN a lot of
       | optimization work will be handled there and customization should
       | also be handled there. I'd be shocked if CDNs don't allow you to
       | do much/all of that out of the box.
       | 
       | Honestly though if this is an authenticated function and you have
       | a small user base... who cares? Is there a reasonable chance at
       | this disrupting any end user services? Maybe it's not the best
       | way to spend hundreds of hours and thousands of dollars.
       | 
       | Granted you're an SRE so it's your job to ideas this. I'd just
       | push back on defaulting to dropping serious resources on a
       | process that might be entirely superfluous for your use case.
        
         | tootie wrote:
         | If it's AWS there's basically nothing like this available as
         | default. There are transcoding services and edge functions but
         | you need to set them up yourself.
        
       | j45 wrote:
       | I'm not affiliated but a cloud file service from someone like
       | backblaze may interest you
        
       | eastoeast wrote:
       | Following a similar stack, has anyone found success handling
       | iPhone HDR and Live Photos? Both seem to give issues to standard
       | HTML formats. I believe we're using an AWS service to convert
       | videos to various qualities (maybe Elastic Transcoder or Media
       | Convert), and those iPhone video formats causes the service to
       | error out.
        
       | squigz wrote:
       | Is HN turning into StackOverflow now?
        
         | lobito14 wrote:
         | This would probably be closed on SO as too broad.
        
           | whoknowsidont wrote:
           | Good. It doesn't work here either.
        
       | giaour wrote:
       | > Occasionally, limits are set on the <input/> element for file
       | types.
       | 
       | Since this isn't enforced by the presigned PUT URL, you can't
       | trust that the limits have been respected without inspecting
       | whatever was uploaded to S3. You can get a lot more flexibility
       | in what is allowed if you use an S3 presigned POST, which lets
       | you set minimum and maximum allowed content lengths.
       | 
       | [0]:
       | https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-HTTPPO...
        
       | sigil wrote:
       | Like you, we use pre-signed S3 upload urls. From there we use
       | Transloadit [0] to crop and sanitize and convert and generate
       | thumbnails. Transloadit is basically ImageMagick-as-a-Service.
       | Running ImageMagick yourself on a huge variety of untrusted user
       | input would be terrifying.
       | 
       | [0] https://transloadit.com/
        
       | b-karl wrote:
       | I was part of designing a user file upload, it was a B2B
       | product,limited number of users and in principle trusted users
       | but similar to other comments we did something like:
       | 
       | - some file type and size checks in web app
       | 
       | - pre-signed URL
       | 
       | - upload bucket
       | 
       | - lambdas for processing and sanity checks
       | 
       | - processed bucket
        
       | atonse wrote:
       | We use Cloudflare Images and Cloudflare Stream (Video) to process
       | images and video that are uploaded to our site.
       | 
       | Both have worked well for us so far but I don't know about your
       | scale and impact on pricing (we're small scale so far).
       | 
       | Cloudflare Images lets you auto resize images to generate
       | thumbnails, etc. Same with video, where they will auto-encode the
       | video based on who's watching it where. So for us it's just a
       | matter of uploading it, getting an identifier, and storing that
       | identifier.
        
       | hamandcheese wrote:
       | For what its worth, processing the files is probably more risky
       | for your internal infra than doing nothing. I've seen a RCE
       | exploit from resizing profile images before.
       | 
       | On the other hand, not processing/scanning your uploads is
       | probably more risky for your users/the rest of the internet.
        
       | time0ut wrote:
       | We map the TUS[0] protocol to S3 multipart upload operations.
       | This lets us obscure the S3 bucket from the client and perform
       | authorize each interaction. The TUS operations are handled by a
       | dedicated micro-service. It could be done in a Lambda or
       | anything.
       | 
       | Once the upload completes we kick off a workflow to virus scan,
       | unzip, decrypt, and process the file depending on what it is. We
       | do some preliminary checks in the service looking at the file
       | name, extension, magic bytes, that sort of stuff and reject
       | anything that is obviously wrong.
       | 
       | For virus scanning, we started with ClamAV[1], but eventually
       | bought a Trend Micro product[2] for reasons that may not apply to
       | you. It is serverless based on SQS, Lambda, and SNS. Works fine.
       | 
       | Once scanned, we do a number of things. For images that you are
       | going to serve back out, you for sure want to re-encode those and
       | strip metadata. I haven't worked directly on that part in years,
       | but my prototype used ImageMagick[3] to do this. I remember being
       | annoyed with a Java binding for it.
       | 
       | [0] https://tus.io/ [1] https://www.clamav.net/ [2]
       | https://cloudone.trendmicro.com/ [3]
       | https://imagemagick.org/index.php
        
       | efxhoy wrote:
       | We run an image scaler on aws lambda based on libvips. We cache
       | the responses from it with cloudflare. We compared to letting
       | cloudflare handle the scaling and the lambda was several times
       | cheaper.
        
       | whoknowsidont wrote:
       | Is this really the state of the industry? Where an SRE is asking
       | how to handle user media on the web?
       | 
       | I'm not diminishing asking the question in principle, I'm
       | questioning the role and forum that the question is being asked
       | on.
        
       | dgoldstein0 wrote:
       | Some application security thoughts for serving untrusted content.
       | Not all are required but the main thing is that you don't want
       | the user to be able to serve html or similar (pdf, SVG?) file
       | formats that can use your origin and therefore gain access to
       | anything your origin can do:
       | 
       | - serve on a different top level domain, ideally with random
       | subdomains per uploaded file or user who provides the content.
       | This is really most important for serving document types and
       | probably not for images though SVG I think is the exception as it
       | can have scripting and styling within when loaded outside of an
       | IMG tag
       | 
       | - set "content-security-policy: sandbox" (don't allow scripts and
       | definitely don't allow same origin)
       | 
       | - set "X-Content-Type-Options: no sniff" - disabling sniffing
       | makes it a lot harder to submit an image that's actually
       | interpreted as html or js later.
       | 
       | Transforming the uploaded file also would defeat most exploit
       | paths that depend on sniffing the content type.
        
         | bawolff wrote:
         | Pdf isnt too bad, it has javascript but generally is sandboxed
         | so you cant do anything bad with it.
         | 
         | SVG is basically the same as html, and you can do standard XSS
         | stuff.
         | 
         | Hosting on separate domain (not just subdomain but actual
         | separate domain) is a must if you allow formats like svg, and a
         | good idea generally.
        
       | bagels wrote:
       | If you are a big enough target, people will try to compromise
       | your infrastructure or your users through these uploads.
       | 
       | Some problems you can run in to: Exploiting image or video
       | processing tools with crafted inputs, leading to server
       | compromise when encoding/resizing. Having you host illegal or
       | objectionable material. Causing extreme resource consumption,
       | especially around video processing and storage. Having you host
       | material that in some way compromises your clients (exploits bugs
       | in jpeg libraries, cross site scripting, etc.)
       | 
       | I can't really talk about what is done at the FAANG that I worked
       | at on this stuff, but if you are a large enough target, this is a
       | huge threat vector.
        
       | qingcharles wrote:
       | I remember one major web host in 2004 .. I noticed they weren't
       | checking the extension of profile pic uploads, so I uploaded a
       | .aspx file that I wrote a file tree explorer into.
       | 
       | From there I could browse through all of their customer's home
       | directories; eventually I found the SQL database admin password,
       | which turned out to be the same as their administrator password
       | for the Windows server it was running on: "internet".
       | 
       | This was a big lesson for me in upload sanitizing.
        
       | nickjj wrote:
       | Beyond taking advantage of validations that are enforced with IAM
       | policies, you can also have a background job handle making
       | thumbnails or whatever you want.
       | 
       | Also I don't think the Content-Type is actually verified by S3 so
       | technically users can still upload malicious files such as an
       | executable with a png extension.
       | 
       | On the bright side, S3 supports requesting a range of bytes. You
       | can use that to perform validation server side afterwards to
       | enforce it's really a png, jpg or whatever format you want.
       | Here's examples in Python and Ruby to verify common image types
       | by reading bytes: https://nickjanetakis.com/blog/validate-file-
       | types-by-readin...
        
       | erhaetherth wrote:
       | I tried to do the pre-signed URL thing but gave up quickly. I
       | don't know how you'd do it properly. You're going to want a
       | record of that in your database, right? So what, you have the
       | client upload the image and then send a 2nd request to your
       | server to tell you they uploaded it?
       | 
       | I ended up piping it through my server. I can limit file size,
       | authenticate, insert it into my DB and do what not this way.
        
         | brianhama wrote:
         | Usually it's the opposite order. Client requests an upload, an
         | entry is made in the database and a pre signed url is generated
         | and returned to the client. The client uploads the image.
         | Optionally, the client could then tell the server the upload
         | was completed, although I've never done that final step.
        
       | jftuga wrote:
       | Slight OT...
       | 
       | I created a program for profile pictures. It uses face
       | recognition technology as to not deform faces when resizing
       | photos. This may be useful to you.
       | 
       | https://github.com/jftuga/photo_id_resizer
        
       ___________________________________________________________________
       (page generated 2024-05-04 23:00 UTC)