[HN Gopher] Ways to shoot yourself in the foot with Redis
___________________________________________________________________
Ways to shoot yourself in the foot with Redis
Author : philbo
Score : 88 points
Date : 2023-07-29 14:25 UTC (8 hours ago)
(HTM) web link (philbooth.me)
(TXT) w3m dump (philbooth.me)
| koolba wrote:
| > I wrote a basic session cache using GET, which fell back to a
| database query and SET to populate the cache in the event of a
| miss. Crucially, it held onto the Redis connection for the
| duration of that fallback condition and allowed errors from SET
| to fail the entire operation. Increased traffic, combined with a
| slow query in Postgres, caused this arrangement to effectively
| DOS our Redis connection pool for minutes at a time.
|
| This has nothing to do with the redis server. This is bad
| application code monopolizing a single connection waiting for an
| unrelated operation. A stateless request / response to interact
| with redis for the individual operations does not hold any such
| locks.
| zgluck wrote:
| And the default limit is 10k connections.
|
| https://redis.io/docs/reference/clients/#maximum-concurrent-...
| philbo wrote:
| > This has nothing to do with the redis server. This is bad
| application code monopolizing a single connection waiting for
| an unrelated operation.
|
| Well, yes. That is why the preceding sentence, which you didn't
| quote, said "poorly-implemented application logic". So thanks
| for agreeing with my post, I guess.
|
| The point, in case you missed it, was to advertise ways I'd
| fucked up and hopefully help others not to fuck up the same way
| in future. It was never my intention to say Redis was the
| problem and I'm sorry if it made you think that.
| badrabbit wrote:
| Don't expose your redis to the internet (please!). Don't
| whitelist large swathes of your cloud/hosting provider's subnets
| either. Of course redis isn't special, mongo, elastic, docker,
| k8s,etc... even if it is a testing server and you will never put
| important data on it.
| amenghra wrote:
| This. Configure private vlans and/or Wireguard or whatever VPN
| software you prefer.
| nforgerit wrote:
| And what about mTLS?
| GauntletWizard wrote:
| MTLS doesn't affect this advice at all. You should, where
| possible, use MTLS because it's good security. You
| shouldn't leave your redis server open to the internet
| anyway, to cut down on logspam.
|
| With MTLS, a good security posture is to log every
| connection establishment, with basic metadata about the
| certificate involved - it's SAN and public key hash are the
| best bet. For troubleshooting, do that logging before the
| authentication decision. But anyone can make their own
| certificate, so keeping network controls keeps that list
| free of clutter.
| berkle4455 wrote:
| > Crucially, it held onto the Redis connection for the duration
| of that fallback condition and allowed errors from SET to fail
| the entire operation.
|
| What? Was this inside a MULTI (transaction) or something? This
| isn't a flaw of Redis being single-threaded. Honestly all of
| these "footguns" sound like amateur programmer mistakes and have
| zero to do with Redis.
| philbo wrote:
| No. As it explains at the beginning of the paragraph you're
| quoting:
|
| > If you're particularly naive, like I was on one occasion,
| you'll exacerbate these failures with some poorly-implemented
| application logic.
|
| Then a few paragraphs above that is this sentence:
|
| > The gotchas that follow were all occasions when I didn't use
| it correctly.
|
| I'm not sure how to make it more clear that I'm criticising
| myself, not Redis, in the post, but that's the intention. If
| you have suggestions how I could make it more obvious, please
| let me know.
| berkle4455 wrote:
| The title comes across like these are faults of Redis and
| that if you're not particularly careful about you'll shoot
| yourself in the foot.
|
| > I'm not sure how to make it more clear that I'm criticising
| myself
|
| "Mistakes I made while building applications on Redis"
| philbo wrote:
| Thanks, I'll update the post and link to your comment for
| attribution.
| scrame wrote:
| I had a jr dev connect and typed 'flushall' because he thought it
| would refresh the dataset to disk.
|
| thankfully it was on a staging env, I think he's at google now.
| spacephysics wrote:
| One time during my internship years ago I took down a
| production server because of a command I ran on it that I
| didn't fully understand.
|
| Since then I treat any prod server terminal like I'm entering
| launch codes for a middle system.
|
| Anything outside of ls or cd I'm very careful, read the command
| a couple times before executing, etc.
| returningfory2 wrote:
| In my opinion you weren't at fault here. Production systems
| should be designed so that one person can't inadvertently
| destroy things.
| RyanHamilton wrote:
| This is the way.
| ljm wrote:
| In almost every place I've worked at, the most difficult
| thing has been getting people out of ad-hoc JFDI style
| development and debugging, where everything in production
| is fair game, and into a process where you avoid touching
| production as much as humanly possible.
|
| Takes a lot of effort to stop people opening up a shell in
| prod or grabbing a prod DB dump or even just connecting to
| the prod datastore directly from their local env.
| tetha wrote:
| For critical and overall... fiddly things, we've grown into a
| culture of writing down reviewable plans and possibly
| executing these plans in pairs.
|
| We tend to go ahead and either use a runbook, or whatever
| experience we might have, to setup a pretty detailed plan of
| what to run on which systems with which purpose. You can then
| throw these plans at someone else to review. Sure, it takes
| an hour or two more to setup a solid plan and waiting for a
| review takes time as well. But this has turned into a great
| tool to build up experience in weird parts of the
| infrastructure.
| [deleted]
| js2 wrote:
| You can use rename-command to help avoid these kinds of
| mistakes: # To disable: rename-command
| FLUSHALL "" # To rename: rename-command
| FLUSHALL DANGER_WILL_ROBINSON_FLUSH_ALL
| kgeist wrote:
| Another one: don't use distributed locks using Redis (Redlock) as
| if they were just another mutex.
|
| Someone on the team decided to use Redlock to guard a section of
| code which accessed a third-party API. The code was racy when
| accessed from several concurrently running app instances, so
| access to it had to be serialized. A property of distributed
| locking is that it has timeouts (based on Redis' TTL if I
| remember correctly) - other instances will assume the lock is
| released after N seconds, to make sure an app instance which died
| does not leave the lock in the acquired state forever. So one day
| responses from the third party API started taking more time than
| Redlock's timeout. Other app instances were assuming the lock was
| released and basically started accessing the API simultaneously
| without any synchronization. Data corruption ensued.
| Racing0461 wrote:
| That doesn't make any sense. the timeout is how long to block
| for and retry, now how long to block for and continue.
| GauntletWizard wrote:
| You should do two things to combat this- one is to carefully
| monitor third party API timings and lock acquisition timings.
| Knowing when you approach your distributed locking timeouts
| (and alerting if they time out more than occasionally) is key
| to... Well, using distributed locks at all. There are
| distributed locking systems that require active unlocking
| without timeout, but they break pretty easily if your process
| crashes and require manual intervention.
|
| The second is to use a redis client that has its own thread -
| your application blocking on a third party API response
| shouldn't prevent you from updating/reacquiring the lock. You
| want a short timeout on the lock for liveness but a longer
| maximum lock acquire time so that if it takes several periods
| to complete a task you still can.
|
| The third is to not use APIs without idempotency. :)
| Phelinofist wrote:
| I found this blog post about Redlock quite interesting:
| https://martin.kleppmann.com/2016/02/08/how-to-do-distribute...
| remote_phone wrote:
| That doesn't make sense, they can't assume the lock is freed
| after the timeout. They have to retry to get the lock again,
| because another process might have taken the lock. Also, redis
| is single threaded so access to redis is by definition
| serialized.
| cbzoiav wrote:
| As the other guy says the lock is released by the server. If
| you don't have a mechanism to release it after a timeout,
| what happens if a node fails?
| codegladiator wrote:
| The lock is explicitly release by the redis server itself
| after the ttl. It's not that the Client will assume that the
| lock is released.
| processunknown wrote:
| The problem here is that the request timeout is greater than
| the lock timeout.
| pipe_connector wrote:
| While this might make this situation more likely to occur,
| you can _never_ prevent concurrent accesses from happening in
| a distributed system.
| mjb wrote:
| All distributed locking systems have a liveness problem: what
| should you do when a participant fails? You can block forever,
| which is always correct but not super helpful. You can assume
| after some time that the process is broken, which preserves
| liveness. But what if it comes back? What if it was healthy all
| along and you just couldn't talk to it?
|
| The classic solution is leases: assume bounded clock drift, and
| make lock holders promise to stop work some time after taking
| the lock. This is only correct if all clients play by the
| rules, and your clock drift hypothesis is right.
|
| The other solution is to validate that the lock holder hasn't
| changed on every call. For example, with a lock generation
| epoch number. This needs to be enforced by the callee, or by a
| middle layer, which might seem like you've just pushed the
| fault tolerance problem to somebody else. In practice, pushing
| it to somebody else, like a DB is super useful!
|
| Finally, you can change call semantics to offer idempotency (or
| other race-safe semantics). Nice if you can get it.
| ljm wrote:
| I've found that your mileage will vary when using Redis in
| clustered mode because the even if there is an official Redis
| driver in your language of choice that supports it, this might
| not be exposed by any libraries that depend on it. In those cases
| you'll just be connecting to a single specific instance in the
| cluster but will mistakenly believe that isn't the case.
|
| I've noticed this particularly with Ruby where the official gem
| has cluster and sentinel support, but many other gems that depend
| on Redis expose their own abstraction for configuring it and it
| isn't compatible with the official package.
|
| Of course, I think that running Redis in clustered mode is
| actually just another way to shoot yourself in the foot,
| especially if a standalone instance isn't causing you any
| trouble, as you can easily run into problems with resharding or
| poorly distributing the keyspace. Maybe just try out Sentinal for
| HA and failover support if you want some resilience.
| jrockway wrote:
| It seems like you can run Envoy as a sidecar next to each
| application instance to allow non-cluster-aware libraries to
| use the cluster:
| https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overv...
| efrecon wrote:
| This is cool! Didn't know
| ljm wrote:
| That's interesting, I suppose you still lose the benefit of
| your redis driver being unaware of cluster-mode though, so
| errors are going to be at protocol level and not application
| level.
|
| Better than nothing though.
| adventured wrote:
| I have been using Redis for a long time and one of the things I
| love about it, is how difficult it is to shoot yourself in the
| foot with it. From the first use, after briefly reading some
| basic tips on what not to do, it was ridiculously simple to just
| get to work with it. I've never once run into a security or
| performance issue with it.
| jontonsoup wrote:
| Has anyone seen max (p100) client latencies of 300 to 400ms but
| totally normal p99? We see this across almost all our redis
| clusters on elasticache and have no idea why. CPU usage is tiny.
| Slowlog shows nothing.
| secondcoming wrote:
| Is it doing backups?
| jontonsoup wrote:
| My understanding is elasticache does not let you turn them
| off.
| GauntletWizard wrote:
| I would guess your problem is probably scheduler based. The
| default(ish) Linux scheduler operates in 100ms increments, the
| first use of a client takes 3-4 round-trips. TCP opens, block,
| request is sent, the client blocks on write, the client
| attempts to read and blocks on read. If CPU usage is high
| momentarily, each of these yields to another process and your
| client isn't scheduled for another 100ms
| jontonsoup wrote:
| Hmm. We have super low CPU utilization- something like 9%.
| This is also across 10+ different clusters.
| jontonsoup wrote:
| We also pool our clients heavily. Maybe we could reduce the
| new connections to zero to test.
| welder wrote:
| Change the default `stop-writes-on-bgsave-error` to "no" or
| you're asking for trouble... a ticking time bomb.
| [deleted]
| chrisbolt wrote:
| Isn't it another ticking time bomb to accept writes that will
| be lost if the server is shut down?
| GauntletWizard wrote:
| Expecting that any key in redis will be there next time you
| read it is a ticking timebomb. Redis is not a database. It's
| a cache.
|
| Unless you're using AOF mode with fsync always, you can lose
| writes. If you're doing that, you should be using a real
| database instead.
___________________________________________________________________
(page generated 2023-07-29 23:01 UTC)