[HN Gopher] Ways to shoot yourself in the foot with Redis
       ___________________________________________________________________
        
       Ways to shoot yourself in the foot with Redis
        
       Author : philbo
       Score  : 88 points
       Date   : 2023-07-29 14:25 UTC (8 hours ago)
        
 (HTM) web link (philbooth.me)
 (TXT) w3m dump (philbooth.me)
        
       | koolba wrote:
       | > I wrote a basic session cache using GET, which fell back to a
       | database query and SET to populate the cache in the event of a
       | miss. Crucially, it held onto the Redis connection for the
       | duration of that fallback condition and allowed errors from SET
       | to fail the entire operation. Increased traffic, combined with a
       | slow query in Postgres, caused this arrangement to effectively
       | DOS our Redis connection pool for minutes at a time.
       | 
       | This has nothing to do with the redis server. This is bad
       | application code monopolizing a single connection waiting for an
       | unrelated operation. A stateless request / response to interact
       | with redis for the individual operations does not hold any such
       | locks.
        
         | zgluck wrote:
         | And the default limit is 10k connections.
         | 
         | https://redis.io/docs/reference/clients/#maximum-concurrent-...
        
         | philbo wrote:
         | > This has nothing to do with the redis server. This is bad
         | application code monopolizing a single connection waiting for
         | an unrelated operation.
         | 
         | Well, yes. That is why the preceding sentence, which you didn't
         | quote, said "poorly-implemented application logic". So thanks
         | for agreeing with my post, I guess.
         | 
         | The point, in case you missed it, was to advertise ways I'd
         | fucked up and hopefully help others not to fuck up the same way
         | in future. It was never my intention to say Redis was the
         | problem and I'm sorry if it made you think that.
        
       | badrabbit wrote:
       | Don't expose your redis to the internet (please!). Don't
       | whitelist large swathes of your cloud/hosting provider's subnets
       | either. Of course redis isn't special, mongo, elastic, docker,
       | k8s,etc... even if it is a testing server and you will never put
       | important data on it.
        
         | amenghra wrote:
         | This. Configure private vlans and/or Wireguard or whatever VPN
         | software you prefer.
        
           | nforgerit wrote:
           | And what about mTLS?
        
             | GauntletWizard wrote:
             | MTLS doesn't affect this advice at all. You should, where
             | possible, use MTLS because it's good security. You
             | shouldn't leave your redis server open to the internet
             | anyway, to cut down on logspam.
             | 
             | With MTLS, a good security posture is to log every
             | connection establishment, with basic metadata about the
             | certificate involved - it's SAN and public key hash are the
             | best bet. For troubleshooting, do that logging before the
             | authentication decision. But anyone can make their own
             | certificate, so keeping network controls keeps that list
             | free of clutter.
        
       | berkle4455 wrote:
       | > Crucially, it held onto the Redis connection for the duration
       | of that fallback condition and allowed errors from SET to fail
       | the entire operation.
       | 
       | What? Was this inside a MULTI (transaction) or something? This
       | isn't a flaw of Redis being single-threaded. Honestly all of
       | these "footguns" sound like amateur programmer mistakes and have
       | zero to do with Redis.
        
         | philbo wrote:
         | No. As it explains at the beginning of the paragraph you're
         | quoting:
         | 
         | > If you're particularly naive, like I was on one occasion,
         | you'll exacerbate these failures with some poorly-implemented
         | application logic.
         | 
         | Then a few paragraphs above that is this sentence:
         | 
         | > The gotchas that follow were all occasions when I didn't use
         | it correctly.
         | 
         | I'm not sure how to make it more clear that I'm criticising
         | myself, not Redis, in the post, but that's the intention. If
         | you have suggestions how I could make it more obvious, please
         | let me know.
        
           | berkle4455 wrote:
           | The title comes across like these are faults of Redis and
           | that if you're not particularly careful about you'll shoot
           | yourself in the foot.
           | 
           | > I'm not sure how to make it more clear that I'm criticising
           | myself
           | 
           | "Mistakes I made while building applications on Redis"
        
             | philbo wrote:
             | Thanks, I'll update the post and link to your comment for
             | attribution.
        
       | scrame wrote:
       | I had a jr dev connect and typed 'flushall' because he thought it
       | would refresh the dataset to disk.
       | 
       | thankfully it was on a staging env, I think he's at google now.
        
         | spacephysics wrote:
         | One time during my internship years ago I took down a
         | production server because of a command I ran on it that I
         | didn't fully understand.
         | 
         | Since then I treat any prod server terminal like I'm entering
         | launch codes for a middle system.
         | 
         | Anything outside of ls or cd I'm very careful, read the command
         | a couple times before executing, etc.
        
           | returningfory2 wrote:
           | In my opinion you weren't at fault here. Production systems
           | should be designed so that one person can't inadvertently
           | destroy things.
        
             | RyanHamilton wrote:
             | This is the way.
        
             | ljm wrote:
             | In almost every place I've worked at, the most difficult
             | thing has been getting people out of ad-hoc JFDI style
             | development and debugging, where everything in production
             | is fair game, and into a process where you avoid touching
             | production as much as humanly possible.
             | 
             | Takes a lot of effort to stop people opening up a shell in
             | prod or grabbing a prod DB dump or even just connecting to
             | the prod datastore directly from their local env.
        
           | tetha wrote:
           | For critical and overall... fiddly things, we've grown into a
           | culture of writing down reviewable plans and possibly
           | executing these plans in pairs.
           | 
           | We tend to go ahead and either use a runbook, or whatever
           | experience we might have, to setup a pretty detailed plan of
           | what to run on which systems with which purpose. You can then
           | throw these plans at someone else to review. Sure, it takes
           | an hour or two more to setup a solid plan and waiting for a
           | review takes time as well. But this has turned into a great
           | tool to build up experience in weird parts of the
           | infrastructure.
        
           | [deleted]
        
         | js2 wrote:
         | You can use rename-command to help avoid these kinds of
         | mistakes:                 # To disable:       rename-command
         | FLUSHALL ""            # To rename:       rename-command
         | FLUSHALL DANGER_WILL_ROBINSON_FLUSH_ALL
        
       | kgeist wrote:
       | Another one: don't use distributed locks using Redis (Redlock) as
       | if they were just another mutex.
       | 
       | Someone on the team decided to use Redlock to guard a section of
       | code which accessed a third-party API. The code was racy when
       | accessed from several concurrently running app instances, so
       | access to it had to be serialized. A property of distributed
       | locking is that it has timeouts (based on Redis' TTL if I
       | remember correctly) - other instances will assume the lock is
       | released after N seconds, to make sure an app instance which died
       | does not leave the lock in the acquired state forever. So one day
       | responses from the third party API started taking more time than
       | Redlock's timeout. Other app instances were assuming the lock was
       | released and basically started accessing the API simultaneously
       | without any synchronization. Data corruption ensued.
        
         | Racing0461 wrote:
         | That doesn't make any sense. the timeout is how long to block
         | for and retry, now how long to block for and continue.
        
         | GauntletWizard wrote:
         | You should do two things to combat this- one is to carefully
         | monitor third party API timings and lock acquisition timings.
         | Knowing when you approach your distributed locking timeouts
         | (and alerting if they time out more than occasionally) is key
         | to... Well, using distributed locks at all. There are
         | distributed locking systems that require active unlocking
         | without timeout, but they break pretty easily if your process
         | crashes and require manual intervention.
         | 
         | The second is to use a redis client that has its own thread -
         | your application blocking on a third party API response
         | shouldn't prevent you from updating/reacquiring the lock. You
         | want a short timeout on the lock for liveness but a longer
         | maximum lock acquire time so that if it takes several periods
         | to complete a task you still can.
         | 
         | The third is to not use APIs without idempotency. :)
        
         | Phelinofist wrote:
         | I found this blog post about Redlock quite interesting:
         | https://martin.kleppmann.com/2016/02/08/how-to-do-distribute...
        
         | remote_phone wrote:
         | That doesn't make sense, they can't assume the lock is freed
         | after the timeout. They have to retry to get the lock again,
         | because another process might have taken the lock. Also, redis
         | is single threaded so access to redis is by definition
         | serialized.
        
           | cbzoiav wrote:
           | As the other guy says the lock is released by the server. If
           | you don't have a mechanism to release it after a timeout,
           | what happens if a node fails?
        
           | codegladiator wrote:
           | The lock is explicitly release by the redis server itself
           | after the ttl. It's not that the Client will assume that the
           | lock is released.
        
         | processunknown wrote:
         | The problem here is that the request timeout is greater than
         | the lock timeout.
        
           | pipe_connector wrote:
           | While this might make this situation more likely to occur,
           | you can _never_ prevent concurrent accesses from happening in
           | a distributed system.
        
         | mjb wrote:
         | All distributed locking systems have a liveness problem: what
         | should you do when a participant fails? You can block forever,
         | which is always correct but not super helpful. You can assume
         | after some time that the process is broken, which preserves
         | liveness. But what if it comes back? What if it was healthy all
         | along and you just couldn't talk to it?
         | 
         | The classic solution is leases: assume bounded clock drift, and
         | make lock holders promise to stop work some time after taking
         | the lock. This is only correct if all clients play by the
         | rules, and your clock drift hypothesis is right.
         | 
         | The other solution is to validate that the lock holder hasn't
         | changed on every call. For example, with a lock generation
         | epoch number. This needs to be enforced by the callee, or by a
         | middle layer, which might seem like you've just pushed the
         | fault tolerance problem to somebody else. In practice, pushing
         | it to somebody else, like a DB is super useful!
         | 
         | Finally, you can change call semantics to offer idempotency (or
         | other race-safe semantics). Nice if you can get it.
        
       | ljm wrote:
       | I've found that your mileage will vary when using Redis in
       | clustered mode because the even if there is an official Redis
       | driver in your language of choice that supports it, this might
       | not be exposed by any libraries that depend on it. In those cases
       | you'll just be connecting to a single specific instance in the
       | cluster but will mistakenly believe that isn't the case.
       | 
       | I've noticed this particularly with Ruby where the official gem
       | has cluster and sentinel support, but many other gems that depend
       | on Redis expose their own abstraction for configuring it and it
       | isn't compatible with the official package.
       | 
       | Of course, I think that running Redis in clustered mode is
       | actually just another way to shoot yourself in the foot,
       | especially if a standalone instance isn't causing you any
       | trouble, as you can easily run into problems with resharding or
       | poorly distributing the keyspace. Maybe just try out Sentinal for
       | HA and failover support if you want some resilience.
        
         | jrockway wrote:
         | It seems like you can run Envoy as a sidecar next to each
         | application instance to allow non-cluster-aware libraries to
         | use the cluster:
         | https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overv...
        
           | efrecon wrote:
           | This is cool! Didn't know
        
           | ljm wrote:
           | That's interesting, I suppose you still lose the benefit of
           | your redis driver being unaware of cluster-mode though, so
           | errors are going to be at protocol level and not application
           | level.
           | 
           | Better than nothing though.
        
       | adventured wrote:
       | I have been using Redis for a long time and one of the things I
       | love about it, is how difficult it is to shoot yourself in the
       | foot with it. From the first use, after briefly reading some
       | basic tips on what not to do, it was ridiculously simple to just
       | get to work with it. I've never once run into a security or
       | performance issue with it.
        
       | jontonsoup wrote:
       | Has anyone seen max (p100) client latencies of 300 to 400ms but
       | totally normal p99? We see this across almost all our redis
       | clusters on elasticache and have no idea why. CPU usage is tiny.
       | Slowlog shows nothing.
        
         | secondcoming wrote:
         | Is it doing backups?
        
           | jontonsoup wrote:
           | My understanding is elasticache does not let you turn them
           | off.
        
         | GauntletWizard wrote:
         | I would guess your problem is probably scheduler based. The
         | default(ish) Linux scheduler operates in 100ms increments, the
         | first use of a client takes 3-4 round-trips. TCP opens, block,
         | request is sent, the client blocks on write, the client
         | attempts to read and blocks on read. If CPU usage is high
         | momentarily, each of these yields to another process and your
         | client isn't scheduled for another 100ms
        
           | jontonsoup wrote:
           | Hmm. We have super low CPU utilization- something like 9%.
           | This is also across 10+ different clusters.
        
             | jontonsoup wrote:
             | We also pool our clients heavily. Maybe we could reduce the
             | new connections to zero to test.
        
       | welder wrote:
       | Change the default `stop-writes-on-bgsave-error` to "no" or
       | you're asking for trouble... a ticking time bomb.
        
         | [deleted]
        
         | chrisbolt wrote:
         | Isn't it another ticking time bomb to accept writes that will
         | be lost if the server is shut down?
        
           | GauntletWizard wrote:
           | Expecting that any key in redis will be there next time you
           | read it is a ticking timebomb. Redis is not a database. It's
           | a cache.
           | 
           | Unless you're using AOF mode with fsync always, you can lose
           | writes. If you're doing that, you should be using a real
           | database instead.
        
       ___________________________________________________________________
       (page generated 2023-07-29 23:01 UTC)