[HN Gopher] Logs and tracing: not just for production, local dev...
       ___________________________________________________________________
        
       Logs and tracing: not just for production, local development too
        
       Author : lawrjone
       Score  : 55 points
       Date   : 2021-11-11 12:40 UTC (10 hours ago)
        
 (HTM) web link (incident.io)
 (TXT) w3m dump (incident.io)
        
       | lmeyerov wrote:
       | What's a good local dev version of this nowadays?
       | 
       | We invested heavily on `logger` in Python and equiv in JS, which
       | makes debugging most straightline issues a breeze. However, not
       | so much with some of our async & cross-service flows. It's not
       | hard to add correlation ID handoffs across REST API calls, so
       | curious on both agent instrumentation and local UI (e.g., 10
       | lines of docker compose config)?
        
       | coldacid wrote:
       | I almost always instrument all my code with logging messages for
       | the purpose of local development. Sometimes it can be very
       | inconvenient for different reasons to run something in the
       | debugger, and so generating log files while running the code to
       | test it helps me identify a lot of issues that I may have
       | overlooked or that otherwise come up in the code while it's still
       | being worked on.
       | 
       | I almost never remove those log messages either, but thanks to
       | the magic of log levels, I don't have to. Anything debug or trace
       | level can be skipped in production logs unless we need to turn on
       | those levels due to bugs that are commonly hit in production yet
       | difficult to diagnose.
        
         | lawrjone wrote:
         | That all sounds very normal to me, and it's what I'd expect
         | most people do.
         | 
         | The bit that surprised me is more likely due to my background
         | being mostly Ruby/similar based, which is around debugging
         | highly concurrent apps.
         | 
         | Our app spins up goroutines all the time. One request will fan
         | out to a number of concurrent threads, doing things like
         | hydrating responses from Slack to syncing issue tracker
         | tickets.
         | 
         | With your STDOUT logs, that's so difficult to understand as
         | it's all interwoven. And picturing all the threads and how they
         | interact is fairly difficult.
         | 
         | I think that's why this was such a "ah-ha!" moment for us.
         | Suddenly we could find the "error" log in our terminal, then
         | click the trace URL and see everything in context.
         | 
         | That's probably the key difference to our past experience, that
         | the app we're working with is so concurrent.
        
       | stavros wrote:
       | In the past, I've used Honeycomb to do tracing, with great
       | success. Google Cloud Trace seems interesting, though I think
       | that something purpose-built would be a better fit.
       | 
       | Right now we have a need for tracing the frontend. Does anyone
       | here have any suggestions for that? In particular, we'd like
       | something that can reconstruct a user's journey, particularly if
       | it results in an error, so we can fix it.
        
         | lawrjone wrote:
         | Sentry (https://sentry.io/) has been building a load of tracing
         | functionality, and the Javascript client captures logs and
         | requests leading up to an error by default.
         | 
         | I've used this in the past quite a bit, and found it really
         | useful.
         | 
         | And in terms of Cloud Trace vs other tracing providers, I think
         | it was the ease of setup and the simple log + trace inline view
         | that sold us on it. Only took a couple of hours to get working,
         | and it's great that it works for local dev too.
        
           | stavros wrote:
           | We did try Sentry, but we couldn't really get the UI to work
           | for us. By default, the logical unit it uses is the error,
           | and our project view would quickly be spammed with all the
           | events and become unusable. I haven't personally researched
           | it (I've used Sentry for errors-only needs, and it works well
           | for that), but from what I heard the error-centric
           | functionality made it hard to work with traces.
           | 
           | Would you happen to have any tips on how to fix that? Maybe
           | we just got it wrong.
        
         | candiddevmike wrote:
         | Proxy the open tracing requests from your backend. You could
         | even add a correlated request ID for the backend API calls:
         | 
         | https://github.com/opentracing/opentracing-javascript
        
           | stavros wrote:
           | I'm not sure of the workflow here, the backend is
           | instrumented separately. We're only looking to trace user
           | actions in the backend, can you explain a bit about how
           | OpenTracing fits into that? Where do the traces go?
        
       | cbushko wrote:
       | I really enjoyed this article. I will assume that logs are being
       | sent to a dedicated GCP project.
       | 
       | Can you shorten the retention of logs and traces in Stackdriver?
       | If so, that would be amazing and make this practical for all dev
       | use.
        
         | lawrjone wrote:
         | We have some terraform that generates each developer their own
         | GCP project, along with all the infrastructure and build
         | pipelines they'd need to run their stack themselves.
         | 
         | It looks something like this:                   module
         | "incident_io" {           for_each = {             "staging" =
         | {               project = "incident-io-staging"             }
         | "production" = {               project = "incident-io-
         | production"             }             "dev-lawrence" = {
         | project    = "incident-io-dev-lawrence"
         | autodeploy = true             }             "dev-lisa" = {
         | project    = "incident-io-dev-lisa"               autodeploy =
         | true             }             # ...           }
         | source = "./modules/stack"                application       =
         | "incident-io"           instance          = each.key
         | google_project_id = each.value.project           autodeploy
         | = lookup(each.value, "autodeploy", false)         }
         | 
         | So their traces + logs get sent to their own StackDriver
         | instances, rather than polluting either staging or production.
        
           | cbushko wrote:
           | That is an interesting way to do it. Per project would allow
           | you to set give each developer all the permissions they need.
           | 
           | The worry I have with this is that as you grow you will
           | eventually end up with a bunch of dead projects. You need to
           | cleanup that list every so often as employees come and go.
           | 
           | We have a dev project so they could send their logs there.
        
       | jakozaur wrote:
       | Personally, I find local logs and tracing super useful. It allows
       | you to debug tests and ensure you have right instrumentation to
       | troubleshoot similar issues on production. Design and write code
       | that you can debug later.
       | 
       | E.g. we troubleshooted slow build times or analyse adoption of
       | different internal tooling.
       | 
       | Minor nitpick, these days I would use OpenTelemetry for
       | instrumentation instead of OpenCensus. Even on OpenCensus website
       | (https://opencensus.io/) you can read "OpenCensus and OpenTracing
       | have merged into OpenTelemetry!".
       | 
       | Disclaimer: I work at Sumo Logic (https://www.sumologic.com). We
       | do logs, metrics and tracing.
        
         | lawrjone wrote:
         | You're totally right on the OC vs OT- we couldn't use OT when I
         | first put this together, as the StackDriver plugin wasn't
         | compatible. We also found it didn't work out-the-box with a
         | load of our stuff (notably the GCP client libraries) while OC
         | did.
         | 
         | Another commenter has pointed out things have moved on quite a
         | bit since then- that's really good to hear, and we'll revisit
         | this when we get some time to make the shift :)
        
       | bob1029 wrote:
       | We built a centralized tracing/logging infrastructure that all of
       | our environments talk to - including local developer
       | environments. This is a 100% in-house creation which is purpose-
       | built to convey important facts specific to our product.
       | 
       | Anyone from developer to end user is able to submit a trace to
       | this system simply by selecting a menu option and entering a
       | comment. These traces contain sufficient information to perfectly
       | reconstruct all user interactions and review all communications
       | with external systems.
       | 
       | There is a dashboard that updates in real-time containing all of
       | these submissions, so everyone on the team typically has this off
       | to the side for a quick glance throughout the day.
       | 
       | The amount of time this sort of tooling has bought us is hard to
       | quantify, but I can't imagine this stuff hasn't paid for itself
       | yet. I honestly don't see how you could what we did by just
       | taping together 3rd party solutions. Certainly, not when also
       | considering the security & deployment model we have to cope with
       | due to banking industry regulations.
        
         | lawrjone wrote:
         | That's really interesting!
         | 
         | In terms of incident.io, we're a team of 3 engineers (looking
         | for more!) and a company of 6 in total, so our priority was on
         | buy and not build.
         | 
         | But in my most recent experience at a larger company (fintech
         | unicorn, ~1k employees) we:
         | 
         | - Used Sentry to capture exceptions from all over the app. Our
         | language was Ruby, which meant the stacktraces were usually
         | explanatory, and attached debugging information to the event
         | whenever we sent it - We didn't use tracing much, though when
         | we did it would be at a request level. Our Sentries would link
         | to the trace, and you'd see what happened during your request -
         | All logs were associated with the trace
         | 
         | So our error report would be a Sentry, which could be joined
         | against the logs and trace information.
         | 
         | In the situation where error reporting (probably better phrased
         | as crash report?) was triggered by a user, it would normally
         | come from a frontend site and trigger a Sentry.
         | 
         | The Sentry plugin will have tracked the console.log/debug calls
         | up until that moment, and any requests that took place. So it
         | was quite easy to piece together what had happened, and why it
         | went wrong.
         | 
         | Not ever seen anything so custom as what you describe, though.
         | Interesting to know that those systems are out there, and that
         | you feel it worth building over buying!
        
       | lawrjone wrote:
       | Author of the post here.
       | 
       | We're a super small start-up, but with prior experience of great
       | observability toolchains.
       | 
       | Been really surprised at how little effort it required to get a
       | richly integrated trace and logging setup, using Cloud Trace +
       | StackDriver.
       | 
       | Equally, we've all been surprised by how much it changed our
       | local development experience. Going from an error in dev to the
       | trace view and all the logs inline is just one click, which is
       | pretty transformative.
       | 
       | Happy to answer any questions, hope you enjoy the article!
        
         | ogjunkyard wrote:
         | I enjoyed the article a fair bit!
         | 
         | I was wondering, how should one get started with observability
         | and implementing it? Are there specific books/courses/talks
         | you'd recommend?
         | 
         | The reason I ask is because I've never been directly exposed to
         | good observability at any of the companies I've worked at for a
         | handful of reasons. It mostly boils down to the fact that I'm a
         | DevOps engineer, so building observability is a set-up-and-
         | keep-running sort of deal for other teams, not a useful-for-my-
         | applications thing that I'm going to be working with often.
         | Teams let us know "Splunk is down", "I can't reach Kibana", or
         | "Looks like disk space is filling up" and that's about it after
         | it's been initially set up.
         | 
         | There's a whole host of questions I'd ideally like to answer,
         | but a lot of it boils down to the fact that I don't know what I
         | don't know and I'd suggest assuming I know nothing over
         | assuming I know something because I know the word. Questions
         | I'd like to be able to answer are:
         | 
         | - What makes a good log? - What is a trace? Why is it useful?
         | How does it help me debug issues faster? - How do you increase
         | observability for loosely-coupled microservice systems? - How
         | do you observe multi-threaded applications? - ... and I'm sure
         | there are a whole bunch more.
        
           | lawrjone wrote:
           | Urgh, yes. This is a really difficult place to be coming
           | from, and I totally feel your pain!
           | 
           | My background was working as an SRE at GoCardless, starting
           | when the company was around ~30 and leaving at around 700.
           | During that time we did the whole "oh crap, wtf is
           | observability" that coincided with a big push in the industry
           | to define the term, and I worked on the team (Observability
           | Working Group) that tried rolling out these practices.
           | 
           | The truth is this is much easier if you have someone with you
           | who knows what good is, though I was in your position at GC
           | and it's possible to learn it by first principles.
           | 
           | If you're doing this, the best advice I can give you is to
           | think really critically about _why_ you want observability.
           | 
           | Usually it's "when something goes wrong, I want to be able to
           | understand what lead to it, and what was going on at that
           | time". If that's the case, you can't make a wrong step if it
           | improves your ability to understand that- even if what you do
           | is simple.
           | 
           | At GC we began with logs, as everyone was familiar with them.
           | We encouraged people to start thinking about logs as
           | structured data, so drop the "Posted message to Slack" log
           | line and go for something like:
           | 
           | ``` { event: "slack_message.posted", slack_channel_id:
           | "CH123", slack_user_id: "US123", etc... } ```
           | 
           | When you get your logs looking like that, you can setup
           | something like Grafana to expose visualisations that are
           | built from your logs. We were using ElasticSearch for log
           | storage, which is quite simple to build graphs on top of.
           | 
           | Visualisations are really compelling, and help you persuade
           | people it's worthwhile to consider this stuff.
           | 
           | Beyond structured logging, you'd want to look into time-
           | series metrics (Prometheus) which can help you monitor things
           | in a bit more real-time, then traces if you want that type of
           | insight.
           | 
           | I've often compared observability to testing, in terms of how
           | you should think about it/use it. You'll find a load of dev
           | teams who think testing is a waste of time, but most high
           | performance teams won't ship without tests.
           | 
           | They'll say testing doesn't just help catch errors, it helps
           | them build faster, due to the confidence it gives them.
           | 
           | You'll know when your org has adopted observability when they
           | feel that way about instrumenting their code, and it's second
           | nature to write log/trace/metrics into their software.
           | 
           | Not sure I have any reference links in mind just yet, but
           | I'll give it a think.
        
             | ogjunkyard wrote:
             | Thanks so much for this response!
             | 
             | I've been thinking on your response for probably over an
             | hour as I've been going about my day, and the thing that is
             | sticking out to me is your directive to think critically
             | about WHY I want observability. I think I figured out the
             | motivation on why I'm looking into all of this stuff.
             | 
             | I have a side business I'm working on that causes me to
             | think about the customer experience a lot since it's a
             | fully self-service, no-touch product where I'm not actively
             | engaged in the sales, onboarding, etc. experience a new
             | user has. When someone does have an issue, I want to be
             | able to help them accomplish what they are trying to do as
             | quickly as possible.
             | 
             | I recently had a user/friend who was trying to get
             | something set up in the application I'm building. The only
             | reason I knew he had an issue was because he reached out to
             | me. Luckily, when I finally saw his message 4-5 hours
             | later, he was around and able to work with me on
             | troubleshooting his issue. It took me a bit to troubleshoot
             | exactly what was going on and the friend was very
             | patient/helpful the entire time. I remember having him try
             | to initiate his request probably a dozen or so times as I
             | worked through my application and teasing out the root
             | cause of his problem. Ultimately, this led to me building
             | in better error messages into my application to address
             | this specific point, but if there's a way to get ahead of
             | the user issue whack-a-mole game, I'm all for it.
             | 
             | Instead of him trying to reach out to me and us
             | troubleshoot this issue together in real time, it would be
             | more helpful to simply have had an Error Code and Request
             | ID instead. This would allow me to instead tell him, "I dug
             | into this and found out what's going on. Here's exactly
             | what the issue is. Do X, Y, and Z to get this working."
             | 
             | Other points that particularly resonate with me, although I
             | may not consciously know why are:
             | 
             | - JSON-structured logging
             | 
             | - Visualizations could help sell the idea of observability
             | at $DAYJOB (but no clue what would make for a good
             | graph/diagram/etc.)
             | 
             | - High-functioning teams want observability like high-
             | functioning teams want automated testing.
        
           | Nezteb wrote:
           | RE your questions at the bottom, some good reading materials
           | are:
           | 
           | - https://sre.google/sre-book/monitoring-distributed-systems/
           | 
           | - https://cloud.google.com/architecture/devops/devops-
           | measurem...
           | 
           | - https://www.oreilly.com/library/view/distributed-systems-
           | obs...
        
       ___________________________________________________________________
       (page generated 2021-11-11 23:03 UTC)