[HN Gopher] Logs and tracing: not just for production, local dev...
___________________________________________________________________
Logs and tracing: not just for production, local development too
Author : lawrjone
Score : 55 points
Date : 2021-11-11 12:40 UTC (10 hours ago)
(HTM) web link (incident.io)
(TXT) w3m dump (incident.io)
| lmeyerov wrote:
| What's a good local dev version of this nowadays?
|
| We invested heavily on `logger` in Python and equiv in JS, which
| makes debugging most straightline issues a breeze. However, not
| so much with some of our async & cross-service flows. It's not
| hard to add correlation ID handoffs across REST API calls, so
| curious on both agent instrumentation and local UI (e.g., 10
| lines of docker compose config)?
| coldacid wrote:
| I almost always instrument all my code with logging messages for
| the purpose of local development. Sometimes it can be very
| inconvenient for different reasons to run something in the
| debugger, and so generating log files while running the code to
| test it helps me identify a lot of issues that I may have
| overlooked or that otherwise come up in the code while it's still
| being worked on.
|
| I almost never remove those log messages either, but thanks to
| the magic of log levels, I don't have to. Anything debug or trace
| level can be skipped in production logs unless we need to turn on
| those levels due to bugs that are commonly hit in production yet
| difficult to diagnose.
| lawrjone wrote:
| That all sounds very normal to me, and it's what I'd expect
| most people do.
|
| The bit that surprised me is more likely due to my background
| being mostly Ruby/similar based, which is around debugging
| highly concurrent apps.
|
| Our app spins up goroutines all the time. One request will fan
| out to a number of concurrent threads, doing things like
| hydrating responses from Slack to syncing issue tracker
| tickets.
|
| With your STDOUT logs, that's so difficult to understand as
| it's all interwoven. And picturing all the threads and how they
| interact is fairly difficult.
|
| I think that's why this was such a "ah-ha!" moment for us.
| Suddenly we could find the "error" log in our terminal, then
| click the trace URL and see everything in context.
|
| That's probably the key difference to our past experience, that
| the app we're working with is so concurrent.
| stavros wrote:
| In the past, I've used Honeycomb to do tracing, with great
| success. Google Cloud Trace seems interesting, though I think
| that something purpose-built would be a better fit.
|
| Right now we have a need for tracing the frontend. Does anyone
| here have any suggestions for that? In particular, we'd like
| something that can reconstruct a user's journey, particularly if
| it results in an error, so we can fix it.
| lawrjone wrote:
| Sentry (https://sentry.io/) has been building a load of tracing
| functionality, and the Javascript client captures logs and
| requests leading up to an error by default.
|
| I've used this in the past quite a bit, and found it really
| useful.
|
| And in terms of Cloud Trace vs other tracing providers, I think
| it was the ease of setup and the simple log + trace inline view
| that sold us on it. Only took a couple of hours to get working,
| and it's great that it works for local dev too.
| stavros wrote:
| We did try Sentry, but we couldn't really get the UI to work
| for us. By default, the logical unit it uses is the error,
| and our project view would quickly be spammed with all the
| events and become unusable. I haven't personally researched
| it (I've used Sentry for errors-only needs, and it works well
| for that), but from what I heard the error-centric
| functionality made it hard to work with traces.
|
| Would you happen to have any tips on how to fix that? Maybe
| we just got it wrong.
| candiddevmike wrote:
| Proxy the open tracing requests from your backend. You could
| even add a correlated request ID for the backend API calls:
|
| https://github.com/opentracing/opentracing-javascript
| stavros wrote:
| I'm not sure of the workflow here, the backend is
| instrumented separately. We're only looking to trace user
| actions in the backend, can you explain a bit about how
| OpenTracing fits into that? Where do the traces go?
| cbushko wrote:
| I really enjoyed this article. I will assume that logs are being
| sent to a dedicated GCP project.
|
| Can you shorten the retention of logs and traces in Stackdriver?
| If so, that would be amazing and make this practical for all dev
| use.
| lawrjone wrote:
| We have some terraform that generates each developer their own
| GCP project, along with all the infrastructure and build
| pipelines they'd need to run their stack themselves.
|
| It looks something like this: module
| "incident_io" { for_each = { "staging" =
| { project = "incident-io-staging" }
| "production" = { project = "incident-io-
| production" } "dev-lawrence" = {
| project = "incident-io-dev-lawrence"
| autodeploy = true } "dev-lisa" = {
| project = "incident-io-dev-lisa" autodeploy =
| true } # ... }
| source = "./modules/stack" application =
| "incident-io" instance = each.key
| google_project_id = each.value.project autodeploy
| = lookup(each.value, "autodeploy", false) }
|
| So their traces + logs get sent to their own StackDriver
| instances, rather than polluting either staging or production.
| cbushko wrote:
| That is an interesting way to do it. Per project would allow
| you to set give each developer all the permissions they need.
|
| The worry I have with this is that as you grow you will
| eventually end up with a bunch of dead projects. You need to
| cleanup that list every so often as employees come and go.
|
| We have a dev project so they could send their logs there.
| jakozaur wrote:
| Personally, I find local logs and tracing super useful. It allows
| you to debug tests and ensure you have right instrumentation to
| troubleshoot similar issues on production. Design and write code
| that you can debug later.
|
| E.g. we troubleshooted slow build times or analyse adoption of
| different internal tooling.
|
| Minor nitpick, these days I would use OpenTelemetry for
| instrumentation instead of OpenCensus. Even on OpenCensus website
| (https://opencensus.io/) you can read "OpenCensus and OpenTracing
| have merged into OpenTelemetry!".
|
| Disclaimer: I work at Sumo Logic (https://www.sumologic.com). We
| do logs, metrics and tracing.
| lawrjone wrote:
| You're totally right on the OC vs OT- we couldn't use OT when I
| first put this together, as the StackDriver plugin wasn't
| compatible. We also found it didn't work out-the-box with a
| load of our stuff (notably the GCP client libraries) while OC
| did.
|
| Another commenter has pointed out things have moved on quite a
| bit since then- that's really good to hear, and we'll revisit
| this when we get some time to make the shift :)
| bob1029 wrote:
| We built a centralized tracing/logging infrastructure that all of
| our environments talk to - including local developer
| environments. This is a 100% in-house creation which is purpose-
| built to convey important facts specific to our product.
|
| Anyone from developer to end user is able to submit a trace to
| this system simply by selecting a menu option and entering a
| comment. These traces contain sufficient information to perfectly
| reconstruct all user interactions and review all communications
| with external systems.
|
| There is a dashboard that updates in real-time containing all of
| these submissions, so everyone on the team typically has this off
| to the side for a quick glance throughout the day.
|
| The amount of time this sort of tooling has bought us is hard to
| quantify, but I can't imagine this stuff hasn't paid for itself
| yet. I honestly don't see how you could what we did by just
| taping together 3rd party solutions. Certainly, not when also
| considering the security & deployment model we have to cope with
| due to banking industry regulations.
| lawrjone wrote:
| That's really interesting!
|
| In terms of incident.io, we're a team of 3 engineers (looking
| for more!) and a company of 6 in total, so our priority was on
| buy and not build.
|
| But in my most recent experience at a larger company (fintech
| unicorn, ~1k employees) we:
|
| - Used Sentry to capture exceptions from all over the app. Our
| language was Ruby, which meant the stacktraces were usually
| explanatory, and attached debugging information to the event
| whenever we sent it - We didn't use tracing much, though when
| we did it would be at a request level. Our Sentries would link
| to the trace, and you'd see what happened during your request -
| All logs were associated with the trace
|
| So our error report would be a Sentry, which could be joined
| against the logs and trace information.
|
| In the situation where error reporting (probably better phrased
| as crash report?) was triggered by a user, it would normally
| come from a frontend site and trigger a Sentry.
|
| The Sentry plugin will have tracked the console.log/debug calls
| up until that moment, and any requests that took place. So it
| was quite easy to piece together what had happened, and why it
| went wrong.
|
| Not ever seen anything so custom as what you describe, though.
| Interesting to know that those systems are out there, and that
| you feel it worth building over buying!
| lawrjone wrote:
| Author of the post here.
|
| We're a super small start-up, but with prior experience of great
| observability toolchains.
|
| Been really surprised at how little effort it required to get a
| richly integrated trace and logging setup, using Cloud Trace +
| StackDriver.
|
| Equally, we've all been surprised by how much it changed our
| local development experience. Going from an error in dev to the
| trace view and all the logs inline is just one click, which is
| pretty transformative.
|
| Happy to answer any questions, hope you enjoy the article!
| ogjunkyard wrote:
| I enjoyed the article a fair bit!
|
| I was wondering, how should one get started with observability
| and implementing it? Are there specific books/courses/talks
| you'd recommend?
|
| The reason I ask is because I've never been directly exposed to
| good observability at any of the companies I've worked at for a
| handful of reasons. It mostly boils down to the fact that I'm a
| DevOps engineer, so building observability is a set-up-and-
| keep-running sort of deal for other teams, not a useful-for-my-
| applications thing that I'm going to be working with often.
| Teams let us know "Splunk is down", "I can't reach Kibana", or
| "Looks like disk space is filling up" and that's about it after
| it's been initially set up.
|
| There's a whole host of questions I'd ideally like to answer,
| but a lot of it boils down to the fact that I don't know what I
| don't know and I'd suggest assuming I know nothing over
| assuming I know something because I know the word. Questions
| I'd like to be able to answer are:
|
| - What makes a good log? - What is a trace? Why is it useful?
| How does it help me debug issues faster? - How do you increase
| observability for loosely-coupled microservice systems? - How
| do you observe multi-threaded applications? - ... and I'm sure
| there are a whole bunch more.
| lawrjone wrote:
| Urgh, yes. This is a really difficult place to be coming
| from, and I totally feel your pain!
|
| My background was working as an SRE at GoCardless, starting
| when the company was around ~30 and leaving at around 700.
| During that time we did the whole "oh crap, wtf is
| observability" that coincided with a big push in the industry
| to define the term, and I worked on the team (Observability
| Working Group) that tried rolling out these practices.
|
| The truth is this is much easier if you have someone with you
| who knows what good is, though I was in your position at GC
| and it's possible to learn it by first principles.
|
| If you're doing this, the best advice I can give you is to
| think really critically about _why_ you want observability.
|
| Usually it's "when something goes wrong, I want to be able to
| understand what lead to it, and what was going on at that
| time". If that's the case, you can't make a wrong step if it
| improves your ability to understand that- even if what you do
| is simple.
|
| At GC we began with logs, as everyone was familiar with them.
| We encouraged people to start thinking about logs as
| structured data, so drop the "Posted message to Slack" log
| line and go for something like:
|
| ``` { event: "slack_message.posted", slack_channel_id:
| "CH123", slack_user_id: "US123", etc... } ```
|
| When you get your logs looking like that, you can setup
| something like Grafana to expose visualisations that are
| built from your logs. We were using ElasticSearch for log
| storage, which is quite simple to build graphs on top of.
|
| Visualisations are really compelling, and help you persuade
| people it's worthwhile to consider this stuff.
|
| Beyond structured logging, you'd want to look into time-
| series metrics (Prometheus) which can help you monitor things
| in a bit more real-time, then traces if you want that type of
| insight.
|
| I've often compared observability to testing, in terms of how
| you should think about it/use it. You'll find a load of dev
| teams who think testing is a waste of time, but most high
| performance teams won't ship without tests.
|
| They'll say testing doesn't just help catch errors, it helps
| them build faster, due to the confidence it gives them.
|
| You'll know when your org has adopted observability when they
| feel that way about instrumenting their code, and it's second
| nature to write log/trace/metrics into their software.
|
| Not sure I have any reference links in mind just yet, but
| I'll give it a think.
| ogjunkyard wrote:
| Thanks so much for this response!
|
| I've been thinking on your response for probably over an
| hour as I've been going about my day, and the thing that is
| sticking out to me is your directive to think critically
| about WHY I want observability. I think I figured out the
| motivation on why I'm looking into all of this stuff.
|
| I have a side business I'm working on that causes me to
| think about the customer experience a lot since it's a
| fully self-service, no-touch product where I'm not actively
| engaged in the sales, onboarding, etc. experience a new
| user has. When someone does have an issue, I want to be
| able to help them accomplish what they are trying to do as
| quickly as possible.
|
| I recently had a user/friend who was trying to get
| something set up in the application I'm building. The only
| reason I knew he had an issue was because he reached out to
| me. Luckily, when I finally saw his message 4-5 hours
| later, he was around and able to work with me on
| troubleshooting his issue. It took me a bit to troubleshoot
| exactly what was going on and the friend was very
| patient/helpful the entire time. I remember having him try
| to initiate his request probably a dozen or so times as I
| worked through my application and teasing out the root
| cause of his problem. Ultimately, this led to me building
| in better error messages into my application to address
| this specific point, but if there's a way to get ahead of
| the user issue whack-a-mole game, I'm all for it.
|
| Instead of him trying to reach out to me and us
| troubleshoot this issue together in real time, it would be
| more helpful to simply have had an Error Code and Request
| ID instead. This would allow me to instead tell him, "I dug
| into this and found out what's going on. Here's exactly
| what the issue is. Do X, Y, and Z to get this working."
|
| Other points that particularly resonate with me, although I
| may not consciously know why are:
|
| - JSON-structured logging
|
| - Visualizations could help sell the idea of observability
| at $DAYJOB (but no clue what would make for a good
| graph/diagram/etc.)
|
| - High-functioning teams want observability like high-
| functioning teams want automated testing.
| Nezteb wrote:
| RE your questions at the bottom, some good reading materials
| are:
|
| - https://sre.google/sre-book/monitoring-distributed-systems/
|
| - https://cloud.google.com/architecture/devops/devops-
| measurem...
|
| - https://www.oreilly.com/library/view/distributed-systems-
| obs...
___________________________________________________________________
(page generated 2021-11-11 23:03 UTC)