[HN Gopher] Debugging audio artifacts caused by... a serial port?
___________________________________________________________________
Debugging audio artifacts caused by... a serial port?
Author : davidgu
Score : 36 points
Date : 2024-10-23 15:54 UTC (5 days ago)
(HTM) web link (www.recall.ai)
(TXT) w3m dump (www.recall.ai)
| davidgu wrote:
| At Recall.ai, we built a 10,000-node cluster, processing over
| 1TB/sec of raw video in real-time.
|
| After a major migration, we faced a strange audio issue that led
| us on a deep dive through our infrastructure.
|
| The culprit? Not the audio code--but a hidden interaction with
| AWS's virtual serial ports.
|
| We wrote about our journey discovering the artifacts and finding
| a clean fix!
| jakekw wrote:
| Wild af
| snozolli wrote:
| I'm curious why this is even an issue. I don't understand why an
| actual interrupt would get tripped for virtual serial port
| writes, and I don't understand why a virtual serial port (i.e.
| logging) gets swamped by what seems like a moderate stream of
| data.
|
| The first result for "8250 too much work for irq4" is this:
| https://unix.stackexchange.com/questions/387600/understandin...
| (2017)
|
| _The problem is that the UART hardware that is emulated by
| various brands of virtual machine behaves impossibly, sending
| characters at an impossibly fast line speed. To the kernel, this
| is indistinguishable from faulty real UART hardware that is
| continually raising an interrupt for an empty output buffer /full
| input buffer. (Such faulty real hardwares exist, and you will
| find embedded Linux people also discussing this problem here and
| there.) The kernel pushes the data out/pulls the data in, and the
| UART is immediately raising an interrupt saying that it is ready
| for more._
|
| So, is this seven year old problem still a problem, or is the
| virtual serial port driver actually unable to keep up with the
| stream of text?
| toast0 wrote:
| > I don't understand why an actual interrupt would get tripped
| for virtual serial port writes, and I don't understand why a
| virtual serial port (i.e. logging) gets swamped by what seems
| like a moderate stream of data.
|
| If you're running in a VM, how do you know the difference
| between a virtual interrupt and a actual interrupt? Either way,
| you're handling an interrupt.
|
| If I understand the issue, it's not that the virtual serial
| port is swamped really. It seems like the issue is that the
| virtual serial port is firing its empty/ready interrupt so
| often (probably after every byte written to the port?) that the
| driver thinks the interrupt is broken, so then it falls back to
| polling for readiness. The driver polling rate seems like it's
| not high enough to keep up with the logging, so then logging
| blocks.
|
| Probably, it'd be better for the virtual serial port to limit
| the number of irqs it will send. Limiting the number of
| interrupts is a key benefit of the 16550 over earlier UART
| chips; it does this by having a 16 byte buffer and sending
| interrupts only when the buffer is empty/below a threshold (for
| outgoing) or full/above a threshold (for incoming). Getting an
| interrupt for each outbound byte is too many interrupts if
| you're logging a lot of junk on the serial console (which I'm
| guessing happens; modern software is full of junky logs IMHO).
|
| Probably the right thing to do would be for virtual machine
| hosts to over virtio consoles as well as virtual serial ports,
| and for virtual machine guests to prefer to use a virtualio
| console rather than a serial port for logging. But it's
| probably possible to adjust the serial driver options as well.
| Virtio console could be much more efficient, as it allows
| transfers larger than a single byte at a time.
| toast0 wrote:
| Too late for edit... But rereading the blog and references...
| I suspect if this was on real hardware, with a real 16550
| UART, you'd have the same issue, but without the message
| about too many interrupts.
|
| You'd just have your logs backing up because you're writing
| more than 115kbaud, and then things that write logs become
| blocking. You'd need to figure that out by seeing what
| processes are blocked on what, rather than getting a hint
| because the irq behavior is weird.
| rcxdude wrote:
| It's not clear from the thread whether the logging to the
| serial port was in the same thread as the audio processing.
| If it was, then it would be a problem either way. If it
| wasn't, then the fact that the irq is not premptable would
| mean that it will cause problems for all threads in the
| virtual case, but not in the real hardware case.
| rcxdude wrote:
| Curiously, I can't find this message in recent kernels, it was
| patched out in 2018:
| https://github.com/torvalds/linux/commit/9d7c249a1ef9bf0d569...
| . Is it being re-added by distros or something?
|
| In terms of how it causes the issue, I think it's more that the
| serial driver will monopolise the (virtual) CPU for as long as
| it is receiving or sending text. If there's a large buffer
| being dumped all at once (a common default for non-interactive
| streams), it could stall for long enough to cause an underrun,
| which doesn't need to be very long, even if the average data
| rate isn't very high. Consider that the virtual 8250 serial
| port interface isn't exactly the most efficient interface to
| begin with, since it's at most copying 16 bytes from host to
| guest at a time
| kiririn wrote:
| The more we abstract things, treat servers like cattle, and lose
| low level knowledge, the more things like this will happen
|
| You shouldn't have to try to reproduce this in a test environment
| - your infrastructure should allow profiling in live for cases
| like this. And it should be solved with profiling, not guesswork
| and bisecting
| theamk wrote:
| Crazy talk! Next thing you are going to say is that realtime-
| focused infrastructure should have native counters to detect
| missing deadlines, so those pops could be preemptively detected
| via dashboards, instead of via customer complaints?
|
| That's not how you live in AI age. Move fast and break things;
| YOLO; K8S all the things; etc..
|
| (/s in case it's not clear)
| theamk wrote:
| I am surprised.. all this complex discovery and the syslogs were
| containing the culprit line from the first moment:
| serial8250: too much work for irq4
|
| do people not look at syslog anymore? It's one of the first
| things I do on unexplainable problems - check for OOM's, thermal
| throttling, BUGs, etc... Sure, it's not the most common problem,
| but the check is fast and easy.
| myself248 wrote:
| What I don't understand, is why the upstream audio doesn't just
| buffer while the downstream thing processing it is blocked. Why
| should that result in audible artifacts, can't it just catch up
| with the rest of the buffer later?
|
| Buffer overruns feels very 1996-cd-burner-ish. Ope, burned a
| coaster, let's try this hellaciously real-time-bound thing again
| with inadequate buffering and I/O devices that have unpredictable
| latency.
|
| What am I missing?
| theamk wrote:
| it's remote meeting infrastructure, so the latency is critical.
| When burning CDs, or playing music, it's OK to have a second or
| two of buffer. When doing conference call, a second of buffer
| means a second of latency, which means you ask a question and
| get a response 2 seconds back, which is pretty bad experience.
| And that's why conference software tries to keep latency as low
| as possible.
|
| (Now, why does it produce a pop as opposed to
| silence/hiccup/stretched sound? probably because it was easiest
| to code)
| anyfoo wrote:
| Even if the buffer is large enough, at some point (i.e. a
| long enough meeting in this case), it will fill up.
|
| > (Now, why does it produce a pop as opposed to
| silence/hiccup/stretched sound? probably because it was
| easiest to code)
|
| Sudden "silence" pretty much _is_ a pop, and so is the
| silence suddenly ending. The sharp transition at least
| theoretically contains energy in all frequencies (or rather
| the full bandwidth of this bandwidth-limited signal), which
| we perceive as a pop.
|
| Bang a steel bar against a hard table, and you get a whole
| range of frequencies as well, also very pop-like. Do the same
| with a tuning fork, and after the initial bang you get a
| nice, clean, single tone, because the tuning fork effectively
| filters out all the other frequencies through its impulse
| response.
| rcxdude wrote:
| It's possible to handle buffer underruns more elegantly
| than that, but it does require more processing power on the
| receive side of the buffer (basically by using some
| strategy to extrapolate the audio forward and decay, as
| opposed to just dropping the signal to zero when there's no
| data coming from the other side). It's a common thing to do
| in streaming audio contexts, especially voice, but
| generally at the end of the user's network connection which
| is presumed to be unreliable, not in the middle of a
| processing pipeline which is presumed to be able to hit its
| latency targets.
| anyfoo wrote:
| Oh yeah, I didn't mean to imply that the pops are a
| necessary consequence of buffer overruns. But as you say,
| gracefully mitigating the symptom requires non-negligible
| engineering effort (and potentially resources), when the
| actual problem, namely buffer overruns occurring because
| the consumer is too slow, shouldn't exist in this system
| in the first place.
| anyfoo wrote:
| > What I don't understand, is why the upstream audio doesn't
| just buffer while the downstream thing processing it is
| blocked. Why should that result in audible artifacts, can't it
| just catch up with the rest of the buffer later?
|
| So you make your buffer larger. But the thing consuming the
| data is still slower than the thing producing the data, so that
| buffer will eventually overrun, too. The graceful way to fail
| is not to block, but to quickly drop the excess data. However,
| you're now losing data, and doing so very unexpectedly overall,
| because apparently this transcription service was supposed to
| be well able to keep up with the audio stream. (If I understood
| this whole thing right, it's a bit vague.)
|
| So really, unless the root cause is taken care of, you either
| block or lose data, without good reason in this case.
|
| > Buffer overruns feels very 1996-cd-burner-ish.
|
| That was a buffer _underrun_ , the opposite. The CD burner,
| moving at a fixed speed, ran out of data because the producer
| (the rest of the PC, effectively) was not able to replenish the
| buffer quickly enough. The fix, besides faster PCs, was to have
| very large buffers, that were able to pick up the slack for
| longer than the duration of a "dropout" on the PC side. But
| between those "dropouts", the PC was probably well able to
| produce the data to the CD burner at a _higher_ rate than
| necessary, so the large buffer stayed filled on average.
|
| Both buffer overruns and underruns are still very much a
| concern. They exist anywhere where producers and consumers
| operate at a mismatched rate for more than just a short time,
| in a real-time setting.
| PaulDavisThe1st wrote:
| If you (a) need low latency (b) have hardware with
| unpredictable/unreliable latency behavior, you're screwed by
| definition.
|
| If you can manage with large latency, then sure, buffer the
| hell out of everything and your chances of losing data will be
| close to zero.
|
| If you need low latency, you _cannot_ buffer the hell out of
| everything, and anything that interferes with the data flow
| runs the risk of causing an underrun.
|
| The solution in the CD burner case was easy: massive latency is
| fine, so buffer the hell out of it, and it just works.
|
| The solution for low-latency audio anything is not so easy: you
| can't buffer the hell out of it, which means you're susceptible
| to hardware and software issues that disrupt timing and end up
| causing underruns/overruns.
| PaulDavisThe1st wrote:
| > There are a few processes in the bot that are especially
| latency sensitive, which we have tuned the nice value for.
|
| This immediately signifies some level of "does not understand how
| this stuff works". Latency sensitive (audio or other) tasks need
| to be in the SCHED_RR or SCHED_FIFO scheduling class, which
| nice(1) has no effect on.
|
| Conversely, using nice(1) on a SCHED_OTHER task is also unlikely
| to work, given that nice only impacts scheduling decisions and
| cannot provide RT-like behavior.
| Animats wrote:
| Bottlenecking on logging is a common problem. Classically, Linux
| assumed it has a serial console device. It can still be enabled
| at kernel compile time.[1] Apparently, this mass of AWS
| instances, VMs, and Docker images works that way.
|
| I liked the QNX approach, where the kernel sends log messages to
| another process, loaded on boot. When you build a boot image, you
| provide a logger process to read those messages. There's no
| expectation in embedded that there's a console available. If your
| system is a pump or an auto dashboard, you need to send the
| messages somewhere other than a "console". There might not even
| be a file system. In QNX, messaging is a primitive on top of
| which file systems and networking are built.
|
| When I had to do logging in real time, each process used a
| library which did "logprintf" calls. Those went into a circular
| buffer which was written to a log file from another thread. If
| the circular buffer filled, "..." would appear in the log, and
| log messages would be lost, but the real-time thread would not
| block.
|
| Interaction between real-time and non-real-time is always tough.
| It comes up a lot in networked game development.
|
| [1] https://www.kernel.org/doc/html/latest/admin-guide/serial-
| co...
___________________________________________________________________
(page generated 2024-10-28 23:00 UTC)