[HN Gopher] Debugging audio artifacts caused by... a serial port?
       ___________________________________________________________________
        
       Debugging audio artifacts caused by... a serial port?
        
       Author : davidgu
       Score  : 36 points
       Date   : 2024-10-23 15:54 UTC (5 days ago)
        
 (HTM) web link (www.recall.ai)
 (TXT) w3m dump (www.recall.ai)
        
       | davidgu wrote:
       | At Recall.ai, we built a 10,000-node cluster, processing over
       | 1TB/sec of raw video in real-time.
       | 
       | After a major migration, we faced a strange audio issue that led
       | us on a deep dive through our infrastructure.
       | 
       | The culprit? Not the audio code--but a hidden interaction with
       | AWS's virtual serial ports.
       | 
       | We wrote about our journey discovering the artifacts and finding
       | a clean fix!
        
       | jakekw wrote:
       | Wild af
        
       | snozolli wrote:
       | I'm curious why this is even an issue. I don't understand why an
       | actual interrupt would get tripped for virtual serial port
       | writes, and I don't understand why a virtual serial port (i.e.
       | logging) gets swamped by what seems like a moderate stream of
       | data.
       | 
       | The first result for "8250 too much work for irq4" is this:
       | https://unix.stackexchange.com/questions/387600/understandin...
       | (2017)
       | 
       |  _The problem is that the UART hardware that is emulated by
       | various brands of virtual machine behaves impossibly, sending
       | characters at an impossibly fast line speed. To the kernel, this
       | is indistinguishable from faulty real UART hardware that is
       | continually raising an interrupt for an empty output buffer /full
       | input buffer. (Such faulty real hardwares exist, and you will
       | find embedded Linux people also discussing this problem here and
       | there.) The kernel pushes the data out/pulls the data in, and the
       | UART is immediately raising an interrupt saying that it is ready
       | for more._
       | 
       | So, is this seven year old problem still a problem, or is the
       | virtual serial port driver actually unable to keep up with the
       | stream of text?
        
         | toast0 wrote:
         | > I don't understand why an actual interrupt would get tripped
         | for virtual serial port writes, and I don't understand why a
         | virtual serial port (i.e. logging) gets swamped by what seems
         | like a moderate stream of data.
         | 
         | If you're running in a VM, how do you know the difference
         | between a virtual interrupt and a actual interrupt? Either way,
         | you're handling an interrupt.
         | 
         | If I understand the issue, it's not that the virtual serial
         | port is swamped really. It seems like the issue is that the
         | virtual serial port is firing its empty/ready interrupt so
         | often (probably after every byte written to the port?) that the
         | driver thinks the interrupt is broken, so then it falls back to
         | polling for readiness. The driver polling rate seems like it's
         | not high enough to keep up with the logging, so then logging
         | blocks.
         | 
         | Probably, it'd be better for the virtual serial port to limit
         | the number of irqs it will send. Limiting the number of
         | interrupts is a key benefit of the 16550 over earlier UART
         | chips; it does this by having a 16 byte buffer and sending
         | interrupts only when the buffer is empty/below a threshold (for
         | outgoing) or full/above a threshold (for incoming). Getting an
         | interrupt for each outbound byte is too many interrupts if
         | you're logging a lot of junk on the serial console (which I'm
         | guessing happens; modern software is full of junky logs IMHO).
         | 
         | Probably the right thing to do would be for virtual machine
         | hosts to over virtio consoles as well as virtual serial ports,
         | and for virtual machine guests to prefer to use a virtualio
         | console rather than a serial port for logging. But it's
         | probably possible to adjust the serial driver options as well.
         | Virtio console could be much more efficient, as it allows
         | transfers larger than a single byte at a time.
        
           | toast0 wrote:
           | Too late for edit... But rereading the blog and references...
           | I suspect if this was on real hardware, with a real 16550
           | UART, you'd have the same issue, but without the message
           | about too many interrupts.
           | 
           | You'd just have your logs backing up because you're writing
           | more than 115kbaud, and then things that write logs become
           | blocking. You'd need to figure that out by seeing what
           | processes are blocked on what, rather than getting a hint
           | because the irq behavior is weird.
        
             | rcxdude wrote:
             | It's not clear from the thread whether the logging to the
             | serial port was in the same thread as the audio processing.
             | If it was, then it would be a problem either way. If it
             | wasn't, then the fact that the irq is not premptable would
             | mean that it will cause problems for all threads in the
             | virtual case, but not in the real hardware case.
        
         | rcxdude wrote:
         | Curiously, I can't find this message in recent kernels, it was
         | patched out in 2018:
         | https://github.com/torvalds/linux/commit/9d7c249a1ef9bf0d569...
         | . Is it being re-added by distros or something?
         | 
         | In terms of how it causes the issue, I think it's more that the
         | serial driver will monopolise the (virtual) CPU for as long as
         | it is receiving or sending text. If there's a large buffer
         | being dumped all at once (a common default for non-interactive
         | streams), it could stall for long enough to cause an underrun,
         | which doesn't need to be very long, even if the average data
         | rate isn't very high. Consider that the virtual 8250 serial
         | port interface isn't exactly the most efficient interface to
         | begin with, since it's at most copying 16 bytes from host to
         | guest at a time
        
       | kiririn wrote:
       | The more we abstract things, treat servers like cattle, and lose
       | low level knowledge, the more things like this will happen
       | 
       | You shouldn't have to try to reproduce this in a test environment
       | - your infrastructure should allow profiling in live for cases
       | like this. And it should be solved with profiling, not guesswork
       | and bisecting
        
         | theamk wrote:
         | Crazy talk! Next thing you are going to say is that realtime-
         | focused infrastructure should have native counters to detect
         | missing deadlines, so those pops could be preemptively detected
         | via dashboards, instead of via customer complaints?
         | 
         | That's not how you live in AI age. Move fast and break things;
         | YOLO; K8S all the things; etc..
         | 
         | (/s in case it's not clear)
        
       | theamk wrote:
       | I am surprised.. all this complex discovery and the syslogs were
       | containing the culprit line from the first moment:
       | serial8250: too much work for irq4
       | 
       | do people not look at syslog anymore? It's one of the first
       | things I do on unexplainable problems - check for OOM's, thermal
       | throttling, BUGs, etc... Sure, it's not the most common problem,
       | but the check is fast and easy.
        
       | myself248 wrote:
       | What I don't understand, is why the upstream audio doesn't just
       | buffer while the downstream thing processing it is blocked. Why
       | should that result in audible artifacts, can't it just catch up
       | with the rest of the buffer later?
       | 
       | Buffer overruns feels very 1996-cd-burner-ish. Ope, burned a
       | coaster, let's try this hellaciously real-time-bound thing again
       | with inadequate buffering and I/O devices that have unpredictable
       | latency.
       | 
       | What am I missing?
        
         | theamk wrote:
         | it's remote meeting infrastructure, so the latency is critical.
         | When burning CDs, or playing music, it's OK to have a second or
         | two of buffer. When doing conference call, a second of buffer
         | means a second of latency, which means you ask a question and
         | get a response 2 seconds back, which is pretty bad experience.
         | And that's why conference software tries to keep latency as low
         | as possible.
         | 
         | (Now, why does it produce a pop as opposed to
         | silence/hiccup/stretched sound? probably because it was easiest
         | to code)
        
           | anyfoo wrote:
           | Even if the buffer is large enough, at some point (i.e. a
           | long enough meeting in this case), it will fill up.
           | 
           | > (Now, why does it produce a pop as opposed to
           | silence/hiccup/stretched sound? probably because it was
           | easiest to code)
           | 
           | Sudden "silence" pretty much _is_ a pop, and so is the
           | silence suddenly ending. The sharp transition at least
           | theoretically contains energy in all frequencies (or rather
           | the full bandwidth of this bandwidth-limited signal), which
           | we perceive as a pop.
           | 
           | Bang a steel bar against a hard table, and you get a whole
           | range of frequencies as well, also very pop-like. Do the same
           | with a tuning fork, and after the initial bang you get a
           | nice, clean, single tone, because the tuning fork effectively
           | filters out all the other frequencies through its impulse
           | response.
        
             | rcxdude wrote:
             | It's possible to handle buffer underruns more elegantly
             | than that, but it does require more processing power on the
             | receive side of the buffer (basically by using some
             | strategy to extrapolate the audio forward and decay, as
             | opposed to just dropping the signal to zero when there's no
             | data coming from the other side). It's a common thing to do
             | in streaming audio contexts, especially voice, but
             | generally at the end of the user's network connection which
             | is presumed to be unreliable, not in the middle of a
             | processing pipeline which is presumed to be able to hit its
             | latency targets.
        
               | anyfoo wrote:
               | Oh yeah, I didn't mean to imply that the pops are a
               | necessary consequence of buffer overruns. But as you say,
               | gracefully mitigating the symptom requires non-negligible
               | engineering effort (and potentially resources), when the
               | actual problem, namely buffer overruns occurring because
               | the consumer is too slow, shouldn't exist in this system
               | in the first place.
        
         | anyfoo wrote:
         | > What I don't understand, is why the upstream audio doesn't
         | just buffer while the downstream thing processing it is
         | blocked. Why should that result in audible artifacts, can't it
         | just catch up with the rest of the buffer later?
         | 
         | So you make your buffer larger. But the thing consuming the
         | data is still slower than the thing producing the data, so that
         | buffer will eventually overrun, too. The graceful way to fail
         | is not to block, but to quickly drop the excess data. However,
         | you're now losing data, and doing so very unexpectedly overall,
         | because apparently this transcription service was supposed to
         | be well able to keep up with the audio stream. (If I understood
         | this whole thing right, it's a bit vague.)
         | 
         | So really, unless the root cause is taken care of, you either
         | block or lose data, without good reason in this case.
         | 
         | > Buffer overruns feels very 1996-cd-burner-ish.
         | 
         | That was a buffer _underrun_ , the opposite. The CD burner,
         | moving at a fixed speed, ran out of data because the producer
         | (the rest of the PC, effectively) was not able to replenish the
         | buffer quickly enough. The fix, besides faster PCs, was to have
         | very large buffers, that were able to pick up the slack for
         | longer than the duration of a "dropout" on the PC side. But
         | between those "dropouts", the PC was probably well able to
         | produce the data to the CD burner at a _higher_ rate than
         | necessary, so the large buffer stayed filled on average.
         | 
         | Both buffer overruns and underruns are still very much a
         | concern. They exist anywhere where producers and consumers
         | operate at a mismatched rate for more than just a short time,
         | in a real-time setting.
        
         | PaulDavisThe1st wrote:
         | If you (a) need low latency (b) have hardware with
         | unpredictable/unreliable latency behavior, you're screwed by
         | definition.
         | 
         | If you can manage with large latency, then sure, buffer the
         | hell out of everything and your chances of losing data will be
         | close to zero.
         | 
         | If you need low latency, you _cannot_ buffer the hell out of
         | everything, and anything that interferes with the data flow
         | runs the risk of causing an underrun.
         | 
         | The solution in the CD burner case was easy: massive latency is
         | fine, so buffer the hell out of it, and it just works.
         | 
         | The solution for low-latency audio anything is not so easy: you
         | can't buffer the hell out of it, which means you're susceptible
         | to hardware and software issues that disrupt timing and end up
         | causing underruns/overruns.
        
       | PaulDavisThe1st wrote:
       | > There are a few processes in the bot that are especially
       | latency sensitive, which we have tuned the nice value for.
       | 
       | This immediately signifies some level of "does not understand how
       | this stuff works". Latency sensitive (audio or other) tasks need
       | to be in the SCHED_RR or SCHED_FIFO scheduling class, which
       | nice(1) has no effect on.
       | 
       | Conversely, using nice(1) on a SCHED_OTHER task is also unlikely
       | to work, given that nice only impacts scheduling decisions and
       | cannot provide RT-like behavior.
        
       | Animats wrote:
       | Bottlenecking on logging is a common problem. Classically, Linux
       | assumed it has a serial console device. It can still be enabled
       | at kernel compile time.[1] Apparently, this mass of AWS
       | instances, VMs, and Docker images works that way.
       | 
       | I liked the QNX approach, where the kernel sends log messages to
       | another process, loaded on boot. When you build a boot image, you
       | provide a logger process to read those messages. There's no
       | expectation in embedded that there's a console available. If your
       | system is a pump or an auto dashboard, you need to send the
       | messages somewhere other than a "console". There might not even
       | be a file system. In QNX, messaging is a primitive on top of
       | which file systems and networking are built.
       | 
       | When I had to do logging in real time, each process used a
       | library which did "logprintf" calls. Those went into a circular
       | buffer which was written to a log file from another thread. If
       | the circular buffer filled, "..." would appear in the log, and
       | log messages would be lost, but the real-time thread would not
       | block.
       | 
       | Interaction between real-time and non-real-time is always tough.
       | It comes up a lot in networked game development.
       | 
       | [1] https://www.kernel.org/doc/html/latest/admin-guide/serial-
       | co...
        
       ___________________________________________________________________
       (page generated 2024-10-28 23:00 UTC)