https://blog.mozilla.org/data/2022/04/13/this-week-in-glean-what-flips-your-bit/ Mozilla Menu * Internet Health * Technology * Give * Discover Firefox Data@Mozilla * Explore * Categories * Search this site [ ] Search The Glean logo Categories: Data Engineering Data Hype Data Science Glean This Week in Glean: What Flips Your Bit? Travis Long April 13, 2022 ("This Week in Glean" is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. You can find an index of all TWiG posts online.) The idea of "soft-errors", particularly "single-event upsets" often comes up when we have strange errors in telemetry. Single-event upsets are defined as: "a change of state caused by one single ionizing particle (ions, electrons, photons...) striking a sensitive node in a micro-electronic device, such as in a microprocessor, semiconductor memory, or power transistors. The state change is a result of the free charge created by ionization in or close to an important node of a logic element (e.g. memory "bit")". And what exactly causes these single-event upsets? Well, from the same Wikipedia article: "Terrestrial SEU arise due to cosmic particles colliding with atoms in the atmosphere, creating cascades or showers of neutrons and protons, which in turn may interact with electronic circuits". In other words, energy from space can affect your computer and turn a 1 into a 0 or vice versa. There are examples in our data collected by Glean from Mozilla projects like Firefox, that appear to be malformed by a single bit from the value we would expect. In almost every case we cannot find any plausible explanation or bug in any of the infrastructure from client to analysis, so we often shrug and say "oh well, it must be cosmic rays". A totally fantastical explanation for an empirical measurement of some anomaly that we cannot explain. What if it wasn't just some fantastical explanation? What if there was some grain of truth in there and somehow we could detect cosmic rays with browser telemetry data? I was personally struck with these questions, recently, as I became aware of a recent bug that was filed that described just these sorts of errors in their data. These errors were showing up as strings with a single character different in the data (well, a single bit actually). At about the same time, I read an article about a geomagnetic storm that hit at the end of March. Something clicked and I started to really wonder if we could possibly have detected a cosmic event through these single-event upsets in our telemetry data. I did a little research to see if there was any data on the frequency of these events and found a handful of articles (for instance) that kept referring to a study done by IBM in the 1990's that referenced 1 cosmic ray bit flip per 256MB of memory per month. After a little digging, I was able to come up with two papers by J.F. Ziegler, an IBM researcher. The first paper, from 1979, on "The Effects of Cosmic Rays on Computer Memories", goes into the mechanisms by which cosmic rays can affect bits in computer memory, and makes some rough estimates on the frequency of such events, as well as the effect of elevation on the frequency. The later article from the 1990's, " Accelerated Testing For Cosmic Soft-Error Rate", went more in detail in measuring the soft-error rates of different chips by different manufacturers. While I never found the exact source of the "1 bit-flip per 256MB per month" quote in either of these papers, the figure could possibly be generalized from the soft-error rate data in the papers. So, while I'm not entirely sure that that number for the rate is accurate, it's probably close enough for us to do some simple calculations. So, now that I had checked out the facts behind cosmic ray induced errors, it was time to see if there was any evidence of this in our data. First of all, where could I find these errors, and where would I most likely find these sorts of errors? I thought about the types of data that we collect and decided that a numeric field would be nearly impossible to detect a bit-flip within, unless it was a field with a very limited expected range. String fields seemed to be a little easier candidate to search for, since single bit flips tend to make strings a little weird due to a single unexpected character. There are also some good places to go looking for bit flips in our error streams too, such as when a column or table name is affected. Secondly, I had to make a few hand-wavy assumptions in order to crunch some numbers. The main assumption is that every bit in our data has the same chance of being flipped as any other bit in any other memory. The secondary assumption is that the bits are getting flipped at the client side of the connection, and not while on our servers. We have a lot of users, and the little bit of data we collect from each client really adds up. Let's convert that error rate to some more compatible units. Using the 1/256MB/month figure from the article, that's 4096 cosmic soft-errors per terabyte per month. According to my colleague, chutten, we receive about 100 terabytes of data per day, or 2800 TB in a 4 week period. If we multiply that out, it looks like we have the potential to find 11,468,800 bit flips in a given 4 week period of our data. WHAT! That seemed like an awful lot of possibilities, even if I suspect a good portion of them to be undetectable just due to not being an "obvious" bit flip. Looking at the Bugzilla issue that had originally sparked my interest in this, it contained some evidence of labels embedded in the data being affected by bit-flips. This was pretty easy to spot because we knew what labels we were expecting and the handful of anomalies stood out. Not only that, the effect seemed to be somewhat localized to a geographical area. Maybe this wasn't such a bad place to try and correlate this information with space-weather forecasts. Back to the internet and I find an interesting space-weather article that seems to line up with the dates from the bug. I finally hit a bit of a wall in this fantastical investigation when I found it difficult to get data on solar radiation by day and geographical location. There is a rather nifty site, SpaceWeatherLive.com which has quite a bit of interesting data on solar radiation, but I was starting to hit the limits of my current knowledge and the limits on time that I had set out for myself to write this blog post. So, rather reluctantly, I had to set aside any deeper investigations into this for another day. I do leave the search here feeling that not only is it possible that our data contains signals for cosmic activity, but that it is very likely that it could be used to correlate or even measure the impact of cosmic ray induced single-event upsets. I hope that sometime in the future I can come back to this and dig a little deeper. Perhaps someone reading this will also be inspired to poke around at this possibility and would be interested in collaborating on it, and if you are, you can reach me via the Glean Channel on Matrix as @travis. For now, I've turned something that seemed like a crazy possibility in my mind into something that seems a lot more likely than I ever expected. Not a bad investigation at all. Previous article Documenting outages to seek transparency and accountability March 9, 2022 More articles in "Data Engineering" * Documenting outages to seek transparency and accountability March 9, 2022 * This Week in Glean: Migrating Legacy Telemetry Collections to Glean February 9, 2022 * Detecting Internet Outages with Mozilla Telemetry Data November 8, 2021 * This Week in Glean: Designing a telemetry collection with Glean October 18, 2021 * This Week in Glean: Announcement: Glean.js v0.19.0 supports Node.js September 30, 2021 Recent articles * Documenting outages to seek transparency and accountability March 9, 2022 * This Week in Glean: Your personal Glean data pipeline February 25, 2022 * This Week in Glean: What If I Want To Collect All The Data? February 16, 2022 * This Week in Glean: Migrating Legacy Telemetry Collections to Glean February 9, 2022 * This Week in Glean: Building and Deploying a Rust library on iOS January 31, 2022 Love the Web? Get the Mozilla newsletter and help us keep it open and free. Your e-mail address [ ] Language [English ] (*) HTML ( ) Text [ ] I'm okay with Mozilla handling my info as explained in this Privacy Policy. Sign up now We will only send you Mozilla-related information. Thanks! If you haven't previously confirmed a subscription to a Mozilla-related newsletter you may have to do so. Please check your inbox or your spam filter for an e-mail from us. Search this site [ ] Search Recent Posts * This Week in Glean: What Flips Your Bit? Categories * Data Engineering * Data Hype * Data Policy * Data Science * Data Stewardship * Glean * Process and Management Meta * Log in * Entries feed * Comments feed * WordPress.org More articles * Glean * Data Engineering * Data Science * Data Policy * Data Stewardship * Data Hype * Process and Management Glean * [place-thumb] This Week in Glean: What Flips Your Bit? April 13, 2022 * [place-thumb] This Week in Glean: Your personal Glean data pipeline February 25, 2022 * [place-thumb] This Week in Glean: What If I Want To Collect All The Data? February 16, 2022 * [place-thumb] This Week in Glean: Migrating Legacy Telemetry Collections to Glean February 9, 2022 * [place-thumb] This Week in Glean: Building and Deploying a Rust library on iOS January 31, 2022 Data Engineering * [place-thumb] This Week in Glean: What Flips Your Bit? April 13, 2022 * [place-thumb] Documenting outages to seek transparency and accountability March 9, 2022 * [place-thumb] This Week in Glean: Migrating Legacy Telemetry Collections to Glean February 9, 2022 * [place-thumb] Detecting Internet Outages with Mozilla Telemetry Data November 8, 2021 * [place-thumb] This Week in Glean: Designing a telemetry collection with Glean October 18, 2021 Data Science * [place-thumb] This Week in Glean: What Flips Your Bit? April 13, 2022 * [place-thumb] This Week in Glean: Designing a telemetry collection with Glean October 18, 2021 * [place-thumb] My first time experience at the SciPy conference October 6, 2021 * [place-thumb] This Week in Glean: Why choosing the right data type for your metric matters August 23, 2021 * [place-thumb] Announcing Mozilla Rally May 5, 2021 Data Policy * [place-thumb] Documenting outages to seek transparency and accountability March 9, 2022 * [place-thumb] Data and Firefox Suggest September 15, 2021 * [place-thumb] Announcing Mozilla Rally May 5, 2021 * [place-thumb] Data Publishing @ Mozilla September 25, 2020 * [place-thumb] Understanding default browser trends March 16, 2020 Data Stewardship * [place-thumb] Documenting outages to seek transparency and accountability March 9, 2022 * [place-thumb] Data and Firefox Suggest September 15, 2021 * [place-thumb] Responsible Data Collection is Good, Actually (Ubisoft Data Summit 2021) July 6, 2021 * [place-thumb] Announcing Mozilla Rally May 5, 2021 * [place-thumb] This Week in Glean: The Glean Dictionary January 27, 2021 Data Hype * [place-thumb] This Week in Glean: What Flips Your Bit? April 13, 2022 * [place-thumb] Detecting Internet Outages with Mozilla Telemetry Data November 8, 2021 * [place-thumb] Making your Data Work for you with Mozilla Rally March 30, 2021 Process and Management * [place-thumb] This Week in Glean: Fantastic Facts and where to find them November 19, 2020 * [place-thumb] Welcome (back) to Data@Mozilla March 9, 2020 Mozilla Mozilla * About * Contact Us * Donate * + Twitter (@mozilla) + Instagram (@mozillagram) Firefox * Download Firefox * Desktop * Mobile * Features * Beta, Nightly, Developer Edition * + Twitter (@firefox) + YouTube (firefoxchannel) * Website Privacy Notice * Cookies * Legal Visit Mozilla Corporation's not-for-profit parent, the Mozilla Foundation. Portions of this content are (c)1998-2022 by individual contributors. Content available under a Creative Commons license.