Remix Hacker News Clone

792

10% of Firefox crashes are caused by bitflips

I've told this story before on HN, but my biz partner at ArenaNet, Mike O'Brien (creator of battle.net) wrote a system in Guild Wars circa 2004 that detected bitflips as part of our bug triage process, because we'd regularly get bug reports from game clients that made no sense.

Every frame (i.e. ~60FPS) Guild Wars would allocate random memory, run math-heavy computations, and compare the results with a table of known values. Around 1 out of 1000 computers would fail this test!

We'd save the test result to the registry and include the result in automated bug reports.

The common causes we discovered for the problem were:

- overclocked CPU

- bad memory wait-state configuration

- underpowered power supply

- overheating due to under-specced cooling fans or dusty intakes

These problems occurred because Guild Wars was rendering outdoor terrain, and so pushed a lot of polygons compared to many other 3d games of that era (which can clip extensively using binary-space partitioning, portals, etc. that don't work so well for outdoor stuff). So the game caused computers to run hot.

Several years later I learned that Dell computers had larger-than-reasonable analog component problems because Dell sourced the absolute cheapest stuff for their computers; I expect that was also a cause.

And then a few more years on I learned about RowHammer attacks on memory, which was likely another cause -- the math computations we used were designed to hit a memory row quite frequently.

Sometimes I'm amazed that computers even work at all!

Incidentally, my contribution to all this was to write code to launch the browser upon test-failure, and load up a web page telling players to clean out their dusty computer fan-intakes.

by netcoyote1772694403

ECC should have become standard around the time memories passed 1GB.

It's seriously annoying that ECC memory is hard to get and expensive, but memory with useless LEDs attached is cheap.

by Animats1772755063

Very interesting. The Go toolchain has an (off by default) telemetry system. For Go 1.23, I added the runtime.SetCrashOutput function and used it to gather field reports containing stack traces for crashes in any running goroutine. Since we enabled it over a year ago in gopls, our LSP server, we have discovered hundreds of bugs.

Even with only about 1 in 1000 users enabling telemetry, it has been an invaluable source of information about crashes. In most cases it is easy to reconstruct a test case that reproduces the problem, and the bug is fixed within an hour. We have fixed dozens of bugs this way. When the cause is not obvious, we "refine" the crash by adding if-statements and assertions so that after the next release we gain one additional bit of information from the stack trace about the state of execution.

However there was always a stubborn tail of field reports that couldn't be explained: corrupt stack pointers, corrupt g registers (the thread-local pointer to the current goroutine), or panics dereferencing a pointer that had just passed a nil check. All of these point to memory corruption.

In theory anything is possible if you abuse unsafe or have a data race, but I audited every use of unsafe in the executable and am convinced they are safe. Proving the absence of data races is harder, but nonetheless races usually exhibit some kind of locality in what variable gets clobbered, and that wasn't the case here.

In some cases we have even seen crashes in non-memory instructions (e.g. MOV ZR, R1), which implicates misexecution: a fault in the CPU (or a bug in the telemetry bookkeeping, I suppose).

As a programmer I've been burned too many times by prematurely blaming the compiler or runtime for mistakes in one's own code, so it took a long time to gain the confidence to suspect the foundations in this case. But I recently did some napkin math (see https://github.com/golang/go/issues/71425#issuecomment-39685...) and came to the conclusion that the surprising number of inexplicable field reports--about 10/week among our users--is well within the realm of faulty hardware, especially since our users are overwhelmingly using laptops, which don't have parity memory.

I would love to get definitive confirmation though. I wonder what test the Firefox team runs on memory in their crash reporting software.

by adonovan1772675599

Firefox is about the only piece of software in my setup that occasionally crashes. I say "occasionally" for lack of a better word, it's not "all the time", but it is definitely more than I would want to.

If that was caused by bad memory, I would expect other software to be similarly affected and hence crash with about comparable frequency. However, it looks like I'm falling more into the other 90% of cases (unsurprisingly) because I do not observe other software crashing as much as firefox does.

Also, this whole crashing business is a fairly recent effect - I've been running firefox for forever and I cannot remember when it last was as much of an issue as it has become recently for me.

by kleiba1772781394

I've written genetic programming experiments that do not require an explicit mutation operator because the machine would tend to flip bits in the candidate genomes under the heavy system load. It took me a solid week to determine that I didn't actually have a bug in my code. It happens so fast on my machine (when it's properly loaded) that I can depend on it to some extent.

by bob10291772785293

Strange. I have a tab hoarding problem, I often have over 1000 tabs open [1][2], and I cannot remember the last time Firefox crashed. I'm thinking it must have been years? I use ublock origin though, which might help since ads do their best to steal your computer and soul through any means possible of course.

I also use a bunch of other extensions though, dark reader, vimium, sideberry... I'd expect me to be a bit more exposed than the average user. Yet it's just rock stable for me. Maybe it just works better on linux?

1: I know this because I installed https://addons.mozilla.org/en-US/firefox/addon/tab-counter-p... to check :)

2: However after finding Karakeep I don't actually have 1000 tabs anymore!

by bergheim1772806342

It's worth noting that the thread says "up to 10%," not "10%" as the title suggests. So it's reasonable to believe the rate is as low as 5% based on the only real figure given (25000 / 470000)

I think our education system should include a unit on "marketing bullshit" sometime early in elementary school. Maybe as part of math class, after they learn inequalities. "Ok kids, remind me, what does 'up to' mean?" "less than or equal to!"

by ryukoposting1772810226

A 5 part thread where they say they're "now 100% positive" the crashes are from bitflips, yet not a single word is spent on how they're supposedly detecting bitflips other than just "we analyze memory"?

by thegrim331772658546

I'll submit my bit flip story for consideration also :) https://julialang.org/blog/2020/09/rr-memory-magic/

by KenoFischer1772770813

> In other words up to 10% of all the crashes Firefox users see are not software bugs, they're caused by hardware defects!

Bold claim. From my gut feeling this must be incorrect; I don't seem to get the same amount of crashes using chromium-based browsers such as thorium.

by shevy-java1772751655

As someone who has a strong background from hobby projects with five-digit users before going into work, I think one of the most interesting differences I experienced was that the problems you see at scale simply don't exist on small scale projects. Bit flips/bad memory is one of them.

by petterroea1772780246

That's super interesting because I remember Linus Torvalds saying he requires ECC RAM in his computers, because he got tired of weird issues that were resolved by a reboot.

But non-ECC is fine for most of us mortals gaming and streaming.

I would expect pro gamers to opt for ECC though.

by INTPenis1772782472

It is rumored heavily on HN that when the first employee of Google, Craig Silverstein was asked about his biggest regret, he said: "Not pushing for ECC memory."

by camkego1772677806

Bit flips aren’t always bad hardware. I remember an anecdote from Sandia from my HPC days - they found they were getting more bit flips on some machines than others on their cluster and sometimes correlated.

Turned out at their altitude cosmic rays were flipping bits in the top-most machines in the racks, sometimes then penetrating lower and flipping bits in more machines too.

by moconnor1772782765

I'm glad to see somebody is getting some data on this, I feel bad memory is one of the most underrated issues in computing generally. I'd like to see a more detailed writeup on this, like a short whitepaper.

by kdklol1772661255

I would love to see DDR4 vs DDR5 bitflips. As I understand it DDR5 must come with some level of ECC [1].

[1] https://www.corsair.com/us/en/explorer/diy-builder/memory/is...

by bhelkey1772753064

This is quite surprising to me, since I thought the percentage would be a lot lesser.

But I don’t really know what the Firefox team does with crash reports and in making Firefox almost crash proof.

I have been using it at work on Windows and for the last several years it always crashes on exit. I have religiously submitted every crash report. I even visit the “about:crashes” page to see if there are any unsubmitted ones and submit them. Occasionally I’ll click on the bugzilla link for a crash, only to see hardly any action or updates on those for months (or longer).

Granted that I have a small bunch of extensions (all WebExtensions), but this crash-on-exit happens due to many different causes, as seen in the crash reports. I’m too loathe to troubleshoot with disabling all extensions and then trying it one by one. Why should an extension even cause a crash, especially when its a WebExtension (unlike the older XUL extensions that had a deeper integration into the browser)? It seems like there are fundamental issues within Firefox that make it crash prone.

I can make Firefox not crash if I have a single window with a few tabs. That use case is anyway served by Edge and Chrome. The main reasons I use Firefox, apart from some ideological ones, are that it’s always been much better at handling multiple windows and tons of tabs and its extensibility (Manifest V2 FTW).

I would sincerely appreciate Firefox not crashing as often for me.

by newscracker1772766292

Just out of interest is ECC memory supposed to me more resilient to these types of failure?

by bilekas1772795505

Oh, on my old PC, FF sometimes mysteriously crashed for apparently no reason. I sent bug reports and cleared the profile and it seemed to help for a while, then it crashed again. Much later, I suspected and tested the RAM and turned out, it had a faulty module!

by sinuhe691772802683

>>> In the last week we received ~470000 crash reports, these do not represent all crashes because it's an opt-in system, the real number of crashes will be several times larger

Having the number of unique machines would be great to see how skewed this estimate is.

by fasteo1772788856

Maybe a partial solution would be to duplicate pointer data, compare pointers at every deference and panics if it doesn't match up. In essence a poor man's version of ECC. It's a considerable runtime overhead, but it might be possible to hide it behind a flag, only to be turned on to reproduce bugs. Also, anti-cheat measures already do something similar.

Certain data is more sensitive as well and requires extra protection. Pointers and indexes obviously, which might send the whole application on a wild goose chase around memory. But also machine code, especially JIT-generated traces, is worth to be checksummed and verified before executing it.

by samus1772789200

I guess the percentage of crashes due to hardware is high because people with faulty hardware are experiencing the vast majority of crashes. It sounds kind of dumb when put like that, I'm actually surprised it's that low a percentage.

by Neil441772788325

This seems like the kind of metric that 3 users with 15 year old machines can skew significantly.

Has to be normalized, and outliers eliminated in some consistent manner.

by fooker1772756469

I bought my PC like 2 weeks ago and ran my ram at 5800 to test its limits and forgot to lower it. After few strange crashes of my fedora desktop - super strange behavior, apps refuse start/stop, can't even escape to the console... I ran memtest today and it lit all red in the first 2 minutes! Then I log in to my stable desktop at 5200 MT and I see this in the front HN page! What are the chances?!!

by Habgdnv1772773876

It's high enough that I would wonder if some systems software issues are mixed in, like rare races in malloc or page table management.

by kev0091772753220

> In the last week we received ~470000 crash reports, these do not represent all crashes because it's an opt-in system, the real number of crashes will be several times larger.

470k crashes in a single week, and this is under-reported! I bet the number of crashes is far higher. My snap Firefox on Ubuntu would lock-up, forcing me to kill it from the system monitor, and this was never reported as a crash.

Once upon a time I wrote software for safety critical systems in C/C++, where the code was deployed and expected to work for 10 years (or more) and interact with systems not built yet. Our system could lose power at any time (no battery) and we would have at best 1ms warning.

Even if Firefox moves to Rust, it will not resolve these issues. 5% of their crashes could be coming from resource exhaustion, likely mostly RAM - why is this not being checked prior to allocation? 5% of their crashes could be resolved tomorrow if they just checked how much RAM was available prior to trying to allocate it. That accounts for ~23k crashes a week. Madness.

With the RAM shortages and 8GB looking like it will remain the entry laptop norm, we need to start thinking more carefully about how software is developed.

by bArray1772787011

> In other words up to 10% of all the crashes Firefox users see are not software bugs, they're caused by hardware defects! If I subtract crashes that are caused by resource exhaustion (such as out-of-memory crashes) this number goes up to around 15%.

Crashes caused by resource exhaustion are still software bugs in Firefox. At least on sane operating systems where memory isn't over-comitted.

by tredre31772661404

There is this app https://github.com/Smerity/bitflipped _Your computer is a cosmic ray detector. Literally._

by jurakovic1772781144

I’m pretty sure Torvalds tells a story of spending days hunting down a compiler bug, only to find it was memory, and then simply never using anything other than EC memory again.

10+% is huge

by lifeisstillgood1772799148

I’ve also found that compiling large packages in GCC or similar tends to surface problems with the system’s RAM. Which probably means most typical software is resilient to a bit-flip; makes you wonder how many typos in actual documents might have been caused by bad R@M.

by soletta1772772236

Also a polite reminder that most of those crashes will be concentrated on machines with faulty memory so the naive way of stating the statistic may overestimate its impact to the average user. For the average user this is the difference between 4/5 crashes are from software bugs and 5/5 crashes are from software bugs, and for a lot of people it will still be 5/5

by conartist61772671603

When debugging something, I often remember the the quote, often misattributed to Einstein: "Insanity is doing the same thing over and over again and expecting different results". Then I remember about bitflips, and run a second, maybe a third time, just expecting the next bit to flip to not be in the routine I'm trying to debug.

by dbolgheroni1772761392

Travis Long had done something similar in 2022 at Mozilla.

https://blog.mozilla.org/data/2022/04/13/this-week-in-glean-...

by spiffy20251772759618

Hmm, can someone educate me here? Why don't bit flips ever seem to impact the results of calculations in settings like big-data analytics on AWS?

Is it a difference between server hardware managed by knowledgeable people and random hardware thrown together by home PC builders?

by SeanSullivan861772780429

So, why aren't we all using ECC in 2026?

by _0xdd1772769307

by 1772775436

The next logical step would be to somehow inform users so they could take action to replace the bad memory. I realize this is a challenge given the anonymized nature of the crash data, but I might be willing to trade some anonymity in exchange for stability.

by kmoser1772661544

This is a pretty big claim which seems to imply this is much more common than expected, but there's no real information here and the numbers don't even stack up:

> That's one crash every twenty potentially caused by bad/flaky memory, it's huge! And because it's a conservative heuristic we're underestimating the real number, it's probably going to be at least twice as much.

So the data actually only supports 5% being caused by bitflips, then there's a magic multiple of 2? Come on. Let alone this conservative heuristic that is never explained - what is it doing that makes him so certain that it can never be wrong, and yet also detects these at this rate?

by CamouflagedKiwi1772756135

I wonder if Chrome dev team can corroborate on this finding in their crash reporting.

by devy1772766026

what happens if bitflip occurs while you are detecting bitflip?

bitflippin...

by pulkas1772782300

>That fancy ARM-based MacBook with RAM soldered on the CPU package? We've got plenty of crashes from those, good luck replacing that RAM without super-specialized equipment and an extraordinarily talented technician doing the job.

CPU caches and registers - how exactly are they different from a RAM on a SoC in this regard?

by AndriyKunitsyn1772753258

Yet the operating system keeps running.

by chazburger1772755963

Try running two instances of Firefox in parallel with different profiles, then do a normal quit / close operation on one after any use. Demons exist here.

by stnvh1772665429

So does this mean bool true = 3 or should bool true = 5?

This will bloat the code a bit.

by ptek1772764148

so could software engineering sommehow catch those crashes?

by est1772766602

How many are caused by cosmic radiation bitflips?

by brador1772707112

Going to be downvoted, but I call bullshit on this. Bitflips are frequent (and yes ECC is an improvement but does not solve the problem), but not that frequent. One can either assume users that enabled telemetry are an odd bunch with flaky hardware, or the implementation isnt actually detecting bitflips (potentially, as the messages indicate), but a plathora of problems. Having a 1/10 probability a given struct is either processed wrong, parsed wrong or saved wrong would have pretty severe effects in many, many scenarios - from image editing to cad. Also, bitflips on flaky hardware dont choose protection rings - it would also affect the OS routines such as reading/writing to devices and everything else that touches memory. Yup, i've seen plenty of faulty ram systems (many WinME crashes were actually caused by defective ram sticks that would run fine with W98), it doesnt choose browsers or applications.

by aforwardslash1772755591

The title should start with "Up to 10%"

by wosined1772788011

is there a way to get the memory tester he mentioned? Is it open source? Once Ram goes bad is there a way or recovering it or is it toasted forever?

by vsgherzi1772663182

When I had bad memory, Firefox was the only program which would crash because of it. I think there is also something to say about how Firefox's design could be improved to handle them better.

by charcircuit1772786421

I was running my PC with bad memory for a few weeks last year. Firefox crashed a LOT, way more than any other application I used during that time, so I've probably contributed a decent amount to these numbers...

by bakugo1772751478

Does anyone know how they can detect hardware defects like this? This sounds like an incredibly hard problem. And I don’t see how they can do this without impacting performance significantly.

by d--b1772780552

Curious why this article is written into divided up chunks?

by 1over1371772753315

Guesstimation at its finest.

by phendrenad21772755579

And.. how do they not know its their software being leaky and causing these bitflips?

These are potential bitflips.

I found an issue only yesterday in firefox that does not happen in other browsers on specific hardware.

My guess is that the software is riddled with edge-case bugs.

by dana3211772754904

What brands or types of memory cards are less likely to crash by bitflips?

by darkhorn1772752410

People I think are overindexing on this being about "Bad hardware".

We have long known that single bit errors in RAM are basically "normal" in terms of modern computers. Google did this research in 2009 to quantify the number of error events in commodity DRAM https://static.googleusercontent.com/media/research.google.c...

They found 25,000 to 70,000 errors per billion device hours per Mbit and more than 8% of DIMMs affected by errors per year.

At the time, they did not see an increase in this rate in "new" RAM technologies, which I think is DDR3 at that time. I wonder if there has been any change since then.

A few years ago, I changed from putting my computer to sleep every night, to shutting it down every night. I boot it fresh every day, and the improvements are dramatic. RAM errors will accumulate if you simply put your computer to sleep regularly.

by mrguyorama1772663301

But muh memory-safe Rust!!! :'(

by wartywhoa231772795583

This matches what I have long said, which is that adding ECC memory to consumer devices will not result in any incredible stability improvement. It will barely be a blip really.

As we know from Google and other papers, most of these 10% of flips will be caused by broken or marginal hardware, of which a good proportion of which could be weeded out by running a memory tester for a while. So if you do that you're probably looking a couple out of every hundred crashes being caused by bitflips in RAM. A couple more might be due to other marginal hardware. The vast majority software.

How often does your computer or browser crash? How many times per year? About 2-3 for me that I can remember. So in 50 years I might save myself one or two crashes if I had ECC.

ECC itself takes about 12.5% overhead/cost. I have also had a couple of occasions where things have been OOM-killed or ground to a halt (probably because of memory shortage). Could be my money would be better spent with 10% more memory than ECC.

People like to rave and rant at the greedy fatcats in the memory-industrial complex screwing consumers out of ECC, but the reality is it's not free and it's not a magical fix. Not when software causes the crashes.

Software developers like Linus get incredibly annoyed about bug reports caused by bit flips. Which is understandable. I have been involved in more than one crazy Linux kernel bug that pulled in hardware teams bringing up new CPU that irritated the bug. And my experience would be far from unique. So there's a bit of throwing stones in glass houses there too. Software might be in a better position to demand improvement if they weren't responsible for most crashes by an order of magnitude...

by stinkbeetle1772758210

Rust would fix this. Oh wait.

by nickhodge1772790157

Definitely going to hard disagree with Gabriele Svelto's take. I could point to the comments, however, let me bring up my own experiences across personal devices and organizational devices. In particular, note where he says this:

"I can't answer that question directly because crash reports have been designed so that they can't be tracked down to a single user. I could crunch the data to find the ones that are likely coming from the same machine, but it would require a bit of effort and it would still only be a rough estimate."

You can't claim any percentage if you don't know what you are measuring. Based on his hot take, I can run an overclocked machine have firefox crash a few hundred thousand times a day and he'll use my data to support his position. Further, see below:

First: A pre-text: I use Firefox, even now, despite what I post below. I use it because it is generally reliable, outside of specific pain points I mention, free, open source, compatible with most sites, and for now, is more privacy oriented than chrome.

Second: On both corporate and home devices, Firefox has shown to crash more often than Chrome/Chromium/Electron powered stuff. Only Safari on Windows beats it out in terms of crashes, and Safari on Windows is hot garbage. If bit flips were causing issues, why are chromium based browsers such as edge and Chrome so much more reliable?

Third: Admittedly, I do not pay close enough attention to know when Firefox sends crash reports, however, what I do know is that it thinks it crashes far more often than it does. A `sudo reboot` on linux, for example, will often make firefox think it crashed on my machine. (it didn't, Linux just kills everything quickly, flushes IO buffers, and reboots...and Firefox often can't even recover the session after...)

Fourth: some crashes ARE repeatable (see above), which means bit flips aren't the issue.

Just my thoughts.

by eek21211772753925

Ugh just write a real blog post dude.

by wakawaka281772761206

I had a refurbished ThinkPad that had memory corruption. I only noticed because Firefox started to crash an unreasonable amount. Ran memcheck through BIOS and sure enough it was bad RAM.

Have we considered that maybe Firefox is the cause of bad memory?

by titzer1772805628

by 1772654771

>> In other words up to 10% of all the crashes Firefox users see are not software bugs, they're caused by hardware defects!

I find this impossible to believe.

If this were so all devs for apps, games, etc... would be talking about this but since this is the first time I'm hearing about this I'm seriously doubting this.

>> This is a bit skewed because users with flaky hardware will crash more often than users with functioning machines, but even then this dwarfs all the previous estimates I saw regarding this problem.

Might be the case, but 10% is still huge.

There imo has to be something else going on. Either their userbase/tracking is biased or something else...

by NotGMan1772664540

470k crashes in a week? Considering how low their market share is, that would suggest every install crashes several times a day... I gotta call bs.

by nubinetwork1772664797