Software Development

Correcting The Future

Failing Drastically or Quietly Produce Incorrect Results?

The paraphrased words in the title were uttered by Andrew Koenig who blogs on the Dobbs Code Talk web site. The specific comment is located in his own blog article titled What Dijkstra said was harmful about goto statements. This is a followup from a previous blog entry where he asks readers for their opinions about goto statements.

Yes, it's on older topic from March 2009, but I must have missed it. Here is the specific portion of the comment.

To my way of thinking, a program that fails in a dramatic way, such as an assertion failure or infinite loop, is less harmful than one that quietly produces incorrect results.



I am utterly SHOCKED by this quote. Not because it's out there. But because I agree with it despite countless people who have devoted their entire careers against this principle.

We should back up a little because there are many issues at hand here.

Let's go back to the original issue about goto statements. Andrew talks about two things that Dijkstra talked about. First is that goto's make it more difficult to ascertain the program's state at a particular point, especially at the destination's label because you now have to inspect the entire program to make sure you know of everything that could goto to that label. Second is that goto's make it more difficult to ascertain how far the program's execution has gone.

There is some other talk about continue and break statements. I agree with Andrew that these don't add much complexity. Then again, I was never big on being against goto because I still code in assembly where it's simply not an option to not use goto's.

So goto's are bad because it makes debugging and finding problems with your code much more difficult. Creates spaghetti code and all that. At the risk of another "X is considered harmful" article, what else makes code that much more difficult to finding problems?

There are plenty of things we could talk about, but the title of this article says exactly what's on my mind. Is it better for a program to stop abruptly than it is to continue with incorrect data? This is where the quote at the beginning of this article comes into play. It also fits in perfectly with what's wrong with gotos because they both have the same issues.

Java has taken the opposite view where continuing with corrupt data is better than crashing. In fact, crashing is seen as the worst possible outcome. This was not done because the designers actually believed this (well, maybe they did). It was because newcomers were found to go away if their programs stopped abruptly. It was a shock and a slap in the face. Their program was wrong. The abruptness of it all was too much to ignore. Many people would rather do something else than face that all the time.

This has created what I'm calling the leaky roof syndrome. A leaky roof almost never leaks where the real problem is because most people have something called tar paper installed between the shingles and the sheeting. Tar paper is a product meant to absorb moisture. Aside from the questionable use of a material that absorbs moisture being placed directly on a wooden surface (inducing rot), what happens is that your shingles will leak at a certain spot and the tar paper will bring the water down to a lower spot where it will eventually leak into your home. If you have trouble getting someone to fix your roof, this is why. It's a world of hurt and you may never find where the real leak is actually located. IOW, no matter who you get to fix your roof, they will never know how long it will take. It can actually be cheaper to just to redo your entire roof with new shingles. That's how bad it can get.

If your software keeps running with faulty data, then we have the exact same problem described by the leaky roof problem. By the time you notice the faulty data, you know that the problem lies earlier. But where? I've heard people say that they never have problem with this. I've also seen people who just put the equivalent of buckets in their code hoping to catch the spill. It turns into a big ball of patches.

Java isn't the only language that has this problem. Anywhere that forces you to use exceptions will give you the same scenario. C++ can have this problem even though many claim the opposite. I don't actually mind exceptions though I prefer old style error checking. An old complaint of mine was that you couldn't have multiple return values (without going to pointers). One for the actual return value and the other for the error. With pass-by-reference arguments, you can indeed have multiple return values with little effort. In fact, many of the best API's use some form of this. DirectX uses the return value only for error codes. The actual return values are done via pointers or pointer to pointers (instead of by reference). Despite the extensive use of pointers, it's an incredibly good API. I recently also used the NetCDF API for reading in scientific data and it uses old style error checking as well even though they have C++ wrappers. It too works extremely well. The documentation is actually very good with plenty of examples.

Note that this isn't an argument against exceptions. I like using them in many languages such as Perl or VB when doing database handling for example. Localized use of exception is fine, even beneficial. What I'm saying is bad is anything that will send control to who knows where, effectively reproducing a goto. This is what happens when exceptions must be used everywhere because people tend to not handle all exceptions.

So is getting a core dump better than blank exceptions that keep your program running with corrupt data? Hell yeah! When I get a core dump, it's usually on the exact spot where the error was produced. I very rarely have to go looking around.

Andrew kept saying in his article that there are FACTS of gotos that make them bad. That these facts are not up for negotiation. Only their severity. Well, what *I* am saying is that it doesn't matter what the name of it is called. If it has the same FACTS that makes it bad, then you're going to have the same problems. No amount of opinion is going to convince me that one thing is gonna be different than another when they both share the same problems.

And one thing I can say is that core dumps do NOT have the same problems as gotos. There is no leaky roof syndrome there. What's more, multiple things that don't inherently have a leaky roof syndrome can produce this syndrome when combined together in certain circumstances. That's when you start to see patches applied and your program slowly turning into a big ball of mud. So it's not just specific things you have to be concerned about. It's how the whole thing interacts together. And this is one thing I've been talking about for ages with respect to programming with the execution point. Functional programming is not immune either. Neither is dataflow for that matter if one uses it like how monads are implemented where only one data item may use the network grid at a time forcing components to execute sequentially reproducing what is effectively imperative programming.

The leaky roof syndrome can happen anywhere. When you know what causes it (or simply being aware of it), you can be better prepared to avoid it.

The Scientific Method (Updated: New 2008 map)Out Of Memory Checks (in DataFlow?)

Comments

Unregistered user Sunday, January 10, 2010 10:45:47 AM

Dan writes: Totally agree. Corrupt data is never never never the correct choice. Doesn't matter what you're doing. If I want to do X, having the program do X wrong (but not crash) is NOT what I want or expect. Having the program potentially destroy other data (by corrupting memory or filesystems or whatever), is definitely NOT what I want. Having the program fail dramatically, whilest annoying, at least makes it damned obvious that something went horribly wrong. I'd much rather have a program die horribly (but leave my data intact) than damage my data somehow, or continue running, but giving me wrong results.. Imagine I was working in a hospital and a program gave me incorrect results. This could lead to someones death. If it failed, yes - that would be a bad thing, but wrong results are potentially much more harmful. So yes, I completely agree that if you cannot gracefully recover, its better to bail out than to corrupt data or produce incorrect results.

Sean Connerspc476 Tuesday, January 12, 2010 5:17:24 AM

But what do you do in an embedded system? Like my friend's pool chlorinator? (You add regular salt to the pool, and as the water is pumped through the system, the chlorinator converts the sodium chloride to chlorine to keep the pool clean) While there's a small LCD screen and a keypad, he doesn't check it daily; he just expects it to work (and yes, it's computer controlled, and my friend wrote the software for it).

Conversely, a long lived Unix daemon. Same friend is using one I wrote ( http://www.x-grey.com/ ) and when it fails, it just stops. No core dump, nothing. It took me a bit of time to figure out exactly what was going on (thankfully, the logging he did get pointed me in the right direction), but the fact that it affected his email was troubling (the daemon implements a type of spam filter and when it wasn't running, the default action was to accept all emails). It's hard to say what the proper thing to do is in that case.

Vorlath Tuesday, January 12, 2010 7:45:55 PM

Too much chlorine is not good and no chlorine is prone to spread microorganisms and bacteria. But if your program keeps running, one might think everything is fine until it's too late, no? If the software fails drastically, then you have to fix it. Perhaps I'm missing something, but I don't see how keeping running is the best option.

With email, the same thing applies. Imagine an email daemon that keeps running, but ends up discarding valid messages. You would never know it until later when you start wondering why you're not getting email anymore.

But yeah, it depends on the error. The severity is always up for debate. But I'm hard pressed to find a situation where running with corrupt data is better than failing. Maybe in situations where a little bit is better than nothing at all. If your email spam daemon filters some messages and fails to work properly with a lot of other messages is better than nothing. But then you have to wonder if it won't corrupt valid emails or filter out valid emails.

No, I'd still rather have the software terminate, even without a core dump, then keep running. I understand if you see this differently, but the reasoning escapes me. It always has. From my experience, the view of keeping the program running is the common one, much to my dismay.

Sean Connerspc476 Tuesday, January 12, 2010 10:41:39 PM

I wrote about errors last month where I made (in my opinion) a rather interesting observation about the types of errors. But in the end, I still have no idea of how to best handle errors.

Sean Connerspc476 Tuesday, January 12, 2010 10:43:42 PM

And just to make it clear, I like the "fail fast" method but the issue I have is when I don't notice the error for a month or so ...

Unregistered user Wednesday, January 13, 2010 11:59:23 AM

Dan writes: I think in cases where software can either terminate or continue incorrectly, it should terminate. If it terminates in the most drastic and obvious way possible, then it becomes immediately obvious that something went wrong. I think this is what should happen. For example, if I run a cli program and get "Segmentation Fault", I immediately know soemthing went horribly horrible wrong. Of course, the more details about _what_ went wrong that it can give me, the better.

Vorlath Thursday, January 14, 2010 3:05:54 AM

Sean: the example you give is a good one about error checking. I have code that looks a LOT like that (the one where you close your sockets and clean up). And you know what? I like that kind of code. When I see that, I know that the developer took the time to take control over what happens when things don't go quite right. Even if it's just reporting back that something went wrong.

And yeah, a really bad scenario is when a program terminates and no one knows and no info is given. A segmentation fault without a core dump is frustrating.

Write a comment

New comments have been disabled for this post.

June 2012
S M T W T F S
May 2012July 2012
1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30