Saturday, 26 September 2009

Sometimes there just isn't enough

I'm starting to get confused. (OK, starting is wrong). What is the problem with more error/debug than less.

I mean, sure, your average user is not going to have the foggiest idea what your error information is, and so a nice helpful message is good. But what about in a system,(which has many many many tests) where the average time to run in production takes 1000 times longer than any of your tests. And you get for Joe User

HI,

I received this error message:

An error occured in processing your request. Please let us know at line 360 in something.pl

Could you do something about this. I had been running this process for 10 hours.

Cheers

Joe

This is great, except, I know absolutely nothing further. Joe, understandably, reported the error he got (all of it!) and so thinks he has been helpful, but when you look at line 360, it's a croak of an eval block, where all your code is in multiple modules running inside it. AAAHHHH!

From looking around (and I may only have looked at a relatively small subset) people seem to think that letting the user see all of the error stack is a bad thing. Why? Speaking with some users, they don't understand the output, and would like an additional friendly message, but want to be helpful. They just don't know what to say.

My response to get info from Joe above would be:

On which node did you run this?
What parameters did you set/include?
At what time did you start and stop this process?
Did you have an error log?


To which Joe replies:

Parameters: a,b,c
I submitted it to the farm from work_cluster2
My bsub command: ....
Error log: dev/null (because, that is what he was instructed)

Great, no further output. I can't reproduce, I can't write a test, I can't track down the problem without running the whole thing myself.

So what is the potential solution:

1) Don't be afraid to show your user what looks like the internal workings of the script. Give them the whole stack output. With a friendly message too, of course.
2) Training to ensure that they write an error log (dev/null is not a log, but users don't know that often).
3) Training to ensure that they email you/bug report through rt the whole stack. If they know it will help you solve their problem, they ought to be happy. (Certainly users I spoke to said they would be happy to provide 200 lines of 'garbage output' if it meant that I could solve a problem even 25% faster.

I don't know who suggested it, or who started the culture. It might not be great for commercially sold apps, but certainly where you have internal or free (in it's many forms) software, then surely you shouldn't be worried about 'what the user might find out', because, I would be 95% certain, they don't actually care!

So big it up for confess and cluck, or even occasional print 'debug info trace points' as they ought to be the way of the future, in dealing with those bugs in a timely fashion.

3 comments:

2shortplanks said...

One technique that I've found particularly helpful is to write copious infomation to a logfile somewhere I can get at it (or get the user to get it for me.) This is obviously somewhat easier with hosted webapps.

The key thing is to prepend to the start of each logline a unique "tag" for the request/run/user interaction. When we send a user visible error (or API visible error at least) then we output this tag in the error response so we know what to extract from all the logfiles when the user asks why they saw this error.

One of the advantages of this is that I can write to the logfile *in* *case* something goes wrong, before I know the error condition is actually going to occur.

Andy Brown - SetitesUK said...

That is a great idea, however, a small problem I have sometimes come up against is where to write those blogs too.

Whilst not always the case, and some people blatantly break the principle, we are frowned upon writing too much out to some log files in web apps (our servers are handling so much, it can be tricky to find something in the logs even with a good unique search string, 10 mintues post the event), and it can be tricky to determine where log files can be written to with permission settings.

I try to output as much as I can to log files, but I know that just at that one time it breaks, I won't have the info I could have had if I routinely had used something to get and output the full stack (and that is the bit which seems to be frowned upon).

But as I said before, A great idea. Something that clearly needs some looking into.

Michael said...

I don't think users should be exposed to full stack traces or system errors at all. But I as a developer do enjoy good full stack traces for tracking down problems. A good solution we have at $work is to have any internal errors generate an email to our development team with a full stack trace and a dump of the whole environment. But sometimes even this is not enough and we need to try to contact the users who had a problem to gather more information so we can reproduce the error.