31 March, 2009

EurekaLog's anti-freeze feature

As you know, EurekaLog is a great tool to catch exceptions in your application. Though there are some cases which you want to consider as "bugs", but there is no exception to catch it!

For example, consider your application is deadlocking. How about endless cycle which loads CPU core up to 100%? What about do not updating your UI for 10 minutes? Are those issues bugs or not?

Yes, all these cases are bugs in your application (*). But no exception is raised in such cases - your application simply stops working. Those cases can be very nasty and hard to diagnose, because your application doesn't show any error message and doesn't generate any log files.

So, how can you catch them?

Okay, if you look in the EurekaLog's project option - you can see something called "Anti-Freeze Options" on the "Advanced Options" tab:

“Advanced Options” tab in EurekaLog's options.

You can read the EL's documentation to find out that if you enable this option then EurekaLog will detect when your application became frozen. What does it mean exactly and how does it work? Some of you can say: "Whoa, that one is cool! Let's use it!", enable the option and then face the strange behavior. You should understand what are you doing before enabling it.

There is no clear definition what does "hung application" mean. Usually the "hung application" is either deadlocking or busy with some heavy processing/computations. However from the user's point of view, your application has stopped working - therefore it is a bug.

To detect such cases you can periodically check if your application is still responding.

Okay, back to EurekaLog. The EurekaLog uses the common technique by using the WM_NULL message. The WM_NULL is a special message that performs no operation. The application should process this message by simple ignoring it. You can use it in many situations. And you can use it to detect hung application too.

What does it mean that application is hung? Deadlock, endless cycle, not updating the UI have one thing in common: they do not process the window messages. In all those cases application is busy with some work and it doesn't call Peek/GetMessage. Therefore, you can send a WM_NULL message to application to see if it'll respond to it in specified time. If you don't get a reply within, say, 30 seconds interval - then the application is probably hung. Notice that "probably" word. Application may really perform some actual work (accessing network or perform complex computations, etc) and it'll come back to life later. Yes, technically this is not an error. But the end-user doesn't care about this stuff. His application has stopped responding - therefore it is a bug.

EurekaLog launches a helper thread (TFreezeThread) at application start up. This thread will constantly asks your main thread with the WM_NULL messages. If you are unable to answer in the specified amount of time then this helper thread starts consider you as "hung" and raises an EFrozenApplication exception. Since this is an exception - the exception can be handled by standard EurekaLog's hooking code. The EurekaLog will generate a full bug-report and send it to you. By default the application will not be terminated after processing the exception. Therefore the exception will be raised a few more times and then the "Restart application" check-box will appear in the exception dialog:


If you don't like this standard behavior - you can easily change it. For example, I personally prefer adding a new exception filter (there is a "Exceptions Filters" tab in the EL's project options) on EFrozenApplication class and changing it's action to "Restart" or "Terminate".

It is important to note that the exception itself is raised from the main thread (and not the helper thread) - therefore a call stack for the exception will contain the proper place where main thread was hung. And by analyzing this call stack, you can detect the source of the problem. Well, raising the exception in the another thread is certainly a dirty trick, involving direct manipulations with thread's context than in rare cases can lead to memory leaks (**) - therefore it can not be used in regular situations. But it is okay in such emergency case - it is better to have bug-report on hung application with (little possible) mem-leak than do not have bug-report at all.

BTW, there is run-time control too! So, if you want to do some heavy lifting in your Button3Click, you can disable anto-freeze checking before and restore it after. To do so, you can play with the CurrentEurekaLogOptions.FreezeActivate property. Setting it to False will abort monitoring helper thread and setting it to True will create a new monitor thread:
procedure TForm1.Button3Click(Sender: TObject);
begin
  CurrentEurekaLogOptions.FreezeActivate := False;
  try
    // <- here goes your actual code
  finally
    CurrentEurekaLogOptions.FreezeActivate := True;
  end;
end;
Okay, what does it all means to us, if you'll enable this option?

First: your application will receive WM_NULLs (which is 0) periodically. That is how EurekaLog checks if you are alive or not. This should not care you, since you are not supposed to do any processing of WM_NULL message.

Second: you should design your application very carefully. If you write some "bad" code in your Button1Click (and by "bad" I mean code that performs long-time operation) - your application may encounter a "bug". If you have some code that can potentially takes a long time to complete (connection to DB, for example) - now it is a good time to unload such code in the helper thread (and do not forget to abort or kill it after some period of time if it doesn't complete his job).

Third: do not make check-timeout too short. Or you'll be flooded with false bug-reports - only for the reason that Button2Click takes a whopping two seconds to complete instead of half a second on your developer machine. A minute or two is usually a good estimate.

Fourth: the anti-freeze feature starts working only after your main form is created. The start up of more or less complex application usually takes a little time due which the application does not respond to the messages. It is normal behavior and therefore EL wait for application to start up.

Fifth: nothing is free and there is a certain overhead for using this feature. Specifically: a) there is a one more thread in your application and b) it is constantly pumping your message queue with WM_NULL messages. "Constantly" means few times per second here (of course EL do not want to eat 100% of your CPU just to check your activity :) ). From CPU point of view - the helper thread is napping almost always. Therefore those overheads are very low and I don't know why anyone should care about it - but I've mention it just in case.

Sixth: it is important not to mix EurekaLog anti-freeze feature with TIdAntiFreeze or similar Delphi component - those are entirely different things which have only similar name in common.

(*) Well, the last case (not updating UI) is not a bug for you, the developer - since after 10 minutes your application will suddenly come back to life. But this is certainly an error from the end-user's viewpoint.

(**) Consider, what happens with such (typical) code:
SomeObj := TSomeClass.Create();
try
  // ...
finally
  FreeAndNil(SomeObj);
end;
What, if the EFrozenException will be raised right after your code exits constructor, but before assigning the object pointer to SomeObj variable? The try-finally block will be skipped, since exception was raised before try. But even if FreeAndNil will be executed - it has nothing to free: the SomeObj is empty. The object reference is forgotten - therefore it is a memory leak. Well, as I said: this is a very rare scenario. You can ignore it when using in such case as hung application - since hung application is already a bug, therefore you don't make the situation worse. But you surely shouldn't use such dangerous technique in the common code.