09 October, 2020

EurekaLog causes AccessViolation?

We were contacted by a person who complained that his application was working fine until he added EurekaLog to it. An AccessViolation exception was raised after adding EurekaLog to project. The message was: "Access violation at address 00E15025 in module 'Project.exe'. Read of address 83EC8B69". And the stack was:
  • Contoso.pas TContosoEventMessage.BasePerform
  • Vcl.Forms.pas TApplication.WndProc
  • System.Classes.pas StdWndProc
  • Vcl.Forms.pas TApplication.HandleMessage
  • Vcl.Forms.pas TApplication.Run
  • Project.dpr Initialization
(all names have been replaced; Contoso represents some well-known library for Delphi).

The person did not want to provide a reproducible example, but, fortunately, we had access to the source code of the library.

The address in the error message (83EC8B69 - e.g. "random") and the short stack may indicate a problem like "execution flow followed a trash pointer and crashed in a random place". Unfortunately, the report did not include the CPU and assembler sections (were disabled in the settings). In this case, diagnostics based on the source code alone would be impossible. Frankly, at first I wanted to answer this way, giving a recommendation to strengthen control over memory.

But just such a stack is also possible if the BasePerform is called by the main thread as a procedure queued for Synchronize. Indeed, a search through the source code revealed the following code:
function TContosoEventMessage.Perform(AThread: TContosoThread): Boolean;
begin
  FMsgThread := AThread as TContosoEventThread;
  if FMsgThread.Active then begin
    if FMsgThread.Options.Synchronize then
      TContosoThread.Queue(nil, BasePerform)
    else
      BasePerform;
  end;
  Result := True;
end;
E.g. the message class has a function to handle/process this message. And if a Synchronize flag is specified in the options of something, then the message processing (BasePerform) will be performed not in the current thread, but will be sent for execution to the main thread - through TThread.Queue.

And yes, the report also shows this background thread from Contoso that could have executed such code.

The method BasePerform itself is just a wrapper around InternalHandle:
procedure TContosoEventMessage.BasePerform;
begin
  FMsgThread.InternalHandle(Self);
end;
And the EurekaLog report shows that the crash occurred on the very first line.

What do we have?
  1. The background thread enqueued the method and filling in the FMsgThread field;
  2. The main thread tried to execute this method and crashed when trying to read the FMsgThread field.
The second line is an assumption - since, as I said, there was no assembler in the report. But the hypothesis is good, because if otherwise - the crash would be inside InternalHandle.

E.g. it seems like there is a problem with memory. And by the way, this is the most common "problem" of projects that include EurekaLog. Indeed, consider this code:
procedure TForm1.Button1Click(Sender: TObject);
var
  List: TList;
begin
  List := TList.Create;
  List.Free;

  List.Clear; // - accesses already deleted object
end;
Is this a correct code? Of course not. But will it "work"? Yes, it will work correctly with a high probability. This happens for the simple reason that the "freed" memory (and, therefore, the "freed" object) is not actually deleted, but simply marked as free. E.g. their memory remains available unchanged. Of course, if there are many operations between releasing memory/object and re-accessing it, then there is some chance that these operation will change the former memory. But you won't even notice the problem on short runs.

Adding EurekaLog to the program changes the situation radically, since EurekaLog enables memory checks by default and actively fights these bugs.

Well, let's see if we can have such a situation now. Since FMsgThread comes from a method's argument, let's see who is calling the method BasePerform. A search through the source code suggested the following code (actually, the search highlighted several places, but a specific place was chosen based on the state of the background thread from the report):
procedure TContosoThread.Execute;
// ...
begin
// ...
              for i := 0 to oReceivedMsgs.Count - 1 do 
              begin
                oMsg := TContosoThreadMsgBase(oReceivedMsgs[i]);
                try
                  if not oMsg.Perform(Self) then
                    Break;
                finally
                  oMsg.Destroy;
                end;
// ...
end;
Oops, here it is. Scenario of what is happening:
  1. The background thread iterates over incoming messages;
  2. Each message is processed (Perform);
  3. Some message indicates that it needs to be synchronized to the main thread;
  4. The background thread queues the message (Queue);
  5. The background thread deletes the message (Destroy), but the message is still in the queue;
  6. The main thread starts processing scheduled message (BasePerform);
  7. Processing the message causes Access Violation because the message has already been deleted.
E.g. the bug is either in memory management or in thread synchronization. How did this error occur? Obviously, the base classes were originally written with assumed linear execution: "take message - process - release". Then a special subclass was written that redefined the "process" behavior to "enqueue and exit". And the base class (with a loop) is not aware of this. Which led to the bug.

In fact, the subclass clearly violated an implicit contract: all function's arguments are valid only for the duration of the function call. E.g. if you store the arguments to a function somewhere where they will be available after the function completes - then you must ensure that the arguments are available. The subclass did not.

I suspect that fixing the bug will be replacing Queue with Synchronize.

Here is how an investigation turned out - solely on the source code of an unfamiliar library and an incomplete crash report, without having a working example.