UI freeze on GHC 9.2 on some operations (notmuch related) #468

frasertweedale · 2022-07-22T00:12:00Z

Describe the bug

Some operations lock the UI. For example, using the database generated from UAT
test data, perform the actions from testUserCanMoveBetweenThreads. That is:

<Enter> to show the first thread/mail
J to display to next thread
K to display previous thread

If these actions are performed SLOWLY (say a 1 second interval) it can be done over and over and everything works.
If these actions are performed QUICKLY, the UI instantly locks.

During the lock, it is observed that the purebred has forked:

% pgrep -f -l purebred
75692 purebred --database /tmp/mail/Maildir
75600 purebred --database /tmp/mail/Maildir

kill -9 <child-pid> unblocks the UI and reveals an error message:

A Xapian exception occurred opening database:
  Unable to get write lock on /tmp/mail/Maildir/.notmuch/xapian:
    Got EOF reading from child process

Analysis

Reading of Xapian source code shows that the "FlintLock" facility is used to get an exclusive (write) lock on the database. The implementation forks and the child uses fcntl(lockfd, F_SETLK, fl) to acquire the lock. Here is where it gets complicated and my guess as to what is happening:

The file is already locked due to a previous database open to read thread message. That "session" is done but the DB is not yet closed (and lock not released) because that is performed by the finalizer upon GC of the database handle in the parent process.
The new child therefore blocks as it waits for the lock.
GC in the parent process does not get triggered because it is still in a (unsafe) foreign call waiting for the child to exit. This is confirmed by GHC documentation that states:

...since version 8.4 ... GHC guarantees that garbage collection will never occur during an unsafe call, even in the bytecode interpreter, and further guarantees that unsafe calls will be performed in the calling thread.

This error did not occur before GHC 9.2, so it is probably a GC change that triggers the bug. The bug was always present and this seems to be a "how did this ever work" scenario.

Proposed solution

First, change notmuch_database_open call in hs-notmuch to be safe rather than unsafe. This may allow the parent process to GC the previous database handle, releasing the lock and unblocking the child process.
If that doesn't work, we have to move to a "client/server" DB access paradigm, where all DB access is via a single thread using a single database handle. This idea has come up before as a way to avoid concurrency issues with notmuch/xapian, including the long-running issue SIGABRT when opening mail #284. But it is a huge change so we didn't embark on it yet.

The text was updated successfully, but these errors were encountered:

frasertweedale closed this as completed in purebred-mua/hs-notmuch@e033d74 Jul 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UI freeze on GHC 9.2 on some operations (notmuch related) #468

UI freeze on GHC 9.2 on some operations (notmuch related) #468

frasertweedale commented Jul 22, 2022 •

edited

Loading

UI freeze on GHC 9.2 on some operations (notmuch related) #468

UI freeze on GHC 9.2 on some operations (notmuch related) #468

Comments

frasertweedale commented Jul 22, 2022 • edited Loading

Analysis

Proposed solution

frasertweedale commented Jul 22, 2022 •

edited

Loading