You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some operations lock the UI. For example, using the database generated from UAT
test data, perform the actions from testUserCanMoveBetweenThreads. That is:
<Enter> to show the first thread/mail
J to display to next thread
K to display previous thread
If these actions are performed SLOWLY (say a 1 second interval) it can be done over and over and everything works.
If these actions are performed QUICKLY, the UI instantly locks.
During the lock, it is observed that the purebred has forked:
kill -9 <child-pid> unblocks the UI and reveals an error message:
A Xapian exception occurred opening database:
Unable to get write lock on /tmp/mail/Maildir/.notmuch/xapian:
Got EOF reading from child process
Analysis
Reading of Xapian source code shows that the "FlintLock" facility is used to get an exclusive (write) lock on the database. The implementation forks and the child uses fcntl(lockfd, F_SETLK, fl) to acquire the lock. Here is where it gets complicated and my guess as to what is happening:
The file is already locked due to a previous database open to read thread message. That "session" is done but the DB is not yet closed (and lock not released) because that is performed by the finalizer upon GC of the database handle in the parent process.
The new child therefore blocks as it waits for the lock.
GC in the parent process does not get triggered because it is still in a (unsafe) foreign call waiting for the child to exit. This is confirmed by GHC documentation that states:
...since version 8.4 ... GHC guarantees that garbage collection will never occur during an unsafe call, even in the bytecode interpreter, and further guarantees that unsafe calls will be performed in the calling thread.
This error did not occur before GHC 9.2, so it is probably a GC change that triggers the bug. The bug was always present and this seems to be a "how did this ever work" scenario.
Proposed solution
First, change notmuch_database_open call in hs-notmuch to be safe rather than unsafe. This may allow the parent process to GC the previous database handle, releasing the lock and unblocking the child process.
If that doesn't work, we have to move to a "client/server" DB access paradigm, where all DB access is via a single thread using a single database handle. This idea has come up before as a way to avoid concurrency issues with notmuch/xapian, including the long-running issue SIGABRT when opening mail #284. But it is a huge change so we didn't embark on it yet.
The text was updated successfully, but these errors were encountered:
Describe the bug
Some operations lock the UI. For example, using the database generated from UAT
test data, perform the actions from
testUserCanMoveBetweenThreads
. That is:<Enter>
to show the first thread/mailJ
to display to next threadK
to display previous threadIf these actions are performed SLOWLY (say a 1 second interval) it can be done over and over and everything works.
If these actions are performed QUICKLY, the UI instantly locks.
During the lock, it is observed that the purebred has forked:
kill -9 <child-pid>
unblocks the UI and reveals an error message:Analysis
Reading of Xapian source code shows that the "FlintLock" facility is used to get an exclusive (write) lock on the database. The implementation forks and the child uses
fcntl(lockfd, F_SETLK, fl)
to acquire the lock. Here is where it gets complicated and my guess as to what is happening:The file is already locked due to a previous database open to read thread message. That "session" is done but the DB is not yet closed (and lock not released) because that is performed by the finalizer upon GC of the database handle in the parent process.
The new child therefore blocks as it waits for the lock.
GC in the parent process does not get triggered because it is still in a (unsafe) foreign call waiting for the child to exit. This is confirmed by GHC documentation that states:
This error did not occur before GHC 9.2, so it is probably a GC change that triggers the bug. The bug was always present and this seems to be a "how did this ever work" scenario.
Proposed solution
First, change
notmuch_database_open
call in hs-notmuch to besafe
rather thanunsafe
. This may allow the parent process to GC the previous database handle, releasing the lock and unblocking the child process.If that doesn't work, we have to move to a "client/server" DB access paradigm, where all DB access is via a single thread using a single database handle. This idea has come up before as a way to avoid concurrency issues with notmuch/xapian, including the long-running issue SIGABRT when opening mail #284. But it is a huge change so we didn't embark on it yet.
The text was updated successfully, but these errors were encountered: