Skip to content

UI freeze on GHC 9.2 on some operations (notmuch related) #468

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
frasertweedale opened this issue Jul 22, 2022 · 0 comments
Closed

UI freeze on GHC 9.2 on some operations (notmuch related) #468

frasertweedale opened this issue Jul 22, 2022 · 0 comments

Comments

@frasertweedale
Copy link
Member

frasertweedale commented Jul 22, 2022

Describe the bug

Some operations lock the UI. For example, using the database generated from UAT
test data, perform the actions from testUserCanMoveBetweenThreads. That is:

  • <Enter> to show the first thread/mail
  • J to display to next thread
  • K to display previous thread

If these actions are performed SLOWLY (say a 1 second interval) it can be done over and over and everything works.
If these actions are performed QUICKLY, the UI instantly locks.

During the lock, it is observed that the purebred has forked:

% pgrep -f -l purebred
75692 purebred --database /tmp/mail/Maildir
75600 purebred --database /tmp/mail/Maildir

kill -9 <child-pid> unblocks the UI and reveals an error message:

A Xapian exception occurred opening database:
  Unable to get write lock on /tmp/mail/Maildir/.notmuch/xapian:
    Got EOF reading from child process

Analysis

Reading of Xapian source code shows that the "FlintLock" facility is used to get an exclusive (write) lock on the database. The implementation forks and the child uses fcntl(lockfd, F_SETLK, fl) to acquire the lock. Here is where it gets complicated and my guess as to what is happening:

  • The file is already locked due to a previous database open to read thread message. That "session" is done but the DB is not yet closed (and lock not released) because that is performed by the finalizer upon GC of the database handle in the parent process.

  • The new child therefore blocks as it waits for the lock.

  • GC in the parent process does not get triggered because it is still in a (unsafe) foreign call waiting for the child to exit. This is confirmed by GHC documentation that states:

    ...since version 8.4 ... GHC guarantees that garbage collection will never occur during an unsafe call, even in the bytecode interpreter, and further guarantees that unsafe calls will be performed in the calling thread.

This error did not occur before GHC 9.2, so it is probably a GC change that triggers the bug. The bug was always present and this seems to be a "how did this ever work" scenario.

Proposed solution

  • First, change notmuch_database_open call in hs-notmuch to be safe rather than unsafe. This may allow the parent process to GC the previous database handle, releasing the lock and unblocking the child process.

  • If that doesn't work, we have to move to a "client/server" DB access paradigm, where all DB access is via a single thread using a single database handle. This idea has come up before as a way to avoid concurrency issues with notmuch/xapian, including the long-running issue SIGABRT when opening mail #284. But it is a huge change so we didn't embark on it yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant