Skip to content

Performance Improvements #94

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 24 commits into
base: main
Choose a base branch
from
Draft

Performance Improvements #94

wants to merge 24 commits into from

Conversation

realchonk
Copy link
Owner

@realchonk realchonk commented Feb 16, 2025

TODO:

  • Implement benchmarking
  • Inode LruCache (see lru) (icache)
  • Directory Entry Cache (dcache)
  • Cache the last accessed file block
  • Add a small block cache (bcache)
    • Optimize, or remove
  • Make caching optional
  • Find other performance improvements
  • Track write performance
  • More benchmarks
  • speed up inode_resolve_block()

Original performance numbers (on a Ryzen 9 7900):

Test Variant Time Throughput
open 20.2 us
read 1MiB 27.8 ms 983 MiB/s
read 64KiB 28.7 ms 955 MiB/s
read 16KiB 30.6 ms 896 MiB/s
read 4KiB 118 ms 232 MiB/s
read 512B 934 ms 29 MiB/s
find direct 610 us 3.1 Mfile/s
find lookup 1.77 ms 1.1 Mfile/s
find lookup+stat 109 ms 17 Kfile/s
find lookup+stat+read-64K 193 ms 9.8 Kfile/s
find lookup+stat+read-16K 202 ms 9.3 Kfile/s
find lookup+stat+read-4K 455 ms 4.2 Kfile/s
find lookup+stat+read-512 2.9 s 652 file/s

@realchonk
Copy link
Owner Author

realchonk commented Feb 16, 2025

Performance numbers with an inode cache:

Test Variant Time Throughput
open 20.5 us
read 1MiB 26.7 ms 1023 MiB/s
read 64KiB 27.4 ms 998 MiB/s
read 16KiB 27.5 ms 994 MiB/s
read 4KiB 106 ms 257 MiB/s
read 512B 826 ms 33 MiB/s
find direct 441 us 4.3 Mfile/s
find lookup 924 us 2.1 Mfile/s
find lookup+stat 75 ms 25 Kfile/s
find lookup+stat+read-64K 156 ms 12 Kfile/s
find lookup+stat+read-16K 158 ms 12 Kfile/s
find lookup+stat+read-4K 400 ms 4.7 Kfile/s
find lookup+stat+read-512 2.6 s 718 file/s

@realchonk
Copy link
Owner Author

Performance numbers with:

  • Inode cache
  • redundant read_inode() removal
Test Variant Time Throughput
open 19.7 us
read 1MiB 26 ms 1029 MiB/s
read 64KiB 26.6 ms 1004 MiB/s
read 16KiB 26.3 ms 1015 MiB/s
read 4KiB 104 ms 264 MiB/s
read 512B 797 ms 34 MiB/s
find direct 429 us 4.4 Mfile/s
find lookup 885 us 2.1 Mfile/s
find lookup+stat 73 ms 26 Kfile/s
find lookup+stat+read-64K 155 ms 12 Kfile/s
find lookup+stat+read-16K 157 ms 12 Kfile/s
find lookup+stat+read-4K 380 ms 5.0 Kfile/s
find lookup+stat+read-512 2.4 s 783 file/s

@realchonk
Copy link
Owner Author

Performance numbers with:

  • inode cache
  • redundant read_inode() removal
  • block cache

With block cache

Test Variant Time Throughput
open 23 us
read 1MiB 26.1 ms 1023 MiB/s
read 64KiB 26.4 ms 1010 MiB/s
read 16KiB 27.0 ms 1012 MiB/s
read 4KiB 99 ms 276 MiB/s
read 512B 747 ms 37 MiB/s
find direct 457 us 4.1 Mfile/s
find lookup 913 us 2.1 Mfile/s
find lookup+stat 70 ms 27 Kfile/s
find lookup+stat+read-64K 153 ms 12 Kfile/s
find lookup+stat+read-16K 155 ms 12 Kfile/s
find lookup+stat+read-4K 367 ms 5.1 Kfile/s
find lookup+stat+read-512 2.3 s 808 file/s

@realchonk
Copy link
Owner Author

Performance numbers with:

  • inode cache
  • redundant read_inode() removal
  • block cache
  • directory entry cache
Test Variant Time Throughput
open 23 us
read 1MiB 26.8 ms 1019 MiB/s
read 64KiB 27.5 ms 996 MiB/s
read 16KiB 27.9 ms 983 MiB/s
read 4KiB 102 ms 270 MiB/s
read 512B 776 ms 35 MiB/s
find direct 461 us 4.1 Mfile/s
find lookup 661 us 2.9 Mfile/s
find lookup+stat 68 ms 28 Kfile/s
find lookup+stat+read-64K 152 ms 12 Kfile/s
find lookup+stat+read-16K 155 ms 12 Kfile/s
find lookup+stat+read-4K 375 ms 5.1 Kfile/s
find lookup+stat+read-512 2.4 s 795 file/s

@realchonk
Copy link
Owner Author

realchonk commented Feb 16, 2025

Summary so far

Test Variant / Buffer Size Baseline icache dcache bcache icache+dcache
open 20.7 us 20.3 us 20.6 us 22.4 us 20.9 us
read 1MiB 26.0 ms / 1029 MiB/s 24.9 ms / 1074 MiB/s 24.3 ms / 1099 MiB/s 27.2 ms / 1005 MiB/s 25.6 ms / 1044 MiB/s
read 64KiB 26.8 ms / 1020 MiB/s 25.1 ms / 1065 MiB/s 25.0 ms / 1071 MiB/s 28.1 ms / 975 MiB/s 26.2 ms / 1021 MiB/s
read 16KiB 28.8 ms / 952 MiB/s 25.3 ms / 1055 MiB/s 27.6 ms / 990 MiB/s 29.9 ms / 915 MiB/s 26.3 ms / 1016 MiB/s
read 4KiB 109 ms / 251 MiB/s 99.1 ms / 276 MiB/s 105 ms / 262 MiB/s 109 ms / 251 MiB/s 103 ms / 265 MiB/s
read 512B 863 ms / 31.7 MiB/s 763 ms / 35.8 MiB/s 828 ms / 33.0 MiB/s 847 ms / 32.3 MiB/s 777 ms / 35.2 MiB/s
find direct 430 us / 4.41 Mfile/s 431 us / 4.40 Mfile/s 421 us / 4.50 Mfile/s 443 us / 4.28 Mfile/s 439 us / 4.32 Mfile/s
find lookup 1.22 ms / 1.56 Mfile/s 876 us / 2.16 Mfile/s 760 us / 2.50 Mfile/s 1.20 ms / 1.58 Mfile/s 639 us / 2.97 Mfile/s
find lookup+stat 77.8 ms / 24.4 Kfile/s 77.0 ms / 24.6 Kfile/s 72.6 ms / 26.1 Kfile/s 75 ms / 25.2 Kfile/s 70.1 ms / 27.0 Kfile/s
find lookup+stat+read-64K 162 ms / 11.7 Kfile/s 148 ms / 12.8 Kfile/s 147 ms / 12.9 Kfile/s 160 ms / 11.9 Kfile/s 147 ms / 12.9 Kfile/s
find lookup+stat+read-16K 170 ms / 11.2 Kfile/s 150 ms / 12.6 Kfile/s 155 ms / 12.2 Kfile/s 166 ms / 11.4 Kfile/s 150 ms / 12.6 Kfile/s
find lookup+stat+read-4K 411 ms / 4.61 Kfile/s 370 ms / 5.13 Kfile/s 377 ms / 5.03 Kfile/s 405 ms / 4.68 Kfile/s 373 ms / 5.09 Kfile/s
find lookup+stat+read-512 2.62 s / 722 file/s 2.36 s / 804 file/s 2.50 s / 759 file/s 2.59 s / 732 file/s 2.48 s / 763 file/s

Conclusions:

  • bcache is useless (or badly optimized)
  • icache+dcache is king

@realchonk
Copy link
Owner Author

@asomers What do you think about this? Do you have any suggestions on what could improve the performance?

@asomers
Copy link
Collaborator

asomers commented Feb 17, 2025

@asomers What do you think about this? Do you have any suggestions on what could improve the performance?

I'll share what I found when optimizing the performance of https://github.com/KhaledEmaraDev/xfuse . The biggest problem I found was that even when the kernel cache was enabled, read amplification was still high. That was because the kernel only cached the contents of files, not metadata structures. For example, files' indirect blocks were not cached. So when reading a very large file, the daemon would be forced to reread some of those indirect blocks over and over. Fixing that problem required the daemon to cache that metadata itself. The most logical way I found to do it was to attach said metadata in memory to the inode and cache it until FUSE_FORGET dismissed the inode. That does have the potential disadvantage of high memory consumption if the kernel rarely or never forgets a vnode. But in practice I did not find it to be a problem.
To measure the performance of the metadata cache, a time-based benchmark wasn't ideal. Instead, I constructed a read-amplification benchmark 1. It runs a reproducible workload on a fuse-xfs file system backed by a gnop device. Then it uses gnop's measurement of throughput compared with the workload's throughput to calculate the read amplification of fuse-xfs. You can probably use the same program for fuse-ufs, though you'll have to tailor the workloads and golden images appropriately. For some workloads, I was able to achieve improvements of hundreds or even over 1000x 2.

Footnotes

  1. https://github.com/KhaledEmaraDev/xfuse/blob/main/benches/read-amplification.rs

  2. https://github.com/KhaledEmaraDev/xfuse/issues/107#issuecomment-2051985520

@realchonk
Copy link
Owner Author

@asomers What do you think about this? Do you have any suggestions on what could improve the performance?

I'll share what I found when optimizing the performance of https://github.com/KhaledEmaraDev/xfuse . The biggest problem I found was that even when the kernel cache was enabled, read amplification was still high. That was because the kernel only cached the contents of files, not metadata structures. For example, files' indirect blocks were not cached. So when reading a very large file, the daemon would be forced to reread some of those indirect blocks over and over. Fixing that problem required the daemon to cache that metadata itself. The most logical way I found to do it was to attach said metadata in memory to the inode and cache it until FUSE_FORGET dismissed the inode. That does have the potential disadvantage of high memory consumption if the kernel rarely or never forgets a vnode. But in practice I did not find it to be a problem. To measure the performance of the metadata cache, a time-based benchmark wasn't ideal. Instead, I constructed a read-amplification benchmark 1. It runs a reproducible workload on a fuse-xfs file system backed by a gnop device. Then it uses gnop's measurement of throughput compared with the workload's throughput to calculate the read amplification of fuse-xfs. You can probably use the same program for fuse-ufs, though you'll have to tailor the workloads and golden images appropriately. For some workloads, I was able to achieve improvements of hundreds or even over 1000x 2.

Footnotes

1. https://github.com/KhaledEmaraDev/xfuse/blob/main/benches/read-amplification.rs [↩](#user-content-fnref-1-6e7bd8390a5e20306de8a346a6dbab9e)

2. https://github.com/KhaledEmaraDev/xfuse/issues/107#issuecomment-2051985520 [↩](#user-content-fnref-2-6e7bd8390a5e20306de8a346a6dbab9e)

Thanks for your insight. My approach was to implement a block cache for BlockReader, but I failed spectacularly by decreasing performance by 10%-20%.

As for the indirect block cache, I could probably integrate it into the inode cache somwhow..

Right now I'm using time-based benchmark for simplicity and flamegraphs for checking what's taking up most of the time,
but my workstation (where I did the benchmarks) died somehow and I'll only be able to repair it on Saturday.

By using flamegraphs I was able to optimize inode_resolve_block(), which yielded an improvement of upto 500%.

The problem of the kernel never forgetting can be solved via lru, which I'm currently using.
In fact it might be useful to just ignore forget calls.

@asomers
Copy link
Collaborator

asomers commented Feb 17, 2025

Right now I'm using time-based benchmark for simplicity and flamegraphs for checking what's taking up most of the time,
but my workstation (where I did the benchmarks) died somehow and I'll only be able to repair it on Saturday.

Beware: a flamegraph will only tell you what's taking the most CPU cycles, not the most time. I/O-bound operations won't show up in a flamegraph.

@realchonk
Copy link
Owner Author

That's interesting, because right now llseek is taking a lot of time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants