Performance Improvements #94

realchonk · 2025-02-16T17:07:44Z

TODO:

Original performance numbers (on a Ryzen 9 7900):

Test	Variant	Time	Throughput
open		20.2 us
read	1MiB	27.8 ms	983 MiB/s
read	64KiB	28.7 ms	955 MiB/s
read	16KiB	30.6 ms	896 MiB/s
read	4KiB	118 ms	232 MiB/s
read	512B	934 ms	29 MiB/s
find	direct	610 us	3.1 Mfile/s
find	lookup	1.77 ms	1.1 Mfile/s
find	lookup+stat	109 ms	17 Kfile/s
find	lookup+stat+read-64K	193 ms	9.8 Kfile/s
find	lookup+stat+read-16K	202 ms	9.3 Kfile/s
find	lookup+stat+read-4K	455 ms	4.2 Kfile/s
find	lookup+stat+read-512	2.9 s	652 file/s

realchonk · 2025-02-16T18:19:54Z

Performance numbers with an inode cache:

Test	Variant	Time	Throughput
open		20.5 us
read	1MiB	26.7 ms	1023 MiB/s
read	64KiB	27.4 ms	998 MiB/s
read	16KiB	27.5 ms	994 MiB/s
read	4KiB	106 ms	257 MiB/s
read	512B	826 ms	33 MiB/s
find	direct	441 us	4.3 Mfile/s
find	lookup	924 us	2.1 Mfile/s
find	lookup+stat	75 ms	25 Kfile/s
find	lookup+stat+read-64K	156 ms	12 Kfile/s
find	lookup+stat+read-16K	158 ms	12 Kfile/s
find	lookup+stat+read-4K	400 ms	4.7 Kfile/s
find	lookup+stat+read-512	2.6 s	718 file/s

realchonk · 2025-02-16T19:43:34Z

Performance numbers with:

Inode cache
redundant read_inode() removal

Test	Variant	Time	Throughput
open		19.7 us
read	1MiB	26 ms	1029 MiB/s
read	64KiB	26.6 ms	1004 MiB/s
read	16KiB	26.3 ms	1015 MiB/s
read	4KiB	104 ms	264 MiB/s
read	512B	797 ms	34 MiB/s
find	direct	429 us	4.4 Mfile/s
find	lookup	885 us	2.1 Mfile/s
find	lookup+stat	73 ms	26 Kfile/s
find	lookup+stat+read-64K	155 ms	12 Kfile/s
find	lookup+stat+read-16K	157 ms	12 Kfile/s
find	lookup+stat+read-4K	380 ms	5.0 Kfile/s
find	lookup+stat+read-512	2.4 s	783 file/s

realchonk · 2025-02-16T20:07:53Z

Performance numbers with:

inode cache
redundant read_inode() removal
block cache

With block cache

Test	Variant	Time	Throughput
open		23 us
read	1MiB	26.1 ms	1023 MiB/s
read	64KiB	26.4 ms	1010 MiB/s
read	16KiB	27.0 ms	1012 MiB/s
read	4KiB	99 ms	276 MiB/s
read	512B	747 ms	37 MiB/s
find	direct	457 us	4.1 Mfile/s
find	lookup	913 us	2.1 Mfile/s
find	lookup+stat	70 ms	27 Kfile/s
find	lookup+stat+read-64K	153 ms	12 Kfile/s
find	lookup+stat+read-16K	155 ms	12 Kfile/s
find	lookup+stat+read-4K	367 ms	5.1 Kfile/s
find	lookup+stat+read-512	2.3 s	808 file/s

realchonk · 2025-02-16T20:24:57Z

Performance numbers with:

inode cache
redundant read_inode() removal
block cache
directory entry cache

Test	Variant	Time	Throughput
open		23 us
read	1MiB	26.8 ms	1019 MiB/s
read	64KiB	27.5 ms	996 MiB/s
read	16KiB	27.9 ms	983 MiB/s
read	4KiB	102 ms	270 MiB/s
read	512B	776 ms	35 MiB/s
find	direct	461 us	4.1 Mfile/s
find	lookup	661 us	2.9 Mfile/s
find	lookup+stat	68 ms	28 Kfile/s
find	lookup+stat+read-64K	152 ms	12 Kfile/s
find	lookup+stat+read-16K	155 ms	12 Kfile/s
find	lookup+stat+read-4K	375 ms	5.1 Kfile/s
find	lookup+stat+read-512	2.4 s	795 file/s

realchonk · 2025-02-16T22:05:49Z

Summary so far

Test	Variant / Buffer Size	Baseline	icache	dcache	bcache	icache+dcache
open		20.7 us	20.3 us	20.6 us	22.4 us	20.9 us
read	1MiB	26.0 ms / 1029 MiB/s	24.9 ms / 1074 MiB/s	24.3 ms / 1099 MiB/s	27.2 ms / 1005 MiB/s	25.6 ms / 1044 MiB/s
read	64KiB	26.8 ms / 1020 MiB/s	25.1 ms / 1065 MiB/s	25.0 ms / 1071 MiB/s	28.1 ms / 975 MiB/s	26.2 ms / 1021 MiB/s
read	16KiB	28.8 ms / 952 MiB/s	25.3 ms / 1055 MiB/s	27.6 ms / 990 MiB/s	29.9 ms / 915 MiB/s	26.3 ms / 1016 MiB/s
read	4KiB	109 ms / 251 MiB/s	99.1 ms / 276 MiB/s	105 ms / 262 MiB/s	109 ms / 251 MiB/s	103 ms / 265 MiB/s
read	512B	863 ms / 31.7 MiB/s	763 ms / 35.8 MiB/s	828 ms / 33.0 MiB/s	847 ms / 32.3 MiB/s	777 ms / 35.2 MiB/s
find	direct	430 us / 4.41 Mfile/s	431 us / 4.40 Mfile/s	421 us / 4.50 Mfile/s	443 us / 4.28 Mfile/s	439 us / 4.32 Mfile/s
find	lookup	1.22 ms / 1.56 Mfile/s	876 us / 2.16 Mfile/s	760 us / 2.50 Mfile/s	1.20 ms / 1.58 Mfile/s	639 us / 2.97 Mfile/s
find	lookup+stat	77.8 ms / 24.4 Kfile/s	77.0 ms / 24.6 Kfile/s	72.6 ms / 26.1 Kfile/s	75 ms / 25.2 Kfile/s	70.1 ms / 27.0 Kfile/s
find	lookup+stat+read-64K	162 ms / 11.7 Kfile/s	148 ms / 12.8 Kfile/s	147 ms / 12.9 Kfile/s	160 ms / 11.9 Kfile/s	147 ms / 12.9 Kfile/s
find	lookup+stat+read-16K	170 ms / 11.2 Kfile/s	150 ms / 12.6 Kfile/s	155 ms / 12.2 Kfile/s	166 ms / 11.4 Kfile/s	150 ms / 12.6 Kfile/s
find	lookup+stat+read-4K	411 ms / 4.61 Kfile/s	370 ms / 5.13 Kfile/s	377 ms / 5.03 Kfile/s	405 ms / 4.68 Kfile/s	373 ms / 5.09 Kfile/s
find	lookup+stat+read-512	2.62 s / 722 file/s	2.36 s / 804 file/s	2.50 s / 759 file/s	2.59 s / 732 file/s	2.48 s / 763 file/s

Conclusions:

bcache is useless (or badly optimized)
icache+dcache is king

realchonk · 2025-02-16T22:06:50Z

@asomers What do you think about this? Do you have any suggestions on what could improve the performance?

asomers · 2025-02-17T18:40:40Z

@asomers What do you think about this? Do you have any suggestions on what could improve the performance?

I'll share what I found when optimizing the performance of https://github.com/KhaledEmaraDev/xfuse . The biggest problem I found was that even when the kernel cache was enabled, read amplification was still high. That was because the kernel only cached the contents of files, not metadata structures. For example, files' indirect blocks were not cached. So when reading a very large file, the daemon would be forced to reread some of those indirect blocks over and over. Fixing that problem required the daemon to cache that metadata itself. The most logical way I found to do it was to attach said metadata in memory to the inode and cache it until FUSE_FORGET dismissed the inode. That does have the potential disadvantage of high memory consumption if the kernel rarely or never forgets a vnode. But in practice I did not find it to be a problem.
To measure the performance of the metadata cache, a time-based benchmark wasn't ideal. Instead, I constructed a read-amplification benchmark ¹. It runs a reproducible workload on a fuse-xfs file system backed by a gnop device. Then it uses gnop's measurement of throughput compared with the workload's throughput to calculate the read amplification of fuse-xfs. You can probably use the same program for fuse-ufs, though you'll have to tailor the workloads and golden images appropriately. For some workloads, I was able to achieve improvements of hundreds or even over 1000x ².

realchonk · 2025-02-17T19:29:49Z

@asomers What do you think about this? Do you have any suggestions on what could improve the performance?

I'll share what I found when optimizing the performance of https://github.com/KhaledEmaraDev/xfuse . The biggest problem I found was that even when the kernel cache was enabled, read amplification was still high. That was because the kernel only cached the contents of files, not metadata structures. For example, files' indirect blocks were not cached. So when reading a very large file, the daemon would be forced to reread some of those indirect blocks over and over. Fixing that problem required the daemon to cache that metadata itself. The most logical way I found to do it was to attach said metadata in memory to the inode and cache it until FUSE_FORGET dismissed the inode. That does have the potential disadvantage of high memory consumption if the kernel rarely or never forgets a vnode. But in practice I did not find it to be a problem. To measure the performance of the metadata cache, a time-based benchmark wasn't ideal. Instead, I constructed a read-amplification benchmark 1. It runs a reproducible workload on a fuse-xfs file system backed by a gnop device. Then it uses gnop's measurement of throughput compared with the workload's throughput to calculate the read amplification of fuse-xfs. You can probably use the same program for fuse-ufs, though you'll have to tailor the workloads and golden images appropriately. For some workloads, I was able to achieve improvements of hundreds or even over 1000x 2.

Footnotes
1. https://github.com/KhaledEmaraDev/xfuse/blob/main/benches/read-amplification.rs [↩](#user-content-fnref-1-6e7bd8390a5e20306de8a346a6dbab9e)

2. https://github.com/KhaledEmaraDev/xfuse/issues/107#issuecomment-2051985520 [↩](#user-content-fnref-2-6e7bd8390a5e20306de8a346a6dbab9e)

Thanks for your insight. My approach was to implement a block cache for BlockReader, but I failed spectacularly by decreasing performance by 10%-20%.

As for the indirect block cache, I could probably integrate it into the inode cache somwhow..

Right now I'm using time-based benchmark for simplicity and flamegraphs for checking what's taking up most of the time,
but my workstation (where I did the benchmarks) died somehow and I'll only be able to repair it on Saturday.

By using flamegraphs I was able to optimize inode_resolve_block(), which yielded an improvement of upto 500%.

The problem of the kernel never forgetting can be solved via lru, which I'm currently using.
In fact it might be useful to just ignore forget calls.

asomers · 2025-02-17T19:38:42Z

Right now I'm using time-based benchmark for simplicity and flamegraphs for checking what's taking up most of the time,
but my workstation (where I did the benchmarks) died somehow and I'll only be able to repair it on Saturday.

Beware: a flamegraph will only tell you what's taking the most CPU cycles, not the most time. I/O-bound operations won't show up in a flamegraph.

realchonk · 2025-02-17T19:39:58Z

That's interesting, because right now llseek is taking a lot of time.

realchonk added 3 commits February 16, 2025 16:27

implement benchmarking

b8c1e2b

bench: count number of files in filesystem

b699bd1

fmt

ee173de

realchonk mentioned this pull request Feb 16, 2025

Implement basic benchmarking #41

Closed

realchonk added 6 commits February 16, 2025 17:45

lint

1867f21

implement inode cache

145d13e

fix rebase

2c3492a

rufs: add Ufs::lookup()

45ac378

rufs: use Ufs::lookup()

2c7f964

actually implement inode cache, sorry I was dumb

2264414

realchonk added 2 commits February 16, 2025 19:30

rufs: reduce amount of read_inode() calls

fbdabab

add make bench

80e6660

rufs: implement block cache

80c3fdf

add directory entry cache

2f75776

realchonk force-pushed the perf branch from 1193aec to 2f75776 Compare February 16, 2025 20:57

realchonk added 4 commits February 16, 2025 21:08

rufs: make caching optional

ed0ed3d

update make bench

049cda7

fix make bench

c38a4a2

rufs: disable bcache by default

91a44a7

rufs: use pprof to generate flamegraphs

7a05960

realchonk force-pushed the perf branch from 88acb47 to 7a05960 Compare February 16, 2025 22:46

realchonk added 3 commits February 17, 2025 10:12

rufs: refactor inode_resolve_block()

94fdda7

rufs: refactor read_pblock_ptr()

92ed25a

fmt

6b1fad9

realchonk added 3 commits February 17, 2025 11:07

lint

494b9d1

minor changes

7c00e09

rufs: remove bcache

c722451

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Improvements #94

Performance Improvements #94

realchonk commented Feb 16, 2025 •

edited

Loading

realchonk commented Feb 16, 2025 •

edited

Loading

realchonk commented Feb 16, 2025

realchonk commented Feb 16, 2025

realchonk commented Feb 16, 2025

realchonk commented Feb 16, 2025 •

edited

Loading

realchonk commented Feb 16, 2025

asomers commented Feb 17, 2025

realchonk commented Feb 17, 2025

Footnotes

asomers commented Feb 17, 2025

realchonk commented Feb 17, 2025

Performance Improvements #94

Are you sure you want to change the base?

Performance Improvements #94

Conversation

realchonk commented Feb 16, 2025 • edited Loading

realchonk commented Feb 16, 2025 • edited Loading

realchonk commented Feb 16, 2025

realchonk commented Feb 16, 2025

With block cache

realchonk commented Feb 16, 2025

realchonk commented Feb 16, 2025 • edited Loading

realchonk commented Feb 16, 2025

asomers commented Feb 17, 2025

Footnotes

realchonk commented Feb 17, 2025

Footnotes

asomers commented Feb 17, 2025

realchonk commented Feb 17, 2025

realchonk commented Feb 16, 2025 •

edited

Loading

realchonk commented Feb 16, 2025 •

edited

Loading

realchonk commented Feb 16, 2025 •

edited

Loading