BTRFS: Finding and fixing highly fragmented files
What is BTRFS fragmentation?
Most of the best BTRFS features are powered by the copy-on-write technology. If a application wants to rewrite a part of a file, like the first megabyte, the data is not written in-place but in an so-called extend. This enables BTRFS to keep multiple versions of partially rewritten files with only claiming disk space assigned to the changes and not multiple full copies of a file. The old data can be discarded at some point (i.e. if its not used by any snapshots anymore) and the extend will serve the files current version.
BTRFS fragmentation can hurt the performance of your System
You can guess, reading a file with 100k+ extends and adding more extends requires a lot of bookkeeping and storage seeks from your system. That 10GB file there is internally shattered into 100k parts that need to be collected if you want to read the whole file. This clearly adds complexity - and decreases performance.
BTRFS fragmentation can block huge amounts of disk space
Yes, BTRFS has to store the locations of these 100k extends somewhere, easily adding some extra GB of used disk space to your system. The bad thing is that BTRFS does not tell you that
If you see your btrfs filesystem using 80GB in df and btrfs fi show while du -hsx only shows 54GB there are only two reasons i am aware of to cause this: either you have snapshots that keep old extends - or you have massive fragmentation.
BTRFS filesystem defrag
It is possible to use BTRFS filesystem defrag on your whole file system, but that causes all you snapshots to duplicate the data. it also causes a lot of IO so this is nothing you want to do on your production server without a reason. There is really no point in defrag'ing static files that are almost never changed.
Find the most fragmented files on your System
There is a linux-tool called filefrag which reports how many fragments a file consists of. So i thought ... „why not try to find the most fragmented files and fix just these?“ here you go:
find / -xdev -type f| xargs filefrag 2>/dev/null | sed 's/^\(.*\): \([0-9]\+\) extent.*/\2 \1/' | awk -F ' ' '$1 > 500' | sort -n -r
You should review this list. If there is something with 10k+ extends, it is a candidate to be flagged as nodatacow. In my case, I have discovered that the fail2ban sqlite database was using 170k extends which is a lot! If you have database-files with a high fragmentation while using nodatacow, it is better to run a "optimize table" on them, as this also cleans up the database-related fragmentation of frequently rewritten tables. If you use snapshots, make sure to have some free space, as the defrag does an in-place copy of the files while snapshots are blocking the old version from being released.
If everything is fine, you can go ahead and defrag all files on that list
find / -xdev -type f| xargs filefrag 2>/dev/null | sed 's/^\(.*\): \([0-9]\+\) extent.*/\2 \1/' |
awk -F ' ' '$1 > 500' | cut -d ' ' -f2 2>/dev/null | xargs -r btrfs fi defrag -f -v
This will print out all filenames that are processed.
A short explanation of the command
find gets all files on the specified path (/) without descending into other mounted filesystems (-xdev). Then filefrag determines the fragmentation, the sed command reformats the output so that the extent count is on first position followed by the filename. Then awk parses that list filtering only files that have more than 500 extends. After that is done, the output is „cut“ to only contain the filenames and passed to btrfs defrag for defragmentation. -v on the defrag command prints out all processed files. Also take a look on the longterm IO usage before and after the defrag to see how big the difference in the real world is.