XFS>16TB: Duncan White's XFS beyond 16TB Notes

XFS beyond 16TB!

Hi there! I'm Duncan White, an experienced professional Unix programmer.
I just (late Nov 2008) wasted most of a week trying to do something that should have been completely routine: expand an XFS filesystem from 14TB to 21TB on an AMD64 Ubuntu Feisty server. In case anyone else suffers similarly, here's some notes.

Back to my Personal Pages.
0. The Hardware:
The core of this system was a hardware RAID5 array (Xyratex F5404E) unit which (rather like a Sun thumper) takes up to 48 SATA or SAS drives, inserted vertically from above. We had 24 750GB SATA disks in at the start, forming a single 14TB block device, with a single 14TB XFS filesystem on it used for online backups. This filesystem had reached 100% full some months ago. We wanted to add another 12 x 750GB disks, giving us a 21TB filesystem. I started by quiescing the filesystem (un-NFS exporting, and unmounting it).
1. Expanding the RAID:
This started out straightforwardly: I eased the RAID array out of the rack, added the disks, pushed the array back into the rack, the rack catches tried to impale my hands as usual, apart from that no problem. First problem was how to control the RAID array, as I hadn't set it up. It had serial and network management options, but I didn't know if the network management port had been set up, and if so, what IP address (on a private management network) to use. So I attached my Ubuntu laptop to the RAID controller's serial interface, read the Xyratex manuals in detail, but nowhere in about 250 pages did they bother to mention that the menus only appear when you type CTRL-E. Must have rebooted that RAID array about 10 times, trying all single keys (but not control key combinations, sadly), before we called the supplier who said "oh, you need CTRL-E". Sigh. I love documentation.
Now I could get into the menus, expanding the LUN was straightforward. I found that the LUN was actually formed of two "logical regions", each of which was a 10+1 drive RAID5 volume. I think that this combining of regions into a LUN meant striping, i.e. RAID50, although the manuals didn't make that clear - it could have been all-of-region-1 followed by all-of-region-2. Anyway, that left 2 hot spares. I checked that all the new disks were seen, and in good state, and then created a third 10+1 drive RAID5 volume, and added it as a third logical region to the existing LUN. For the next N hours, the RAID controller logged progress in creating the new RAID5 volume, and all the disk lights flashed enthusiastically. Stage 1 done.
2. Expanding the block device.
On the Linux server, I used the old trick to make the kernel re-detect the block device (with it's changed size). The RAID volume happened to be on controller 0, target 0, disk 0, lun 0, so I removed it by:
echo 'scsi remove-single-device 0 0 0 0' > /proc/scsi/scsi
and then added it back again (forcing redetection of properties like size):
echo 'scsi add-single-device 0 0 0 0' > /proc/scsi/scsi
Sure enough, dmesg now revealed the device had been redetected with the new size (21TB). fdisk -l /dev/sda confirmed this. Stage 2 was easy; too easy!
3. Expanding the XFS filesystem.
XFS offers the ability to expand a filesystem, and (rather unusually) requires the filesystem to be mounted (but not NFS exported) to do it. So I mounted the filesystem up (mount /export/vol), and then grew the filesystem via the near instantaneous command:
xfs_grow /export/vol
Don't you just love extent-based filesystems? As well as XFS on Linux, we use the Veritas filesystem on Solaris, for exactly the same reason. Now, df /export/vol reveals that the size has changed to 21TB as expected, and ls /export/vol shows the old directory contents so the filesystem really seems to be up and ok. Yippee, stage 3 only takes a few seconds; all done now, yes? At this point, I went home for the weekend, satisfied that it was basically all done.
4. An unexpected problem!
On returning after the weekend, the server had a load average in the hundreds, caused by every online backup process that attempted to write to the filesystem hanging. Rebooting fixed the hanging problem but revealed another one:
touch /export/vol/wibble
reported No space left on device. strace'ing this revealed little more, the open('/export/vol/wibble', for writing) call returned ENOSPC.
What on earth was wrong? df, remember, knows the filesystem was now 21TB and 67% full. xfs_info revealed an expanded number of Allocation Groups (AGs) and data blocks.
I also found the useful xfs_db filesystem debugger, it has many useful features, two that I highlight here are:

show me the free block list xfs_db -r -c freesp /dev/sda, which confirmed that after the filesystem had been grown there were a small number of huge free extents, comprising 99.9% of the free block list.

show me how many inodes are used, and free: xfs_db -r -c sb -c print /dev/sda | egrep 'icount|ifree',

5. Attempting to repair the filesystem!
I began to wonder, despite all the above (df and xfs_info) evidence, whether the grow operation had failed in some strange way, corrupting the filesystem in a subtle way. XFS provides two utilities: xfs_check which nearly always runs out of memory when analysing large filesystems (and did this time, as expected), and xfs_repair. So I unmounted the filesystem and tried:
xfs_repair /dev/sda
This took hour upon hour, so I left it running overnight, while doing numerous google searches to see if anyone had seen this problem before - but I couldn't find any clearly analagous case. I did note that 32-bit Linuxes had a 16TB filesystem limit - and I'd gone from 14TB to 21TB, remember, so I'd crossed that limit - but of course my server was 64-bit.
To my annoyance, the next day I found that the xfs_repair process had died before completing the repair scan. I ran it again, but again when I came in the next day it had died before completion (wasting another day!).
I began to wonder whether I needed a newer version of the XFS utilities. I considered building the latest XFS utilities for Ubuntu Feisty (there were no later packages in the repository), but thought instead that I might as well upgrade the server to Ubuntu Hardy, which we are using widely now anyway.
6. Latest XFS: Upgrading the server!
So, I decided to upgrade the server. I tried the experiment of changing the /etc/apt/sources.list file to point at Hardy and running aptitude dist-upgrade several times, i.e. an inline upgrade. Debian and Ubuntu are supposed to be able to do this, however we've had mixed success in the past. This time, after about 4 runs, aptitude appeared to be completely happy, however rebooting failed miserably because the md (software mirroring) packages were no longer installed and the root filesystem was mirrored!
So, the inline upgrade having failed, I detached the RAID array's fibrechannel connection for safety and then reinstalled the server via our home brew networked installation system, booting from a USB stick that always lives on my key ring nowadays. I suspect that this took less time than the inline upgrade, given that we only install a few hundred packages on servers!
7. Repairing the filesystem.. again!
Now that my server was running Ubuntu Hardy, I reattached the RAID array and rebooted. Then I was ready to try xfs_repair /dev/sda once again. This time, it seemed to run much faster, and completed successfully within about 5 hours rather than taking more than 11 hours. Did they improve the algorithm quite significantly, this seemed surprising?
Unfortunately, despite finishing cleanly, xfs_repair failed to find a single problem to correct, claiming that the filesystem was intact and correct, and after mounting the filesystem the touch /export/vol/wibble command still failed with the no space available message.
8. The Solution!
By now, more than a week had passed, and I was getting very irritated and rather desperate. As the filesystem in question only contained online backups of most of our important filesystems, and we had a second independent online backup, I came very close to destroying the filesystem with it's 14TB of data, and recreating it, to see whether a newly created XFS filesystem still suffered the same problem. Almost as a last step before that, I thought that if the filesystem was truly not corrupted, and if it wasn't a layout problem at creation time, then logically something at mount time wasn't happening right. I thought again about the 16TB filesystem limit, and possible 32-bit vs 64-bit differences that might occur, and I remembered that the Veritas filesystem required filesystems containing 64-bit sized files to be mounted with a special mount option (largefiles), so I reread the XFS section of the mount manpage looking very closely for 64-bit comments. I found the following option:
inode64 Indicates that XFS is allowed to create inodes at any location in the filesystem, including those which will result in inode numbers occupying more than 32 bits of significance. This is provided for backwards compatibility, but causes problems for backup applications that cannot handle large inode numbers.

This sounded worth trying, despite the confusing backwards compatibility aside which made it sound like it was now the default behaviour, so I remounted the filesystem with:
umount /export/vol mount -o inode64 /export/vol
and to my delight, I was now able to create files with impunity!
9. Conclusions
XFS filesystems on 64-bit Linux machines still have some kind of limit, not as simple as the old 16TB size limit, but presumably when the inode numbers that the filesystem uses become 64-bit (i.e. no longer fit in a 32-bit unsigned integer) the above magic option is still required. For my system, this boundary appeared to coincide with the filesystem being expanded above 16TB, but I don't know whether this is always true - it may be specific to my system's history, generic to all XFS filesystems, or something similar. Or the inode limit could be hit completely independently of the filesystem size, it could be a property of the number of inodes used already.
It is extremely irritating that the mount command can't figure this out for itself, or warn that the option is missing when needed, but if it really can't then the man page should clearly describe this.
The newer version of the XFS utilities that come with Hardy (2.9.4 vs Feisty's 2.8.18) did make some difference in terms of xfs_repair completing at least twice as fast as the previous version, however all this did was eliminate file system corruption as the most likely explanation! Generally, I have some suspicion of the prevalent Linux sysadmin belief that "newest is best", but this time it was genuinely useful.
dcw@doc.ic.ac.uk

Updated: 1st Dec 2008