Saturday, November 10, 2018

windows - Performance of file operations on thousands of files on NTFS vs HFS, ext3, others



[Crossposted from my Ask HN post. Feel free to close it if the question's too broad for superuser.]




This is something I've been curious about for years, but I've never found any good discussions on the topic. Of course, my Google-fu might just be failing me...



I often deal with projects involving thousands of relatively small files. This means that I'm frequently performing operations on all of those files or a large subset of them—copying the project folder elsewhere, deleting a bunch of temporary files, etc. Of all the machines I've worked on over the years, I've noticed that NTFS handles these tasks consistently slower than HFS on a Mac or ext3/ext4 on a Linux box. However, as far as I can tell, the raw throughput isn't actually slower on NTFS (at least not significantly), but the delay between each individual file is just a tiny bit longer. That little delay really adds up for thousands of files.



(Side note: From what I've read, this is one of the reasons git is such a pain on Windows, since it relies so heavily on the file system for its object database.)



Granted, my evidence is merely anecdotal—I don't currently have any real performance numbers, but it's something that I'd love to test further (perhaps with a Mac dual-booting into Windows). Still, my geekiness insists that someone out there already has.



Can anyone explain this, or perhaps point me in the right direction to research it further myself?



Answer



I'm not an HFS expert, but I've looked into NTFS and ext3 filesystems. It sounds like you should consider two things.



First,the ext2/3/4 file systems pre-allocate the areas on-disk for storing file metadata (permissions, ownership, the blocks or extents that make up the file's data). I don't think NTFS does. The equivalent of an ext3 "inode" is the $MFT record. It's my understanding that the $MFT records aren't necessarily already allocated when you create a file. $MFT can be grown if need be. It's much harder to increase the number of inodes in an ext2/3/4 filesystem.



I'm not privy to any NT internals, but everything reads like the $MFT records get created as needed, so you can have small files, directories, large files interspersed.



For BSD FFS style filesystems, which the ext2/3/4 filesystems most definitely are, a lot of though has gone into grouping on-disk inodes, and seperating directory files from inodes. A lot of though has gone into writing out directories and metadata both efficiently and safely. See: http://www.ece.cmu.edu/~ganger/papers/softupdates.pdf as an example.



Second, the data for small files are kept in the $MFT records, if I read things correctly. This isn't true of ext2/3/4, and that's why I mentioned above that small files and large files are treated a bit differently.




It sounds to me like NT (the operating system) is suffering from contention for $MFT. Directoies get updated, which is a $MFT record update. Small files get created, which is an $MFT update. The OS can't order reads and writes efficiently because all the metadata updates and the data writes all go to the same "file", $MFT.



But, like I said, just a guess. My knowledge of NTFS is mainly from reading, and only a very little from experimenting with it. You could double check my guess by seeing if HFT keeps "directories" seperate from "inodes" seperate from "file data". If it does, that might be a big hint.


No comments:

Post a Comment

hard drive - Leaving bad sectors in unformatted partition?

Laptop was acting really weird, and copy and seek times were really slow, so I decided to scan the hard drive surface. I have a couple hundr...