I'm planning to build a new NAS to store a large amount of media (20tb+). I would like to use btrfs for both the NAS and the backup (might be a separate system, not sure yet)
- I want to use raid1 or raid10 to cover disk failure & bit rot
- I want to use 1 large file system and 8-15 sub volumes - efficient space usage etc
My issues is - it does not look like raid 6 is up to scratch yet and a single raid1 or raid10 file system will only protect me from a single disk failure - I,m worried that rebuilding my file system after a disk failure with 5tb-10tb sized disks will takes days at least and expose me to total loss with another disk failure. I know I will then still have my backup, but i have the same issues again
- what are my options with btrfs for the above scenario
- is there any btrfs file system mode for combinng disks that will only loose what files are on that disk if there is a failure ?
- can btrfs use a backup file system rather that raid to recover a checksum error ?
- what about zfs
- what about unraid, flexraid, etc for my scenario ?
Thanks
Answer
- what about zfs
Hello Shaun,
I can't tell you much about btrfs, it's still on my to-do list. For ZFS, there are a few solutions available, some with graphical interface (they usually offer versions that are free for private use). I've also tested it using command line on Solaris, OpenIndiana and OmniOS, but for ease of use I'd recommend using a special NAS distribution like nexentastor (more business-oriented, less intuitive GUI) or in your case probably FreeNAS (good allrounder, webGUI, free).
FreeNAS installation is a breeze (e.g. write image to USB stick (I prefer SLC-based chips for better resiliance), stick it onto mainboard, boot, configure network on command line and plug into network - after that, all else is done via web-GUI) and the community is quite lively. AND it has an easy option to install (as an isolated module) a media server (plex media server) and let it see a selected directory or file system, optionally read-only.
And to me most important: you get (almost limitless) snapshots and snapshot-based replication to another box. Meaning: you can introduce a task that periodically makes snapshots and then replicates them to another box. That box doesn't have to be identical, it can be a low-cost system configuration (even based on a different system / OS) that only serves as an archive - or a full-fledged twin.
Now, when it comes to the disk configuration, there are some basic informations required, mainly about the kind of usage:
Media files are usualy large, copying them from and to storage one by one is usally no big task for any system. What else will you need? Multiple simultaneous access to different media? Heavily skipping forward / backward? Or basically put: how random is your read access? Same goes for write access. Single-User, storing files and watching from time to time shouldn't be a big deal. A home theater box regularly scanning through all the media on the NAS to build an index for each file, or streaming out to 5 or 50 is a totally different thing. 20 people working on separate projects, editing, cutting and merging media files is another story completely.
The good news: ZFS can satisfy any of the above. Even all of them. But the costs will naturally vary. Let me give you some examples:
An 'entry configuration' (mainly single user throughput) providing 24TB might look like this:
* one pool with a RAIDZ2 or Z3 configuration of 6 respectively 7 6TB HDs ('Z' is followed by the number of disks that may fail without actual data loss, max 3)
* 8GB RAM (4GB is a bit tight, with ZFS it's generally: the more, the merrier!)
* one or more 1GBit ethernet ports (best to add one dedicated network for replication if needed/feasible)
This setup (about 24TB) should suffice for mainly single user access, big files copied serially onto the box, then read/streamed singly. Paired with an adequate CPU (recent generation 2-4 cores, 2.5+ GHz) it should offer good read- and write-throughput but due to the monolithic disk layout would suffer from low IO-performance (esp. writing). Throughput would be expected to stay below 4x single disk performance but especially write IOPS would be expected to be no more than that of a single disk (apart from cached reads, naturally). Rebuilds after disk failure would naturally curb performance even further, but since only used blocks are replicated, it usually finishes much faster (depending of fill rate of the pool) than 'usual' RAID rebuilds.
To improve parallel read performance, you can add a 'performance SSD' (high IO, good throughput) as L2ARC, an intelligent read cache that otherwise resides completely in RAM. That should greatly enhance read performance, but the L2ARC is 'emptied' on reboot, afaik. So after a reboot, it would have to gradually 'refill', based on the 'working set' of files / pattern of access.
Here's an example of a better parallel (read/write) performer:
* one pool containing 6 mirrors with 3x 4TB disks each (meaning each disk is mirrored TWICE for redundancy, reducing load during mirror-rebuild when one copy can be read for re-mirroring and another serves read-requests)
* 32GB RAM
* 2x 200GB+ L2ARC
* one or more 10GBit ethernet ports (again, add one for replication between boxes)
This setup should offer several times the (read- and write) IO of the first setup (data is spread over 6 mirrors instead of one RAIDZ-device), performance during rebuilds should be much better, rebuild time less (due to smaller disks). And redundancy (ok-to-fail) is 2 disks - for each mirror. Naturally you have more disks in total -> more likel to have a failed disk at some point. But rebuild is faster and has much less of an impact.
Naturally, the IO also depends on the disks: compare 10.000rpm with <3ms seek time to 5.400rpm with >12ms seek time, not to mention SSDs with a fraction of that.
Speaking of SSDs, there is also an option for speeding up things using a separate device for 'write logging' called SLOG (Separate LOG), usually utilizing one or more SSDs (or PCIe cards) but this is often misunderstood and thus used incorrectly. I won't delve further into this topic at this point except for one point: it's only used related to SYNCHRONOUS data transfers (write transactions are acknowledged as soon as data has actually been written to stable storage, e.g. disks, in sort meaning 'I'm finished'), as opposed to asynchronous transfers (write transactions are acknowledged as soon as data has been received, but part (or all) of the data may still be residing in cache/RAM waiting to be written to stable storage, meaning 'I shall do it ASAP'). Usually, when we're talking network shares for file storage, we are talking about asynchronous transfers. Without any 'tweaks', sync writes are always slower than async ones. If you need this kind of integrity, just come back and ask for more. ;-)
Almost forgot: to ensure data integrity, it is best to use ECC-RAM (and compatible mainboard and CPU) to avoid data corruption due to unnoticed faulty memory. In a production environment, you would definitely not want that.
A few other features you might want to know
* ZFS is generally (but not always, sigh)compatible among distributions/OSes based on same ZFS version (if no additional 'special features' are activated)
* several good 'inline' compression options - but probably not in your case (pre-compressed media, I suppose)
* integrity with auto-repair
* ZFS rebuild after disk failure only replicates live data on disk, not free space
integration with Active Directory (for business use)
* FreeNAS has a built-in disk encryption option - best used with appropriate CPUs (acceleration) - but beware, it breaks compatibility with other distributions
Ok, so much for a short write-up on a ZFS-based solution... I hope it offers more answers than it provokes new questions.
Regards,
Kjartan
No comments:
Post a Comment