Not knowing anything about hard drives, I am wondering how a cloud service provider monitors their hard drives for problems (data corruption, loss of data, hard drive failure, etc.). Searching google doesn't reveal much other than "download your hard drive manufacturers repair kit and press repair". I would like to know what is happening in that repair process, and better yet, how a cloud provider regularly monitors their hard drives for quality. I read somewhere that BackBlaze does a daily SMART stats analysis sort of thing, to see how their hard drives are doing, but I'm not sure really what that means.
We use Smartmontools to capture the SMART data.
The repo is here, but I would like to know what this Smartmontools is doing. Not necessarily in detail, but a quick outline. Can't really tell from the repo what it does.
What I imagine would happen (to monitor a hard drive) is this. Create a database with MD5 hashes of every file. Periodically scan the entire hard drive and do a checksum with every file against the saved MD5 hash. But this seems like it would be very slow, especially on terabytes hard drives. I am not really sure what kinds of failures you can have, and what kinds of notifications you can get. Maybe you can use file system events, but I'm not sure how that would work on an external drive rather than on the main machine. But even if it did work on the external hard drive, I'm not sure it would be notified when data got corrupted because of the device getting old. So it seems the only way to check that the data is correct is to actually compare the current data with the old data. But other than a checksum I'm not really sure what efficient way could be done to do this.
The main thing you would want in the monitoring process is to know when a drive is starting to not work as well, so you can get ready to change it. Repairing a drive is a whole other thing which I don't know how it works, but I won't ask that here. I would just like to know how you typically monitor an external hard drive, and and how you know when it is starting to not work correctly (i.e. how you know if data has been corrupted/lost in an efficient way, and things like that).
This seems to offer some information.
Instead of just knowing "just apply x technology", I would like to know how to actually implement it as like an app or or something, at least the basics to get started.
No comments:
Post a Comment