Storage

Of late I have been spending my work (and outside of work) time thinking about storage, specifically how to keep all of my Center’s digital stuff safe, secure, resilient, and available to us when and where we need it. Surely the stultifying stuff born to cure the native insomniac in all of us, right?

Maybe. It turns out to present an interesting set of problems, not just to the data and IT professionals, but also all of us who have an increasing bigger pile of digital stuff that we’d not like to lose.

What matters most, as it turns out, is the size of the pile. For most of us individuals, unless we have extensive digital image/music collections, we can probably keep most of our digital stuff on thumb drives, and carry around multiple copies in our pockets. Or we could put it up “in the cloud” using any one of a variety of services — Amazon, Google, Dropbox, Flickr to mention but a few. I believe that Flickr these days offers as much as 1 terabyte of storage for nothing or next to nothing.

But what if you have a big pile, and getting bigger. Big as in 20 TB which is not big by industrial standards, but big enough for the purposes of this exercise, and what my Center’s looking at. Let’s say that they are big image files (some upwards of 100GB), and we have more than a half a million of them. So I have them on disk and want to be sure that if some bad thing happens to that disk (hardware failure, power outage, whatever), that I can get that content back. Some of the questions are: where do I put it? How long does it take me to put it there? How much can I afford to lose if I do have a failure? How long will it take me to get it back? And how do I do this when my organization has tech savvy users, but no real technologists, and no technical infrastructure that they own?

Some of those questions may not matter quite so much to individuals — hey, if it takes me a day or two to get back Aunt Hilda’s wicked hula pics from Hawaii, maybe I don’t care so much — as to businesses where failure to recover quickly could prove costly or fatal. If I have a copy from yesterday either on a thumb drive or on another system that’s still OK, recovery means copying back data from that drive or storage to the “good” system, and the time to recover is the time it takes me to make the copy, and how much I’ve lost (if anything) depends on when I last made the copy and what has changed since then.

Hardware to hardware (thumb drive to your new hard drive) is relative quick, but still do try copying 100GB of data and do the timing. Hardware to hardware over a network (mounted network drive to new local storage) is constrained by bandwidth. For the average user who has a 50Mbps network (about as fast as you can get from your ISP), even if all if it was available to you (and it’s not), it would take about 4 hours to restore 100GB of data. And that 20TB number I mentioned above? More than a month, running 24/7.

All of us who have done large back-ups know that you probably can’t run 24/7 for even a week unless you’re in a professionally managed data center and they know that they shouldn’t shut you down/reboot you/patch you/do anything that would adversely affect your restore. And what’s going on during that restore with the rest of your data/business, hmm? Even with the very fast network that I have available (more than 240Mbps), a complete restore, given the effective rate would be a week and change.

So what are some of the ways to reduce risk/address these issues?

  1. Triage — determine what data is absolutely mission critical and develop a plan to back it up/sync it/have it available on multiple, physically redundant sites over a network that you control or have managed for you so that you can recover by pointing your system at the other storage sites. The best solutions in this area use a combination of hardware/software to accomplish this. They are not cheap, and require dedicated resources to manage.
  2. Hybrid (i.e., mix of hardware and cloud) storage solutions — after having done 1 (in some fashion), consider solutions that manage within the scope of a physical piece of hardware data on local disk and in “the cloud”. In these solutions, heavily used local data is kept locally, either in memory or on disk, while the less used data is relegated to cloud storage, and only fetched when needed. These are becoming less expensive solutions, but still require management of hardware, the network, and a thought towards using additional local systems for back-up
  3. Pure Cloud — Pure Cloud solutions where backup/sync storage is located with a cloud provider is the least expensive, but also, as we have shown, the slowest, even on a very fast network. There are some good solutions in this space, but the need to be combined with the triage approach mentioned above.

At the end of the process, we determined that the mission-critical data, about 300GB in our case, could be kept redundantly locally, and sync’ed to our cloud back-up solution. In a disaster, this 300GB could be recovered to local storage in a few hours, meeting the Center’s requirements. The rest of the data would also be local, and recovery could occur over time as it was largely “archival” data that would not change much (if at all), and where recovery could be done selectively, at a much faster pace for urgently needed data, and then in a more leisurely fashion for the rest. So the Pure Cloud play, with some twiddles, looks like it’s going to be the best and most cost-effective for us in the near term (next year).

Right now I think the hybrid solutions hold the most promise, but they still require technical management, not something that my Center had readily available and my University is currently read to provide. And if you’re still awake at this point, good for you. Hope you found it interesting, and just consider what this problem looks like at a larger scale, 100’s of TB or even petabytes (1,000 TB) or zettabytes (1,000 PB). At this point, I think I’d want to be in the storage hardware business rather than trying to figure it all out, that would be the problem that Amazon, Google, and their ilk will be needing to solve.

This entry was posted in Information Technology. Bookmark the permalink.

Leave a comment