Hardly a day goes by in my travels and discourses that I don’t encounter an IT pro new to an organization that discovers they’ve inherited some server, application, service, etc. that doesn’t make sense in its current implementation. This manifests as the panicked exclamation, “Help! I just inherited a fill-in-the-blank and it’s a mess. What should I do?”
The answer to this, of course, is always, “It depends.” So should you ever find yourself in this situation (perish the thought) or maybe just feel altruistic enough to share this article with somebody you know in this situation, here’s some food for thought.
Why do you have a predecessor?
First, let’s recognize that you have a predecessor. That’s the first clue. Why do you have a predecessor? What happened to that predecessor? Was your predecessor invited to move on? Actually, these are questions you probably should have asked in the job interview that might have prevented you from taking the job, but you’re here now, so let’s press on. These are all clues as to why the system is in the state it’s in, but of course, don’t necessarily help remediate the situation.
Are you an expert?
The second key question you should ask revolves around an objective evaluation of your own expertise with the technology of interest. If you’re expert in the technology enough to authoritatively determine that the environment is a legitimate nightmare, well, you probably don’t need this article. This article is intended to help those who do not have expertise with their inherited messes.
Before you do anything with the system, become knowledgeable about the system. What does it do? How is it used? What is the basic architecture? How does it function? How is it implemented in normal conditions? Is there anything unique about this environment that requires an implementation methodology that deviates from the norm? Also, consider the possibility that what you see as a mess may simply be a lack of familiarity. But, for the sake of argument, let’s assume not.
As a next step, find and read the product documentation. If there are communities of experts on the system, engage with those communities. Take note, however, that the point of engaging is not to ask them how to fix it, but rather to educate yourself about the system. Once you become educated, then you’ll be in a position to understand why what exists is wrong and why a particular remediation step is right. Remember, the only thing worse than inheriting a messed up system is breaking the system even worse because you thought it was messed up and just went ahead and “tried something.”
Make a plan
Once you’ve become familiar enough with the system, you can begin developing an educated and rational plan for dealing with your nightmare. Here’s some suggestions for getting started:
System is offline
First, ask yourself if the system is currently online and functioning in some capacity, even if in a degraded capacity. If the answer is no, then start from scratch and build the system the way it should have been built to begin with. There’s absolutely no point expending effort trying to fix something that’s already dead.
System is partially functional
If the system is online and providing some functionality, then you’ll need to determine to what extent the system can be disrupted. Some systems can be disrupted without any noticeable impact, while some systems are mission critical and can’t be offline at all or only for a limited time.
System can be disrupted
If the system can be disrupted and the allowable disruption time is longer than the time it would take to rebuild the system, I say it’s always better to rebuild and know what you have than it is to try to repair it and hope you’ve caught all of the defects.
System cannot be disrupted
If the system cannot be disrupted, ask yourself if it is possible to build a replacement system side-by-side and then switch over to the new system. This gives you the advantage of building the replacement system correctly from the ground up, but eliminating the risk of negatively impacting the services that are currently available.
System cannot be disrupted and a side-by-side system is not an option
If building a parallel system is not a viable option and you have to make repairs with the system online—which is the worst case scenario—you need to make a written plan that outlines your execution step-by-step, and each step must change only one thing at a time. Your plan should include tests to verify that each step has achieved the intended objective and has not negatively impacted the existing level of service. Do not perform the next step in the remediation plan until you have absolutely verified that the previous step has been successfully completed with no negative impact.
In short, rebuild whenever you can, as that will produce a system you have first-hand knowledge of, but where rebuilding is not possible, remediate with exceptionally controlled efforts, testing and verifying after every action.