Escaping the Rabbit Hole

Escaping the Rabbit Hole

Or: How To Function Without A Functioning System

“Why, sometimes I’ve believed as many as six impossible things before breakfast.”
― Lewis Carroll, Alice’s Adventures in Wonderland & Through the Looking-Glass

I love being faced with an impossible challenge. The limits of what is possible in IT are constantly expanding, and when something seems impossible there are almost certainly solutions available – solutions that aren’t visible to us yet.

The Setup

I was requested for an emergency meeting by the Student Management team at a large university. It was weeks away from their busiest time of year, when potential students would be sending in applications.

If a student has studied elsewhere they may be offered credit towards their new course. Assessing this credit is complex and most Universities only allow students to request credit after they have already enrolled.

We had collaboratively built a streamlined CPS workflow, backed by historical data, course information, and decision precedents. The complexity was hidden from users, and assessing a request dropped from weeks to hours or minutes.

This allowed the university to offer credit to prospective students, who are then much more likely to apply rather than choosing a different university that might not grant them credit. This became a cornerstone of international and postgraduate student recruitment

… until we were told that the system would be down for around a week, right before enrollment applications began.

Not only would this cost millions in lost tuition fees, the delayed workload would create a knock-on effect throughout the entire enrollment period.

The Problem

Due to new legislation the central Student Management System needed to be upgraded before enrollments. As part of the upgrade, the Student Identity Management system needed to be upgraded. As part of upgrading the Student Identity Management system, the Staff Identity Management needed to be upgraded. During this week any of these systems could have outages without warning.

This meant staff couldn’t log in to process these requests. Even worse, because students’ forms were submitted under their own account, the public forms also submitted under a role-based account and would also fail. We did foresee this problem, so the student would be told the submission failed and they need to try again later – but that’s hardly ideal, and was only intended as a failsafe for short outages.

Updates to our system were done via a process taking a minimum of 2-4 weeks. We also did not have direct access to write into the database – that could only be done via the form submission. This would take us well into the enrollment period. We needed a solution now.

That meant:

  • The submission form would be regularly failing
  • Prospective students would be submitting this form
  • Staff login would be regularly failing
  • Staff need to process the requests ASAP, which means they need access when they couldn’t log in
  • Information necessary to process the requests would regularly be unavailable
  • We could not update any of the code
  • We could not update any of the text, either to warn or redirect to a different form

If this seems like a crazy situation, that’s because it was. Downtime during the upgrade was completely avoidable, core services such as identity management should be built to be highly-available, load-balanced, and functional even while systems using them are cut off from the rest of the network. The delay releasing new code to our system could have been avoided by using a better release process and the right tools.

However, we deal with the real world, and central IT had mandated these restrictions.

In other words:

  • The system is going to break
  • We cannot change the central upgrade plan so it won’t break the system
  • We cannot update the system so it won’t break
  • We cannot allow the system to break

Seems impossible? Surely we will need some compromise. An emergency meeting was called because all attempts to find a workable solution had already failed.

The Proposed Solution

In this emergency meeting I learned about the downtime, and after discussion we eliminated some early ideas — including a manual stopgap — which was costly and wouldn’t provide much help. The Student Management team already understood this, but it was their best available solution to an impossible situation.

The Impossible Solution

So far, we’ve only been talking about systems and business needs. IT systems are built on a technology stack, and understanding this allows us to break down the problem and look for solutions at a deeper level.

Looking past our assumptions, we had three key problems to solve:

  • Students will be visiting a form that might not be functional. We need to ensure that they can submit their request.
  • The processing is complex, so the request needs to be submitted into and processed via the system.
  • Subject Matter Experts need to continue processing requests whether or not the system is available.

To address the first problem, I created a new form identical to the current form, which submitted the request to an SQL database. Now we could capture the student’s requests, even during peak submission.

We still needed the students to submit this new form but could not update the old form to redirect here, nor could we update links and search engine information. Updating the DNS for these webpages decoupled the current system from the known URL, allowing us to move the old site aside and use the new on in its place. Because the system had much more than this one form, we set the new website to redirect visitors to the appropriate URL at the new address. Anyone following a link to one of our webpages would arrive at the new website where we could display appropriate messages or redirect to the main system. Other than possibly noticing that they have been redirected to a slightly different URL, everything was working as before.

We could now disable the redirection for credit requests, allowing prospective students to submit the new form during the downtime. Problem 1 had been solved.

At this point submitting the request into the main system was also relatively straightforward – using the role-based account we could submit the exact same information in the exact same format as the previous form did. The catch was that this submission would still fail during the Identity Management System downtime.

Instead of submitting the request immediately when the form was submitted, this new webpage cached the request in a database first. With a little extra code the website would spot when the main system was down and queue the requests. Once the system was available again it would submit the requests with a timestamp of the actual submission date, so the final results had all the correct student submission information. Problem 2 had been solved.

The subject matter experts would process requests via the main system using information gathered from multiple other systems – including the Student Management System that was being upgraded.

If the Student Management System was down they would not have all the information needed to make the right decision. If the Staff Identity Management system was down they could not log in to process the requests. As these could be down at different times, we needed solutions for both.

Due to our solution to problem 2, we could check whether the Student Management system was available before creating the request in our system. Usually we would query information as the staff member needed it. As a temporary measure we added code to gather, format, and append that information onto the body of the request text. Staff could now continue whether or not the Student Management system was available.

We added the ability for staff to log in to our new website (using our own authentication and a username/password we provided), view incoming submissions, and write notes. When the request was submitted to the system the notes would also be created and attached to the record. Although the subject matter experts were not able to actually process the requests, the downtime could still be used productively and with their notes those records could be processed by general admin staff once the system was available again. Downtime to the Staff Identity Management system was minimal, and ultimately this feature did not get used – but knowing we had this ability allowed the Student Management team to plan their staffing and workloads. Problem 3 had been solved.

After the upgrade was complete, the DNS for both the old URL and the new URL could be pointed to the existing system, and everything would continue working as before. Anyone who had bookmarked a page using the temporary URL would be redirected to the original URL.

Things went without a hitch. Students submitted their requests without interruption. There was downtime during heavy usage, and this temporary system handled tens of thousands of requests. Many staff did not even notice our changes, though some recognized that this system continued working while other systems throughout the university were failing.

Similar solutions were used to ensure other critical processes remained available. Less critical processes were submitted while the core systems were available, and given a temporary downtime message when unavailable. As term had not started and this was a quiet period for most of the remaining university, the downtime was not an issue.

We’d managed to continue business as usual without losing any submissions, data, or work hours, despite the system becoming repeatedly, unexpectedly unavailable.

I love challenges like this, because it forces me to look for different solutions to my usual work. I’m proud of the outcome – more than that, I’m proud that I could help out some very stressed, good, kind people who thought they were facing a nightmare situation.

I truly believe that good IT costs nothing, because every dollar spent will save more than a dollar elsewhere in the business. A knowledgeable team making the right decisions would have ensured there was no downtime during upgrades, and avoided this fragility in interconnected systems where the downtime to one system cascaded into downtime to multiple other systems and uncountable costs throughout the university.

Even though it’s fun to find an impossible solution, I’d much rather design systems to avoid problems rather than facing problems that are “impossible” to solve. But nothing’s impossible . . . so if you’re facing an impossible-looking problem, let us know: we’d be glad to help.

“Door: Why it’s simply impassible!
Alice: Why, don’t you mean impossible?
Door: No, I do mean impassible. (chuckles) Nothing’s impossible!”
― Lewis Carroll, Alice’s Adventures in Wonderland & Through the Looking-Glass