The Decision Nobody Before Me Would Make: Betting a SaaS Business on a Cloud Migration
Why I committed a SaaS business to migrating 100 live customers from IBM Cloud to Azure with no downtime window, and the FTP wall that nearly broke it.
The short version: I committed Sunrise to migrating its entire production estate from IBM Cloud to Azure, a hundred live customers with no acceptable downtime window, because the platform's AI future was gated behind getting onto Azure AI Foundry and modern tooling first. The business had discussed the move for years without committing. The hardest moment came when the planned topology routed FTP through an Azure Application Gateway, which is a layer-7 HTTP load balancer and cannot carry FTP's dual-channel passive-mode traffic. We pulled the Application Gateway out of that path, moved FTP onto a transport-layer route, and migrated all hundred customers with no migration-related outage anyone felt.
Migrating a SaaS platform's entire production estate from one cloud to another has no upside the customer can see. Done perfectly, nobody notices. Done badly, you've potentially ended the company. A hundred organisations run their service desks on the platform every working hour, and if the migration breaks, they don't experience a technical incident. They experience their service desk being gone. For a SaaS business, that's not a bad week. That's the kind of event a company doesn't come back from.
So the rational-looking move is the one the business had made for years: talk about it, agree it matters, and never actually commit. The migration from IBM Cloud to Azure had been discussed long before I arrived. What hadn't happened was anyone deciding to own it. One of the first real decisions I made was to commit to it. Properly, with a date, with my name on the outcome.
Here's why I made a decision that could have shattered the business, and what happened when the plan hit a wall halfway through.
Why take a business-ending risk at all?
You don't take a business-ending risk for a lateral move. You take it because standing still is the more expensive option once you look past the next quarter.
Everything I could see coming for this platform ran through AI. Not as a bolt-on, as the actual direction of the product. And the AI future was going to be dramatically easier to build on Azure. Native access to the AI tooling, Azure AI Foundry, the model infrastructure, the services you want sitting next to your application rather than reached across the public internet to some other provider's cloud. Modernising the apps themselves, the work of dragging a legacy estate into something maintainable, was going to be smoother on a modern platform built for it.
Staying on the legacy cloud wasn't free. It looked free because the cost was deferred, paid later as every future piece of work got harder, every AI capability got bolted on awkwardly from the outside, every modernisation effort fought the platform instead of using it. The years of discussing the migration without doing it weren't years of avoiding the risk. They were years of the risk compounding quietly while everyone agreed it was important and nobody signed.
That's the real argument for committing. The danger isn't in the migration. The danger is in the platform you're stuck on if you never do it. Deciding wasn't the brave bit. Deciding was just refusing to keep paying interest on a decision the business had been dodging for years.
What makes a live SaaS migration hard?
The constraint that made this migration hard was a single, non-negotiable one: no downtime window.
Not a maintenance slot on a Sunday night. None. Service management is the system you reach for when something else has already broken. You can't take it offline to relocate it. The thing that catches everyone's outages can't become one. So the brief was to move a hundred live production customers onto entirely new infrastructure and never hand a single one of them a reason to notice it happened.
The planned Azure topology looked reasonable on the diagram. Most do. The trouble with network architecture is that the diagram is a set of promises about how traffic will behave, and traffic only tells you whether the promises were true once it's actually flowing.
Why does an Azure Application Gateway fail with FTP?
The first real failure came from an FTP path, and for a moment it looked like the thing that might prove the doubters right.
Several customers relied on FTP for inbound file transfer into the platform. Legacy, yes. Replaceable eventually, yes. But "eventually" is not "during a migration where nothing is allowed to break," so FTP had to work on day one in the new environment exactly as it had on the old one.
FTP is a genuinely awkward protocol to put behind modern cloud networking. FTP uses a separate control channel and data channel, and in passive mode the server hands the client a port to come back on that the client couldn't otherwise have known. Stateful firewalls and gateways have to actually understand the protocol to keep the two channels associated. That understanding is the Application Level Gateway, the ALG. When the ALG works, FTP flows. When it doesn't, the control channel connects, the user authenticates, everything looks healthy, and then the data transfer hangs forever because the data channel has nowhere legal to land.
The planned design routed this traffic through an Azure Application Gateway. On paper, fine. In practice, the Azure Application Gateway is an HTTP/HTTPS layer-7 load balancer. It is very good at the thing it's for. FTP's dual-channel passive-mode behaviour is not the thing it's for, and the ALG handling FTP needs simply wasn't there to configure. The control channel came up. The data channel died. Authenticated sessions that transferred nothing.
This is the moment a migration succeeds or quietly rots. You can spend two weeks trying to bend a component into a shape it was never built for, generating increasingly creative firewall rules to paper over a fundamental protocol mismatch. Or you can call it: the Application Gateway is the wrong tool for this path, and no amount of configuration changes what the tool is.
We called it. The Application Gateway came out of the FTP path entirely.
What carries FTP traffic if not the Application Gateway?
Once you accept that layer-7 HTTP load balancing can't carry a layer-4 stateful protocol like FTP, the replacement gets clearer. FTP traffic needed a path that operated at the transport layer, where the dual-channel relationship could be preserved without something in the middle trying to reinterpret it as web traffic.
That meant pulling the FTP flows onto a routing path that handled them at the network and transport level, with firewall rules written to permit the passive-mode data port range explicitly rather than hoping a gateway would infer it. The deviations from the original firewall plan weren't sloppiness. They were the specific, deliberate set of rules required once you stop pretending FTP is HTTP. Every deviation had a reason, and the reason was always the same: the planned topology assumed a protocol behaviour the real protocol doesn't exhibit.
The HTTP and HTTPS traffic, the actual web application customers spend their day in, stayed on the path it belonged on. The mistake would have been forcing one architecture to serve two protocols with incompatible needs. Different traffic, different paths. Obvious in hindsight. Most good architecture is.
How do you cut over a hundred live customers safely?
The architecture being right is necessary and not sufficient. You still have to move a hundred live customers across, and the cutover is its own discipline.
The principle we held to: every customer's cutover had to be reversible until the moment it was confirmed working, and no customer's migration could affect any other's. You migrate, you verify against real traffic, and only then do you commit. If verification fails, you're back on the old path before anyone's logged a ticket about it. Treat each customer as an independent transaction that either fully completes or fully rolls back, and a migration of a hundred stops being one business-ending event and becomes a hundred small controlled ones.
That's how you get a number I'm still quietly proud of. A hundred production customers, off legacy infrastructure, onto Azure, with no migration-related disruption that landed on a customer as a real outage. The work was enormous. The customer experience of the work was close to nothing. That gap, between how hard it was and how little anyone felt it, is the entire job.
What I'd take from it
The lesson isn't about FTP or ALGs. It's about the decisions a business avoids precisely because they're the ones that matter.
This migration sat undone for years not because it was hard to understand but because it was frightening to own. The downside was real and the upside was invisible, which is the exact profile of decision organisations defer indefinitely. But deferring it wasn't safety. The platform we'd have been stuck on was quietly getting more expensive to build on every month, and the AI direction the product needed was gated behind a migration nobody would sign.
Committing meant accepting that if the FTP wall, or any of the dozen other walls, had brought the whole thing down, it would have been my decision that did it. That's the cost of owning the call. You don't get the upside of the platform decision without carrying the downside of it going wrong.
A hundred customers moved. None of them felt it. We're on the platform the next decade of the product actually needs. And the reason any of that happened is that the decision finally had someone willing to put their name to it, including the part where it might not have worked.
That's the job. Not avoiding the risky decision. Being the person who'll own it, wall and all.