...

How to Evaluate a Cloud Provider Before Migration: Technical Due Diligence for CTOs

Martin Klein

Reading time 1 minute

Technical due diligence is not about checking the cloud provider’s storefront. It is about testing real scenarios: what happens during peak load, an outage, data recovery, bill growth, an SLA dispute, a security audit, and a cloud exit two or three years later.

A provider may look strong during the selection stage: a low price for virtual machines, a familiar set of services, a “99.9%” SLA, and a fast response from sales. But problems often appear later: outbound traffic sharply increases the bill, limits interfere with scaling, support does not escalate the incident to engineers, and backup recovery takes longer than the business can tolerate.

The evaluation should start not from the pricing page, but from the company’s requirements:

  • Which systems are business-critical;
  • Which RTO/RPO targets are required for recovery;
  • Which network, security, region, and compliance requirements exist;
  • Which limits may block growth;
  • How support works during a real incident;
  • What the full operating cost consists of;
  • How the provider assists with exit or data migration.

The main result of due diligence is not a formal “the provider is suitable,” but a risk map: what is confirmed by documents, what has been verified in a pilot, what must be fixed in the contract, and which limitations the company consciously accepts.

Evaluate Real Scenarios, Not the Cloud Storefront

A provider may look convincing during the selection stage: a low price for virtual machines, a familiar service portfolio, a “99.9%” SLA, and a fast response from the sales team. But for a CTO, the presentation matters less than how the platform behaves under production conditions.

Problems usually appear after migration. The bill grows because of outbound traffic, backups, logs, and cross-zone data transfer. Support responds quickly, but does not provide an engineering solution during a critical incident. Quotas block scaling before a peak. Data recovery turns out to be slower than the business expected. And cloud exit terms become clear only when the company is already dependent on the provider’s managed services and formats.

Technical due diligence is needed to verify whether the provider can meet the company’s real requirements: an outage, peak load, a contractual dispute, a security audit, and a possible exit in two or three years.

The evaluation should follow a clear sequence: first business criticality of systems, then risks, cost, technical resilience, security, support, pilot operation, and only after that — formalizing the terms in the contract. Without this sequence, selection quickly turns into a comparison of prices and SLA percentages, even though the real risk lies in operational details.

Start with System Requirements, Not the Pricing Page

The same SLA, storage volume, or network limit may be acceptable for an analytics environment and unacceptable for a customer-facing service with contractual obligations. That is why, before the first technical call with a provider, the CTO should prepare not a “cloud wishlist,” but a system map.

This map should define critical parameters: which services are being migrated, who on the business side owns the requirements, what downtime is acceptable, what data loss is tolerable, where data must be stored, which integrations exist, and what load growth is expected over the next 1–3 years.

For recovery, two indicators should be defined in advance:

  • RTO — the acceptable time to restore a service after an incident;
  • RPO — the acceptable amount of data loss measured in time.

A basic matrix helps connect internal requirements with provider evaluation:

Area What to Define Internally What to Check with the Provider Risk 
Availability Service criticality and acceptable downtime SLA, maintenance windows, exclusions A high SLA does not eliminate business risk 
Data RTO/RPO and retention periods Backups, recovery testing, data return Data exists, but recovery takes too long 
Network Latency, channels, integrations Connectivity, bandwidth, resilience Integrations work unreliably 
Security Regulatory constraints and audit requirements Certifications, encryption, access controls, logs The provider does not pass audit 
Limits Peaks, growth, resource requirements Quotas, API limits, expansion procedure Scaling hits hidden constraints 
Support Required response and escalation criticality Channels, response time, engineer access The incident gets stuck at first-line support 
Cost Resources, traffic, storage Full cost model The budget is calculated by VM, but paid across the ecosystem 
Exit Volumes, formats, migration timelines Export, deletion, exit cost Changing providers becomes a separate project 


The conclusion is simple: without an internal requirements matrix, the provider’s answers cannot be properly evaluated. “99.9%,” backups, or 24/7 support may look acceptable until they are matched against the criticality of a specific system.

After the internal matrix is ready, the next step is the SLA and contract. This is where it becomes clear what the provider actually promises, how availability is measured, and which events are excluded from its responsibility.

SLA and Contract: Check Measurability, Exclusions, and Liability

After the internal requirements matrix is ready, the next step is to examine what exactly the provider promises in the SLA. An SLA is a contractual obligation, while an SLO is a target service level the provider aims to meet operationally. Actual reliability depends not only on the provider’s promises, but also on the customer’s architecture: placement across zones, redundancy, backups, and network connectivity.

The figure “99.9% availability” tells a CTO very little on its own. What matters is which services it applies to, how downtime is calculated, and what happens if the obligation is breached.

SLA review can be reduced to several practical questions:

What to Check Why It Matters 
Which services are covered by the SLA Not all managed services, networks, storage systems, and APIs may be included in the obligations 
How availability is calculated The calculation method is more important than the percentage itself 
What counts as downtime Full outage, degradation, API errors, and loss of connectivity may be interpreted differently 
Which events are excluded Scheduled maintenance, external networks, and customer actions may be excluded from the SLA 
How maintenance windows are announced Unlimited maintenance windows reduce the value of the SLA 
What compensation is provided Service credits do not always cover real business damage 
Who records the incident It matters whether the customer can use its own monitoring metrics 
Whether RTO/RPO are documented Availability does not replace data and service recovery requirements 


A red flag is an SLA where exclusions are broader than obligations: scheduled maintenance is not limited, network issues are placed outside the provider’s responsibility, managed services are not covered, compensation is available only as service credits, and only if the customer submits a claim within a short period.

The conclusion for the CTO is that the SLA should be read as a legal and technical mechanism. If it does not define service boundaries, calculation methodology, exclusions, compensation, and monitoring data, the availability percentage does not create a manageable risk model.

After the SLA, the technical foundation should be checked: whether the platform can handle networking, limits, recovery, and load growth not on paper, but in real operation.

Technical Resilience: Network, Limits, Backups, and Recovery

Contractual availability will not help if the network degrades under load, quotas prevent resources from being provisioned quickly, and backup recovery has never been tested. That is why technical resilience should be evaluated across three areas.

Network

For the network layer, request the connectivity architecture, availability zones, redundant routes, VPN or private connection parameters, and DDoS protection terms. Latency and bandwidth should be checked separately for integrations with offices, data centers, partners, and other clouds.

During the pilot, it is worth running a network load test — not only inside the cloud, but also across the real routes used by the application. For a B2B service with external customers, it is important to test not the average latency during a quiet period, but how the network behaves under peak traffic.

Limits and Scaling

The CTO should obtain current quotas for compute, memory, disks, I/O operations, network, snapshots, load balancers, managed services, and APIs. It is important not only to see the limit, but also to test the procedure for increasing it: who approves it, how long it takes, and whether it can be done before a peak.

The behavior when a limit is reached should also be clarified separately: resource creation failure, degradation, manual request, or automatic expansion. A red flag is when scaling a critical system depends on an informal request to support.

Backups and Recovery

A backup is not the same as a restored business service. The review should cover backup frequency, retention period, geography, isolation, encryption, and the recovery procedure.

For critical systems, a trial recovery procedure is mandatory: restore the service or its replica from a backup, measure the actual time, verify data integrity, and compare the result with the target RTO/RPO. If the provider is not ready to support such a test before production migration, recovery risk remains unverified.

Even if technical resilience looks convincing, security should be checked separately. Certifications are useful, but they do not replace analysis of the specific architecture, access rights, keys, and logs.

Security and Compliance: Do Not Reduce the Review to Certificates

A certificate is useful, but it does not replace technical due diligence. It confirms compliance within a defined scope, not the security of the customer’s specific architecture, roles, keys, and access settings.

The review should be built around the shared responsibility model. In infrastructure as a service, the provider is responsible for physical infrastructure and the basic cloud layer, while the customer is responsible for operating systems, configurations, access, and data. In platform services, some operational tasks move to the provider, but access policies, encryption, logs, and architectural decisions still remain within the CTO’s area of responsibility.

It is worth requesting not only an ISO certificate or SOC report, but evidence of specific controls: access management, MFA, role model, encryption at rest and in transit, key management, administrator activity logs, customer isolation, vulnerability management, change management, and incident response.

For certifications, the provider should clarify their validity, scope, covered services, and covered regions. For regulated data, storage and processing geography, provider personnel access, data processing obligations, and audit rights should be documented separately.

The outcome of the security review should not be a folder of certificates, but clearly defined responsibility boundaries, available reports, a list of controls, and requirements for logging, encryption, and incident handling.

Support: Verify Engineering Availability, Not Just the 24/7 Label

24/7 support does not mean that, at a critical moment, the customer will reach an engineer capable of diagnosing network degradation, a storage issue, or a managed service failure. Sometimes “round-the-clock support” only means first-line ticket intake, not full engineering escalation.

The review should cover not only first response time, but the entire incident path: communication channels, severity levels, status update timelines, escalation procedure, availability of an engineering line, communication language, technical account manager coverage, and the boundary between basic and paid support.

During the pilot, it is worth opening several real tickets: a technical network question, a quota increase request, a degradation incident, and a billing question. This shows not only response speed, but also routing quality: whether the request remains with first-line support or quickly reaches a specialized engineer.

For critical systems, it is important to agree in advance who from the provider participates in emergency communication, how a shared incident channel is opened, how often status is updated, and what information the customer receives after closure: timeline, root cause, and measures to prevent recurrence.

The main criterion is simple: support must be tested before migration, not during the first real incident. If the provider cannot show a clear escalation and engineering communication model, the 24/7 promise remains a storefront claim, not a managed process.

Real Operating Cost: Calculate the Workload Scenario, Not the Resource Price

The price of a virtual machine or a terabyte of storage does not show the real cost of operation. A cloud bill consists of many line items: compute, disks, snapshots, backups, outbound traffic, cross-zone data transfer, load balancers, NAT, public IP addresses, logs, metrics, managed databases, licenses, support, and redundancy.

That is why cost should be evaluated not by a single resource, but across several operating scenarios. The CTO should request calculations for at least three situations:

  1. A normal month;
  2. A peak month;
  3. An emergency scenario with recovery or data migration.

Below is an illustrative example of a cost structure. The figures are provided only for demonstration; actual rates should be taken from the provider’s commercial proposal.

Cost Item Normal Month Peak Month Emergency Scenario 
Compute $4,000 $6,500 $8,000 
Disks and storage $1,800 $2,200 $2,600 
Backups and snapshots $900 $1,200 $1,800 
Outbound and cross-zone traffic $1,600 $4,200 $7,000 
Logs and metrics $700 $1,100 $1,300 
Support $1,000 $1,000 $1,000 
Licenses $1,200 $1,200 $1,200 
Total $11,200 $17,400 $22,900 


This example shows why a low starting price may be unrepresentative. In an emergency scenario, costs are almost twice as high as in a normal month because of traffic, temporary duplication of resources, recovery, and additional storage. These scenarios should be discussed before migration, not after the first large bill.

It is also worth checking expenses that often appear only after launch: outbound traffic to users, partners, and other clouds; cross-zone and cross-region data transfer; storage of backups, snapshots, and archives; long-term retention of logs and metrics; growth in managed service costs; licenses; extended support; recovery tests; a reserve site; and the cost of exiting the cloud.

The result of the review should be a full operating cost model for 12–36 months, with assumptions, growth limits, and separate scenarios for normal operation, peak load, an incident, and cloud exit.

Exit Terms: Check Them Before Signing the Contract

Exit terms should be reviewed before migration, not when the business has already decided to change providers. The more actively a company uses managed services, provider-specific APIs, proprietary backup formats, and the provider’s networking tools, the more expensive the way back may become.

For the CTO, it is important to understand in advance not only whether the company can retrieve its data, but exactly how that process will work: in which formats databases, disks, objects, logs, and configurations can be exported; how long it will take to export the current and projected data volume; and how much outbound traffic, temporary storage, and parallel operation of two environments will cost.

It is also worth checking how long the data remains available after contract termination, how deletion is confirmed, whether access can be extended for the migration period, and which provider obligations for exit assistance can be documented in writing.

Exit terms should cover not only data, but also architecture. If the application relies heavily on the provider’s unique managed services, migration may require reworking code, data schemas, pipelines, monitoring, and access policies. This does not mean such services cannot be used, but the cost of dependency should be assessed in advance.

After reviewing exit terms, the next practical step is the pilot. It should validate not the provider’s presentation, but how the platform behaves in real scenarios.

Minimum Set of Pilot Test Scenarios

The pilot should test future operations, not a demo environment. If the test is limited to launching the application and running a few manual requests, it does not show how the platform will behave under peak load, failure, recovery, or support interaction.

A minimum set of scenarios can be summarized as follows:

Scenario What to Check 
Load Latency, throughput, load balancers, limits, and application behavior as traffic grows 
Failure System response to the unavailability of a component, zone, network, or managed service 
Data recovery Restoring the service from a backup and comparing actual RTO/RPO with target values 
Support Response to tickets of different severity, escalation, and quality of engineering answers 
Cost Actual resource consumption and alignment with the cost model 
Exit Test export of critical data, including format, speed, and export cost 


If the provider passes only a functional check, but not failure, network, financial, and exit scenarios, the migration decision remains incomplete. The pilot should give the CTO not the feeling that “everything has started,” but a set of measurable results: where the platform met the requirements, where contractual clarifications are needed, and where the risk remains unacceptable.

Ready-Made List of Questions for a Cloud Provider

The final list of questions is best structured by due diligence area. It should be adapted to the internal requirements matrix, system criticality, and pilot results.

Service and Operations Terms

SLA and contract. Which services are covered by the SLA, how is availability calculated, what counts as downtime, which events are excluded, what compensation is provided, and are RTO/RPO documented in writing?

Network. How are availability zones, redundant routes, VPN or private connections, and DDoS protection organized? What limits apply to load balancers, NAT, public IP addresses, and inter-region traffic?

Backups and recovery. How often are backups created, where are they stored, who manages the keys, how long does recovery take, and can a trial recovery be performed before production migration?

Limits and scaling. What quotas apply to compute, memory, disks, network, APIs, and managed services? How quickly can they be increased, and what happens when a limit is reached?

Support. Which communication channels, severity levels, response times, escalation procedures, engineering lines, technical account managers, and extended support options are available?

Risks, Cost, and Exit

Security and compliance. Which certificates and reports are available, which services and regions are covered by them, and how are MFA, IAM, encryption, key management, administrator logs, and incident response implemented?

Operating cost. How are normal, peak, and emergency months calculated? How much do traffic, backups, snapshots, logs, metrics, support, licenses, a reserve site, and load growth cost?

Exit terms. In which formats are data and configurations exported, how much does export cost, how long does data remain available after contract termination, how is deletion confirmed, and can a test export be performed in advance?

This list does not replace architectural review, but it covers the main areas of technical due diligence. The provider’s answers should be compared with the internal requirements matrix, pilot results, and contract wording. Only then can the CTO see not the marketing storefront, but the real profile of risks, costs, and responsibilities.

Conclusion

The result of technical due diligence is not a formal “suitable / unsuitable” decision, but a map of risks, terms, and checks that should be carried into the contract, SLA, migration plan, and pilot operation.

A provider can be considered properly evaluated only when its promises are confirmed by documents, technical tests, and a clear responsibility model. For the CTO, the main outcome is predictability: what will happen during an outage, load growth, an SLA dispute, an audit, bill growth, or cloud exit.

If these scenarios are analyzed before migration, the cloud becomes a manageable infrastructure platform. If not, it may turn into a new source of technical, financial, and contractual uncertainty.

FAQ

Which documents should be requested from a cloud provider before migration?

Request the SLA, the contract and its appendices, a description of the responsibility model, support procedures, backup and recovery policy, security documents, current certifications, pricing description, and contract termination terms.

Is a 99.9% SLA enough for critical systems?

No. It is necessary to check which services the SLA covers, how downtime is calculated, which exclusions apply, who records the incident, and what compensation is provided. For critical systems, RTO and RPO should be documented separately.

How can support be checked beyond the 24/7 promise?

Clarify communication channels, response times by severity level, escalation procedure, availability of an engineering line, communication during large-scale incidents, and the cost of extended support. Ideally, this should be tested with a real ticket during the pilot.

Which costs are most often underestimated during cloud migration?

Outbound traffic, backup and snapshot storage, logs and metrics, load balancers, NAT, public IP addresses, licenses, support, redundancy, migration work, and the later cost of exiting the cloud.

Why should exit terms be checked before entering?

Because dependence on managed services, APIs, backup formats, networking services, and the access model can make changing providers expensive and slow. Before signing the contract, the company should understand how to retrieve its data and how much it will cost.

Is a pilot needed before production migration?

Yes. A pilot tests not the presentation, but the platform’s actual behavior: load, network, backup recovery, limits, support, traffic costs, and the selected architecture.

Sources

1. OMG / Cloud Standards Customer Council — Practical Guide to Cloud Service Agreements V3.0


2. Cloud Security Alliance — Cloud Controls Matrix / CAIQ


3. ISO/IEC 19086 — Cloud computing service level agreement framework

Subscribe to our newsletter and receive articles and news

    Check out our other materials

    • How to Evaluate a Cloud Provider Before Migration: Technical Due Diligence for CTOs

      Technical due diligence is not about checking the cloud provider’s storefront. It is about testing real scenarios: what happens during peak load, an outage, data recovery,...

    • Cloud Infrastructure for Medical Data: Encryption, Access Control, Regions, and Provider Requirements

      Medical data can be stored in the cloud, but a cloud environment cannot be assessed only by the provider’s name, the selected region, or enabled...

    • RAG Infrastructure in the Cloud: Where to Place the Vector Database, Object Storage, API, and Models

      RAG infrastructure should not be designed only around the LLM or the vector database. In a production system, the entire data path matters: where documents...