A look behind Qonversion’s most reliable in-app subscription management service and SDKs for mobile apps
Subscription management is a mission-critical component of subscription-based app businesses. The cost of outages can be high. Poor end-user experience and the loss of end-user trust, though, can be even more costly.
This is why the reliability of an app’s subscription management infrastructure should be a priority when choosing a service to manage in-app subscriptions for any mobile app. In this post, we will delve into the methods that helped us achieve and guarantee a remarkable uptime of 99.99% for Qonversion’s subscription management service.
Qonversion In-App Subscription Management Uptime: 99.99%
At Qonversion, reliability has been our priority from the start. But despite that, it took several short-term outages for us to realize that we need to rethink our approach and introduce a set of new fallback options and a backup infrastructure to be able to guarantee the uptime of this mission-critical service. Now we have several protection mechanisms that allow us to avoid any downtimes and provide in-app subscription management service SLA. These mechanisms can be divided into three levels:
- Internal infrastructure reliability,
- External backend-replacement service,
- SDK-level protection.
Let’s dig deeper into each of these levels and see what we have done to guarantee a 99.99% uptime to our clients.
Internal Infrastructure Reliability
“Internal infrastructure” – it’s a broad term, but we don’t need to examine all of its facets here. For this article’s purposes, it’s sufficient to think about “internal infrastructure” as a core element of our service. As a core element, it warrants the corresponding level of protection from any disaster. Our infrastructure predominantly consists of databases and computing clusters. Let’s talk about the reliability of these mechanisms.
All our data is separated into several databases. Each of these has at least one replica. In the case of any problems with the original database, it is automatically replaced with the replica. Data in the replicated database is kept as accurate as possible – the maximum possible data lag time is 3 seconds.
The backup of each database is made on a daily basis. If for some reason we face data loss in any of the databases, we can quickly recover the data from the backups in a semi-automatic way. We check the integrity of the backups once a day to ensure that they are in working order.
What about the computing powers our service is based on? Our service relies on several machines combined into a Kubernetes cluster. All the machines inside the cluster are interchangeable, which means if any of them falls out, others take its load. And that redistribution is automated.
We’ve also created an identical second cluster, which works in tandem with the first one, doubling the reliability of the computing powers of our service. If one cluster falls down, the second one will cover it, taking on the entire load itself. This might increase the handling time of requests, but they will be handled – it’s much better than a stable “503”, isn’t it?
And the last thing here is that at Qonversion we use microservice architecture, meaning that all the modules of our service work as separate microservices. And all microservices have their own replicas. As with the K8s clusters, microservices work together with the replicas and automatically redistribute the load in case of one’s termination.
External Backend-Replacement Service
In the previous section, I described a few things that ensure our services’ reliability. But let’s assume, for the sake of argument, that something critical were to happen leading our internal infrastructure to become unable to handle incoming requests. In this unlikely case, we have developed a completely independent service which can replace our API when needed. It is based on Cloudflare infrastructure – workers, storages, queues, and so on. We’ve named it Aegis.
You might suppose that this service would respond with automated template responses, but you’ll be pleased to know that it, in fact, acts the same way our main API does. To achieve this, we specifically prepare all the necessary caches in Cloudflare’s storages from our database, while it is up.
Later, if we face some kind of outage on our API, we redirect all the incoming traffic to the specially developed Cloudflare Worker, which can handle all the main requests using the cached data. It also stores all the incoming requests in the queues to resend them to the API when it is recovered in order to actualize our database state. Once again, everything is based on the Cloudflare infrastructure, which commits to 100% SLA itself.
Now Aegis can handle all of the most common API requests from our SDKs and the number of supported endpoints is increasing.
When we talk about such a big service, questions about its quality inevitably arise. To ensure that everything will work as expected and as needed, we’ve covered all of our code with unit tests (yes, we have achieved the coveted 100% test coverage) and integration tests that run daily for both Android and iOS platforms (Web is coming). We have also made several manual outage tests that range in duration from several minutes to nearly an hour, validating all the necessary steps to make any outages imperceptible to our clients both in the moment and after recovery.
If you think that the above isn’t sufficient to ensure reliability, then let me introduce you to our Mobile SDKs’ offline mode. This is the SDK-level protection against any kind of API outage. It comes into play when neither our API nor Aegis is responding, and is capable of handling requests itself.
How does it work? Normally our SDKs cache the necessary information about clients’ products, offerings, entitlements, and so on, on the device. If later our infrastructure faces an outage and the SDKs are unable to reach the API, they will use these caches to respond to requests. This approach, among other things, makes it possible for end users to make purchases even if our API is down. This was previously impossible, since without any information about the purchasing product we couldn’t launch the purchase flow. Also, the SDKs will grant users entitlements in case of a successful purchase using the information in caches and will store the request in order to resend it to the API when it becomes active again. All of this means that your end users won’t lose their paid access when we face an outage; they will gain paid access in case of a new purchase; and they will continue to use your app as usual, unaware of any issues.
The mobile SDKs’ offline mode is time-tested; it performed well during previous short-term outages (while Aegis was not in service), making these outages imperceptible to end users and saving our clients money. The only requirement for offline mode to work is at least one successful launch, which means it will not work only for new users who came on during the outage.
You can read more about the offline mode in our documentation, including the configuration setting and the SDK versions by which it is supported. Since part of our reliability improvements are based on the SDKs, which we are constantly improving, it’s important to keep the SDK versions up to date in your apps.
Let’s summarize what has Qonversion done to guarantee reliability and accuracy for your application:
- Database replicas and backups with daily integrity checks
- Multiple interchangeable K8s clusters with multiple interchangeable machines inside
- Microservices with interchangeable replicas
- Backend-replacement service based on Cloudflare infrastructure that is completely autonomous and fully test-covered
- Mobile SDKs’ offline mode
With all these measures, we are trying to protect our clients from any possible outages. These protection measures allow us to commit to 99.99% uptime for our clients. We are also introducing our SLA agreement with that commitment for Growth and Enterprise plans. 99.99% uptime means that we guarantee our services will not be unavailable for more than 53 minutes per year. If we fail to meet that service level, we will pay credit to our customers.
If you’re looking for the most stable and reliable platform to handle the complexities of in-app subscription management, Qonversion is your best choice. With Qonversion, you can be confident that you have a platform that can help you manage your subscriptions seamlessly, allowing you to focus on growing your app and providing excellent service to your users.
Implementing Qonversion is fairly quick and simple. You can implement in-app subscriptions into your app in as quickly as one hour, and you don’t have to invest resources in building and maintaining all of the required infrastructure. You are getting the solution with all cornered cases of cross-platform subscriptions management covered. You don’t have to worry about complex integrations or spend time learning how to manage your subscriptions. Furthermore, you are getting the best subscription analytics and an advanced set of growth tools including a flexible A/B testing module built specifically for subscription apps.
If you have any questions about the reliability of our infrastructure or our product itself, feel free to contact us. We’ll be happy to assist!