Why Minfx - Features & Benefits

Reliability

Reliability is one of the most important features of Minfx. We are dedicated to ensure your data is always available and secure:

We are running multiple backends in distinct physical locations from multiple providers. So even if our cloud provider fails, we will be working just fine.
We keep our services reliable even during server upgrades and deployments by utilizing blue-green staged rollouts. We finalize the upgrades only when they pass our reliability tests in production.
Even services like Cloudflare have outages, so to manage issues with networking, DNS resolution or any other issue that's even slightly outside of our control, our client is logging data localy and submiting them to our backends as soon as it's possible to connect again.
In case of intermittent network failures, the client has retry mechanisms and keeps track of which data it needs to push to backend. Only when backends return information about safely storing the event, the client will remove the data from it's local storage.
An interesting case are spot preemptions, which typically give about 30 seconds to finish the current job. (2 minutes on AWS, 30 seconds on GCP, 30 seconds on Azure) The client will do it's best to flush all logged data. If it can't, it will fall back to disk writing. If it can't even do that, we will notify you that we couldn't salvage the data in this extreme situation.

Testing of Reliability

We have a small staging cluster running nonstop, on real networks, loggers logging to the backend, and clients validating integrity of the data. There's a "chaos monkey" running around this cluster and randomly unpluging machines and/or singular nodes from the service cluster. This way, we make sure that our environment, infrastructure and services can handle whatever the real world can throw at us.

We always strive to do more, so in the future, we are aiming to implement Deterministic Simulation Testing akin to TigerBeetle's Extreme Engineering approach. You can think of it as fuzzing distributed systems, which allows to simulate hundreds of years of execution per every day. Adopting this is important for distributed systems, because it is hard to model-check them formally: the computational complexity of timed automata is PSPACE-complete, so it does not scale to realistic distributed systems^*.

^* Unless P=PSPACE, which is a stronger statement than P=NP

Reliability

Testing of Reliability

Contact Us