Reliability
Reliability is one of the most important features of Minfx. We are dedicated to ensure your data is always
available and secure:
- We are running multiple backends in distinct physical locations from multiple providers. So even if our
cloud provider
fails, we will be working just fine.
- We keep our services reliable even during server upgrades and deployments by utilizing
blue-green staged rollouts. We finalize the upgrades only when they pass our reliability tests in
production.
- Even services like Cloudflare have
outages, so to manage issues with networking, DNS resolution or any other
issue that's even slightly
outside of our control, our client is logging data localy and submiting them to our backends as soon as it's
possible to
connect again.
- In case of intermittent network failures, the client has retry mechanisms and keeps track of which
data it needs to push to backend. Only when backends return information about safely storing the event, the
client will
remove the
data from it's local storage.
- An interesting case are spot preemptions, which typically
give about 30 seconds to finish the current job. (2
minutes
on AWS, 30 seconds on
GCP, 30
seconds
on Azure)
The client
will do it's best to flush all logged data. If it can't, it will fall back to disk writing. If it can't even
do that, we
will notify you that we couldn't salvage the data in this extreme situation.
Testing of Reliability
We have a small staging cluster running nonstop, on real networks, loggers logging to the backend, and clients
validating integrity of the data. There's a "chaos
monkey" running around this cluster and randomly unpluging
machines and/or
singular nodes from the service cluster. This way, we make sure that our environment, infrastructure and
services can handle whatever the real world can throw at us.
We always strive to do more, so in the future, we are aiming to implement Deterministic Simulation
Testing akin
to
TigerBeetle's Extreme Engineering approach. You can think of it as
fuzzing distributed systems, which allows to
simulate
hundreds of years of execution per every day. Adopting this is important for distributed systems,
because it is
hard
to
model-check them formally: the computational complexity of timed automata is PSPACE-complete, so it does not
scale
to
realistic distributed systems*.
* Unless P=PSPACE, which is a stronger statement than P=NP