The platform is well monitored via a combination of New Relic and CloudWatch metrics.
Both monitoring tools are configured to raise alerts:
Via SMS for senior team members
Via Slack for all relevant team members
Via email for all relevant team members
Alerts are configured across a variety of health points, such as:
HTTP ping end points
Database replication lag
Application error rates
Scheduled task throughput
Server resource and capacity
Typically we would expect the relevant staff members to be aware of a critical system issue before it is reported by users.
There is also a separate emergency support escalation process for customers which allows for direct SMS messaging to support team staff who would then be able to escalate to technical.
Distributed smartlinks (AWS)
Platform (UKFast)
Distributed smartlinks are the primary tool through which end user transactions are created and therefore has the highest priority. Distributed smartlinks have been designed to be (in most cases) independent of the Platform. In most cases, end users would still be able to generate transactions even in the event of the Platform being unavailable.
The Platform primarily consists of:
Order management and routing
Product management
Supplier workflow and fulfilment
Downtime of the Platform is therefore less business critical and in most cases, given a relatively short period (i.e. < 3 hours), would not negatively affect end user transactions to any great degree.
Distributed smartlinks are powered almost completely by static files hosted on AWS S3. S3 is a highly fault tolerant and redundant system.
In the event of any data loss within the distributed smartlink S3 bucket, data can be redeployed to the distributed smartlink system via the Platform.
AWS Elasticbeanstalk (EB) is used to provide some key smartlink functionality such as saving print jobs and handling user image uploads.
Application code is executed on EB via the use of Docker.
In the event of any outage (assuming there is not a more widespread AWS outage):
A redeployment to a new EB environment should be performed.
This would typically take 10 - 30 minutes.
A CNAME switch should then be performed, swapping the old and new environments.
This would typically take a few minutes for DNS settings to propagate.
Once traffic to the new environment has been confirmed and validated.
a post mortem should be carried out on the old environment.
In the event of any disaster, all scheduled tasks should be stopped immediately. This can be achieved by halting the docker daemon on the scheduled task servers.
Scheduled tasks should only be resumed once the platform has regained stability.
A full database backup is performed every 3 hours, with 24 hours of backups stored on site and then archived to S3 for 3 months.
Additionally as an extra layer of redundancy, UKFast also performs daily off site incremental backups with a full backup on a Sunday night.
In the event of database corruption at the block or file level, a failover to a redundant database server should be performed rather than a full or partial restore.
Currently, fail overs are a manual process and require an application level configuration update.
In the event of data being corrupted via SQL (e.g. an application bug that causes loss of data), a full restore should only be considered if the loss of data is significant (i.e. greater than 75%).
Where data loss is not significant, a partial restore should be performed by first restoring a previous database backup to a temporary VM followed by a manual restore of the affected data to the primary database server.
A full database restore would be expected to take around 6 hours during which all Platform functionality would be unavailable and should therefore be treated as a last resort.
Generally, file storage is used for items such as product assets, fonts, etc. No sensitive PII is contained within file storage, with the exception of user uploaded images.
The majority of assets are stored within S3. S3 versioning is enabled for most asset types to allow for “undeletion”.
In the event of any data loss, files should be restored from either the S3 backup (for legacy assets) or from the S3 object’s version history.
In the event of a ransomware attack against our web tier, all web tier VMs should be spun down and replaced with fresh VMs and rebuilt docker container images.
Once system stability has been restored, the compromised VMs should be isolated and investigated to learn more about the possible attack vectors.
In the event of a ransomware attack against our primary database server, a failover to a read replica should be attempted. The old primary server should then be investigated, formatted and brought back into service.
If all read replicas have also been compromised then it would be necessary to revert the primary to a previous well known snapshot and rebuild and reinitialise all read replicas.