By: Rodney Ellis, Synthesis Service Manager
With Black Friday around the corner, most if not all, e-commerce and retailers will have that feeling of anxiousness and dread in the weeks leading up to the day asking: “Will the site or systems stay up? Will it perform under the additional load?” and “If it goes down how fast can we recover?”
Where to Start?
Now that we know what potentially can go wrong and it will go wrong at some point, what can be done to avoid this from affecting customer experience and, at the end of the day, sales and business confidence.
Monitoring
A good monitoring foundation and accurate metric thresholds need to be in place. By monitoring the critical metrics a lot of the potential bottlenecks and issues can be identified before it becomes a problem. For example database usage, CPU/Memory utilisation, Disk/Network performance and endpoint latency and application logs can lead to a cascading effect eventually causing downtime.
Architecture
A good architecture design and decoupling of services are needed to decrease the blast radius of outages be it from infrastructure or application-related issues.
Utilising multiple availability zones or datacenters for application hosting with automatic failover decreases the risk of an infrastructure outage effecting availability and converting services to microservices utilising Kubernetes Orchestration can increase service availability.
Partnering with a cloud provider like AWS solves several of the potential issues that collocated, or on-premises infrastructure are unable to handle or easily handle.
The ability to rapidly scale up or down resources either manually or by using AWS autoscaling triggers executing on predefined metric thresholds. This can provide a seamless and predictable experience for your customers.
Hypercare
Synthesis Managed Services provides this service to our retail and e-commerce partners, by providing a group of DevOps and System engineers to actively monitor and test the environment during the day and react within seconds to resolve potential problems that arise.
Utilising these methodologies and best practices, we have been able to provide our partners a predictable business outcome during Black Friday, knowing that no matter how many customers sign up, or buy toasters, there will be no impact on business as usual.
When the load hits and things go wrong
Usually, the first reaction from the Support team is it’s the developer’s fault and from the developers, it’s the infrastructure. With good monitoring in place, it makes the task of finding the root cause of the problem a lot faster, and this eliminates the blame game.
We have seen issues ranging from hitting Disk IOPS constraints, API limits, CPU load and even memory leaks in a service.
Usually, around 11pm on the Thursday users start logging in and refreshing your site continuously. Some of them have become quite crafty utilising web scraping tools to poll for deals and keywords that can lead to hundreds if not thousands of web requests per second.
The teams need to be on standby and ready to deal with any issue that might arise during the day.
When the dust settles
After the day has come and gone take a moment and go through all the issues experienced and come up with remediation plans, usually the first Black Friday is the most error prone but by taking the learnings from the day and implementing fixes for them, the next Black Friday will go a lot smoother.
A couple of tips to prep for the day:
- Monitoring – Make sure all the critical metrics and endpoints are monitored.
- Service autoscaling – Make sure your services and webservers can scale.
- No change Friday – Try and make sure a stable release of the services and sites are up for that week.
Side note:
Do not panic with good preparation a good team can mitigate most, if not all, interruptions to business and when an issue pops they will be able to deal with it quickly.