Performance
- Reponse time: Time between a client sending a request and receiving a response
- Response time/End to end latency = Processing time + Waiting time
- Processing time: System processing time
- Waiting time/Latency: Duration of time spends inactivity in the system
- Response time/End to end latency = Processing time + Waiting time
- Throughput: Amount of work performed by the system time. Amount of data processed by the system per unit of time
- Other important consideration:
- When evaluating the performance of a system, response time is not the only factor to consider. Depending on the nature of the task or the specific requirements of the system, throughput can be an important indicator of performance.
- Sometimes, engineers may only measure processing time, but processing time cannot represent response time.
- Get distribution response time to measure. Depending on the needs, it can be measured by average/median/maximum response time.
- Define response time goals using percentiles
- Performance degradation due to high resource utilization
- Potential overly-utilized resources:
- High CPU utilization
- High memory consumption
- Too many connections/IO
- Message queue is at capacity
- Potential overly-utilized resources:
Scalability
The measure of a systems ability to handle a growing amount of work, in an easy and cost effective way, by adding resources to the system.
- Vertical scalability: Adding more resources of upgrading the existing resources on a single computer, to allow our system to handle higher traffic or load.
- Pros
- Any application can benefit from it
- No code changes are required
- The migration between different machines is very easy
- Cons
- The scope of upgrade is limited
- We are locked to a centralized system which cannot provide
- High availability
- Fault tolerance
- Pros
- Horizontal scalability: Adding more resources in a form of new instances running on different machines, to allow our system to handle higher traffic or load.
- Pros
- No limit on scalability
- It’s easy to add/remove machines
- If designed correctly we get
- High availability
- Fault tolerance
- Cons
- Initial code changes may be required
- Increase complexity, coordination overhead
- Pros
- Team/Organizational scalability: Increasing productivity while hiring more engineers into the team.
Availability
The fraction of time/probability that the service is operationally functional and accessible to the user.
- Uptime: Time that the system is operationally functional and accessible to the user.
- Downtime: The that the system is unavailable to the user
- Availability (in %) = Uptime / (Uptime + Downtime)
- MTBF (Mean Time Between Failures): Represents the average time the system is operational
- MTTR (Mean Time To Recovery): Time average it takes us to detect and recover from a failure
- Availability = MTBF / (MTBF + MTTR)
- In practice, MTTR cannot be zero
- Show that detectability and fast recovery can help us achieve high availability
The risks of low availability:
- Loss of revenue
- Loss of customers to our competitors
Source of Failures
- Human Errors
- Pushing a faulty config to production
- Running the wrong command/script
- Deploying an incompletely tested new version of software
- Software Errors
- Long garbage collections
- Out-of-memory exceptions
- Null pointer exceptions
- Segmentation faults
- Hardware Failures
- Devices breaking down due to limited shelf-life
- Power outages due to natural disasters
- Network failures due to infrastructure issues
Fault Tolerance
Fault tolerance is the best way to achieve high availability in our system.
Fault tolerance enables our system to remain operational and available to the users despite failures within one or multiple of its components.
Fault tolerance revolves around 3 major tactics
- Failure Prevention: Eliminate any SINGLE POINT OF FAILURE in the system
- One server where running the application
- Storing all the data on the one instance of the database that runs on a single computer
- The best way to eliminate is through Replication and Redundancy
- Types of redundancy
- Spatial Redundancy: Running replicas of the application on different computers
- Time Redundancy: Repeating the same operation/request multiple times until we succeed/give up
- Two strategies for Redunancy and Replication
- Active-Active architecture: In an active-active configuration, multiple systems or components are actively serving and processing requests simultaneously. These systems share the workload and distribute the incoming traffic between them. Each system is capable of handling requests and providing services independently. If one system fails or becomes unavailable, the other systems in the configuration continue to handle the workload seamlessly. Active-active setups are designed for scalability, load balancing, and maximizing resource utilization.
- Active-Passive architecture: In an active-passive configuration, there is a primary active system that handles all the requests and provides services, while the secondary passive system remains idle and does not process any requests actively. The passive system is on standby, ready to take over if the primary system fails or becomes unavailable. In case of a failure, the passive system activates and starts handling the workload. The active-passive setup is mainly focused on fault tolerance and ensuring uninterrupted service in case of primary system failure.
- Types of redundancy
- Failure Detection and Isolation
- Require monitoring system to detect the service’s heartbeats
- Recovery
- Actions after detecting faulty instance/server:
- Stop sending traffic/workload to that host
- Restart the host to make problem go away
- Rollback: Going back to a version that was stable and correct
- Actions after detecting faulty instance/server:
SLA – Service Level Agreement
It’s a legal contract that represents the quality service such as
- Availability
- Performance
- Data durability
- Time to respond to system failures
- and so on…
It states the penalties and financial consquences, if we breach the contract and it could include:
- Full/Partial refunds
- Subscription/License extensions
- Service credits
- and so on…
SLOs – Service Level Objectives
- Individual goals set for our system
- Each SLO represents a target value/range that our servcie needs to meet.
SLIs – Service Level Indicators
- Quantitative meansure of our compliance with a service-level objective
- It’s the actual numbers:
- Measured using a monitoring service
- Calculated from our logs
- It can be later compared to our SLOs
SLAs are crafted by the business and the legal team
SLOs and SLIs are defined and set by the software engineers and architects
Important considerations
- Define the most important SLOs that our users care about and then find the SLIs based on those objectives
- Commit to the bare minimum in terms of
- Number of objectives
- Aggressiveness (Leaving a budget for error)
- Have a recovery plan ahead of time to deal with situations of potential breach of SLOs
SLA Examples in real world:
https://aws.amazon.com/tw/legal/service-level-agreements/
搶先發佈留言