Software Architecture – Quality Attributes

Performance

Reponse time: Time between a client sending a request and receiving a response
- Response time/End to end latency = Processing time + Waiting time
  - Processing time: System processing time
  - Waiting time/Latency: Duration of time spends inactivity in the system
Throughput: Amount of work performed by the system time. Amount of data processed by the system per unit of time
Other important consideration:
- When evaluating the performance of a system, response time is not the only factor to consider. Depending on the nature of the task or the specific requirements of the system, throughput can be an important indicator of performance.
- Sometimes, engineers may only measure processing time, but processing time cannot represent response time.
- Get distribution response time to measure. Depending on the needs, it can be measured by average/median/maximum response time.
  - Define response time goals using percentiles
- Performance degradation due to high resource utilization
  - Potential overly-utilized resources:
    - High CPU utilization
    - High memory consumption
    - Too many connections/IO
    - Message queue is at capacity

Scalability

The measure of a systems ability to handle a growing amount of work, in an easy and cost effective way, by adding resources to the system.

Vertical scalability: Adding more resources of upgrading the existing resources on a single computer, to allow our system to handle higher traffic or load.
- Pros
  - Any application can benefit from it
  - No code changes are required
  - The migration between different machines is very easy
- Cons
  - The scope of upgrade is limited
  - We are locked to a centralized system which cannot provide
    - High availability
    - Fault tolerance
Horizontal scalability: Adding more resources in a form of new instances running on different machines, to allow our system to handle higher traffic or load.
- Pros
  - No limit on scalability
  - It’s easy to add/remove machines
  - If designed correctly we get
    - High availability
    - Fault tolerance
- Cons
  - Initial code changes may be required
  - Increase complexity, coordination overhead
Team/Organizational scalability: Increasing productivity while hiring more engineers into the team.

Availability

The fraction of time/probability that the service is operationally functional and accessible to the user.

Uptime: Time that the system is operationally functional and accessible to the user.
Downtime: The that the system is unavailable to the user
Availability (in %) = Uptime / (Uptime + Downtime)
MTBF (Mean Time Between Failures): Represents the average time the system is operational
MTTR (Mean Time To Recovery): Time average it takes us to detect and recover from a failure
Availability = MTBF / (MTBF + MTTR)
- In practice, MTTR cannot be zero
- Show that detectability and fast recovery can help us achieve high availability

The risks of low availability:

Loss of revenue
Loss of customers to our competitors

Source of Failures

Human Errors
- Pushing a faulty config to production
- Running the wrong command/script
- Deploying an incompletely tested new version of software
Software Errors
- Long garbage collections
- Out-of-memory exceptions
- Null pointer exceptions
- Segmentation faults
Hardware Failures
- Devices breaking down due to limited shelf-life
- Power outages due to natural disasters
- Network failures due to infrastructure issues

Fault Tolerance

Fault tolerance is the best way to achieve high availability in our system.

Fault tolerance enables our system to remain operational and available to the users despite failures within one or multiple of its components.

Fault tolerance revolves around 3 major tactics

Failure Prevention: Eliminate any SINGLE POINT OF FAILURE in the system
- One server where running the application
- Storing all the data on the one instance of the database that runs on a single computer
- The best way to eliminate is through Replication and Redundancy
  - Types of redundancy
    - Spatial Redundancy: Running replicas of the application on different computers
    - Time Redundancy: Repeating the same operation/request multiple times until we succeed/give up
  - Two strategies for Redunancy and Replication
    - Active-Active architecture: In an active-active configuration, multiple systems or components are actively serving and processing requests simultaneously. These systems share the workload and distribute the incoming traffic between them. Each system is capable of handling requests and providing services independently. If one system fails or becomes unavailable, the other systems in the configuration continue to handle the workload seamlessly. Active-active setups are designed for scalability, load balancing, and maximizing resource utilization.
    - Active-Passive architecture: In an active-passive configuration, there is a primary active system that handles all the requests and provides services, while the secondary passive system remains idle and does not process any requests actively. The passive system is on standby, ready to take over if the primary system fails or becomes unavailable. In case of a failure, the passive system activates and starts handling the workload. The active-passive setup is mainly focused on fault tolerance and ensuring uninterrupted service in case of primary system failure.
Failure Detection and Isolation
- Require monitoring system to detect the service’s heartbeats
Recovery
- Actions after detecting faulty instance/server:
  - Stop sending traffic/workload to that host
  - Restart the host to make problem go away
  - Rollback: Going back to a version that was stable and correct

SLA – Service Level Agreement

It’s a legal contract that represents the quality service such as

Availability
Performance
Data durability
Time to respond to system failures
and so on…

It states the penalties and financial consquences, if we breach the contract and it could include:

Full/Partial refunds
Subscription/License extensions
Service credits
and so on…

SLOs – Service Level Objectives

Individual goals set for our system
Each SLO represents a target value/range that our servcie needs to meet.

SLIs – Service Level Indicators

Quantitative meansure of our compliance with a service-level objective
It’s the actual numbers:
- Measured using a monitoring service
- Calculated from our logs
It can be later compared to our SLOs

SLAs are crafted by the business and the legal team

SLOs and SLIs are defined and set by the software engineers and architects

Important considerations

Define the most important SLOs that our users care about and then find the SLIs based on those objectives
Commit to the bare minimum in terms of
- Number of objectives
- Aggressiveness (Leaving a budget for error)
Have a recovery plan ahead of time to deal with situations of potential breach of SLOs

SLA Examples in real world:

https://aws.amazon.com/tw/legal/service-level-agreements/

https://cloud.google.com/terms/sla

https://www.microsoft.com/licensing/docs/view/Service-Level-Agreements-SLA-for-Online-Services?lang=1

Post Views: 349

Software Architecture – Quality Attributes

Performance

Scalability

Availability

Source of Failures

Fault Tolerance

SLA – Service Level Agreement

SLOs – Service Level Objectives

SLIs – Service Level Indicators

Important considerations

搶先發佈留言

發佈留言取消回覆

Software Architecture – Quality Attributes

Performance

Scalability

Availability

Source of Failures

Fault Tolerance

SLA – Service Level Agreement

SLOs – Service Level Objectives

SLIs – Service Level Indicators

Important considerations

搶先發佈留言

發佈留言 取消回覆

發佈留言取消回覆