Skip to content

docs: add spot_instance_diversity #71

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion src/app/_meta.global.ts
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,8 @@ export default {
workload_config: {},
workload_diversity: {},
keep_part_nodes: {},
node_template_configuration: {}
node_template_configuration: {},
spot_instance_diversity: {}
}
},
security: {
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Optimizing Cluster Resilience with Spot Instance Diversity Management

This document outlines a feature designed to enhance Kubernetes cluster resilience and efficiency when leveraging Spot instances. By intelligently distributing workloads across heterogeneous instance types, the system reduces operational risks while maintaining cost-effectiveness.

---

## Key Features

### Automated Instance Type Diversification
Dynamically distribute workloads across multiple Spot instance types using a decentralized scheduling strategy. This reduces dependency risk by ensuring that no single instance type is overloaded.

### Operational Risk Mitigation
Reduces service disruptions caused by Spot instance interruptions. By spreading workloads across diverse instance families (e.g., `m5`, `t3`, `c5`), the cluster maintains elasticity even during sudden Spot market volatility.

### Cost-Stability Balance
Achieves an equilibrium between Spot instance cost savings and workload reliability. The scheduler adapts to real-time market conditions without requiring manual intervention.

---

## How It Works

1. **Initial State Analysis**
The system evaluates current cluster composition. For example:
| Instance Type | Allocation |
|---------------|------------|
| `m5.large` | 60% |
| `t3.medium` | 20% |
| `c5.xlarge` | 20% |

2. **Gradual Redistribution**
New workloads are redirected toward underrepresented instance types. Over time, the distribution evolves toward:
| Instance Type | Allocation |
|---------------|------------|
| `m5.large` | 40% |
| `t3.medium` | 30% |
| `c5.xlarge` | 30% |

3. **Real-Time Adaptation**
The scheduler continuously monitors:
- Availability zone capacity
- Spot price fluctuations
- Instance termination rate history
Adjustments occur incrementally to maintain workload stability.

---

## Implementation Notes

- **Manual Activation Required**: This feature must be configured by the CloudPilot AI engineering team. Contact [email protected] for activation.
- **Limitations**: Actual performance depends on real-time Spot market conditions and regional instance availability.

Technical Guide for DevOps & SRE Teams | For detailed configuration support or advanced implementation scenarios, contact the CloudPilot AI Engineering Team.