Skip to content

Commit c0ab2bb

Browse files
committed
Add SLA policy.
1 parent 3db1b40 commit c0ab2bb

File tree

2 files changed

+118
-0
lines changed

2 files changed

+118
-0
lines changed

docs/sla-policy.md

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
---
2+
toc_max_heading_level: 2
3+
---
4+
5+
# Service Level Agreement (SLA) Policy
6+
7+
We are committed to providing reliable, high-quality services to our customers. This Service Level Agreement (SLA) outlines our availability commitments, incident response procedures, and the transparency measures we employ to keep you informed about the health of our services.
8+
9+
## Service Availability Targets
10+
11+
### Production API Services
12+
13+
- **Monthly Uptime Target**: 99.9% (allows for 43 minutes of downtime per month)
14+
- **Measured Services**:
15+
- Primary API endpoints ([api.prod.app.gruntwork.io](https://api.prod.app.gruntwork.io))
16+
- Authentication services
17+
18+
### How We Calculate Uptime
19+
20+
- **Simple Math**: (Total minutes in month - Downtime minutes) ÷ Total minutes in month × 100
21+
- **What Counts as Downtime**: Service is completely down or failing for more than 5% of requests
22+
- **What Doesn't Count**: Scheduled maintenance (we'll tell you 72 hours ahead of time)
23+
24+
## Customer Remedies
25+
26+
While we strive to meet our SLA targets, we recognize that outages impact your business. For paying customers:
27+
28+
### Service Credits
29+
30+
| Monthly Uptime | Service Credit |
31+
|----------------|----------------|
32+
| 99.0% - 99.5% | 2.5% |
33+
| 95.0% - 99.0% | 5% |
34+
| < 95.0% | 10% |
35+
36+
### Credit Request Process
37+
38+
1. Submit request within 30 days of incident
39+
2. Include affected services and timeframe
40+
3. Credits applied to next billing cycle
41+
4. Maximum credit per month: 10% of monthly service fees
42+
43+
## Incident Classification & Response Times
44+
45+
| Severity | Definition | Response Time | Resolution Time | Communication |
46+
|----------|------------|---------------|-----------------|---------------|
47+
| **Severity 1 (Critical)** | Complete service outage or critical functionality unavailable affecting multiple customers | 30 minutes | 4 hours | Immediate notification via status page and email |
48+
| **Severity 2 (High)** | Significant degradation of service or critical functionality unavailable for a subset of customers | 1 hour | 8 hours | Status page update within 2 hours |
49+
| **Severity 3 (Medium)** | Minor service degradation or non-critical functionality unavailable | 4 hours | 24 hours | Status page update within 4 hours |
50+
| **Severity 4 (Low)** | Cosmetic issues or minor bugs with workarounds available | 1 business day | Best effort | As needed |
51+
52+
## How We Catch Problems Fast
53+
54+
We've set up several systems to catch issues before you even notice them:
55+
56+
- **Real-time Monitoring**: Our systems watch critical endpoints 24/7 and alert us the moment something goes wrong
57+
- **Automated Testing**: We regularly test authentication and pipeline workflows to catch issues before they affect you
58+
- **Error Tracking**: We use tools like Sentry to get instant notifications when errors occur
59+
- **Support Monitoring**: Our team watches support channels during business hours to catch issues you report
60+
61+
## Communication & Transparency
62+
63+
### When Things Go Wrong
64+
65+
Here's what you can expect from us during an incident:
66+
67+
1. **First Update** (within our response time)
68+
- We've found the problem and are working on it
69+
- How bad it is and who's affected
70+
- When you'll hear from us next
71+
2. **Regular Updates** (every 2 hours for critical issues)
72+
- What's happening right now
73+
- What we're doing to fix it
74+
- Updated timeline if things change
75+
3. **All Clear**
76+
- Everything is back to normal
77+
- Quick summary of what happened
78+
- We'll do a full review and share lessons learned
79+
80+
### After We Fix It
81+
82+
For serious incidents (Severity 1 & 2), we'll publish a full report within 5 business days that includes:
83+
84+
- **What Happened**: Step-by-step timeline of the incident
85+
- **Who Was Affected**: How many customers and what services were impacted
86+
- **Root Cause**: What actually caused the problem
87+
- **How We'll Prevent It**: Specific steps we're taking to avoid this happening again
88+
- **Lessons Learned**: What worked well and what we'll do better next time
89+
90+
### Our Status Page
91+
92+
- **Check Status**: [status.gruntwork.io](https://status.gruntwork.io/)
93+
- **Live Updates**: Real-time health indicators for all our services
94+
- **Incident History**: 90 days of past incidents and resolutions
95+
- **Get Notified**: Subscribe to email or SMS alerts for outages
96+
97+
## What's Not Covered
98+
99+
This SLA doesn't apply to:
100+
101+
- Beta or preview features (they're still experimental)
102+
- Scheduled maintenance (we'll give you 72 hours notice)
103+
- Issues outside our control (internet outages, AWS problems, etc.)
104+
- Problems you caused (wrong configuration, hitting rate limits, etc.)
105+
- Third-party service failures
106+
107+
## Need Help?
108+
109+
Here's how to reach us:
110+
111+
- **Support Portal**: [support.gruntwork.io](https://support.gruntwork.io) - Submit tickets and track issues
112+
- **Email**: [email protected] - Direct email support
113+
- **Slack**: Customer-specific channels (where available) - Real-time chat with our team

sidebars/docs.js

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,11 @@ const sidebar = [
8181
type: "doc",
8282
id: "support",
8383
},
84+
{
85+
label: "SLA Policy",
86+
type: "doc",
87+
id: "sla-policy",
88+
},
8489
{
8590
value: "Getting Started",
8691
type: "html",

0 commit comments

Comments
 (0)