|
| 1 | +--- |
| 2 | +toc_max_heading_level: 2 |
| 3 | +--- |
| 4 | + |
| 5 | +# Service Level Objective (SLO) Policy |
| 6 | + |
| 7 | +We are committed to providing reliable, high-quality services to our customers. |
| 8 | + |
| 9 | +## Incident Classification & Response Times |
| 10 | + |
| 11 | +| Severity | Definition | Response Time | Resolution Time | Communication | |
| 12 | +|----------|------------|---------------|-----------------|---------------| |
| 13 | +| **Severity 1 (Critical)** | Complete service outage or critical functionality unavailable affecting multiple customers | 30 minutes | 4 hours | Immediate notification via email | |
| 14 | +| **Severity 2 (High)** | Significant degradation of service or critical functionality unavailable for a subset of customers | 1 hour | 8 hours | Notification via email upon resolution | |
| 15 | +| **Severity 3 (Medium)** | Minor service degradation or non-critical functionality unavailable | 4 hours | 24 hours | As needed | |
| 16 | +| **Severity 4 (Low)** | Cosmetic issues or minor bugs with workarounds available | 1 business day | Best effort | As needed | |
| 17 | + |
| 18 | +## Incident Detection Procedures |
| 19 | + |
| 20 | +We've set up several systems to identify incidents: |
| 21 | + |
| 22 | +- **Real-time Monitoring**: We have observability and monitoring on our core infrastructure. |
| 23 | +- **Error Tracking**: We use tools like Sentry to aggregate and produce notifications of errors. |
| 24 | +- **Support Monitoring**: Our team watches support channels during business hours to catch issues you report. |
| 25 | + |
| 26 | +## Communication & Transparency |
| 27 | + |
| 28 | +### When Things Go Wrong |
| 29 | + |
| 30 | +Here's what you can expect from us during an incident: |
| 31 | + |
| 32 | +1. **First Update** (within our response time) |
| 33 | + - We've found the problem and are working on it |
| 34 | + - How bad it is and who's affected |
| 35 | + - When you'll hear from us next |
| 36 | +2. **Regular Updates** |
| 37 | + - What's happening right now |
| 38 | + - What we're doing to fix it |
| 39 | + - Updated timeline if things change |
| 40 | +3. **All Clear** |
| 41 | + - Everything is back to normal |
| 42 | + - Quick summary of what happened |
| 43 | + - We'll do a full review and share lessons learned |
| 44 | + |
| 45 | +### After We Fix It |
| 46 | + |
| 47 | +For serious incidents (Severity 1 & 2), we'll create a Root Cause Analysis that, upon request, will be shared with customers, including: |
| 48 | + |
| 49 | +- **What Happened**: Step-by-step timeline of the incident |
| 50 | +- **Who Was Affected**: How many customers and what services were impacted |
| 51 | +- **Root Cause**: What actually caused the problem |
| 52 | +- **How We'll Prevent It**: Specific steps we're taking to avoid this happening again |
| 53 | +- **Lessons Learned**: What worked well and what we'll do better next time |
| 54 | + |
| 55 | +## What's Not Covered |
| 56 | + |
| 57 | +This SLO doesn't apply to: |
| 58 | + |
| 59 | +- Beta or preview features (they're still experimental) |
| 60 | +- Scheduled maintenance |
| 61 | +- Issues outside our control (internet outages, AWS problems, etc.) |
| 62 | +- Problems you caused (wrong configuration, hitting rate limits, etc.) |
| 63 | +- Third-party service failures |
| 64 | + |
| 65 | +## Need Help? |
| 66 | + |
| 67 | +Here's how to reach us: |
| 68 | + |
| 69 | +- **Support Portal**: [support.gruntwork.io](https://support.gruntwork.io) - Submit tickets and track issues |
| 70 | +- **Email **: [email protected] - Direct email support |
| 71 | +- **Slack**: Customer-specific channels (where available) - Real-time chat with our team |
0 commit comments