AI-powered Pod Diagnosis

The AI-powered Pod Diagnosis feature automatically analyzes the health and status of Kubernetes Pods. It generates structured prompts based on observed behaviors and logs to support large language model (LLM)-based reasoning and troubleshooting.

This feature helps users pinpoint issues in Pods by combining failure conditions, historical warnings, and runtime logs into a unified diagnosis report.

Architecture and Workflow

The diagnosis process follows these key steps:

1. Locate the Pod Object

The target Pod is identified using its namespace and name.
The Pod is retrieved from the Kubernetes API and validated.

2. Analyze Pod Status

Several categories of data are extracted:

Failures

Failure messages are extracted under the following conditions:
- If the Pod is in Pending phase and the scheduling reason is Unschedulable.
- If any init container fails.
- If any container is in a failing state (e.g., CrashLoopBackOff, CreateContainerError, unhealthy readiness probes, abnormal terminations).
These are marked as Failures.

Warnings

The system fetches all related Events for the Pod using Prometheus or Kubernetes API.
These events are included as Warnings.

Infos

Container logs are included as Info context to support runtime-level diagnostics.
However, log fetching is conditional—only performed when the Pod meets specific failure patterns.

The decision is made by the shouldFetchLog(pod) function, which returns true if either of the following is detected:

Terminated Containers with Errors:
- If any init or main container has exited with a non-zero exit code (Terminated.ExitCode != 0).
CrashLoopBackOff States:
- If any container is stuck in a Waiting state with reason CrashLoopBackOff.

These conditions help ensure that logs are only collected when they are likely to provide useful insights into failure root causes, thus avoiding unnecessary overhead.

The following diagram illustrates the diagnosis logic for a Pod:

AI Prompt Construction

After collecting the data, the system builds a structured prompt using a fixed template. Example template:

You are a helpful Kubernetes cluster failure diagnosis expert. Please analyze the following symptoms and respond in Chinese.

Abnormal information: --- {{.ErrorInfo}} ---
Historical Pod warning events (use if helpful): --- {{.EventInfo}} ---
Pod logs (use if helpful): --- {{.LogInfo}} ---

Please respond with the following format (not exceeding 1000 characters):

Healthy: {Yes or No}
Error: {Explain the problem}
Solution: {Step-by-step recommended solution}

The prompt includes:
- ErrorInfo: extracted failure messages.
- EventInfo: warning events.
- LogInfo: container logs.
This enables the LLM to infer root causes with high-level understanding and generate actionable outputs.

Example Use Case: Diagnosing a Pod with Permission Issues

This example shows how to diagnose a problematic Pod using the AegisDiagnosis CRD.

Step 1: Apply the Diagnosis CR

kubectl apply -f diagnosis-pod.yaml

# diagnosis-pod.yaml
apiVersion: aegis.io/v1alpha1
kind: AegisDiagnosis
metadata:
  name: diagnose-pod
  namespace: monitoring
spec:
  object:
    kind: Pod
    name: workflow-controller-xxxxx
    namespace: scitix-system

Step 2: Watch Diagnosis Execution

kubectl get -f diagnosis-pod.yaml --watch

Once the task completes, you should see the phase change to Completed.

Step 3: Inspect the Result

kubectl describe -n monitoring aegisdiagnosises.aegis.io diagnose-pod

Sample Output

Status:
  Phase: Completed
  Explain: Healthy: No
  Error: The container attempted to register a watch on a ConfigMap during startup but failed due to insufficient permissions.
         Error message: "configmaps 'workflow-controller-configmap' is forbidden: User 'system:serviceaccount:...:argo' cannot get resource 'configmaps' in the namespace 'scitix-system'."

  Result:
    Failures:
      - the last termination reason is Error container=workflow-controller
    Infos:
      [pod logs]
      - ... Failed to register watch for controller config map ...
    Warnings:
      - BackOff restarting failed container (count 1930)

Custom Prompt Support

Users can customize the diagnosis prompt to control how the analysis result is structured and phrased.

The available variables for Pod diagnosis prompts are the same as those used in Node diagnosis. Please refer to that section for details.

➡️ For instructions on defining a custom prompt, see the Custom Prompt Guide.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Files

pod-diagnosis.md

pod-diagnosis.md

AI-powered Pod Diagnosis

Architecture and Workflow

1. Locate the Pod Object

2. Analyze Pod Status

Failures

Warnings

Infos

AI Prompt Construction

Example Use Case: Diagnosing a Pod with Permission Issues

Step 1: Apply the Diagnosis CR

Step 2: Watch Diagnosis Execution

Step 3: Inspect the Result

Sample Output

Suggested Solution by AI

Custom Prompt Support

Collapse file tree

Files

pod-diagnosis.md

Latest commit

History

pod-diagnosis.md

File metadata and controls

AI-powered Pod Diagnosis

Architecture and Workflow

1. Locate the Pod Object

2. Analyze Pod Status

Failures

Warnings

Infos

AI Prompt Construction

Example Use Case: Diagnosing a Pod with Permission Issues

Step 1: Apply the Diagnosis CR

Step 2: Watch Diagnosis Execution

Step 3: Inspect the Result

Sample Output

Suggested Solution by AI

Custom Prompt Support