relogiosoft-api-governance-kit

Health Endpoints

Category: Operations Tags: health, readiness, liveness, health-check, kubernetes, probes, operational, uptime


Summary of Rules


Why Health Endpoints?

Health endpoints allow infrastructure components (load balancers, container orchestrators, monitoring systems) to determine whether a service instance is safe to receive traffic.

Component Uses Health Check For
Load balancer Route traffic only to healthy instances
Kubernetes Restart unhealthy pods (liveness); hold traffic until ready (readiness)
API gateway Remove degraded upstreams from the rotation
Monitoring / alerting Alert on-call when service health degrades

Endpoint Definitions

/health — General Health

A combined health check suitable for load balancers and monitoring tools.

GET /health HTTP/1.1
HTTP/1.1 200 OK
Content-Type: application/json
Cache-Control: no-store

{
  "status": "Healthy",
  "timestamp": "2024-07-23T10:30:00Z",
  "version": "1.4.2",
  "checks": [
    { "name": "database", "status": "Healthy" },
    { "name": "cache", "status": "Healthy" }
  ]
}

/health/live — Liveness (Kubernetes)

Answers: “Is this process alive and not deadlocked?”

A failing liveness probe causes Kubernetes to restart the pod. Keep this check lightweight — it should only verify that the process is running and responsive:

It MUST NOT check external dependencies (database, downstream services).

GET /health/live HTTP/1.1
HTTP/1.1 200 OK
Content-Type: application/json
Cache-Control: no-store

{
  "status": "Healthy",
  "timestamp": "2024-07-23T10:30:00Z"
}

/health/ready — Readiness (Kubernetes)

Answers: “Is this instance ready to serve traffic?”

A failing readiness probe causes Kubernetes to remove the pod from the service’s load balancer without restarting it. Use this to signal temporary unavailability:

GET /health/ready HTTP/1.1
HTTP/1.1 200 OK
Content-Type: application/json
Cache-Control: no-store

{
  "status": "Healthy",
  "timestamp": "2024-07-23T10:30:00Z",
  "checks": [
    { "name": "database", "status": "Healthy", "responseTimeMs": 4 },
    { "name": "config-service", "status": "Healthy", "responseTimeMs": 12 }
  ]
}

Response Schema

Status Values

Status HTTP Code Meaning
Healthy 200 OK All checks passed; service is operating normally.
Degraded 200 OK One or more non-critical checks failed; service is operating with reduced capability. Load balancers SHOULD continue routing traffic.
Unhealthy 503 Service Unavailable One or more critical checks failed; service cannot serve requests. Load balancers MUST stop routing traffic.

Top-Level Fields

{
  "status": "Healthy",
  "timestamp": "2024-07-23T10:30:00Z",
  "version": "1.4.2",
  "checks": [ ... ]
}
Field Required Type Description
status Yes string Overall health: Healthy, Degraded, or Unhealthy.
timestamp Yes string (ISO 8601 UTC) When the health check was performed.
version No string Deployed version of the service.
checks No array Individual dependency check results.

Check Object Fields

Field Required Type Description
name Yes string Identifier for this check (e.g. "database", "cache", "payment-service").
status Yes string Healthy, Degraded, or Unhealthy.
responseTimeMs No integer Time taken for this check, in milliseconds.
description No string Human-readable detail about the check result (never include PII or credentials).

Example: Degraded Response

HTTP/1.1 200 OK
Content-Type: application/json
Cache-Control: no-store

{
  "status": "Degraded",
  "timestamp": "2024-07-23T10:30:00Z",
  "version": "1.4.2",
  "checks": [
    { "name": "database", "status": "Healthy", "responseTimeMs": 5 },
    { "name": "email-service", "status": "Unhealthy", "description": "Connection timeout after 3 retries" }
  ]
}

Email service is non-critical (emails are queued); the service continues operating but is degraded.

Example: Unhealthy Response

HTTP/1.1 503 Service Unavailable
Content-Type: application/json
Cache-Control: no-store
Retry-After: 30

{
  "status": "Unhealthy",
  "timestamp": "2024-07-23T10:30:00Z",
  "version": "1.4.2",
  "checks": [
    { "name": "database", "status": "Unhealthy", "description": "Cannot connect to primary database" },
    { "name": "cache", "status": "Healthy", "responseTimeMs": 2 }
  ]
}

What to Check

Liveness Checks (fast, internal only)

Check What to Verify
Process health Is the service responding to HTTP?
Memory Is the process within its memory limits?
Thread pool Are worker threads available?

Readiness Checks (may include external dependencies)

Check What to Verify
Database Can the service execute a simple query?
Cache Is the cache connection available?
Required config Has configuration loaded successfully?
Startup tasks Have migrations / warm-ups completed?

What NOT to Check in Health Endpoints


URI Conventions

/health           → General health (combined liveness + readiness)
/health/live      → Liveness only
/health/ready     → Readiness only

These endpoints SHOULD be at the root of the service path, not under a versioned prefix (e.g. not /v1/health). Health endpoints are operational infrastructure, not part of the API version contract.


Kubernetes Probe Configuration

Example Kubernetes deployment configuration:

livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10
  timeoutSeconds: 3
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 3

Caching

Health endpoint responses MUST include Cache-Control: no-store to prevent intermediaries from caching health status. Stale health responses could cause traffic to be routed to an unhealthy instance.

Cache-Control: no-store