Health Check Configuration Guide
KredSLA scans your infrastructure during every discovery cycle and flags resources where the health check configuration may produce false-positive or false-negative SLA breach signals. These warnings are advisory — KredSLA does not modify your existing configuration — but acting on them makes breach detection more accurate and SLA credit evidence stronger.
This article explains each warning, why it matters for SLA claims, and how to fix it for each supported cloud provider.
Why health checks matter for SLA claims
KredSLA detects SLA breaches by combining two signals:
- Metric anomalies — CloudWatch / Azure Monitor / GCP Cloud Monitoring alarms that fire when a metric crosses an SLA threshold (e.g.
HTTPCode_ELB_5XX_Count,DatabaseConnections). - Provider health events — RSS/Atom feeds from the provider's status page, which confirm that a managed service was genuinely degraded.
Both signals must align for KredSLA to open a claim. A poorly configured health check contaminates the first signal: either genuine outages are hidden (false negatives) or noisy alarms fire during normal operation (false positives). Either way, claim accuracy suffers.
Warning: ALB health check path is /
Severity: high
What it means: The ALB target group is configured to send health check requests to / (the root path). A server can return 200 OK from / even when its backend dependencies — the database, a cache, an upstream service — are completely unreachable. The load balancer sees the target as healthy and continues routing traffic. From ALB's perspective, no outage occurred.
Why it matters for SLA claims: If ALB considers the target healthy, no HTTPCode_ELB_5XX_Count alarm fires. KredSLA's breach detector finds no metric anomaly and skips the incident even if the provider declared a managed-service outage.
How to fix it
The fix is a dedicated /health endpoint in your application that probes its critical backend dependencies and returns a non-200 status when they are unhealthy.
Example — minimal Python /health endpoint that TCP-probes a database:
import socket
def health_check(db_host, db_port=5432):
try:
with socket.create_connection((db_host, db_port), timeout=3):
return 200, {"status": "ok"}
except Exception as e:
return 503, {"status": "error", "detail": str(e)}
Return 503 Service Unavailable (not 500) when unhealthy. ALB's default Matcher.HttpCode is 200, so 503 marks the target unhealthy. 500 would also work, but 503 more accurately signals a dependency problem rather than an application crash.
AWS — update the target group:
aws elbv2 modify-target-group \
--target-group-arn <arn> \
--health-check-path /health \
--matcher HttpCode=200
Or in a CloudFormation template:
TargetGroup:
Type: AWS::ElasticLoadBalancingV2::TargetGroup
Properties:
HealthCheckPath: /health
Matcher:
HttpCode: '200'
Azure Application Gateway — update the health probe in the portal under Health probes, or via the CLI:
az network application-gateway probe update \
--gateway-name <gw-name> \
--resource-group <rg> \
--name <probe-name> \
--path /health
GCP HTTP(S) Load Balancer — update the backend service health check:
gcloud compute health-checks update http <check-name> \
--request-path /health
OCI Load Balancer — update the backend set health checker:
oci lb backend-set update \
--load-balancer-id <lb-ocid> \
--backend-set-name <name> \
--health-checker '{"protocol":"HTTP","urlPath":"/health","returnCode":200}'
Warning: ALB health check matcher allows 5xx codes
Severity: high
What it means: The target group's Matcher.HttpCode is set to a range or list that includes 5xx responses (e.g. 200-599, 200,503,504). The load balancer treats a 5xx response as healthy, so the HTTPCode_ELB_5XX_Count metric is not generated for those responses.
Why it matters for SLA claims: HTTPCode_ELB_5XX_Count is one of the primary signals KredSLA uses to confirm a backend outage. If 5xx responses are masked as "healthy", the alarm never fires and no breach is detected.
How to fix it
Set the matcher to 200 only:
aws elbv2 modify-target-group \
--target-group-arn <arn> \
--matcher HttpCode=200
If your application legitimately returns other 2xx or 3xx codes from the health endpoint (e.g. 204 No Content), include only those:
--matcher HttpCode=200,204
Never include 4xx or 5xx in the health check matcher.
Warning: ASG uses HealthCheckType=EC2
Severity: high
What it means: The Auto Scaling group uses EC2-level health checks. An instance is only replaced when it is stopped, terminated, or fails the EC2 system status check. The ALB's view of the instance (healthy vs. unhealthy) is ignored.
Why it matters for SLA claims: An instance can be serving 5xx responses — the ALB marks it unhealthy — but the ASG does not replace it because EC2 reports it as running. Traffic continues hitting the broken instance. The HTTPCode_ELB_5XX_Count metric rises but no replacement occurs, making it look like a sustained application fault rather than a transient instance failure. This muddies the evidence for a provider-side claim.
How to fix it
Switch the ASG health check type to ELB:
AWS Console: Auto Scaling Groups → Edit → Health check type → ELB
AWS CLI:
aws autoscaling update-auto-scaling-group \
--auto-scaling-group-name <asg-name> \
--health-check-type ELB \
--health-check-grace-period 300
Set a grace period long enough to cover your application's startup time. If your app takes 60 seconds to become healthy after launch, a 60-second grace period is the minimum; 300 seconds is a safe default for most web applications.
CloudFormation:
WebASG:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
HealthCheckType: ELB
HealthCheckGracePeriod: 300
Note: this change takes effect for new instances. Existing instances are not immediately replaced; the new health check type applies when the ASG next evaluates instance health.
Warning: RDS enhanced monitoring is disabled
Severity: medium
What it means: MonitoringInterval is set to 0 on the RDS instance, so OS-level metrics (CPU, memory, disk I/O, network) are not published to CloudWatch Logs. Only the standard 1-minute CloudWatch metrics are available.
Why it matters for SLA claims: Without enhanced monitoring, diagnosing why a database slowed down or became unreachable relies on coarse-grained CloudWatch metrics alone. When filing a support case for an SLA credit, OS-level evidence (CPU saturation, I/O wait, network drops) collected at 60-second or 1-second granularity substantially strengthens the case and speeds provider review.
There is also a specific false-positive risk: when the application has no real traffic, DatabaseConnections drops to zero. Without enhanced monitoring, it is harder to distinguish genuine connection loss from idle-app silence.
How to fix it
Enhanced monitoring requires an IAM role that grants RDS permission to publish metrics to CloudWatch Logs.
Step 1 — create the monitoring role (skip if it already exists):
# Create trust policy
cat > rds-monitoring-trust.json << 'EOF'
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": { "Service": "monitoring.rds.amazonaws.com" },
"Action": "sts:AssumeRole"
}]
}
EOF
aws iam create-role \
--role-name rds-enhanced-monitoring \
--assume-role-policy-document file://rds-monitoring-trust.json
aws iam attach-role-policy \
--role-name rds-enhanced-monitoring \
--policy-arn arn:aws:iam::aws:policy/service-role/AmazonRDSEnhancedMonitoringRole
Step 2 — enable on the instance (this triggers a brief apply window; no reboot required):
aws rds modify-db-instance \
--db-instance-identifier <instance-id> \
--monitoring-interval 60 \
--monitoring-role-arn arn:aws:iam::<account-id>:role/rds-enhanced-monitoring \
--apply-immediately
CloudFormation (opt-in via parameter — see the reference 3-tier template for the full pattern):
Parameters:
EnableEnhancedMonitoring:
Type: String
Default: "false"
AllowedValues: ["true", "false"]
Conditions:
EnhancedMonitoring: !Equals [!Ref EnableEnhancedMonitoring, "true"]
Resources:
RDSMonitoringRole:
Type: AWS::IAM::Role
Condition: EnhancedMonitoring
Properties:
AssumeRolePolicyDocument: ...
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AmazonRDSEnhancedMonitoringRole
Database:
Type: AWS::RDS::DBInstance
Properties:
MonitoringInterval: !If [EnhancedMonitoring, 60, 0]
MonitoringRoleArn: !If [EnhancedMonitoring, !GetAtt RDSMonitoringRole.Arn, !Ref AWS::NoValue]
Azure SQL / Cosmos DB / PostgreSQL Flexible Server — enable diagnostic settings to send resource logs and metrics to a Log Analytics workspace:
az monitor diagnostic-settings create \
--name kredsla-diag \
--resource <resource-id> \
--workspace <log-analytics-workspace-id> \
--metrics '[{"category":"AllMetrics","enabled":true}]' \
--logs '[{"categoryGroup":"allLogs","enabled":true}]'
GCP Cloud SQL — enable the Query Insights and database flags for OS-level visibility:
gcloud sql instances patch <instance-name> \
--database-flags cloudsql.enable_pgaudit=on \
--insights-config-query-insights-enabled=true
OCI Autonomous Database — enable Performance Hub and Operations Insights from the OCI Console (Database → Performance Hub → Enable Operations Insights).
The idle-app false positive
A specific pattern worth understanding: when a web application has zero real traffic, the database connection pool drains. CloudWatch's DatabaseConnections metric drops to zero and the DatabaseConnections < 1 alarm fires. KredSLA checks the provider health feed and finds no incident — so it correctly skips the alert. But the noise makes genuine outages harder to spot.
The reference 3-tier CloudFormation template (test/aws/cfn-3-tier-lowest-cost.yaml) avoids this by running a background thread on each EC2 instance that holds a persistent TCP connection to RDS port 3306 open at all times. The connection uses only the MySQL server greeting (no authentication needed) and refreshes every 8 seconds — just under MySQL's default 10-second connect_timeout. CloudWatch always sees at least one connection, so the alarm only fires during genuine database outages.
If you are using the KredSLA reference templates, this is already in place. If you are onboarding an existing application, consider adding a lightweight connection-keep-alive mechanism or, better, a real application health check that queries the database: SELECT 1 every 30 seconds from a background thread or a sidecar process is sufficient.
Checking your current warning state
After connecting a cloud account, KredSLA surfaces any detected warnings in the onboarding wizard (Step 4 — SLA Alerting) as an amber advisory panel. You can also retrieve them programmatically:
GET /api/v1/assets/{account_id}/onboarding-status
The response includes healthcheck_warnings — a list of warnings with resource_type, resource_id, issue, recommendation, and severity fields. Warnings are re-evaluated on every discovery scan (every 30 minutes) and automatically cleared once the configuration is corrected.