AWS ECS DNS Services Guide for Network Engineers

Prerequisites: This guide assumes familiarity with traditional networking concepts but no prior container experience. We'll explain all ECS-specific terminology and components.

1. ECS Components Overview

ECS Cluster
Think of this as a logical grouping of compute resources (like a rack of servers). It's where your containers will run, similar to how VMs run on physical hosts.
Task Definition
This is like a VM template or blueprint. It defines what container images to run, how much CPU/memory to allocate, networking configuration, and other runtime parameters.
Service
A service ensures a specified number of tasks (running containers) are always running. It's like a process manager that automatically restarts failed containers and handles rolling updates.
Task
A task is a running instance of a task definition. Think of it as a running VM created from a template. One task can contain multiple containers that share networking and storage.
Service Discovery
This is AWS's built-in DNS service for ECS. It automatically creates and manages DNS records for your services, similar to how DHCP assigns IP addresses automatically.

2. ECS DNS Service Discovery Architecture

graph TB subgraph "VPC (10.0.0.0/16)" subgraph "Private Subnet A (10.0.1.0/24)" ECS1[ECS Task 1
IP: 10.0.1.10] ECS2[ECS Task 2
IP: 10.0.1.11] end subgraph "Private Subnet B (10.0.2.0/24)" ECS3[ECS Task 3
IP: 10.0.2.10] ECS4[ECS Task 4
IP: 10.0.2.11] end subgraph "AWS Cloud Map" NS[Private DNS Namespace
myapp.local] SRV[Service Registry
web.myapp.local] end end Client[Client Application] --> NS NS --> SRV SRV --> ECS1 SRV --> ECS2 SRV --> ECS3 SRV --> ECS4 style ECS1 fill:#e1f5fe style ECS2 fill:#e1f5fe style ECS3 fill:#e1f5fe style ECS4 fill:#e1f5fe style NS fill:#fff3e0 style SRV fill:#f3e5f5
Architecture Explanation:
This diagram shows how ECS Service Discovery works within a VPC. The key components are: When a task starts, it automatically registers with the service registry. When it stops or fails health checks, it's automatically removed.

3. DNS Resolution Flow

sequenceDiagram participant Client participant Route53Resolver participant CloudMap participant ECSService participant Task1 participant Task2 Client->>Route53Resolver: DNS Query: web.myapp.local Route53Resolver->>CloudMap: Forward to Private DNS Zone CloudMap->>CloudMap: Check Service Registry CloudMap->>Route53Resolver: Return IP List [10.0.1.10, 10.0.2.10] Route53Resolver->>Client: DNS Response with IPs Client->>Task1: HTTP Request to 10.0.1.10 Task1->>Client: HTTP Response Note over CloudMap,ECSService: Automatic Registration/Deregistration ECSService->>Task1: Health Check Task1->>ECSService: Health OK ECSService->>CloudMap: Keep IP in registry ECSService->>Task2: Health Check Task2->>ECSService: Health FAIL ECSService->>CloudMap: Remove IP from registry
DNS Resolution Flow Explanation:
This sequence shows how DNS resolution works with ECS Service Discovery:
  1. Client Query: Application requests DNS resolution for service name
  2. Route 53 Resolver: VPC's DNS resolver forwards query to Cloud Map
  3. Cloud Map Lookup: Returns list of healthy task IP addresses
  4. Client Connection: Client connects to one of the returned IPs
  5. Health Management: ECS continuously monitors task health and updates DNS records
The system automatically handles task failures by removing unhealthy IPs from DNS responses.

4. Service Discovery Types

graph LR subgraph "DNS-Only Discovery" DNS[DNS A Records
web.myapp.local → IP List] DNS --> IP1[10.0.1.10] DNS --> IP2[10.0.1.11] end subgraph "DNS + SRV Discovery" SRV[SRV Records
_http._tcp.web.myapp.local] SRV --> PORT1[10.0.1.10:8080] SRV --> PORT2[10.0.1.11:8080] end subgraph "API-Only Discovery" API[Cloud Map API
DiscoverInstances] API --> RESP[JSON Response
with IPs + metadata] end style DNS fill:#e8f5e8 style SRV fill:#fff3e0 style API fill:#f3e5f5
Service Discovery Types Explanation:
AWS ECS offers three types of service discovery: DNS-only is the most common choice as it works with existing applications without code changes.

5. Implementation Command Sequence

Setup Order and Dependencies

graph TD A[1Create VPC & Subnets] --> B[2Create ECS Cluster] B --> C[3Create Cloud Map Namespace] C --> D[4Create Cloud Map Service] D --> E[5Create Task Definition] E --> F[6Create ECS Service] F --> G[7Verify DNS Resolution] style A fill:#ffebee style B fill:#e8f5e8 style C fill:#fff3e0 style D fill:#f3e5f5 style E fill:#e1f5fe style F fill:#fce4ec style G fill:#f1f8e9
Command Sequence Dependencies:
This diagram shows the order in which components must be created. Each step depends on the previous ones:
  1. Infrastructure First: VPC and networking must exist before ECS
  2. Cluster Creation: ECS cluster provides the compute environment
  3. DNS Setup: Cloud Map namespace and service define DNS structure
  4. Application Definition: Task definition specifies container configuration
  5. Service Launch: ECS service starts tasks and registers them with DNS
  6. Verification: Test DNS resolution and service connectivity

6. Step-by-Step AWS CLI Commands

Step 1: Create VPC and Networking

aws ec2 create-vpc \
    --cidr-block 10.0.0.0/16 \
    --tag-specifications 'ResourceType=vpc,Tags=[{Key=Name,Value=ecs-vpc}]'
VPC Creation: This creates the virtual network where ECS tasks will run. The CIDR block defines the IP address range available for subnets.
Parameter Description Alternatives
--cidr-block IP address range for the VPC 172.16.0.0/16, 192.168.0.0/16
--tag-specifications Tags for resource identification Optional, but recommended for organization
aws ec2 create-subnet \
    --vpc-id vpc-12345678 \
    --cidr-block 10.0.1.0/24 \
    --availability-zone us-east-1a \
    --tag-specifications 'ResourceType=subnet,Tags=[{Key=Name,Value=ecs-subnet-1}]'
Subnet Creation: Creates a subnet within the VPC for placing ECS tasks. Multiple subnets in different AZs provide high availability.

Step 2: Create ECS Cluster

aws ecs create-cluster \
    --cluster-name my-ecs-cluster \
    --capacity-providers FARGATE \
    --default-capacity-provider-strategy capacityProvider=FARGATE,weight=1 \
    --tags key=Environment,value=production
ECS Cluster: This creates the logical grouping where containers will run. Fargate is serverless compute, meaning AWS manages the underlying infrastructure.
Parameter Description Alternatives
--capacity-providers How to run containers EC2, FARGATE_SPOT for cost savings
--default-capacity-provider-strategy Default compute allocation Can mix multiple providers with weights

Step 3: Create Cloud Map Namespace

aws servicediscovery create-private-dns-namespace \
    --name myapp.local \
    --vpc vpc-12345678 \
    --description "Private DNS namespace for ECS services"
Private DNS Namespace: This creates a private DNS zone within your VPC. Services registered here are only resolvable from within the VPC, providing internal service discovery.
Parameter Description Alternatives
--name DNS domain name Any valid domain: internal, corp.local, etc.
--vpc VPC where namespace is available Must be existing VPC ID
Important: The namespace creation returns a namespace ID that you'll need for the next step. Save this output!

Step 4: Create Cloud Map Service

aws servicediscovery create-service \
    --name web \
    --namespace-id ns-12345678 \
    --dns-config NamespaceId=ns-12345678,DnsRecords=[{Type=A,TTL=300}] \
    --health-check-custom-config FailureThreshold=3 \
    --description "Web service discovery"
Cloud Map Service: This creates the actual service registry within the namespace. It defines how DNS records are created and managed for your ECS service.
Parameter Description Alternatives
--name Service name (becomes DNS record) Any valid hostname: api, db, cache
DnsRecords Type A for IP addresses SRV for port information, CNAME for aliases
TTL DNS cache time in seconds 60-3600 seconds (1 min to 1 hour)
FailureThreshold Health check failures before removal 1-10 (higher = more tolerance)

Step 5: Create Task Definition

aws ecs register-task-definition \
    --family web-service \
    --network-mode awsvpc \
    --requires-compatibilities FARGATE \
    --cpu 256 \
    --memory 512 \
    --execution-role-arn arn:aws:iam::123456789012:role/ecsTaskExecutionRole \
    --container-definitions '[
        {
            "name": "web-container",
            "image": "nginx:latest",
            "portMappings": [
                {
                    "containerPort": 80,
                    "protocol": "tcp"
                }
            ],
            "essential": true,
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-group": "/ecs/web-service",
                    "awslogs-region": "us-east-1",
                    "awslogs-stream-prefix": "ecs"
                }
            }
        }
    ]'
Task Definition: This is the blueprint for your containers. It specifies what container image to run, resource requirements, networking, and logging configuration.
Parameter Description Alternatives
--family Task definition name/group Any descriptive name
--network-mode awsvpc gives each task its own ENI bridge, host (for EC2 only)
--cpu CPU units (1024 = 1 vCPU) 256, 512, 1024, 2048, 4096
--memory Memory in MB 512, 1024, 2048, 4096, 8192
containerPort Port the container listens on Any port 1-65535
Note: The execution role allows ECS to pull container images and write logs. You may need to create this role first if it doesn't exist.

Step 6: Create ECS Service with Service Discovery

aws ecs create-service \
    --cluster my-ecs-cluster \
    --service-name web-service \
    --task-definition web-service:1 \
    --desired-count 2 \
    --launch-type FARGATE \
    --network-configuration 'awsvpcConfiguration={
        subnets=[subnet-12345678,subnet-87654321],
        securityGroups=[sg-12345678],
        assignPublicIp=DISABLED
    }' \
    --service-registries '[
        {
            "registryArn": "arn:aws:servicediscovery:us-east-1:123456789012:service/srv-12345678"
        }
    ]' \
    --tags key=Environment,value=production
ECS Service: This creates the service that runs and manages your containers. It ensures the desired number of tasks are always running and registers them with service discovery.
Parameter Description Alternatives
--desired-count Number of tasks to run 1-100+ depending on needs
subnets Where to place tasks Multiple subnets for HA
securityGroups Firewall rules for tasks Must allow required ports
assignPublicIp DISABLED for private services ENABLED if tasks need internet
registryArn Cloud Map service ARN From previous step's output

Step 7: Verify DNS Resolution

# Test DNS resolution from within VPC
aws ec2 run-instances \
    --image-id ami-0abcdef1234567890 \
    --instance-type t3.micro \
    --subnet-id subnet-12345678 \
    --security-group-ids sg-12345678 \
    --user-data '#!/bin/bash
        yum update -y
        yum install -y bind-utils
        echo "Testing DNS resolution..."
        nslookup web.myapp.local
        dig web.myapp.local
        curl -I http://web.myapp.local'
DNS Verification: This launches a test instance to verify DNS resolution works. The commands test both DNS lookup and HTTP connectivity to your service.

7. Traffic Flow with Load Balancer Integration

graph TD subgraph "Internet" USER[User Request] end subgraph "VPC" subgraph "Public Subnets" ALB[Application Load Balancer\npublic-facing] end subgraph "Private Subnets" subgraph "ECS Cluster" T1[Task 1\n10.0.1.10:80] T2[Task 2\n10.0.1.11:80] T3[Task 3\n10.0.2.10:80] end subgraph "Service Discovery" DNS[web.myapp.local\n→ 10.0.1.10, 10.0.1.11, 10.0.2.10] end subgraph "Internal Service" INT[Internal App\ncalling web.myapp.local] end end end USER --> ALB ALB --> T1 ALB --> T2 ALB --> T3 INT --> DNS DNS --> T1 DNS --> T2 DNS --> T3 style USER fill:#ffebee style ALB fill:#e8f5e8 style T1 fill:#e1f5fe style T2 fill:#e1f5fe style T3 fill:#e1f5fe style DNS fill:#fff3e0 style INT fill:#f3e5f5
Traffic Flow Explanation:
This diagram shows how ECS services can be accessed both externally and internally: This pattern allows for secure internal communication while still providing external access when needed.

8. Health Check and DNS Management

stateDiagram-v2 [*] --> TaskStarting TaskStarting --> HealthCheck : Task starts HealthCheck --> Healthy : Passes health check HealthCheck --> Unhealthy : Fails health check Healthy --> DNSRegistered : Register with DNS DNSRegistered --> ServingTraffic : Receive traffic ServingTraffic --> HealthCheck : Continuous monitoring Unhealthy --> TaskStopping : Stop unhealthy task TaskStopping --> DNSDeregistered : Remove from DNS DNSDeregistered --> [*] : Task terminated ServingTraffic --> Unhealthy : Health check fails Unhealthy --> Healthy : Health check passes
Health Check Lifecycle:
This state diagram shows how ECS manages task health and DNS registration:
  1. Task Starting: New task begins startup process
  2. Health Check: ECS performs health checks (HTTP, TCP, or custom)
  3. DNS Registration: Healthy tasks are added to DNS records
  4. Traffic Serving: Task receives traffic from service discovery
  5. Continuous Monitoring: Health checks continue throughout task lifecycle
  6. Failure Handling: Unhealthy tasks are removed from DNS and replaced
This ensures that only healthy tasks receive traffic, providing automatic failover.

9. DNS Query Types and Use Cases

graph LR subgraph "A Record Query" A1[Client: nslookup web.myapp.local] A2[Response: 10.0.1.10
10.0.1.11
10.0.2.10] end subgraph "SRV Record Query" S1[Client: dig SRV _http._tcp.web.myapp.local] S2[Response: 10 0 8080 web-1.myapp.local
10 0 8080 web-2.myapp.local] end subgraph "API Discovery" API1[Client: DiscoverInstances API] API2[Response: JSON with IPs,
ports, metadata] end A1 --> A2 S1 --> S2 API1 --> API2 style A1 fill:#e8f5e8 style S1 fill:#fff3e0 style API1 fill:#f3e5f5
DNS Query Types:
Different query types serve different use cases: Choose A records for simplicity, SRV for port flexibility, API for advanced scenarios.

10. Troubleshooting Common Issues

Common Issue #1: DNS resolution not working
# Check VPC DNS settings
aws ec2 describe-vpcs --vpc-ids vpc-12345678 --query 'Vpcs[0].{DnsSupport:DnsSupport,DnsHostnames:DnsHostnames}'

# Verify Route 53 Resolver
aws route53resolver describe-resolver-endpoints --filters Name=VpcId,Values=vpc-12345678
Common Issue #2: Tasks not registering with service discovery
# Check service registry instances
aws servicediscovery list-instances --service-id srv-12345678

# Check ECS service events
aws ecs describe-services --cluster my-ecs-cluster --services web-service --query 'services[0].events[0:5]'

11. Best Practices and Security Considerations

Security Best Practices:
Performance Considerations:

12. Monitoring and Observability

# Monitor service discovery health
aws servicediscovery get-instances-health-status --service-id srv-12345678

# Check ECS service metrics
aws logs filter-log-events \
    --log-group-name /ecs/web-service \
    --start-time 1640995200000 \
    --filter-pattern "ERROR"

# Monitor DNS resolution
aws cloudwatch get-metric-statistics \
    --namespace AWS/Route53Resolver \
    --metric-name QueryCount \
    --dimensions Name=VPC,Value=vpc-12345678 \
    --start-time 2024-01-01T00:00:00Z \
    --end-time 2024-01-02T00:00:00Z \
    --period 3600 \
    --statistics Sum
Monitoring Commands: These commands help you monitor the health and performance of your ECS DNS setup. Regular monitoring helps identify issues before they impact users.

Summary

AWS ECS Service Discovery provides automatic DNS management for containerized applications, similar to how DHCP automatically assigns IP addresses. The key benefits include:

This setup enables microservices to communicate using simple DNS names while AWS handles the complexity of service registration, health monitoring, and traffic routing.