How We Deploy Django on AWS ECS Fargate

Avinash Thakur7 min read

How We Deploy Django on AWS ECS Fargate

Most Django deployment tutorials stop at docker run. That gap between a container that boots on your laptop and a service that survives a deploy, a health check, and a database failover is where the real work lives. This is the configuration we reach for when we put a Django application on AWS ECS Fargate for a client, written down so you can adapt it rather than rediscover it.

We use Fargate (rather than EC2-backed ECS) because we would rather not patch and scale a fleet of container hosts. You hand AWS a task definition and it finds capacity. The trade-off is that you give up host-level control and pay a small premium per vCPU-hour. For a typical Django API or admin-backed product, that trade is worth it.

The Dockerfile

Build for a small, predictable image. Use a slim Python base, install dependencies in a layer that only rebuilds when requirements.txt changes, collect static files at build time, and run as a non-root user.

Application source and developer tools open in an editor
The image you build on your laptop is the exact artifact that runs in production. Keep it small and reproducible.
FROM python:3.12-slim AS base
 
ENV PYTHONUNBUFFERED=1 \
    PYTHONDONTWRITEBYTECODE=1 \
    PIP_NO_CACHE_DIR=1
 
WORKDIR /app
 
# System deps for psycopg and building wheels.
RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential libpq-dev \
    && rm -rf /var/lib/apt/lists/*
 
COPY requirements.txt .
RUN pip install -r requirements.txt
 
COPY . .
 
# Bake static assets into the image so the container is self-contained.
RUN python manage.py collectstatic --noinput
 
RUN useradd --create-home appuser
USER appuser
 
EXPOSE 8000
CMD ["gunicorn", "myproject.wsgi:application", "-c", "gunicorn.conf.py"]

A note on collectstatic: under a static-asset CDN you usually push the collected files to S3 and serve them from CloudFront, not from the container. We still run collectstatic in the build so the manifest exists and ManifestStaticFilesStorage can resolve hashed filenames. Where the files physically live is a separate decision from whether the manifest is present.

gunicorn configuration

Put gunicorn settings in a file rather than a long CMD line. The worker count is the setting people get wrong most often. A common starting point is 2 * CPU + 1, but that formula assumes CPU-bound work. Django request handlers spend most of their time waiting on the database, so we usually run sync workers sized to the task's vCPU allocation and add a few threads per worker for I/O overlap.

# gunicorn.conf.py
import os
 
bind = "0.0.0.0:8000"
workers = int(os.environ.get("GUNICORN_WORKERS", "3"))
threads = int(os.environ.get("GUNICORN_THREADS", "4"))
worker_class = "gthread"
 
# Recycle workers to bound memory growth from long-lived processes.
max_requests = 1000
max_requests_jitter = 100
 
# Must be shorter than the ALB idle timeout so gunicorn closes first.
timeout = 60
graceful_timeout = 30
 
accesslog = "-"   # stdout -> CloudWatch Logs
errorlog = "-"

Sizing is something you tune against your own traffic. Start conservative, watch the task's CPU and memory in CloudWatch, and raise the worker count only when the workers are actually saturated rather than idle-waiting on the database.

The ECS task definition

The task definition is where Django meets AWS. It declares the container, its CPU and memory, the environment, secrets, and the logging driver. Here is a trimmed version of one we run.

{
  "family": "myproject-web",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "executionRoleArn": "arn:aws:iam::ACCOUNT:role/myproject-ecs-execution",
  "taskRoleArn": "arn:aws:iam::ACCOUNT:role/myproject-task",
  "containerDefinitions": [
    {
      "name": "web",
      "image": "ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/myproject:GIT_SHA",
      "portMappings": [{ "containerPort": 8000, "protocol": "tcp" }],
      "environment": [
        { "name": "DJANGO_SETTINGS_MODULE", "value": "myproject.settings.production" },
        { "name": "ALLOWED_HOSTS", "value": ".myproject.com" }
      ],
      "secrets": [
        { "name": "DATABASE_URL", "valueFrom": "arn:aws:secretsmanager:us-east-1:ACCOUNT:secret:myproject/db" },
        { "name": "SECRET_KEY", "valueFrom": "arn:aws:secretsmanager:us-east-1:ACCOUNT:secret:myproject/django" }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/myproject-web",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "web"
        }
      }
    }
  ]
}

Two decisions worth calling out. First, image tags are the Git SHA, never latest, so a rollback is just pointing the service back at the previous task definition revision. Second, DATABASE_URL and SECRET_KEY come from secrets (resolved from Secrets Manager by the execution role at launch), not from environment. Plaintext secrets in a task definition are readable by anyone with ecs:DescribeTaskDefinition.

ALB health checks

The Application Load Balancer needs a path that returns 200 only when the app can actually serve traffic. Do not point it at / if / runs an expensive query or redirects. Add a dedicated lightweight endpoint.

Network switch and patch panel with cables
The load balancer only sends traffic to a task once its health check passes, so the check has to mean something.
# urls.py
from django.http import JsonResponse
from django.urls import path
 
def healthz(request):
    return JsonResponse({"status": "ok"})
 
urlpatterns = [
    path("healthz", healthz),
    # ... your routes
]

In the target group, set the health check path to /healthz, the success code to 200, and keep the interval and healthy-threshold tight enough that a bad task is pulled quickly but not so aggressive that a brief GC pause cycles a healthy one. Add healthz to ALLOWED_HOSTS handling. The ALB hits the container by IP, so either allow the health-check host or special-case the path before host validation.

One subtlety: the ALB idle timeout must be longer than gunicorn's timeout, and gunicorn's timeout must be longer than your slowest legitimate request. If those are out of order you get truncated responses that are painful to trace.

RDS PostgreSQL and connection handling

Use RDS for PostgreSQL and give Django persistent connections so it is not opening a fresh TCP and TLS handshake on every request.

# settings/production.py
import dj_database_url
 
DATABASES = {
    "default": {
        **dj_database_url.config(conn_max_age=600, ssl_require=True),
        "OPTIONS": {"connect_timeout": 5},
    }
}

conn_max_age keeps a connection alive across requests within a worker. Be deliberate here: persistent connections multiply by worker count by task count, so check that (workers x threads x tasks) stays under the RDS instance's max_connections. When you outgrow that, put PgBouncer in transaction-pooling mode in front of RDS rather than raising max_connections indefinitely.

Run migrations as a separate one-off ECS task before the new service version goes live, not in the container CMD. If migrations run on every task start, a scale-up event can fire several migration attempts at once.

Where Celery fits

Celery workers are a second ECS service from the same image, with a different command. They share the task definition image and secrets but do not sit behind the ALB. They pull from the broker instead.

{
  "name": "worker",
  "image": "ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/myproject:GIT_SHA",
  "command": ["celery", "-A", "myproject", "worker", "--concurrency", "4"],
  "logConfiguration": {
    "logDriver": "awslogs",
    "options": {
      "awslogs-group": "/ecs/myproject-worker",
      "awslogs-region": "us-east-1",
      "awslogs-stream-prefix": "worker"
    }
  }
}

Because the worker has no health-check endpoint, monitor it on queue depth and task failure rate in CloudWatch rather than on HTTP status. Scale the worker service on broker backlog, and scale the web service on ALB request count or CPU. They have different load shapes and should autoscale independently.

Putting it together

The pieces are: an image tagged by SHA, a task definition that pulls secrets at launch, a web service behind an ALB with a real health endpoint, a worker service on the same image, and RDS with bounded persistent connections. None of it is exotic. The work is in getting the boundaries right: timeouts ordered correctly, connections counted, migrations run once.

If you would rather hand this off, our Django cloud deployment team builds and operates exactly this setup, and if your data model is the hard part we also do PostgreSQL database engineering alongside it. Tell us what you are running and we will tell you, plainly, what we would change. Start the conversation on our Django cloud deployment service page.

Keep reading