Docker in Production: The Practices That Survive Real Incidents
Battle-tested Docker practices for production workloads — multi-stage builds, security hardening, health checks, log management and the Compose-to-Kubernetes migration path.
Key takeaways
- 01Multi-stage builds are non-negotiable: your production image should contain only the runtime, your built artefact and nothing else.
- 02Run as non-root, always. Use distroless or Alpine base images with a dedicated UID.
- 03Health checks should test application readiness (can it serve a request?) not just process liveness (is PID 1 running?).
- 04Log to stdout in structured JSON. Let the orchestrator handle routing, rotation and aggregation.
- 05Set CPU and memory limits on every container. A container without limits is a host failure waiting to happen.
We deploy Docker containers for every client project — from single-container Next.js apps on Railway to multi-service architectures on Kubernetes. The mistakes that cause production incidents are remarkably consistent, and they are almost never about Docker itself. They are about the gap between a Dockerfile that works in development and one that survives real-world traffic, security scans and 3 AM incidents.
Multi-stage builds: the non-negotiable baseline
A production Docker image should contain exactly three things: a minimal OS layer, the language runtime, and your built application. It should not contain build tools, source code, dev dependencies, package manager caches or test fixtures. Multi-stage builds achieve this by using a 'builder' stage to compile and a 'runner' stage that copies only the output.
# Builder stage
FROM node:20-alpine AS builder
WORKDIR /app
COPY package.json pnpm-lock.yaml ./
RUN corepack enable && pnpm install --frozen-lockfile
COPY . .
RUN pnpm build
# Production stage
FROM node:20-alpine AS runner
WORKDIR /app
RUN addgroup -g 1001 app && adduser -u 1001 -G app -s /bin/sh -D app
COPY --from=builder --chown=app:app /app/.next/standalone ./
COPY --from=builder --chown=app:app /app/public ./public
COPY --from=builder --chown=app:app /app/.next/static ./.next/static
USER app
EXPOSE 3000
CMD ["node", "server.js"]This produces an image that is typically 80-150MB instead of 800MB+, has a dramatically smaller attack surface, and starts faster because there is less filesystem to load.
Security: run as non-root, always
The default Docker user is root. This means a container escape gives the attacker root on the host. Running as a non-root user with a dedicated UID is the single most impactful security measure you can take. Combine this with a read-only root filesystem (--read-only flag) and explicit tmpfs mounts for directories that need write access. We also run container image scans (Trivy) in CI and block deployments with critical CVEs.
Health checks that actually work
Docker's HEALTHCHECK instruction and Kubernetes liveness/readiness probes serve different purposes. A liveness probe answers 'is this process broken beyond recovery?' (restart it). A readiness probe answers 'can this instance serve traffic right now?' (stop routing to it). We implement both: liveness checks the process, readiness hits an application endpoint that verifies database connectivity and dependency health. The default 30-second interval with a 5-second timeout works for most services.
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
CMD wget -qO- http://localhost:3000/api/health || exit 1Logging: structured JSON to stdout
Containers should log to stdout in structured JSON. Not to files inside the container (they disappear when the container restarts), not to syslog (it adds a dependency), and not in unstructured plain text (it is unparseable at scale). The orchestrator — Docker Compose, ECS, Kubernetes — collects stdout and routes it to your log aggregation system. We use pino for Node.js, structlog for Python, and slog for Go. Every log line includes: timestamp, level, message, request_id, and any relevant business context.
Resource limits: prevent cascade failures
A container without CPU and memory limits can consume the entire host's resources, starving every other container and the orchestrator itself. This is the most common cause of 'everything went down at once' incidents. Set limits on every container. For Node.js services we typically start with 512MB memory and 0.5 CPU, then adjust based on production metrics. The --memory and --cpus flags in Docker Compose, or resources.limits in Kubernetes, are not optional.
The Compose-to-Kubernetes path
Docker Compose is the right tool for single-server deployments and development environments. When you need multi-node orchestration, automated scaling, rolling deployments and self-healing, you need Kubernetes. We start every project on Compose (or a managed platform like Railway) and migrate to Kubernetes only when the operational requirements justify the complexity. The migration path is straightforward if your containers follow the practices above: stateless, configurable via environment variables, with proper health checks and resource limits.
Frequently asked questions
Direct answers to questions readers and AI assistants commonly ask about this topic.
Should I use Docker in production in 2026?+
Yes. Containers are the standard deployment unit for web applications. Whether you deploy to a managed platform (Vercel, Railway, Fly.io), ECS, or Kubernetes, you are running containers. Understanding Docker production best practices is foundational.
Alpine or Distroless for production images?+
Alpine for most cases: it is small (5MB), has a package manager for debugging, and supports musl libc. Distroless (from Google) for maximum security: no shell, no package manager, minimal attack surface. We use Alpine as the default and Distroless for high-security workloads.
When should I switch from Docker Compose to Kubernetes?+
When you need multi-node deployment, auto-scaling, rolling updates with zero downtime, or you are running more than 10-15 services. For most startups and mid-size applications, a managed platform or Docker Compose on a single server is sufficient and significantly simpler to operate.
Last updated: April 27, 2026 · Written by Ribbsaeter Systems Engineering · DevOps & Platform Engineering