AWS VPC Networking Patterns That Hold Up in Production
Most VPC designs we audit have the same five flaws. The patterns we apply on every multi-account AWS platform that has to pass security review.
VPC design is one of those decisions that’s painful to redo and easy to get wrong on the first try. CIDR blocks chosen carelessly limit future expansion; subnet topology that doesn’t anticipate multi-AZ pinch points hits limits at scale; NAT gateway placement that ignores cost adds five-figure annual surprises. We’ve audited VPCs across hospital platforms, banking workloads, and SaaS clients — the same patterns appear in most well-designed ones, and the same flaws appear in the rest.
Here are the patterns we apply on every production VPC we deploy on AWS.
Pattern 1: One VPC per workload-tier per region, not “the corporate VPC”#
The legacy pattern of one giant VPC shared by everything is what blast-radius nightmares are made of. A misconfigured security group in dev shouldn’t be able to reach prod databases.
The shape that works:
- Production VPC in primary region (and DR region for HA)
- Non-prod VPC (dev/staging) separately
- Sandbox VPC for experimentation
- Shared services VPC for things genuinely shared across environments (DNS resolvers, package mirrors, etc.)
Each VPC lives in its own AWS account (see our IAM patterns piece for the account structure). Cross-account, cross-VPC connectivity via Transit Gateway or VPC Peering — explicit, auditable.
This is more operational surface than “one big VPC” but the blast radius and audit story is dramatically better.
Pattern 2: CIDR blocks with room to grow#
Pick CIDR blocks deliberately. A /16 gives you 65k IPs and room for 16 /20 subnets. A /20 gives you 4k IPs and pinches fast.
Our defaults:
- Production VPC:
/16(e.g.,10.10.0.0/16) - Non-prod VPC:
/16(e.g.,10.20.0.0/16) - Shared services:
/16(e.g.,10.30.0.0/16) - Each subnet:
/20(4096 IPs) per AZ per tier — public, private app, private data
Allocate non-overlapping CIDR ranges per account so future peering / Transit Gateway / VPN doesn’t fail on address conflicts. Document your allocation scheme; the team will thank you in 3 years.
What we avoid: overlapping with common on-prem corporate ranges (10.0.0.0/16, 192.168.0.0/16) if you ever expect VPN/Direct Connect to your office.
Pattern 3: Three-tier subnet topology, replicated per AZ#
Every VPC has the same shape:
Per AZ (us-east-1a, us-east-1b, us-east-1c):
├── public subnet (/24) — ALBs, NAT Gateway, bastion
├── private app subnet (/22) — EC2, ECS, EKS nodes, Lambda
└── private data subnet (/22) — RDS, ElastiCache, etc.
Three AZs minimum for production (HA). Two AZs if you accept the reduced fault tolerance and want to save NAT Gateway costs.
Route tables:
- Public subnets: default route to Internet Gateway
- Private subnets: default route to NAT Gateway in same AZ (or shared NAT — see below)
- Data subnets: no default route at all, or default to a NAT Gateway depending on whether DBs need outbound for patches/CDC
This separation lets you put security group rules like “DB only accepts traffic from app subnets, never from public subnets” naturally.
Pattern 4: NAT Gateway economics#
NAT Gateways are surprisingly expensive — $0.045/hour ($33/month) per gateway, PLUS $0.045/GB of data processed. A busy app pushing 1TB/month through NAT = ~$45 just in data + $33 in gateway hours = ~$78/month per AZ. Three AZs = $234/month for NAT alone.
Two patterns to reduce this:
Single NAT Gateway (in one AZ, shared across all private subnets). Saves money but creates an AZ failure domain — if that AZ’s NAT goes down, the whole VPC’s private subnets lose outbound.
VPC Endpoints for AWS services. S3, DynamoDB, KMS, Secrets Manager, ECR — these are the highest-volume AWS service endpoints. Adding VPC Endpoints for them means traffic to those services goes via private link, not through NAT. Massive cost savings if you’re pulling lots from S3 or ECR.
Our default: VPC Endpoints for the heavy hitters (S3 gateway endpoint is free, others are ~$7/month per endpoint), NAT Gateway per AZ for everything else. For cost-sensitive workloads, single NAT in one AZ with the trade-off documented.
Pattern 5: Security groups, not NACLs#
Network ACLs (NACLs) are stateless and operate at the subnet level. Security Groups are stateful and operate at the ENI level. Both are valid, but security groups are simpler and cover most needs.
Our default: deny-all NACLs at the subnet boundary (default behavior), security groups as the primary access control mechanism.
Security group hygiene:
- One SG per role, not per app.
sg-rds-postgres,sg-redis,sg-app-tier,sg-alb-public. Reference by name across stacks. - Reference other security groups, not CIDR blocks. “Allow port 5432 from sg-app-tier” is clearer and less brittle than “allow port 5432 from 10.10.20.0/22.”
- Tight outbound rules. Default SG outbound is “all traffic, anywhere.” Tighten to specific destinations and ports for sensitive workloads.
- Document each SG’s purpose in its description. Future-you in 18 months won’t remember why a rule exists.
Pattern 6: Transit Gateway for multi-VPC connectivity#
For 3+ VPCs that need to talk, Transit Gateway is the answer. VPC Peering is per-pair; Transit Gateway is hub-and-spoke. The math:
- 5 VPCs needing full mesh via peering: 10 connections
- 5 VPCs via Transit Gateway: 5 attachments
Transit Gateway also supports VPN and Direct Connect attachments — one connection point for the corporate network.
Cost: ~$0.05/hour per attachment + $0.02/GB processed. Adds up at scale but typically less than the engineering cost of managing many VPC peering connections manually.
What we avoid: full-mesh VPC peering across many VPCs. Maintenance nightmare.
Pattern 7: PrivateLink for cross-account/cross-VPC service exposure#
When workloads in different VPCs/accounts need to consume each other’s services, PrivateLink is the cleanest pattern.
Example: a centralized observability platform in one account, exposed to consumer accounts via PrivateLink. Consumer VPCs see a VPC Endpoint they can target via DNS; the actual service stays in its own VPC. No CIDR conflicts, no transit gateway routing, scoped IAM.
PrivateLink is also how AWS services like SaaS providers (Snowflake, Datadog, etc.) expose themselves to your VPC privately.
Pattern 8: Bastion-less access#
Bastion hosts are the legacy pattern. Modern alternatives:
- AWS Systems Manager Session Manager: SSH into private EC2 via IAM without a bastion or open port. Audit logs are native.
- AWS Client VPN: full VPC access for ops workflows that need it.
- EC2 Instance Connect Endpoint: SSH proxy into private instances via IAM.
- Tailscale / Twingate: zero-trust mesh that works across cloud/on-prem.
Our default: SSM Session Manager for occasional access; Tailscale or Client VPN for ongoing ops work. Bastion hosts are extra surface to manage.
Pattern 9: Flow logs to S3 + Athena/CloudWatch Logs for incident response#
VPC Flow Logs capture every flow (5-tuple + action). Without them, “what was this instance talking to before it got compromised?” is unanswerable.
Pattern: Flow logs to S3 in Parquet format (cheaper than CloudWatch Logs at scale), Athena queries when you need them. ~$0.01/GB stored, queries on demand.
For active security monitoring: forward flow logs to your SIEM (Splunk, Datadog, etc.) or use AWS Security Lake.
Anti-patterns we routinely rip out#
- Public-subnet-only deployments. Apps in public subnets, exposed by security groups. One SG misconfiguration = direct internet exposure. Use private subnets behind ALB/NLB.
- Default SG modified. AWS’s default security group is a footgun — leaving it as the active SG with default rules is a recipe for accidental exposure. Use named SGs; let the default rot.
- 0.0.0.0/0 inbound on anything that isn’t an internet-facing ALB. “Allow SSH from anywhere” is how a Tuesday becomes an incident.
- Single AZ deployments for prod. Saving NAT Gateway cost by going single-AZ is fine for non-prod; in prod, you’ve accepted an AZ outage = full outage.
- Hardcoded private IPs. Service A calling Service B at
10.10.20.55is brittle. Use service discovery (Route 53 Private Zones, ECS Service Discovery, EKS DNS). AssociatePublicIpAddress: trueon instances in private subnets. Confusing and often unintentional. Audit and remove.
What we deploy by default#
For new AWS platform builds:
- One VPC per environment per region, in its own AWS account
/16CIDRs allocated from a non-overlapping plan- Three-tier subnet topology (public / private-app / private-data) replicated across 3 AZs
- VPC Endpoints for S3, DynamoDB, KMS, Secrets Manager, ECR
- NAT Gateway per AZ (or single NAT for cost-sensitive non-prod)
- Transit Gateway for multi-VPC connectivity once we have 3+ VPCs
- Security groups for access control; NACLs at default
- SSM Session Manager for instance access (no bastion)
- VPC Flow Logs to S3 in Parquet, retention per compliance regime
- Reserved IP space documented in a CIDR allocation spreadsheet
For hospital and banking platforms with strict isolation requirements, we typically add: dedicated VPCs per tenant/customer, PrivateLink for inter-tenant service exposure, AWS Network Firewall in front of NAT for egress filtering.
The thing VPC design doesn’t solve#
VPC design is the network layer. It doesn’t solve:
- Application security: SQL injection, auth bypass, etc. — that’s app-layer.
- Identity and access: who can call what API — see IAM patterns piece.
- Secret management: how creds get into apps — see Vault piece.
- Observability: VPC Flow Logs tell you what flowed; not what failed.
VPC design is a foundation. It limits the blast radius of mistakes; it doesn’t prevent mistakes.
The pattern of patterns#
VPC design is one of those “decisions you make once and live with” things. The patterns above aren’t novel — they’re in the AWS Well-Architected Framework, AWS Whitepapers, and a thousand re:Invent talks. The discipline is applying them consistently before the platform has 200 workloads on it.
The teams that ship AWS platforms that hold up under audit aren’t the ones with the most clever networking. They’re the ones who set up the basics deliberately and resisted the urge to take shortcuts when it was time to ship.
VPC design is foundational. The shortcuts cost more later than the deliberate setup costs now. If you’re standing up a new AWS platform and want a second opinion on the network shape, our cloud infrastructure team has shipped this pattern for healthcare, finance, and SaaS clients. Tell us about the platform.