NLBs Get Security Groups

In August of 2023 AWS introduced Security Groups for Network Load Balancers and now recommend that any new NLBs be associated with a Security Group. They do not state the obvious: you should stop using NLBs without Security Groups. That may be easier said than done. Here is some advice with examples in Terraform and Python.

AWS does not allow us to transform an NLB which has no Security Groups into one which does nor vice versa. Since v5.13.0 Terraform's hashicorp/aws provider handles these cases by triggering a resource replacement. It's an undocumented fact that some older versions of the provider e.g. v4.59.0 allow NLBs to be created with Security Groups but will not trigger replacement when they should.

Receiving PrivateLink Traffic

NLB without a Security Group

Our NLB without Security Groups is used as part of a VPC Endpoint Service which will receive PrivateLink traffic targeting an ECS Service.

resource "aws_vpc_endpoint_service" "svc" {
  acceptance_required        = true
  network_load_balancer_arns = [aws_lb.svc.arn]
}

resource "aws_lb" "svc" {
  name               = "our-svc-${random_string.nlb_suffix.result}"
  internal           = true
  load_balancer_type = "network"
  subnets            = [for subnet in aws_subnet.private : subnet.id]
  lifecycle {
    create_before_destroy = true
  }
}

resource "random_string" "nlb_suffix" {
  length  = 6
  special = false
}

The aws_vpc_endpoint_service resource requires at least one NLB and while it is attached that NLB cannot be destroyed. Replacing it requires creating a new NLB and updating the VPC Endpoint Service to use it before destroying the old one. This behavior is configured in Terraform with the create_before_destroy lifecycle setting. NLBs must have unique names so we avoid collision by appending a suffix from the random_string resource (which automatically becomes create_before_destroy since aws_lb.svc depends on it).

Security Group Rules

An NLB without a Security Group would be equivalent to assigning a very permissive one.

resource "aws_lb" "svc" {
  ...
  security_groups = [
    aws_security_group.svc_nlb.id
  ]
}

resource "aws_security_group" "svc_nlb" {
  name   = "our_svc_nlb"
  vpc_id = aws_vpc.vpc.id
  ingress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = [
      "0.0.0.0/0"
    ]
  }
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = [
      "0.0.0.0/0"
    ]
  }
}

We can update the IP-based rules permitting ingress to our target so they use the ID of our NLB Security Group. Egress rules from our NLB should be restricted to the ID of the target Security Group.

resource "aws_security_group" "svc_ecs" {
  ...
  ingress {
    ...
    security_group_id = aws_security_group.svc_nlb.id
  }
}

resource "aws_security_group" "svc_nlb" {
  ...
  ingress {}
  egress {
    ...
    security_group_id = aws_security_group.svc_ecs.id
  }
}

Ingress from PrivateLink traffic is a special case as it uses the IP of the client Endpoint which we don't necessarily control. To allow ingress for this traffic and not all traffic we can set a flag introduced in hashicorp/aws v5.30.0 named enforce_security_group_inbound_rules_on_private_link_traffic and remove all ingress rules.

resource "aws_lb" "svc" {
  ...
  enforce_security_group_inbound_rules_on_private_link_traffic = "off"
  security_groups  = [
    aws_security_group.svc_nlb.id
  ]
}

Terraform will create this new Security Group then a new NLB then modify the VPC Endpoint before destroying the old NLB.

New NLB replacing the old

Active Client Connections

The VPC Endpoint Service cannot be updated to use a different NLB if it has active client Endpoint connections. We'll need to reject any active connections before modifying the Service and then accept them again afterward. This process will cause downtime. Endpoint connections can be managed in the AWS Console but instead we'll use a pair of aws_lambda_invocation resources. Python code is provided later.

resource "aws_lambda_invocation" "reject" {
  function_name = "vpc_endpoint_connection_toggle"
  input = jsonencode({
    svc_name_tag_val = "our-svc",
    accept_or_reject = "reject"
  })
}

resource "aws_vpc_endpoint_service" "svc" {
  ...
  tags = {
    Name = "our-svc"
  }
  depends_on = [aws_lambda_invocation.reject]
}

resource "aws_lambda_invocation" "accept" {
  function_name = "vpc_endpoint_connection_toggle"
  input = jsonencode({
    svc_name_tag_val = "our-svc",
    accept_or_reject = "accept"
  })
  depends_on = [aws_vpc_endpoint_service.svc]
}

We enforce the desired order of operations using depends_on. The "accept" invocation depends on the Endpoint Service so it will only be run after Service changes are complete. The Endpoint Service depends on the "reject" invocation which will be run first. The Lambda function uses svc_name_tag_val to discover our Endpoint Service indirectly because using aws_vpc_endpoint_service.svc.id in a resource the Service also depends on would form a cycle.

This function would either accept or reject all connections and could cause problems if invoked unexpectedly. Rejecting client connections causes downtime and accepting all connections might allow unintended traffic into the VPC. We make the function much safer to invoke by introducing declarative fail-safe conditions.

resource "aws_lambda_invocation" "reject" {
  ...
  input = jsonencode({
    if_using_nlb_arn = "..."
    ...
  })
  triggers = {
    uuid = uuid()
  }
}

resource "aws_lambda_invocation" "accept" {
  ...
  input = jsonencode({
    client_endpoints = ["...","..."]
    ...
  })
  triggers = {
    uuid = uuid()
  }
}

When rejecting connections the ARN of the NLB to be removed is identified with if_using_nlb_arn and the function will check if that NLB is attached to the Endpoint Service. If the NLB is not attached the function does nothing. When accepting connections a set of Endpoint IDs are provided in client_endpoints and only those connections are accepted. These sanity checks makes changes safer and simpler to reason about.

The Lambda function with these features can be written in Python.

import boto3
import time

ec2 = boto3.client("ec2")

def handler(event, ctx):
    accept_or_reject = event['accept_or_reject'].lower()
    for svc_id in service_ids_by_name_tag_value(event['svc_name_tag_val']):
        if accept_or_reject == "accept":
            accept_client_connections(svc_id, event['client_endpoints'])
        elif accept_or_reject == "reject":
            if service_has_nlb(svc_id, event['if_using_nlb_arn']):
                ec2.reject_vpc_endpoint_connections(
                  ServiceId = svc_id,
                  VpcEndpointIds = all_endpoint_ids(svc_id)
                )

def service_ids_by_name_tag_value(name):
    svcs = ec2.describe_vpc_endpoint_service_configurations(
        Filters = [
            {
                'Name': 'tag:Name',
                'Values': [
                    name
                ]
            }
        ]
    )
    return list(map(lambda svc: svc['ServiceId'], svcs['ServiceConfigurations']))

def service_has_nlb(svc_id, nlb_arn):
    svcs = ec2.describe_vpc_endpoint_service_configurations(
        ServiceIds = [
            svc_id
        ]
    )
    return nlb_arn in svcs['ServiceConfigurations'][0]['NetworkLoadBalancerArns']

def all_endpoint_ids(svc_id):
    conns = ec2.describe_vpc_endpoint_connections(
        Filters = [
            {
                'Name': 'service-id',
                'Values': [
                    svc_id
                ]
            }
        ]
    )
    return list(map(lambda conn: conn['VpcEndpointId'], conns['VpcEndpointConnections']))

def accept_client_connections(svc_id, endpoint_ids):
    is_successful = False
    while not is_successful:
        resp = ec2.accept_vpc_endpoint_connections(
            ServiceId = svc_id,
            VpcEndpointIds = endpoint_ids
        )
        is_successful = len(resp['Unsuccessful']) == 0
        if not is_successful:
            time.sleep(2)

Six Years Of Tech Debt

Network Load Balancers were announced in September 2017 and not touted for their security. Integrating them with existing Security Groups has been a thorny problem for AWS customers which is finally solved. You should do anything necessary to prevent Network Load Balancers from being created without an associated Security Group and to replace any you are now using.