Network Restricted Databricks UCX Installation
A guide on installing UCX on Databricks CLI without opening up a restricted network to allow external services, e.g., GitHub access.
In August of 2023 AWS introduced Security Groups for Network Load Balancers and now recommend that any new NLBs be associated with a Security Group. They do not state the obvious: you should stop using NLBs without Security Groups. That may be easier said than done. Here is some advice with examples in Terraform and Python.
AWS does not allow us to transform an NLB which has no Security Groups into one which does nor vice versa. Since v5.13.0 Terraform's hashicorp/aws
provider handles these cases by triggering a resource replacement. It's an undocumented fact that some older versions of the provider e.g. v4.59.0 allow NLBs to be created with Security Groups but will not trigger replacement when they should.
Our NLB without Security Groups is used as part of a VPC Endpoint Service which will receive PrivateLink traffic targeting an ECS Service.
resource "aws_vpc_endpoint_service" "svc" {
acceptance_required = true
network_load_balancer_arns = [aws_lb.svc.arn]
}
resource "aws_lb" "svc" {
name = "our-svc-${random_string.nlb_suffix.result}"
internal = true
load_balancer_type = "network"
subnets = [for subnet in aws_subnet.private : subnet.id]
lifecycle {
create_before_destroy = true
}
}
resource "random_string" "nlb_suffix" {
length = 6
special = false
}
The aws_vpc_endpoint_service
resource requires at least one NLB and while it is attached that NLB cannot be destroyed. Replacing it requires creating a new NLB and updating the VPC Endpoint Service to use it before destroying the old one. This behavior is configured in Terraform with create_before_destroy. NLBs must have unique names so we avoid collision by appending a suffix from the random_string
resource (which automatically becomes create_before_destroy
since aws_lb.svc
depends on it).
An NLB without a Security Group would be equivalent to assigning a very permissive one.
resource "aws_lb" "svc" {
...
security_groups = [
aws_security_group.svc_nlb.id
]
}
resource "aws_security_group" "svc_nlb" {
name = "our_svc_nlb"
vpc_id = aws_vpc.vpc.id
ingress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = [
"0.0.0.0/0"
]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = [
"0.0.0.0/0"
]
}
}
We can update the IP-based rules permitting ingress to our target so they use the ID of our NLB Security Group. Egress rules from our NLB should be restricted to the ID of the target Security Group.
resource "aws_security_group" "svc_ecs" {
...
ingress {
...
security_group_id = aws_security_group.svc_nlb.id
}
}
resource "aws_security_group" "svc_nlb" {
...
ingress {}
egress {
...
security_group_id = aws_security_group.svc_ecs.id
}
}
Ingress from PrivateLink traffic is a special case as it uses the IP of the client Endpoint which we don't necessarily control. To allow ingress for this traffic and not all traffic we can set a flag introduced in hashicorp/aws
v5.30.0 named enforce_security_group_inbound_rules_on_private_link_traffic
and remove all ingress rules.
resource "aws_lb" "svc" {
...
enforce_security_group_inbound_rules_on_private_link_traffic = "off"
security_groups = [
aws_security_group.svc_nlb.id
]
}
Terraform will create this new Security Group then a new NLB then modify the VPC Endpoint before destroying the old NLB.
The VPC Endpoint Service cannot be updated to use a different NLB if it has active client Endpoint connections. We'll need to reject any active connections before modifying the Service and then accept them again afterward. This process will cause downtime. Endpoint connections can be managed in the AWS Console but instead we'll use a pair of aws_lambda_invocation
resources. Python code is provided later.
resource "aws_lambda_invocation" "reject" {
function_name = "vpc_endpoint_connection_toggle"
input = jsonencode({
svc_name_tag_val = "our-svc",
accept_or_reject = "reject"
})
}
resource "aws_vpc_endpoint_service" "svc" {
...
tags = {
Name = "our-svc"
}
depends_on = [aws_lambda_invocation.reject]
}
resource "aws_lambda_invocation" "accept" {
function_name = "vpc_endpoint_connection_toggle"
input = jsonencode({
svc_name_tag_val = "our-svc",
accept_or_reject = "accept"
})
depends_on = [aws_vpc_endpoint_service.svc]
}
We enforce the desired order of operations using depends_on. The "accept"
invocation depends on the Endpoint Service so it will only be run after Service changes are complete. The Endpoint Service depends on the "reject"
invocation which will be run first. The Lambda function uses svc_name_tag_val
to discover our Endpoint Service indirectly because using aws_vpc_endpoint_service.svc.id
in a resource the Service also depends on would form a cycle.
This function would either accept or reject all connections and could cause problems if invoked unexpectedly. Rejecting client connections causes downtime and accepting all connections might allow unintended traffic into the VPC. We make the function much safer to invoke by introducing declarative fail-safe conditions.
resource "aws_lambda_invocation" "reject" {
...
input = jsonencode({
if_using_nlb_arn = "..."
...
})
triggers = {
uuid = uuid()
}
}
resource "aws_lambda_invocation" "accept" {
...
input = jsonencode({
client_endpoints = ["...","..."]
...
})
triggers = {
uuid = uuid()
}
}
When rejecting connections the ARN of the NLB to be removed is identified with if_using_nlb_arn
and the function will check if that NLB is attached to the Endpoint Service. If the NLB is not attached the function does nothing. When accepting connections a set of Endpoint IDs are provided in client_endpoints
and only those connections are accepted. These sanity checks makes changes safer and simpler to reason about.
The Lambda function with these features can be written in Python.
import boto3
import time
ec2 = boto3.client("ec2")
def handler(event, ctx):
accept_or_reject = event['accept_or_reject'].lower()
for svc_id in service_ids_by_name_tag_value(event['svc_name_tag_val']):
if accept_or_reject == "accept":
accept_client_connections(svc_id, event['client_endpoints'])
elif accept_or_reject == "reject":
if service_has_nlb(svc_id, event['if_using_nlb_arn']):
ec2.reject_vpc_endpoint_connections(
ServiceId = svc_id,
VpcEndpointIds = all_endpoint_ids(svc_id)
)
def service_ids_by_name_tag_value(name):
svcs = ec2.describe_vpc_endpoint_service_configurations(
Filters = [
{
'Name': 'tag:Name',
'Values': [
name
]
}
]
)
return list(map(lambda svc: svc['ServiceId'], svcs['ServiceConfigurations']))
def service_has_nlb(svc_id, nlb_arn):
svcs = ec2.describe_vpc_endpoint_service_configurations(
ServiceIds = [
svc_id
]
)
return nlb_arn in svcs['ServiceConfigurations'][0]['NetworkLoadBalancerArns']
def all_endpoint_ids(svc_id):
conns = ec2.describe_vpc_endpoint_connections(
Filters = [
{
'Name': 'service-id',
'Values': [
svc_id
]
}
]
)
return list(map(lambda conn: conn['VpcEndpointId'], conns['VpcEndpointConnections']))
def accept_client_connections(svc_id, endpoint_ids):
is_successful = False
while not is_successful:
resp = ec2.accept_vpc_endpoint_connections(
ServiceId = svc_id,
VpcEndpointIds = endpoint_ids
)
is_successful = len(resp['Unsuccessful']) == 0
if not is_successful:
time.sleep(2)
Network Load Balancers were announced in September 2017 and not touted for their security. Integrating them with existing Security Groups has been a thorny problem for AWS customers which is finally solved. You should do anything necessary to prevent Network Load Balancers from being created without an associated Security Group and to replace any you are now using.
Read more about the latest and greatest work Rearc has been up to.
A guide on installing UCX on Databricks CLI without opening up a restricted network to allow external services, e.g., GitHub access.
Our seasoned engineers at Rearc are here to share their insights for navigating anything spooky in your next digital transformation project
The Art of Hiring: How Rearc Matches Top Talent
LLM and Copyright
Tell us more about your custom needs.
We’ll get back to you, really fast
Kick-off meeting