Happy New Year! I hope you've had a chance to update your copyright line on all your static websites!
I also hope you've had a chance to upgrade all your dependencies in 2023, because a lot has changed in the past few years and if you're anything like me, you pinned all your dependencies in 2020 and haven't looked back since.
Recently, I've been working on deploying a GCP GKE cluster via terraform "from scratch", which is a lot of codewords so let me explain:
- GCP: Google Cloud Platform, Google's cloud system that competes well in terms of cost and performance but not in mindshare so the ecosystem is lighter.
- GKE: Google Kubernetes Engine, Google's managed Kubernetes offering. Despite being the authors of Kubernetes itself, they're not the largest provider of Kubernetes.
- terraform: One of a few "Infrastructure-as-code" tools that isn't actually "code" as I'd call it. This is more "Infrastructure-as-declarative-configuration", which tries to inject the ability to write software into it. When you have simple needs though, it's great; the issues arise when you try to be fancy or your needs get complicated.
- "from scratch": This isn't really from scratch, since I'm not physically touching any hardware, I'm not writing to any SSD's directly, and I'm not even able to SSH into these boxes (easily). Instead, I'm sitting on top of layers and layers of abstractions that hide the fact that there's people running around in Oregon somewhere replacing hard drives or whole systems.
Before you can terraform a GCP GKE Cluster
Something that you don't typically deal with in AWS or Azure is whether APIs are enabled or not. In GCP, Google made the decision that they're going to enable only a handful of APIs for everyone, and if you want to use others then you have to enable them yourself. These APIs include basic things like the ability to start a VM (compute.googleapis.com
) to more complicated ones like setting up a container registry (containerregistry.googleapis.com
).
Before we get to the example code, note that the example can be copy-pasted into a folder and will make up a terraform module, as well as a user of that module. As a result, there will be references to variables (var.project_id
), that are defined in the module and set by the using module.
Let's enable some APIs:
# gcp-gke/enable-apis.tf
module "enables-google-apis" {
# The double // is necessary so that terraform knows how to split this into a source address and a folder path.
source = "terraform-google-modules/project-factory/google//modules/project_services"
# Latest as of 2023-10-20
# https://github.com/terraform-google-modules/terraform-google-project-factory/releases
version = "14.4.0"
# Optional: you can leave this off and set it in the "google" provider instead
project_id = var.project_id
activate_apis = [
"iam.googleapis.com", # Identity management, always necessary
"cloudresourcemanager.googleapis.com", # Metadata management
"compute.googleapis.com", # Compute nodes
"containerregistry.googleapis.com", # GCR: Container image repository
"container.googleapis.com", # GKE: Kubernetes itself
"storage-component.googleapis.com", # GCS: Cloud Storage, for creating the storage behind GCR
"logging.googleapis.com", # Manages Cloud Logging
"monitoring.googleapis.com", # and Cloud Monitoring
"dns.googleapis.com", # For setting up DNS nameservers and records for our services
# "tpu.googleapis.com", # Optional, if you want to use TPU instances
]
# If you want to use TPUs:
# activate_api_identities = [
# {
# api = "tpu.googleapis.com"
# roles = ["roles/viewer", "roles/storage.admin"]
# }
# ]
disable_dependent_services = false
# Only set to "true" if this is the only terraform or manually-deployed infrastructure in the whole of this GCP project. Otherwise, you risk disabling services when you tear down that others are using.
disable_services_on_destroy = false
}
Create a VPC (Virtual Private Cloud) network
If you don't do this, I believe the GKE cluster would create one, but then it would be tougher to control parts of this. For my use-case, this was particularly important since I wanted to create other resources in the same VPC, such as a Google SQL database instance (PostgreSQL). If there's interest, I can write a follow-up post on how to do that.
# Docs: https://registry.terraform.io/modules/terraform-google-modules/network/google/latest
module "vpc" {
source = "terraform-google-modules/network/google"
version = "~>8.1.0"
project_id = var.project_id # Optional, see above
network_name = "${local.cluster_name}-vpc"
subnets = [
{
subnet_name = "${local.cluster_name}-vpc-subnet-01"
subnet_ip = "10.0.0.0/16"
subnet_region = var.region # Such as "europe-west1"
}
]
secondary_ranges = {
"${local.cluster_name}-vpc-subnet-01" = [
{
range_name = "${local.cluster_name}-vpc-subnet-01-pods"
ip_cidr_range = "192.168.0.0/18"
},
{
range_name = "${local.cluster_name}-vpc-subnet-01-services"
ip_cidr_range = "192.168.64.0/18"
},
]
}
}
This creates a VPC with 3 IP ranges, used for different purposes:
- Nodes: 10.0.0.0/16 (as in, up to 10.0.255.255)
- Pods: 192.168.0.0/18 (as in, up to 192.168.63.255)
- Services: 192.168.64.0 (as in, up to 192.168.255.255)
We need this for the GKE cluster, as these subnets and ranges are required arguments.
Deploy the cluster itself
Watch out, this isn't the end of the story, it's just the second step. Once the cluster exists, we need to export the ability to connect to the cluster, so we can deploy an actual application to it.
With that out of the way:
# Docs: https://registry.terraform.io/modules/terraform-google-modules/kubernetes-engine/google/latest
module "gke" {
source = "terraform-google-modules/kubernetes-engine/google"
project_id = var.project_id
name = var.cluster_name
region = var.region
# Either make it regional, or specify which zones you want.
regional = true
# zones = var.zones
network = module.vpc.network_name
subnetwork = module.vpc.subnets_names[0]
ip_range_pods = "${local.cluster_name}-vpc-subnet-01-pods"
ip_range_services = "${local.cluster_name}-vpc-subnet-01-services"
# Set these based on your needs
http_load_balancing = false
network_policy = false
horizontal_pod_autoscaling = false
filestore_csi_driver = false
# cluster_autoscaling = {
# enabled = true
# autoscaling_profile = "OPTIMIZE_UTILIZATION"
# max_cpu_cores = 100
# min_cpu_cores = 2
# max_memory_gb = 1024
# min_memory_gb = 4
# gpu_resources = []
# }
node_pools = [
{
# Let's make sure there's at least 1 node in this pool
name = "standard"
machine_type = "e2-standard-2",
min_count = 1
max_count = 10
local_ssd_count = 0
disk_size_gb = 100
disk_type = "pd-standard" # HDD
auto_repair = true
auto_upgrade = true
# Don't use a spot/preemptible instance so this node is (mostly) stable.
# It can still go away, due to repairs, upgrades, etc.
preemptible = false
initial_node_count = 1
},
# ... We can have many node pools, but for brevity I'll keep it to one
]
# Give all the nodes the ability to write logs, read/write monitoring, and read/write cloud data
node_pools_oauth_scopes = [
all = [
"https://www.googleapis.com/auth/logging.write",
"https://www.googleapis.com/auth/monitoring",
"https://www.googleapis.com/auth/cloud-platform",
]
]
node_pools_labels = merge({
all = {}
standard = {"purpose" = "any"}
})
# node_pools_metadata = ...
# node_pools_taints = ...
# node_pools_tags = ...
depends_on = [module.enables-google-apis]
}
Side note: For anyone else who looks at these docs, did you know the tables have more columns? I only just realized because I was checking if some fields were required!
You should change and set almost every variable on the right according to your needs. Some of these values were picked up from a recent project, others were changed, and others were made up outright, so don't think this is "ready to go" unless all you want is any kubernetes/GKE cluster at all.
Export the ability to deploy to it
So far, we've only created the cluster, we don't actually have a way to use it. One way to use it would be through more terraform resources, but I'm not a fan of that. Almost all of the kubernetes ecosystem is defined in YAML files, either of resources, in helm charts, or elsewhere, but terraform is obstinately in HCL and the only kubernetes resource that takes in YAML files says to avoid its use. That means you can't really use these two ecosystems together, and coupled with the slowness of a terraform plan
/apply
cycle I wouldn't want to.
Instead, let's export a valid kubeconfig.yaml
file to disk to use with kubectl
, kustomize
, k9s
, and other k-prefixed tools.
module "gke_auth" {
source = "terraform-google-modules/kubernetes-engine/google//modules/auth"
version = "~> 19.0"
project_id = var.project_id
cluster_name = module.gke.name
location = module.gke.location
depends_on = [module.gke]
}
output "kubeconfig" {
description = "kubeconfig.yaml contents"
sensitive = true
value = module.gke_auth.kubeconfig_raw
}
resource "local_file" "kubeconfig" {
content = module.gke_auth.kubeconfig_raw
filename = "kubeconfig_${var.cluster_name}.yaml"
}
That includes two ways to output this. If you haven't dealt with terraform modules yet, I recommend you read about them next. This article so far includes samples meant to go into a module that you use, and I'll post the variables.tf
and outputs.tf
files at the end. However, if you're just following along and want something that works, you can use these snippets directly with a .tfvars
file, which I'll post an example of as well.
Now, if you used all these snippets (and the ones in the appendix), then you should be able to run terraform apply
to get a working GKE cluster up. You can then run kubectl run -it --image ubuntu:20.04 bash
to get a terminal into a pod in that cluster.
Create a Container Registry to Deploy "Through"
If you're part of an organization creating their own software, then you'll also want a private container registry to deploy to. You could deploy to Docker Hub, or to a manually-deployed container registry, but let's be honest, you're here for that sweet infrastructure as code!
# gcr.tf
# Creating the actual registry is pretty easy
resource "google_artifact_registry_repository" "registry" {
repository_id = var.gcr_name
location = var.region
format = "DOCKER"
}
# Give write access to the developers and any tools that deploy on their behalf
# This uses the outputs from the previous resource to avoid duplicating things
resource "google_artifact_registry_repository_iam_binding" "registry-binding-writers" {
project = google_artifact_registry_repository.registry.project
location = google_artifact_registry_repository.registry.location
repository = google_artifact_registry_repository.registry.name
role = "roles/artifactregistry.writer"
members = var.gcr_writers
}
# Give read access to the cluster itself
resource "google_artifact_registry_repository_iam_binding" "registry-binding-readers" {
project = google_artifact_registry_repository.registry.project
location = google_artifact_registry_repository.registry.location
repository = google_artifact_registry_repository.registry.name
role = "roles/artifactregistry.reader"
members = ["serviceAccount:${module.gke.service_account}"]
}
For most of these things, I wouldn't talk about their outputs, but this one is special. The default outputs of this don't include anything that allows you to just docker push
to it. Instead, you have to do a lot of string concatenation, so here's the output definition:
# This shortens things, but it's still super long...
locals {
garrr = google_artifact_registry_repository.registry
}
output "container_registry" {
value = "${local.garrr.location}-docker.pkg.dev/${local.garrr.project}/${local.garrr.repository_id}"
}
With this, you get something like europe-west1-docker.pkg.dev/project-1234/repo-name
that you can append an image name/tag onto:
docker push europe-west1-docker.pkg.dev/project-1234/repo-name/image:version
Use our new cluster and container registry
I like to test things out with as few moving parts as possible, so I would start with the kubectl
command from above:
$ export KUBECONFIG=kubeconfig_example-gke.yaml
$ kubectl run -it --image ubuntu:20.04 bash
root@bash:/#
^D
Appendix: Extra files to make this actually work
# variables.tf
variable "project_id" {
type = string
description = "The GCP project ID"
}
variable "region" {
type = string
description = "The GCP region"
}
variable "cluster_name" {
type = string
}
variable "gcr_name" {
type = string
description = "Name of the GCR project"
}
variable "gcr_writers" {
type = list(string)
description = "List of GCR writers"
}
# outputs.tf
output "cluster_endpoint" {
description = "Endpoint for GKE control plane."
sensitive = true
value = module.gke.endpoint
}
output "kubeconfig" {
description = "kubectl config as generated by the module."
sensitive = true
value = module.gke_auth.kubeconfig_raw
}
locals {
garrr = google_artifact_registry_repository.registry
}
output "container_registry" {
value = "${local.garrr.location}-docker.pkg.dev/${local.garrr.project}/${local.garrr.repository_id}"
}
# .tfvars
# Examples only, all of these should be set by you
project_id = "..."
region = "europe-west1"
cluster_name = "example-gke"
gcr_name = "example-gke-gcr"
gcr_writers = [
"serviceAccount:deploymentTool@project.iam.gserviceaccount",
]