Create GKE Cluster with Terraform/Opentofu 0x01

Still a not-so comprehensive guide to create GKE Cluster with Terraform/Opentofu.

Visit the previous post before proceeding with this one.

Repo for this post: g-ulap-demo.

Table of Contents

Overview

This topic is quite big and not a single blog or post would answer all your question. I advice you to read on documentations and blog posts. This post would be too long if I explain every details.

As a curious engineer and sometimes reckless (don’t be reckless when using company account or in production hahaha), I suggest you first deploy and you fix the error that popped up.

We’ll cover creating both Standard GKE and Autopilot GKE, I won’t be able to show you here some of the code blocks like variables.tf so make sure to look at the repo for reference.

Also I structured it to be modular so if you are not yet familiar with Terraform modules then I suggest you also look it up.

Project Structure

 1├── env
 2│   ├── dev
 3│   │   ├── main.tf
 4│   │   ├── provider.tf
 5│   │   ├── terraform.tfvars
 6│   │   └── variables.tf
 7│   └── prod
 8│       ├── main.tf
 9│       ├── provider.tf
10│       ├── terraform.tfvars
11│       └── variables.tf
12└── modules
13    ├── addons
14    │   ├── cert-manager
15    │   │   └── main.tf
16    │   └── ingress-nginx
17    │       └── main.tf
18    ├── gke
19    │   ├── autopilot
20    │   │   ├── main.tf
21    │   │   ├── outputs.tf
22    │   │   └── variables.tf
23    │   └── standard
24    │       ├── main.tf
25    │       ├── outputs.tf
26    │       └── variables.tf
27    ├── iam
28    │   ├── api
29    │   │   ├── main.tf
30    │   │   └── variables.tf
31    │   └── service-account
32    │       ├── main.tf
33    │       ├── outputs.tf
34    │       └── variables.tf
35    ├── network
36    │   ├── firewall
37    │   │   ├── main.tf
38    │   │   └── variables.tf
39    │   ├── main.tf
40    │   ├── output.tf
41    │   └── variables.tf
42    ├── nodepool
43    │   ├── main.tf
44    │   └── variables.tf
45    └── storage
46        ├── main.tf
47        └── variables.tf

Services & Roles

Enable first services in your Terraform service account and project. We are using the same service account used in the previous post.

1gcloud services enable container.googleapis.com --project=<project_id>
2
3gcloud projects add-iam-policy-binding <project_id> \
4  --member="serviceAccount:tofu-sa@<project_id>.iam.gserviceaccount.com" \
5  --role="roles/serviceusage.serviceUsageAdmin"
6
7gcloud projects add-iam-policy-binding <project_id> \
8  --member="serviceAccount:tofu-sa@<project_id>.iam.gserviceaccount.com" \
9  --role="roles/iam.serviceAccountAdmin"

IAM Module

api

Enables required Google Cloud APIs.

API Purpose
compute Required for networking and VM instances
container Core GKE service
logging Collects cluster logs
monitoring Provides metrics and observability
secretmanager Secure storage for secrets
artifactregistry Stores container images

modules/api/main.tf

 1resource "google_project_service" "api" {
 2  for_each = toset([
 3    "compute.googleapis.com",
 4    "container.googleapis.com",
 5    "logging.googleapis.com",
 6    "secretmanager.googleapis.com",
 7    "monitoring.googleapis.com",
 8    "artifactregistry.googleapis.com"
 9  ])
10
11  project            = var.project_id
12  service            = each.key
13  disable_on_destroy = false
14}

Service Account

Creates a service account for nodepool with logging and metrics roles.

  • logging.logWriter - Allows nodes to send logs
  • monitoring.metricWriter - Allows nodes to send metrics

modules/service-account/main.tf

 1resource "google_service_account" "node" {
 2  account_id = var.node_sa
 3}
 4
 5resource "google_project_iam_member" "gke_logging" {
 6  project = var.project_id
 7  role    = "roles/logging.logWriter"
 8  member  = "serviceAccount:${google_service_account.node.email}"
 9}
10
11resource "google_project_iam_member" "gke_metrics" {
12  project = var.project_id
13  role    = "roles/monitoring.metricWriter"
14  member  = "serviceAccount:${google_service_account.node.email}"
15}

Export the output to be used in nodepool module.

modules/service-account/outputs.tf

1output "node_sa_email" {
2  value = google_service_account.node.email
3}

Network Module

Disable this if you are okay with your nodes and cluster exposed to the public cloud. GCP create a network with auto-generated subnets.

We’ll create a private network and NAT to expose the endpoint for Ingress and Outbout internet (pulling container images, apis, updates).

VPC

modules/network/main.tf

 1resource "google_compute_network" "vpc" {
 2  name                            = "gke-network"
 3  routing_mode                    = "REGIONAL"
 4  auto_create_subnetworks         = false
 5  delete_default_routes_on_create = true
 6}
 7
 8# -----------------------------
 9# Subnet (with secondary ranges)
10# -----------------------------
11resource "google_compute_subnetwork" "private" {
12  name                     = "private"
13  ip_cidr_range            = "10.0.32.0/19"
14  region                   = var.region
15  network                  = google_compute_network.vpc.id
16  private_ip_google_access = true
17
18  secondary_ip_range {
19    range_name    = "k8s-pods"
20    ip_cidr_range = "172.16.0.0/14"
21  }
22
23  secondary_ip_range {
24    range_name    = "k8s-services"
25    ip_cidr_range = "172.20.0.0/18"
26  }
27}
28
29# -----------------------------
30# Cloud Router
31# -----------------------------
32resource "google_compute_router" "router" {
33  name    = "router"
34  region  = var.region
35  network = google_compute_network.vpc.id
36}
37
38# -----------------------------
39# Cloud NAT 
40# -----------------------------
41resource "google_compute_router_nat" "nat" {
42  name   = "nat"
43  region = var.region
44  router = google_compute_router.router.name
45
46  nat_ip_allocate_option             = "AUTO_ONLY"
47  source_subnetwork_ip_ranges_to_nat = "LIST_OF_SUBNETWORKS"
48
49  subnetwork {
50    name                    = google_compute_subnetwork.private.self_link
51    source_ip_ranges_to_nat = ["ALL_IP_RANGES"]
52  }
53}
54
55# -----------------------------
56# Firewall
57# -----------------------------
58module "firewall" {
59  source = "./firewall"
60
61  network = google_compute_network.vpc.name
62}

modules/outputs.tf

 1output "network" {
 2  value = google_compute_network.vpc.id
 3}
 4
 5output "subnetwork" {
 6  value = google_compute_subnetwork.private.id
 7}
 8
 9output "network_name" {
10  value = google_compute_network.vpc.name
11}

Firewall

modules/network/firewall/main.tf

 1# -----------------------------
 2# Default route (required!)
 3# -----------------------------
 4resource "google_compute_route" "internet" {
 5  name             = "egress-internet"
 6  network          = var.network
 7  dest_range       = "0.0.0.0/0"
 8  next_hop_gateway = "default-internet-gateway"
 9}
10
11# -----------------------------
12# Firewall: Internal traffic
13# -----------------------------
14resource "google_compute_firewall" "allow_internal" {
15  name    = "gke-allow-internal"
16  network = var.network
17
18  direction = "INGRESS"
19
20  source_ranges = [
21    "10.0.0.0/8",       # node subnet
22    "172.16.0.0/14",    # pods
23    "172.20.0.0/18"     # services
24  ]
25
26  allow {
27    protocol = "all"
28  }
29}
30
31# -----------------------------
32# Firewall: Egress (internet access)
33# -----------------------------
34resource "google_compute_firewall" "allow_egress" {
35  name    = "gke-allow-egress"
36  network = var.network
37
38  direction = "EGRESS"
39
40  destination_ranges = ["0.0.0.0/0"]
41
42  allow {
43    protocol = "all"
44  }
45}
46
47# -----------------------------
48# Firewall: Control plane → nodes
49# -----------------------------
50resource "google_compute_firewall" "allow_master_to_nodes" {
51  name    = "gke-master-to-nodes"
52  network = var.network 
53
54  direction = "INGRESS"
55
56  source_ranges = ["192.168.0.0/28"]
57
58  allow {
59    protocol = "tcp"
60    ports    = ["443", "10250"]
61  }
62}

Node Pool Module

This is where your workloads actually run. Changed the configuration based on your app requirement, this will get costly if not managed correctly.

Disable this module if you are planning to use autopilot.

In this module, we defined the compute layer of our GKE setup:

  • configured a scalable node pool
  • enabled automatic repair and upgrades
  • defined machine and disk configurations
  • connected nodes to a secure service account

modules/nodepool/main.tf

 1resource "google_container_node_pool" "gke-node" {
 2  name    = "gke-node"
 3  cluster = var.cluster_id
 4  location = var.region
 5
 6  autoscaling {
 7    total_min_node_count = var.min_node
 8    total_max_node_count = var.max_node
 9  } 
10
11  management {
12    auto_repair  = true
13    auto_upgrade = true
14  }
15
16  node_config {
17    disk_size_gb = var.disk_size_gb
18    machine_type = var.machine_type
19    disk_type    = var.disk_type
20    
21
22    # labels = {
23    #   role = "worker"
24    # }
25
26    # taint {
27    #   key    = "instance_type"
28    #   value  = "spot"
29    #   effect = "NO_SCHEDULE"
30    # }
31
32    service_account = var.node_sa
33    oauth_scopes = [
34      "https://www.googleapis.com/auth/compute",
35      "https://www.googleapis.com/auth/cloud-platform",
36      "https://www.googleapis.com/auth/devstorage.read_only",
37      "https://www.googleapis.com/auth/logging.write",
38      "https://www.googleapis.com/auth/monitoring",
39    ]
40  }
41}