Module 09 — Bootstrap the etcd Cluster
etcd is a distributed key-value store that holds the entire Kubernetes cluster state — every pod, service, deployment, secret, and config map. When you run kubectl get pods, the API server reads from etcd. When you create a deployment, the API server writes to etcd. If etcd is down, the API server cannot function.
In this module you install etcd on both control plane nodes (cp1, cp2), configure TLS for both peer communication (etcd-to-etcd) and client access (API server-to-etcd), and verify the cluster is healthy.
1. Why a Cluster, Not a Single Node
A single etcd instance is a single point of failure. If it dies, the entire Kubernetes cluster becomes read-only (or worse, unresponsive). Clustering provides:
- Replication — data is replicated across members
- Consensus — writes require agreement from a quorum (majority of members)
- Availability — the cluster continues serving if a minority of members fail
Quorum and the 2-node trade-off
| Cluster size | Quorum | Tolerates failures |
|---|---|---|
| 1 | 1 | 0 (single point of failure) |
| 2 | 2 | 0 (both must be up for writes) |
| 3 | 2 | 1 |
| 5 | 3 | 2 |
Your 2-node cluster requires both members to be up for writes. If one node goes down, the cluster becomes read-only. This is acceptable for training — you learn the clustering mechanics — but production clusters use 3 or 5 members.
Why not 3 in this training? Running 3 control plane VMs would push the total to 6 VMs and ~12 GB RAM. Two control planes teach the same concepts with less resource overhead.
2. Download and Install etcd
Run these steps on both cp1 and cp2. SSH into each node:
ssh cp1
2.1 Download etcd
ETCD_VER=v3.5.16
curl -sL "https://github.com/etcd-io/etcd/releases/download/${ETCD_VER}/etcd-${ETCD_VER}-linux-amd64.tar.gz" -o /tmp/etcd.tar.gz
2.2 Extract and install
tar -xzf /tmp/etcd.tar.gz -C /tmp/
sudo mv /tmp/etcd-${ETCD_VER}-linux-amd64/etcd /usr/local/bin/
sudo mv /tmp/etcd-${ETCD_VER}-linux-amd64/etcdctl /usr/local/bin/
sudo mv /tmp/etcd-${ETCD_VER}-linux-amd64/etcdutl /usr/local/bin/
rm -rf /tmp/etcd*
2.3 Verify
etcd --version
etcdctl version
Expected:
etcd Version: 3.5.16
...
etcdctl version: 3.5.16
Repeat these steps on cp2 before continuing.
Checkpoint:
etcd --versionreturns3.5.16on both cp1 and cp2.
3. Prepare the Configuration
Run these steps on both cp1 and cp2.
3.1 Create directories
sudo mkdir -p /etc/etcd /var/lib/etcd
sudo chmod 700 /var/lib/etcd
/etc/etcd— TLS certificates and configuration/var/lib/etcd— data directory (where etcd stores its database). The restrictive permissions prevent other users from reading cluster data.
3.2 Copy certificates
The certificates were distributed to ~/ in Module 07. Move them to /etc/etcd:
sudo cp ~/ca.pem ~/etcd.pem ~/etcd-key.pem /etc/etcd/
3.3 Set environment variables
Each node needs to know its own name and IP. These variables are used by the systemd unit file.
On cp1:
ETCD_NAME=cp1
INTERNAL_IP=192.168.56.21
On cp2:
ETCD_NAME=cp2
INTERNAL_IP=192.168.56.22
Checkpoint: Certificates exist in
/etc/etcd/on both nodes.
4. Create the systemd Unit File
This is the most flag-heavy configuration in the entire track. Each flag has a specific purpose — read through the explanations before creating the file.
Run on each node (with the correct ETCD_NAME and INTERNAL_IP variables set from Section 3.3):
cat <<EOF | sudo tee /etc/systemd/system/etcd.service
[Unit]
Description=etcd
Documentation=https://github.com/etcd-io/etcd
[Service]
Type=notify
ExecStart=/usr/local/bin/etcd \\
--name ${ETCD_NAME} \\
--cert-file=/etc/etcd/etcd.pem \\
--key-file=/etc/etcd/etcd-key.pem \\
--peer-cert-file=/etc/etcd/etcd.pem \\
--peer-key-file=/etc/etcd/etcd-key.pem \\
--trusted-ca-file=/etc/etcd/ca.pem \\
--peer-trusted-ca-file=/etc/etcd/ca.pem \\
--peer-client-cert-auth \\
--client-cert-auth \\
--initial-advertise-peer-urls https://${INTERNAL_IP}:2380 \\
--listen-peer-urls https://${INTERNAL_IP}:2380 \\
--listen-client-urls https://${INTERNAL_IP}:2379,https://127.0.0.1:2379 \\
--advertise-client-urls https://${INTERNAL_IP}:2379 \\
--initial-cluster-token etcd-cluster-0 \\
--initial-cluster cp1=https://192.168.56.21:2380,cp2=https://192.168.56.22:2380 \\
--initial-cluster-state new \\
--data-dir=/var/lib/etcd
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
EOF
What each flag does
TLS flags:
| Flag | Purpose |
|---|---|
--cert-file | Server certificate for client connections |
--key-file | Private key for the server certificate |
--peer-cert-file | Certificate for peer-to-peer connections |
--peer-key-file | Private key for peer certificate |
--trusted-ca-file | CA certificate to verify client certificates |
--peer-trusted-ca-file | CA certificate to verify peer certificates |
--client-cert-auth | Require clients to present a valid certificate |
--peer-client-cert-auth | Require peers to present a valid certificate |
Clustering flags:
| Flag | Purpose |
|---|---|
--name | This member's name (must be unique in the cluster) |
--initial-advertise-peer-urls | URL this member advertises to peers |
--listen-peer-urls | Address to listen for peer connections (port 2380) |
--listen-client-urls | Addresses to listen for client connections (port 2379) |
--advertise-client-urls | URL clients should use to connect to this member |
--initial-cluster | All members and their peer URLs (bootstrap config) |
--initial-cluster-token | Shared token to prevent cross-cluster connections |
--initial-cluster-state | new for first-time bootstrap |
--data-dir | Where to store the etcd database on disk |
Port assignments
| Port | Protocol | Used for |
|---|---|---|
| 2379 | HTTPS | Client connections (API server → etcd) |
| 2380 | HTTPS | Peer connections (etcd node → etcd node) |
Checkpoint:
/etc/systemd/system/etcd.serviceexists on both cp1 and cp2.
5. Start the etcd Cluster
Both members must start in close succession because the cluster waits for quorum during initial bootstrap.
5.1 Start etcd on both nodes
On cp1:
sudo systemctl daemon-reload
sudo systemctl enable etcd
sudo systemctl start etcd
On cp2 (immediately after):
sudo systemctl daemon-reload
sudo systemctl enable etcd
sudo systemctl start etcd
The first node to start will wait for the second member to join. Once both are up, the cluster forms and elects a leader.
5.2 Check the service status
On each node:
sudo systemctl status etcd
Expected: Active: active (running).
If the service is not running, check the logs:
sudo journalctl -u etcd --no-pager -l | tail -30
Common issues at this stage:
- Certificate file not found → verify
/etc/etcd/*.pemexists - Connection refused on peer port → the other node is not started yet
- Bind address in use → another process is on port 2379 or 2380
Checkpoint:
sudo systemctl status etcdshowsactive (running)on both cp1 and cp2.
6. Verify Cluster Health
6.1 Member list
From either cp1 or cp2:
sudo ETCDCTL_API=3 etcdctl member list \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/etcd/ca.pem \
--cert=/etc/etcd/etcd.pem \
--key=/etc/etcd/etcd-key.pem \
--write-out=table
Expected output:
+------------------+---------+------+----------------------------+----------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+------+----------------------------+----------------------------+------------+
| xxxxxxxxxxxx | started | cp1 | https://192.168.56.21:2380 | https://192.168.56.21:2379 | false |
| yyyyyyyyyyyy | started | cp2 | https://192.168.56.22:2380 | https://192.168.56.22:2379 | false |
+------------------+---------+------+----------------------------+----------------------------+------------+
Both members show started status.
6.2 Endpoint health
sudo ETCDCTL_API=3 etcdctl endpoint health \
--endpoints=https://192.168.56.21:2379,https://192.168.56.22:2379 \
--cacert=/etc/etcd/ca.pem \
--cert=/etc/etcd/etcd.pem \
--key=/etc/etcd/etcd-key.pem \
--write-out=table
Expected output:
+----------------------------+--------+-------------+-------+
| ENDPOINT | HEALTH | TOOK | ERROR |
+----------------------------+--------+-------------+-------+
| https://192.168.56.21:2379 | true | 10.123ms | |
| https://192.168.56.22:2379 | true | 11.456ms | |
+----------------------------+--------+-------------+-------+
Both endpoints are healthy.
6.3 Endpoint status (shows leader)
sudo ETCDCTL_API=3 etcdctl endpoint status \
--endpoints=https://192.168.56.21:2379,https://192.168.56.22:2379 \
--cacert=/etc/etcd/ca.pem \
--cert=/etc/etcd/etcd.pem \
--key=/etc/etcd/etcd-key.pem \
--write-out=table
One of the members will show true in the IS LEADER column. The other shows false. The leader handles all write operations and replicates them to the follower.
Checkpoint:
etcdctl member listshows 2 members withstartedstatus.etcdctl endpoint healthshows both healthy.
7. Create a Helper Alias
The etcdctl commands are verbose because of the TLS flags. Create an alias to simplify:
On both cp1 and cp2:
cat >> ~/.bashrc <<'EOF'
# etcdctl with TLS flags
alias etcdctl='sudo ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/etcd/ca.pem \
--cert=/etc/etcd/etcd.pem \
--key=/etc/etcd/etcd-key.pem'
EOF
source ~/.bashrc
Now you can run:
etcdctl member list --write-out=table
Much cleaner.
8. Test Data Operations
etcd is a key-value store. Test it by writing and reading data:
8.1 Write a key
On cp1:
etcdctl put /test/greeting "Hello from etcd"
OK
8.2 Read the key
On cp2 (proves replication):
etcdctl get /test/greeting
/test/greeting
Hello from etcd
The data written on cp1 is immediately available on cp2.
8.3 List all keys under a prefix
etcdctl get /test --prefix --keys-only
/test/greeting
8.4 Delete the test key
etcdctl del /test/greeting
1
8.5 Preview what Kubernetes will store
When the API server connects, it will store data under the /registry/ prefix. After the cluster is running (Module 10+), you can inspect Kubernetes data directly:
# This will work after Module 10 — shown here for context
etcdctl get /registry --prefix --keys-only | head -20
You will see paths like /registry/pods/default/..., /registry/services/..., etc. — the entire cluster state.
Checkpoint: A key written on cp1 is readable on cp2. Data replication works.
9. Understand the Data Directory
etcd stores its database in /var/lib/etcd. Take a look:
sudo ls -la /var/lib/etcd/member/
drwx------ snap
drwx------ wal
| Directory | Purpose |
|---|---|
wal/ | Write-ahead log — records every change before applying it. Used for crash recovery. |
snap/ | Snapshots — periodic full copies of the database. Used for faster recovery and new member bootstrapping. |
Important: Never modify files in
/var/lib/etcddirectly. Always useetcdctlor the etcd API.
Backup
To back up the etcd database:
etcdctl snapshot save /tmp/etcd-backup.db
Snapshot saved at /tmp/etcd-backup.db
Verify the backup:
etcdctl snapshot status /tmp/etcd-backup.db --write-out=table
This shows the snapshot revision, total key count, and size. In a production cluster, you would run this backup on a cron schedule and store the snapshots off-node.
10. Troubleshooting
context deadline exceeded when starting etcd
The other member is not reachable on port 2380. Check:
- The other node is started:
ssh cp2 "sudo systemctl status etcd" - Network connectivity:
ping 192.168.56.22 - Port is open:
ssh cp2 "ss -tlnp | grep 2380"
Start both nodes within a few seconds of each other. etcd waits up to ~30 seconds for the cluster to form.
x509: certificate is valid for X, not Y
The etcd certificate's SANs do not include the IP or hostname being used. Verify with:
openssl x509 -in /etc/etcd/etcd.pem -text -noout | grep -A 1 "Subject Alternative"
The SANs must include 192.168.56.21, 192.168.56.22, cp1, cp2, and 127.0.0.1. If any are missing, regenerate the etcd certificate in Module 07.
permission denied on data directory
The data directory must be owned by root with 700 permissions:
sudo chown -R root:root /var/lib/etcd
sudo chmod 700 /var/lib/etcd
member already bootstrapped
etcd refuses to start because the data directory has data from a previous bootstrap attempt. If you need to start fresh:
sudo systemctl stop etcd
sudo rm -rf /var/lib/etcd/*
sudo systemctl start etcd
Do this on both nodes simultaneously.
etcdctl returns Error: context deadline exceeded
The TLS flags are wrong or missing. Verify you are using --cacert, --cert, and --key with the correct file paths, or use the alias from Section 7.
11. What You Have Now
| Capability | Verification Command |
|---|---|
| etcd installed on cp1 and cp2 | ssh cp1 "etcd --version" |
| 2-member cluster formed | etcdctl member list --write-out=table — 2 members |
| Peer TLS encryption | Peer URLs use https:// |
| Client TLS encryption | Client URLs use https:// |
| Client certificate auth required | --client-cert-auth flag in unit file |
| Data replication across members | Write on cp1, read on cp2 |
| Cluster healthy | etcdctl endpoint health — both true |
| Backup capability | etcdctl snapshot save /tmp/etcd-backup.db |
The data store is running and healthy. The API server (Module 10) will connect to etcd using the certificates from Module 07 and store all Kubernetes cluster state here.
Next up: Module 10 — Bootstrap the Control Plane — install kube-apiserver, kube-controller-manager, and kube-scheduler on both control plane nodes.