CFL Summit

# CFL Summit | Rancher Lab

Please join slack channel: #cfl-rancher-lab-oct-2024

---

## Derek Keightley

## &

## Ryan Elliott-Smith

???

No sheep

---

## What are we doing?

- Quick presentation to cover some relevant topics (~15mins)

- Lab

## What are we working with?

In the lab environment the following has been pre-created:
- Rancher

- RKE2 downstream clusters

- Some deployments to troubleshoot

---

# Webhooks in Kubernetes

## What are they?

???

Ask the audience, does anyone want to answer?

Webhooks provide a way to add logic into the Kubernetes API while processing certain requests

There are two points in the API workflow where webhooks can be integrated:

???

A basic definition of webhooks, more details in the following slides

---

### Validation (validating webhook)

Allowing an approve/deny stage for API requests, there are some common use cases:
  - Ensuring standards - eg, labels, resource limits

- Security tools

- Protection of important objects/CRDs

- Rejecting invalid spec

### Mutation (mutating webook)

Allowing changes to be injected into API objects, some common use cases:
  - Sidecar injection (Istio, for eg)

- Adding context - eg, labels, annotations etc.

- Removing invalid fields

---

![basic-view](webhook.jpeg)

---

![detailed-view](webhook-detailed.jpeg)

More info on webhooks

https://book-v1.book.kubebuilder.io/beyond_basics/what_is_a_webhook

https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/

---

## Webhook best practices

- Exempt system namespaces (kube-system, cattle-system, etc.), ideally with both of the below:
  - (**recommended**) Exemption can be in the webhook config CRD with a `namespaceSelector`

- Depending on the webhook - labels or flags

- Fail open - `failurePolicy: Ignore`

- Multiple webhook pods for HA

- Avoid deploying unecessary webhooks - they add latency

- Ensure webhooks and config CRD objects are cleaned up when related apps are deleted

???

When a webhook is offline, the scope that it applies to can block related API calls

This can be problematic during cluster upgrades, roll backs, restores or node replacement

- Adding an exemption to the webhook config CRD will prevent kube-api from making the request to the webhook
- Adding labels/flags will only instruct the webhook to not apply it's logic to those requests - if it's down this doesn't help

---

# Common network issues

![switch-poe](switch-poe.jpg)

---

## Wait, first let's step back a bit..

---

### TCP - 3 way handshake

![tcphandshake](tcp-handshake.png)

???

Cover the 3-way handshake (SYN, SYN-ACK, ACK)
- Destination starts an app which binds to a port (TCB allocated) and listens
- Source creates a TCB (trans control block), assigns a source port, sends a SYN packet to the destination and port

Come back to this slide if it's useful to reference in the following slides

---

![tcp-session](tcp-full-session-cropped.png)

???

- First green handshake section shows what we covered in the previous slide, the session is started after the handshake
- The session is where the data transmission - what matters
- The closure of the session - in a nice polite way to finish
- Not all TCP sessions close this way, often abruptly

---

# Common network issues

???

Ask the audience about their understanding of what these messages mean

---

## connection refused

```bash
# kubectl describe pod -n kube-system rke2-canal-zoidberg

E0924 22:18:04.357762   4491 memcache.go:265] couldn't get current server API group list: Get https://127.0.0.1:6443/api?timeout=32s: dial tcp 127.0.0.1:6443: connect: connection refused
```

```bash
zapp.brannig.an rke2[2783338]: {"level":"warn","ts":"2024-09-27T11:57:31.335237-0400","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xcd34db33ff/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
```

???

Can be a few reasons:
- (most common) The destination port is not bound by the destination host - the OS of the listening socket recognized the inbound connection request but chose to intentionally reject it
  - This can be temporary, for example if ingress-nginx is restarting, the port will not be bound for a short period
- The destination kernel refuses connections due to a backlog of queued connections
- A firewall rule with a `REJECT` rather than `DROP`

---

## i/o timeout

```bash
# curl localhost:8080
curl: (28) Connection timed out after 2005 milliseconds
```

```bash
[ERROR] plugin/errors: 2 3994503566595593402.4565890997905689978. HINFO: read udp 10.42.2.188:45439->10.17.130.43:53: i/o timeout
```

```bash
E0912 19:08:00.809037       1 run.go:74] "command failed" err="unable to load configmap based request-header-client-ca-file: Get \"https://127.0.0.1:6443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication\": dial tcp 127.0.0.1:6443: i/o timeout"
```

```bash
2024-07-21T14:36:07.543416281Z E0721 14:36:07.543243       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-scheduler: Get "https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
```

???

Can be a few reasons:
- No ACK response from the destination host, need more context about the connectivity path to understand cause
- Firewall rule with `DROP`, doesn't respond with ACK to the source SYN packet
- Destination host is under load, app doesn't reply in the timeout period
- Last error isn't necessarily a TCP timeout, the HTTP request did not get a response from kube-api in the 5s timeout

---

## connection reset by peer

```bash
philip-j-fry.com rancher-system-agent[14615]: time="2024-09-27T20:07:43-05:00" level=fatal msg="[K8s] encountered an error while attempting to update the secret: Put \"https://leela.bender.com/api/v1/namespaces/fleet-default/secrets/custom-8b78ea0e6d6d-machine-plan\": read tcp 10.47.248.198:59390->10.47.130.35:443: read: connection reset by peer"
```

```bash
ERROR: https://prof-farmsworth.edu/ping is not accessible (Recv failure: Connection reset by peer)
```

```bash
2024/07/26 14:46:51 [error] 29#29: *283832 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.16.224.253, server: hermes-conrad.jm, request: "GET /apis/snapshot.storage.k8s.io/v1beta1?timeout=32s HTTP/1.1", upstream: "http://10.42.4.181:80/apis/snapshot.storage.k8s.io/v1beta1?timeout=32s", host: "hermes-conrad.jm"
```

???

Equivalent to hanging up the phone on a caller, an abrupt closure of the TCP session - almost always by the destination but can be from the source as well
- A TCP packet was sent by the destination with the RST (reset) flag set, indicating a forced immediate closure
- More polite than not sending anything (a timeout), can give more context for troubleshooting

---

### no route to host

```bash
2024-06-25T18:57:02.136622425-05:00 stderr F W0625 23:57:02.136398       1 egress_controller.go:1001] Failed to start watch for EgressGroup: Get "https://10.43.131.109:443/apis/controlplane.antrea.io/v1beta2/egressgroups?fieldSelector=nodeName%3Dsomething-47024a6c-xdrfv&watch=true": dial tcp 10.43.131.109:443: connect: no route to host
```

```bash
[ERROR] plugin/errors: 2 3641072525830743004.8496191176616642290. HINFO: read udp 10.42.1.253:43929->8.8.8.8:53: read: no route to host  
```

```bash
2024/04/05 03:18:05 [error] 2681#2681: *2305048 connect() failed (113: No route to host) while connecting to upstream, client: 10.2.176.17, server: anchovies-on-pizza.it, request: "GET /hello HTTP/2.0", upstream: "http://10.42.96.250:1234/hello", host: "anchovies-on-pizza.it"
```

- Can sometimes be caused by firewalld REJECT rules, which use `reject-with icmp-host-prohibited`
```bash
# iptables -nvL
  [...]
pkts bytes   target           prot opt in     out     source               destination
458K   70M REJECT             all  --  *      *       0.0.0.0/0            0.0.0.0/0            reject-with icmp-host-prohibited
```

???

Uncommon but can be , some causes:
- A genuine issue with routes in the OS main route table or pod network sandbox
- A firewall rule with a REJECT type that misleads source clients, firewall commonly adds rules with `--reject-with icmp-host-prohibited`
- REJECT is useful if you want to notify the client, DROP is better for hardening to avoid confirming that the destination exists and the port may be potentially in use

---

## dns failure

```bash
Oct 17 20:36:18 old-bessie-1 rke2[12378]: time="2022-10-17T20:36:18Z" level=warning msg="Failed to get image from endpoint: Get \"https://planet.express.com/v2/\": dial tcp: lookup planet.express.com: i/o timeout"
```

```bash
Post "http://api.prod.domain.local/admin": dial tcp: lookup api.prod.domain.local: no such host
```

```bash
Caused by: java.net.UnknownHostException:  foo.bar.com
	at java.net.InetAddress.getAllByName0(InetAddress.java:1281) ~[?:1.8.0_211]
	at java.net.InetAddress.getAllByName(InetAddress.java:1193) ~[?:1.8.0_211]
      [...]
```

???

- The first code block is interesting, golang logs can be a bit misleading, the key word here is `lookup`, this indicates an i/o timeout due to the DNS lookup not resolving. Also the hostname is a clue, if DNS is successful a destination IP is reported in the error instead

Lots of potential causes:
- Try to triangulate, if the issue is affecting pods, try to determine if it's internal vs external or both
- Based on the above, focus on the key areas:
  - For external, checking coredns logs is often a useful first step, and verifying from another host on the network
  - For internal, checking against each coredns pod (endpoint) to eliminate overlay pod/overlay issues

---

# Time for the lab!

Please join slack channel: #cfl-rancher-lab-oct-2024