class: center, middle # CFL Summit | Rancher Lab Please join slack channel: #cfl-rancher-lab-oct-2024 --- class: center, middle ## Derek Keightley ## & ## Ryan Elliott-Smith ??? No sheep --- ## What are we doing? -- - Quick presentation to cover some relevant topics (~15mins) - Lab -- ## What are we working with? In the lab environment the following has been pre-created: - Rancher - RKE2 downstream clusters - Some deployments to troubleshoot --- # Webhooks in Kubernetes -- ## What are they? ??? Ask the audience, does anyone want to answer? -- Webhooks provide a way to add logic into the Kubernetes API while processing certain requests There are two points in the API workflow where webhooks can be integrated: ??? A basic definition of webhooks, more details in the following slides --- ### Validation (validating webhook) Allowing an approve/deny stage for API requests, there are some common use cases: - Ensuring standards - eg, labels, resource limits - Security tools - Protection of important objects/CRDs - Rejecting invalid spec -- ### Mutation (mutating webook) Allowing changes to be injected into API objects, some common use cases: - Sidecar injection (Istio, for eg) - Adding context - eg, labels, annotations etc. - Removing invalid fields --- class: center, middle ![basic-view](webhook.jpeg) --- class: center, middle ![detailed-view](webhook-detailed.jpeg) More info on webhooks https://book-v1.book.kubebuilder.io/beyond_basics/what_is_a_webhook https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/ --- ## Webhook best practices - Exempt system namespaces (kube-system, cattle-system, etc.), ideally with both of the below: - (**recommended**) Exemption can be in the webhook config CRD with a `namespaceSelector` - Depending on the webhook - labels or flags - Fail open - `failurePolicy: Ignore` - Multiple webhook pods for HA - Avoid deploying unecessary webhooks - they add latency - Ensure webhooks and config CRD objects are cleaned up when related apps are deleted ??? When a webhook is offline, the scope that it applies to can block related API calls This can be problematic during cluster upgrades, roll backs, restores or node replacement - Adding an exemption to the webhook config CRD will prevent kube-api from making the request to the webhook - Adding labels/flags will only instruct the webhook to not apply it's logic to those requests - if it's down this doesn't help --- class: center, middle # Common network issues ![switch-poe](switch-poe.jpg) --- class: center, middle ## Wait, first let's step back a bit.. --- class: center, middle ### TCP - 3 way handshake -- ![tcphandshake](tcp-handshake.png) ??? Cover the 3-way handshake (SYN, SYN-ACK, ACK) - Destination starts an app which binds to a port (TCB allocated) and listens - Source creates a TCB (trans control block), assigns a source port, sends a SYN packet to the destination and port Come back to this slide if it's useful to reference in the following slides --- class: center, middle ![tcp-session](tcp-full-session-cropped.png) ??? - First green handshake section shows what we covered in the previous slide, the session is started after the handshake - The session is where the data transmission - what matters - The closure of the session - in a nice polite way to finish - Not all TCP sessions close this way, often abruptly --- class: center, middle # Common network issues ??? Ask the audience about their understanding of what these messages mean --- ## connection refused ```bash # kubectl describe pod -n kube-system rke2-canal-zoidberg E0924 22:18:04.357762 4491 memcache.go:265] couldn't get current server API group list: Get https://127.0.0.1:6443/api?timeout=32s: dial tcp 127.0.0.1:6443: connect: connection refused ```
```bash zapp.brannig.an rke2[2783338]: {"level":"warn","ts":"2024-09-27T11:57:31.335237-0400","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xcd34db33ff/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""} ``` ??? Can be a few reasons: - (most common) The destination port is not bound by the destination host - the OS of the listening socket recognized the inbound connection request but chose to intentionally reject it - This can be temporary, for example if ingress-nginx is restarting, the port will not be bound for a short period - The destination kernel refuses connections due to a backlog of queued connections - A firewall rule with a `REJECT` rather than `DROP` --- ## i/o timeout ```bash # curl localhost:8080 curl: (28) Connection timed out after 2005 milliseconds ```
```bash [ERROR] plugin/errors: 2 3994503566595593402.4565890997905689978. HINFO: read udp 10.42.2.188:45439->10.17.130.43:53: i/o timeout ```
```bash E0912 19:08:00.809037 1 run.go:74] "command failed" err="unable to load configmap based request-header-client-ca-file: Get \"https://127.0.0.1:6443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication\": dial tcp 127.0.0.1:6443: i/o timeout" ```
```bash 2024-07-21T14:36:07.543416281Z E0721 14:36:07.543243 1 leaderelection.go:330] error retrieving resource lock kube-system/kube-scheduler: Get "https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers) ``` ??? Can be a few reasons: - No ACK response from the destination host, need more context about the connectivity path to understand cause - Firewall rule with `DROP`, doesn't respond with ACK to the source SYN packet - Destination host is under load, app doesn't reply in the timeout period - Last error isn't necessarily a TCP timeout, the HTTP request did not get a response from kube-api in the 5s timeout --- ## connection reset by peer ```bash philip-j-fry.com rancher-system-agent[14615]: time="2024-09-27T20:07:43-05:00" level=fatal msg="[K8s] encountered an error while attempting to update the secret: Put \"https://leela.bender.com/api/v1/namespaces/fleet-default/secrets/custom-8b78ea0e6d6d-machine-plan\": read tcp 10.47.248.198:59390->10.47.130.35:443: read: connection reset by peer" ```
```bash ERROR: https://prof-farmsworth.edu/ping is not accessible (Recv failure: Connection reset by peer) ```
```bash 2024/07/26 14:46:51 [error] 29#29: *283832 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.16.224.253, server: hermes-conrad.jm, request: "GET /apis/snapshot.storage.k8s.io/v1beta1?timeout=32s HTTP/1.1", upstream: "http://10.42.4.181:80/apis/snapshot.storage.k8s.io/v1beta1?timeout=32s", host: "hermes-conrad.jm" ``` ??? Equivalent to hanging up the phone on a caller, an abrupt closure of the TCP session - almost always by the destination but can be from the source as well - A TCP packet was sent by the destination with the RST (reset) flag set, indicating a forced immediate closure - More polite than not sending anything (a timeout), can give more context for troubleshooting --- ### no route to host ```bash 2024-06-25T18:57:02.136622425-05:00 stderr F W0625 23:57:02.136398 1 egress_controller.go:1001] Failed to start watch for EgressGroup: Get "https://10.43.131.109:443/apis/controlplane.antrea.io/v1beta2/egressgroups?fieldSelector=nodeName%3Dsomething-47024a6c-xdrfv&watch=true": dial tcp 10.43.131.109:443: connect: no route to host ``` ```bash [ERROR] plugin/errors: 2 3641072525830743004.8496191176616642290. HINFO: read udp 10.42.1.253:43929->8.8.8.8:53: read: no route to host ``` ```bash 2024/04/05 03:18:05 [error] 2681#2681: *2305048 connect() failed (113: No route to host) while connecting to upstream, client: 10.2.176.17, server: anchovies-on-pizza.it, request: "GET /hello HTTP/2.0", upstream: "http://10.42.96.250:1234/hello", host: "anchovies-on-pizza.it" ``` - Can sometimes be caused by firewalld REJECT rules, which use `reject-with icmp-host-prohibited` ```bash # iptables -nvL [...] pkts bytes target prot opt in out source destination 458K 70M REJECT all -- * * 0.0.0.0/0 0.0.0.0/0 reject-with icmp-host-prohibited ``` ??? Uncommon but can be , some causes: - A genuine issue with routes in the OS main route table or pod network sandbox - A firewall rule with a REJECT type that misleads source clients, firewall commonly adds rules with `--reject-with icmp-host-prohibited` - REJECT is useful if you want to notify the client, DROP is better for hardening to avoid confirming that the destination exists and the port may be potentially in use --- ## dns failure ```bash Oct 17 20:36:18 old-bessie-1 rke2[12378]: time="2022-10-17T20:36:18Z" level=warning msg="Failed to get image from endpoint: Get \"https://planet.express.com/v2/\": dial tcp: lookup planet.express.com: i/o timeout" ```
```bash Post "http://api.prod.domain.local/admin": dial tcp: lookup api.prod.domain.local: no such host ```
```bash Caused by: java.net.UnknownHostException: foo.bar.com at java.net.InetAddress.getAllByName0(InetAddress.java:1281) ~[?:1.8.0_211] at java.net.InetAddress.getAllByName(InetAddress.java:1193) ~[?:1.8.0_211] [...] ``` ??? - The first code block is interesting, golang logs can be a bit misleading, the key word here is `lookup`, this indicates an i/o timeout due to the DNS lookup not resolving. Also the hostname is a clue, if DNS is successful a destination IP is reported in the error instead Lots of potential causes: - Try to triangulate, if the issue is affecting pods, try to determine if it's internal vs external or both - Based on the above, focus on the key areas: - For external, checking coredns logs is often a useful first step, and verifying from another host on the network - For internal, checking against each coredns pod (endpoint) to eliminate overlay pod/overlay issues --- class: center, middle # Time for the lab! Please join slack channel: #cfl-rancher-lab-oct-2024