0%

k8s-calico 网络问题记录

  • 查看pods的时候,发现calico-node在某些机器上一直crash
1
2
3
4
5
6
7
$ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
calico-node-6sqlf 1/1 Running 0 2d5h
calico-node-c6k88 1/1 Running 0 2d5h
calico-node-g6wkt 1/1 Running 0 2d5h
calico-node-mk756 0/1 CrashLoopBackOff 6 9m31s
calico-node-mxmsv 1/1 Running 0 2d7h
  • 通过查看问题pod的日志,看到如下内容。发现pod选择的ipv4的地址是172.18.0.1
1
2
3
4
5
6
7
$ kubectl logs -f pod/calico-node-mk756 -n kube-system

2022-05-21 17:39:46.508 [INFO][9] startup/autodetection_methods.go 103: Using autodetected IPv4 address on interface br-a2f97d7bac67: 172.18.0.1/16
2022-05-21 17:39:46.508 [INFO][9] startup/startup.go 559: Node IPv4 changed, will check for conflicts
2022-05-21 17:39:46.513 [WARNING][9] startup/startup.go 984: Calico node 'aliyun-172-20-197-145' is already using the IPv4 address 172.18.0.1.
2022-05-21 17:39:46.513 [INFO][9] startup/startup.go 389: Clearing out-of-date IPv4 address from this node IP="172.18.0.1/16"
2022-05-21 17:39:46.523 [WARNING][9] startup/utils.go 49: Terminating
  • 查看网口发现这台机器不止一个网口。我们的node之间通信的网口应该是eth0。calico选择的网口和node通信的网口不一致,无法正常通信,导致pod一直creash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
$ ifconfig 
br-a2f97d7bac67: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
inet 172.18.0.1 netmask 255.255.0.0 broadcast 172.18.255.255
ether 02:42:5c:4c:8e:45 txqueuelen 0 (Ethernet)
RX packets 227708125 bytes 313047951956 (291.5 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 227708125 bytes 313047951956 (291.5 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

docker0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
inet 172.17.0.1 netmask 255.255.0.0 broadcast 172.17.255.255
ether 02:42:4f:25:15:a8 txqueuelen 0 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.20.197.141 netmask 255.255.240.0 broadcast 172.20.207.255
inet6 fe80::216:3eff:fe18:a605 prefixlen 64 scopeid 0x20<link>
ether 00:16:3e:18:a6:05 txqueuelen 1000 (Ethernet)
RX packets 285703387 bytes 86199772873 (80.2 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 533428566 bytes 532603608817 (496.0 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

  • 知道是选择网口的问题,那就想办法修改calico-node选择网口的规则。
  • 要修改calico选择网口的规则,可以通过修改calico的配置来实现: 在 spec.template.spec.containers.env下添加IP_AUTODETECTION_METHOD
1
$ kubectl edit daemonset.apps/calico-node -n kube-system
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
        - name: NODENAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: CALICO_NETWORKING_BACKEND
valueFrom:
configMapKeyRef:
key: calico_backend
name: calico-config
- name: CLUSTER_TYPE
value: k8s,bgp
+ - name: IP_AUTODETECTION_METHOD
+ value: interface=eth0
- name: IP
value: autodetect
  • 通过kubectl edit编辑完后,保存即开始生效。k8s集群会重新部署calico-node的pod
  • 观察发现,调整默认网关选择后,所有的calico-node恢复正常
-------- 本文结束 感谢阅读 --------