Environment
172.17.42.30 kube-master
172.17.42.31 kube-node1
172.17.42.32 kube-node2
/usr/bin/kube-apiserver --logtostderr=true --v=0 --etcd-servers=http://kube-master:2379 --insecure-bind-address=127.0.0.1 --secure-port=443 --allow-privileged=true --service-cluster-ip-range=10.254.0.0/16 --admission-control=NamespaceLifecycle,NamespaceExists,LimitRanger,SecurityContextDeny,ServiceAccount,ResourceQuota --tls-cert-file=/etc/kubernetes/certs/server.crt --tls-private-key-file=/etc/kubernetes/certs/server.key --client-ca-file=/etc/kubernetes/certs/ca.crt --token-auth-file=/etc/kubernetes/tokens/known_tokens.csv --service-account-key-file=/etc/kubernetes/certs/server.crt
flannel
implementation
- node1
[root@kube-node1 ~]# ip route show
default via 172.17.42.1 dev eth0
172.16.0.0/12 dev flannel.1 proto kernel scope link src 172.16.28.0
172.16.28.0/24 dev docker0 proto kernel scope link src 172.16.28.1
172.17.0.0/16 dev eth0 proto kernel scope link src 172.17.42.31
[root@kube-node1 ~]# iptables-save -t nat
# Generated by iptables-save v1.4.21 on Thu Mar 17 04:26:56 2016
*nat
-A POSTROUTING -s 172.16.28.0/24 ! -o docker0 -j MASQUERADE
[root@kube-node1 ~]# ip a
5: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN
link/ether b2:69:2c:67:63:b8 brd ff:ff:ff:ff:ff:ff
inet 172.16.28.0/12 scope global flannel.1
valid_lft forever preferred_lft forever
inet6 fe80::b069:2cff:fe67:63b8/64 scope link
valid_lft forever preferred_lft forever
6: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP
link/ether 02:42:fb:83:70:4d brd ff:ff:ff:ff:ff:ff
inet 172.16.28.1/24 scope global docker0
valid_lft forever preferred_lft forever
inet6 fe80::42:fbff:fe83:704d/64 scope link
valid_lft forever preferred_lft forever
### flannel.1 is vxlan device
[root@kube-node1 ~]# ip -d link show flannel.1
5: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT
link/ether b2:69:2c:67:63:b8 brd ff:ff:ff:ff:ff:ff promiscuity 0
vxlan id 1 local 172.17.42.31 dev eth0 srcport 32768 61000 dstport 8472 nolearning ageing 300
- node2
[root@kube-node2 ~]# ip route show
default via 172.17.42.1 dev eth0
172.16.0.0/12 dev flannel.1 proto kernel scope link src 172.16.78.0
172.16.78.0/24 dev docker0 proto kernel scope link src 172.16.78.1
172.17.0.0/16 dev eth0 proto kernel scope link src 172.17.42.32
[root@kube-node2 ~]# iptables-save -t nat
# Generated by iptables-save v1.4.21 on Thu Mar 17 04:30:51 2016
*nat
-A POSTROUTING -s 172.16.78.0/24 ! -o docker0 -j MASQUERADE
- network topology
- packet flow
假设当sshd-2访问nginx-0:
当packet{172.16.28.5:port => 172.16.78.9:80} 到达docker0时,根据路由规则:
172.16.0.0/12 dev flannel.1 proto kernel scope link src 172.16.28.0
packet将选择flannel.1作为出口,同时,根据iptables SNAT规则,将packet的源IP地址改为flannel.1的地址(172.16.28.0/12)。flannel.1是一个VXLAN设备,将packet进行隧道封包,然后发到node2。node2解包,然后根据路由规则:
172.16.78.0/24 dev docker0 proto kernel scope link src 172.16.78.1
从接口docker0发出,再转给nginx-0。
在node2上对VXLAN port抓包:
- difference between flannel and docker overlay
flannel中连接容器的bridge(docker0)与vxlan设备(flannel.1)是相互独立的,而docker overlay是将vxlan设备直接作为bridge的端口。
这里之前理解有点问题,SNAT是由于docker的参数”–ip-masq=false”引起的,
kube建议的参数是”–iptables=false –ip-masq=false”
这样的缺点是当跨节点的容器通信时,看不到对方的IP。比如nginx-0中看到源地IP地址是flannel.1的IP地址。
这样的优点是从node可以直接访问容器。比如可以直接从node1访问nginx-0。这一点是kubernetes中的services的基础。Docker overlay中的host是不能直接与容器通信的。
kubernetes service
test service
- create rc
# cat replication.yml
apiVersion: v1
kind: ReplicationController
metadata:
name: nginx
spec:
replicas: 2
selector:
app: my-nginx
template:
metadata:
name: nginx
labels:
app: my-nginx
spec:
containers:
- name: nginx
image: nginx:1.0
ports:
- containerPort: 80
# kubectl create -f ./replication.yml
# kubectl get rc
CONTROLLER CONTAINER(S) IMAGE(S) SELECTOR REPLICAS AGE
nginx nginx nginx:1.0 app=my-nginx 2 3s
# kubectl describe replicationcontrollers/nginx
Name: nginx
Namespace: default
Image(s): nginx:1.0
Selector: app=my-nginx
Labels: app=my-nginx
Replicas: 2 current / 2 desired
Pods Status: 2 Running / 0 Waiting / 0 Succeeded / 0 Failed
No volumes.
Events:
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
5m 5m 1 {replication-controller } Normal SuccessfulCreate Created pod: nginx-x5qqd
5m 5m 1 {replication-controller } Normal SuccessfulCreate Created pod: nginx-wzcbn
- create service
# cat service.yml
{
"kind": "Service",
"apiVersion": "v1",
"metadata": {
"name": "my-nginx-service"
},
"spec": {
"selector": {
"app": "my-nginx"
},
"ports": [
{
"protocol": "TCP",
"port": 8080,
"targetPort": 80
}
]
}
}
# kubectl create -f ./service.yml
service "my-nginx-service" created
# kubectl get services
NAME CLUSTER-IP EXTERNAL-IP PORT(S) SELECTOR AGE
kubernetes 10.254.0.1 <none> 443/TCP <none> 27d
my-nginx-service 10.254.247.121 <none> 8080/TCP app=my-nginx 11s
# kubectl describe services/my-nginx-service
Name: my-nginx-service
Namespace: default
Labels: <none>
Selector: app=my-nginx
Type: ClusterIP
IP: 10.254.247.121
Port: <unnamed> 8080/TCP
Endpoints: 172.16.78.10:80,172.16.78.9:80
Session Affinity: None
No events.
可以看到,service分配的VIP为10.254.247.121。
service implementation
如果不了解service,可以先参考Kubernetes解析:services。
有两个注意点:
(1)kube会在host上创建2个nat iptables规则,其中一个在PREROUTING链,影响Host上的容器访问service;另外一个规则在OUTPUT链,影响Host本身访问service。
(2)kube-proxy会监听一个端口。当Host上的容器(或者Host本身)访问service时,iptables将请求转到本机的kube-proxy,然后kube-proxy再转给service对应的实际(本机或者其它Host的)容器(从本机的flannel.1接口发出)。
从这里可以看到,kube service要求Host能够访问所有的容器,上一节中已经介绍flannel满足了这种要求。
- node1 rule
[root@kube-node1 ~]# iptables-save -t nat
-A PREROUTING -m comment --comment "handle ClusterIPs; NOTE: this must be before the NodePort rules" -j KUBE-PORTALS-CONTAINER
-A OUTPUT -m comment --comment "handle ClusterIPs; NOTE: this must be before the NodePort rules" -j KUBE-PORTALS-HOST
## 影响从Host上的容器访问service
-A KUBE-PORTALS-CONTAINER -d 10.254.247.121/32 -p tcp -m comment --comment "default/my-nginx-service:" -m tcp --dport 8080 -j REDIRECT --to-ports 50948
## 影响从Host访问service
-A KUBE-PORTALS-HOST -d 10.254.247.121/32 -p tcp -m comment --comment "default/my-nginx-service:" -m tcp --dport 8080 -j DNAT --to-destination 172.17.42.31:50948
[root@kube-node1 ~]# netstat -ltnp
tcp6 0 0 :::50948 :::* LISTEN 2615/kube-proxy
- node2 rule
[root@kube-node2 ~]# iptables-save -t nat
-A PREROUTING -m comment --comment "handle ClusterIPs; NOTE: this must be before the NodePort rules" -j KUBE-PORTALS-CONTAINER
-A OUTPUT -m comment --comment "handle ClusterIPs; NOTE: this must be before the NodePort rules" -j KUBE-PORTALS-HOST
-A KUBE-PORTALS-CONTAINER -d 10.254.247.121/32 -p tcp -m comment --comment "default/my-nginx-service:" -m tcp --dport 8080 -j REDIRECT --to-ports 55231
-A KUBE-PORTALS-HOST -d 10.254.247.121/32 -p tcp -m comment --comment "default/my-nginx-service:" -m tcp --dport 8080 -j DNAT --to-destination 172.17.42.32:55231
[root@kube-node2 ~]# netstat -tlnp|grep kube-proxy
tcp6 0 0 :::55231 :::* LISTEN 1651/kube-proxy
access test
container access service
[root@sshd-2 ~]# ip a
13: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP
link/ether 02:42:ac:10:1c:05 brd ff:ff:ff:ff:ff:ff
inet 172.16.28.5/24 scope global eth0
valid_lft forever preferred_lft forever
[root@sshd-2 ~]# telnet 10.254.247.121 8080
Trying 10.254.247.121...
Connected to 10.254.247.121.
Escape character is '^]'.
[root@sshd-2 ~]# netstat -tnp
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 172.16.28.5:38380 10.254.247.121:8080 ESTABLISHED 48/telnet
Host’s socket:
[root@kube-node1 ~]# netstat -tnp|grep kube-proxy
tcp 0 0 172.16.28.0:59023 172.16.78.10:80 ESTABLISHED 2615/kube-proxy
tcp6 0 0 172.16.28.1:50948 172.16.28.5:38380 ESTABLISHED 2615/kube-proxy
上面的第一个连接为kube-proxy与nginx-1连接。第二个为telnet与kube-proxy的连接。
packet flow:
{SPod_IP: SPod_port -> VIP:VPort } => {SPod_IP: SPod_port -> Docker0_IP:Proxy_port } => {flannel.1_IP: flannel.1_port -> DPod_IP:DPod_port}
node access service
- node1 -> service:
[root@kube-node1 ~]# telnet 10.254.247.121 8080
Trying 10.254.247.121...
Connected to 10.254.247.121.
Escape character is '^]'.
[root@kube-node1 ~]# netstat -ntp|grep telnet
tcp 0 0 172.17.42.31:51498 10.254.247.121:8080 ESTABLISHED 15524/telnet
[root@kube-node1 ~]# netstat -ntp|grep proxy
tcp6 0 0 172.17.42.31:50948 172.17.42.31:51498 ESTABLISHED 2615/kube-proxy
tcp 0 0 172.16.28.0:59012 172.16.78.10:80 ESTABLISHED 2615/kube-proxy
kube-prox的第一个连接为telnet与kube-proxy的连接,第二个为kube-proxy与nginx-1的连接。
- node2 -> service:
[root@kube-node2 ~]# telnet 10.254.247.121 8080
Trying 10.254.247.121...
Connected to 10.254.247.121.
Escape character is '^]'.
[root@kube-node2 ~]# netstat -tnp|grep telnet
tcp 0 0 172.17.42.32:46588 10.254.247.121:8080 ESTABLISHED 25878/telnet
[root@kube-node2 ~]# netstat -tnp|grep kube-proxy
tcp 0 0 172.16.78.1:56942 172.16.78.10:80 ESTABLISHED 1651/kube-proxy
tcp6 0 0 172.17.42.32:55231 172.17.42.32:46588 ESTABLISHED 1651/kube-proxy
与node1的区别在于node2中,kube-proxy与nginx-1的连接的源IP为docker0的IP。node1中为flannel.1的IP。