MPI_Barrier
可以实现MPI进程间的同步,只有所有进程都到达该同步点(synchronization point
),才能继续向下执行。
MPI_Barrier(MPI_Comm communicator)
对于上图,进程P0
在T1时刻到达barrier点,就会陷入等待,不能往下运行,因为其它进程还没有执行到这里。当T4时刻,所有进程都到达该barrier点时,所有进程才能继续往下运行。
MPI_Bcast
实现广播,即一个进程将相同的数据发送给其它进程:
MPI_Bcast(
void* data,
int count,
MPI_Datatype datatype,
int root,
MPI_Comm communicator)
发送进程和接收进程都调用MPI_Bcast
,root
指定发送进程。对于发送者,data
指向发送数据;对于接收者,接收到的数据保存到data
指定的存储空间。
最简单的方式,是发送进程,通过循环遍历,依次向其它进程发送(MPI_Send)数据,但这种方式的时间复杂度为O(N)
。下面是一种更高效的方式的实现:
stage1: 0 -> 1
stage2: 0 -> 2, 1 -> 3
stage3: 0 -> 4, 1 -> 5, 2 -> 6, 3 -> 7
在第1个阶段,进程0发送给进程1; 在第2个阶段,进程0发送给进程2,同时,进程1发送给进程3;这样,通过3轮迭代,就完成了数据广播。显然,这种树形算法的复杂度为O(logN)
。
MPI_Scatter
与MPI_Bcast
类型,也是root进程发送数据到其它所有进程,但不同的是,MPI_Scatter
将发送缓冲区的数据分成大小相同的chunk,对不同的进程,发送不同的chunk。
MPI_Scatter(
void* send_data,
int send_count,
MPI_Datatype send_datatype,
void* recv_data,
int recv_count,
MPI_Datatype recv_datatype,
int root,
MPI_Comm communicator)
send_data
是在root进程上的一个数据数组,send_count
和send_datatype
分别描述了发送给每个进程的数据数量和数据类型。recv_data
为接收进程的缓冲区,它能够存储recv_count
个recv_datatype
数据类型的元素。root
指定发送进程。
MPI_Gather
是MPI_Scatter
的反向操作,从多个进程汇总数据到一个进程。
MPI_Gather(
void* send_data,
int send_count,
MPI_Datatype send_datatype,
void* recv_data,
int recv_count,
MPI_Datatype recv_datatype,
int root,
MPI_Comm communicator)
在MPI_Gather
中,只有root进程需要一个有效的接收缓存。所有其他的调用进程可以传递NULL
给recv_data
。另外,值得注意的是,recv_count
参数是从每个进程接收到的数据数量,而不是所有进程的数据总量之和。
MPI_Allgather
是一种多对多的通信方式,每个进程都会从所有进程收集数据。一般来说,MPI_Allgather
相当于一个MPI_Gather
操作之后跟着一个MPI_Bcast
操作。
MPI_Allgather(
void* send_data,
int send_count,
MPI_Datatype send_datatype,
void* recv_data,
int recv_count,
MPI_Datatype recv_datatype,
MPI_Comm communicator)
相对于MPI_Gather
,没有root参数。
与MPI_Gather
类似,MPI_Reduce
实现root进程从所有进程读取数据。但多了一个MPI_Op
参数,用于指定对数据缓冲区send_data
中每个数据元素进行的计算操作。
MPI_Reduce(
void* send_data,
void* recv_data,
int count,
MPI_Datatype datatype,
MPI_Op op,
int root,
MPI_Comm communicator)
所有进程send_data
中第i个数据元素做相应的计算,然后写到recv_data
中第i个单元。
instead of summing all of the elements from all the arrays into one element, the ith element from each array are summed into the ith element in result array of process 0.
MPI_Allreduce
对MPI_Reduce
,相当于MPI_Allgather
对MPI_Gather
:
MPI_Allreduce(
void* send_data,
void* recv_data,
int count,
MPI_Datatype datatype,
MPI_Op op,
MPI_Comm communicator)
MPI_Allreduce
不需要指定root进程,相当于MPI_Reduce
操作之后,跟一个MPI_Bcast
操作。
MPI_AllReduce
这个通信原语背后,MPI中实现了多种AllReduce算法,包括Butterfly
,Ring AllReduce
,Segmented Ring
等。
Ring AllReduce
主要针对数据块过大的情况,把每个节点的数据切分成N份(相当于scatter操作)。所以,ring allreduce
分2个阶段操作:
通过(N-1)步,让每个节点都得到1/N的完整数据块。每一步的通信耗时是α+S/(NB),计算耗时是(S/N)*C。 这一阶段也可视为scatter-reduce。
通过(N-1)步,让所有节点的每个1/N数据块都变得完整。每一步的通信耗时也是α+S/(NB),没有计算。这一阶段也可视为allgather。
TCP/IP协议栈满足不了现代IDC工作负载(workloads)的需求,主要有2个原因:(1)内核处理收发包需要消耗大量的CPU;(2)TCP不能满足应用对低延迟的需求:一方面,内核协议栈会带来数十ms的延迟;另一方面,TCP的拥塞控制算法、超时重传机制都会增加延迟。
RDMA在NIC内部实现传输协议,所以没有第一个问题;同时,通过zero-copy
、kernel bypass
避免了内核层面的延迟。
与TCP不同的是,RDMA需要一个无损(lossless)的网络。例如,交换机不能因为缓冲区溢出而丢包。为此,RoCE使用PFC(Priority-based Flow Control)
带进行流控。一旦交换机的port的接收队列超过一定阀值(shreshold)时,就会向对端发送PFC pause frame
,通知发送端停止继续发包。一旦接收队列低于另一个阀值时,就会发送一个pause with zero duration
,通知发送端恢复发包。
PFC对数据流进行分类(class),不同种类的数据流设置不同的优先级。比如将RoCE的数据流和TCP/IP等其它数据流设置不同的优先级。详细参考Considerations for Global Pause, PFC and QoS with Mellanox Switches and Adapters
对于IP/Ethernet,有2种方式对网络流量分类:
详细介绍参考Understanding QoS Configuration for RoCE。
对于RoCE,有2个机制用于流控:Flow Control (PFC)
和Congestion Control (DCQCN)
,这两个机制可以同时,也可以分开工作。
PFC是一个链路层协议,只能针对port进行流控,粒度较粗。一旦发生拥塞,会导致整个端口停止pause。这是不合理的,参考Understanding RoCEv2 Congestion Management。为此,RoCE引入Congestion Control
。
DC-QCN
是RoCE使用的拥塞控制协议,它基于Explicit Congestion Notification (ECN)
。后面会详细介绍。
前面介绍有2种方式对网络流量进行分类,所以,PFC也有2种实现。
基于VLAN tag的Priority code point (PCP,3-bits)定义了8个Priority.
In case of L2 network, PFC uses the priority bits within the VLAN tag (IEEE 802.1p) to differentiate up to eight types of flows that can be subject to flow control (each one independently).
HowTo Run RoCE and TCP over L2 Enabled with PFC.
## 将skb prio 0~7 映射到vlan prio 3
for i in {0..7}; do ip link set dev eth1.100 type vlan egress-qos-map $i:3 ; done
## enable PFC on TC3
mlnx_qos -i eth1 -f 0,0,0,1,0,0,0,0
例如:
[root@node1 ~]# cat /proc/net/vlan/eth1.100
eth1.100 VID: 100 REORDER_HDR: 1 dev->priv_flags: 1001
total frames received 0
total bytes received 0
Broadcast/Multicast Rcvd 0
total frames transmitted 0
total bytes transmitted 0
Device: eth1
INGRESS priority mappings: 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0
EGRESS priority mappings:
[root@node1 ~]# for i in {0..7}; do ip link set dev eth1.100 type vlan egress-qos-map $i:3 ; done
[root@node1 ~]# cat /proc/net/vlan/eth1.100
eth1.100 VID: 100 REORDER_HDR: 1 dev->priv_flags: 1001
total frames received 0
total bytes received 0
Broadcast/Multicast Rcvd 0
total frames transmitted 0
total bytes transmitted 0
Device: eth1
INGRESS priority mappings: 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0
EGRESS priority mappings: 0:3 1:3 2:3 3:3 4:3 5:3 6:3 7:3
参考HowTo Set Egress Priority VLAN on Linux.
基于VLAN的PFC机制有2个主要问题:(1)交换机需要工作在trunk模式;(2)没有标准的方式实现VLAN PCP
跨L3网络传输(VLAN是一个L2协议)。
DSCP-based PFC
通过使用IP头部的DSCP
字段解决了上面2个问题。
DSCP-based PFC requires both NICs and switches to classify and queue packets based on the DSCP value instead of the VLAN tag.
The type of service (ToS) field in the IPv4 header has had various purposes over the years, and has been defined in different ways by five RFCs.[1] The modern redefinition of the ToS field is a six-bit Differentiated Services Code Point (DSCP) field[2] and a two-bit Explicit Congestion Notification (ECN) field.[3] While Differentiated Services is somewhat backwards compatible with ToS, ECN is not.
详细介绍参考:
RDMA的PFC机制可能会导致一些问题:
尽管PFC可以避免buffer overflow
导致的丢包,但是,其它一些原因,比如FCS错误,也可能导致网络丢包。RDMA的go-back-0
算法,每次出现丢包,都会导致整个message的所有packet都会重传,从而导致livelock
。TCP有SACK算法,由于RDMA传输层在NIC实现,受限于硬件资源,NIC很难实现SACK算法。可以使用go-back-N
算法来避免这个问题。
当PFC机制与Ethernet的广播机制工作时,可能导致出现PFC Deadlock
。简单来说,就是PFC机制会导致相应的port停止发包,而Ethernet的广播包可能引起新的PFC pause
依赖(比如port对端的server down掉),从而引起循环依赖。广播和多播对于loseless
是非常危险的,建议不要将其归于loseless classes
。
由于PFC pause
是传递的,所以很容器引起pause frame storm
。比如,NIC因为bug导致接收缓冲区填满,NIC会一直对外发送pause frame
。需要在NIC端和交换机端使用watchdog
机制来防止pause storm
。
由于NIC的资源有限,它将大部分数据结构,比如QPC(Queue Pair Context)
和WQE (Work Queue Element)
都放在host memory。而NIC只会缓存部分数据对象,一旦出现cache miss
,NIC的处理速度就会下降。
ECN是一个端到端的拥塞通知机制,而不需要丢包。ECN是可选的特性,它需要端点开启ECN支持,同时底层的网络也需要支持。
传统的TCP/IP网络,通过丢包来表明网络拥塞,router/switch/server
都会这么做。而对于支持ECN的路由器,当发生网络拥塞时,会设置IP头部的ECN(2bits)标志位,而接收端会给发送端返回拥塞的通知(echo of the congestion indication
),然后发送端降低发送速率。
由于发送速率由传输层(TCP)控制,所以,ECN需要TCP和IP层同时配合。
rfc3168定义了ECN for TCP/IP
。
IP头部有2个bit的ECN标志位:
如果端点支持ECN,就数据包中的标志位设置为ECT(0)
或者ECT(1)
。
为了支持ECN,TCP使用了TCP头部的3个标志位:Nonce Sum (NS)
,ECN-Echo (ECE)
和Congestion Window Reduced (CWR)
。
RoCEv2引入了ECN机制来实现拥塞控制,即RoCEv2 Congestion Management (RCM)
。通过RCM,一旦网络发生拥塞,就会通知发送端降低发送速率。与TCP类似,RoCEv2使用传输层头部Base Transport Header (BTH)
的FECN
标志位来标识拥塞。
实现RCM的RoCEv2 HCAs必须遵循下面的规则:
(1) 如果收到IP.ECN为11
的包,HCA生成一个RoCEv2 CNP(Congestion Notification Packet)包,返回给发送端;
(2) 如果收到RoCEv2 CNP
包,则降低对应QP的发送速率;
(3) 从上一次收到RoCEv2 CNP
后,经过配置的时间或者字节数,HCA可以增加对应QP的发送速率。
Term | Description |
---|---|
RP (Injector) | Reaction Point - the end node that performs rate limitation to prevent congestion |
NP | Notification Point - the end node that receives the packets from the injector and sends back notifications to the injector for indications regarding the congestion situation |
CP | Congestion Point - the switch queue in which congestion happens |
CNP | The RoCEv2 Congestion Notification Packet - The notification message an NP sends to the RP when it receives CE marked packets. |
Communication Manager (CM)
在双方真正建立连接前交换QP信息。每个QP包含一个Send Queue(SQ)
和Receive Queue(RQ)
.
QP Setup. When it is set up by software, a RC QP is initialized with:
(1) The port number on the local CA through which it will send and receive all messages.
(2) The QP Number (QPN) that identifies the RC QP that it is married to in a remote CA.
(3) The port address of the remote CA port behind which the remote RC QP resides.
struct ibv_qp {
struct ibv_context *context;
void *qp_context;
struct ibv_pd *pd;
struct ibv_cq *send_cq;
struct ibv_cq *recv_cq;
struct ibv_srq *srq;
uint32_t handle;
uint32_t qp_num;///QPN
enum ibv_qp_state state; /// stat
enum ibv_qp_type qp_type; ///type
pthread_mutex_t mutex;
pthread_cond_t cond;
uint32_t events_completed;
};
ibv_create_qp()用于创建QP.
struct ibv_qp *ibv_create_qp(struct ibv_pd *pd,struct ibv_qp_init_attr *qp_init_attr);
/*
* @max_write_sge: Maximum SGE elements per RDMA WRITE request.
* @max_read_sge: Maximum SGE elements per RDMA READ request.
*/
struct ib_qp {
struct ib_device *device;
struct ib_pd *pd;
struct ib_cq *send_cq;
struct ib_cq *recv_cq;
///...
void *qp_context;
u32 qp_num; ///QP number(QPN)
u32 max_write_sge;
u32 max_read_sge;
enum ib_qp_type qp_type; ///QP type
///..
}
创建API为ib_uverbs_create_qp
.
struct mlx4_ib_qp {
union {
struct ib_qp ibqp; //QP in ib_core
struct ib_wq ibwq;
};
struct mlx4_qp mqp; // QP in mlx4_core
struct mlx4_buf buf;
struct mlx4_db db;
struct mlx4_ib_wq rq;///RQ
///...
struct mlx4_ib_wq sq; ///SQ
///...
}
创建API为mlx4_ib_create_qp
.
struct mlx4_qp {
void (*event) (struct mlx4_qp *, enum mlx4_event);
int qpn; /// QP number
atomic_t refcount;
struct completion free;
u8 usage;
};
创建的API为mlx4_qp_alloc
.
QP有很多属性,包括状态(state)等,具体参考enum ibv_qp_attr_mask.这里主要讨论几个重要的属性.
ibv_modify_qp
用于修改QP的属性,包括QP的状态等。
ibv_modify_qp this verb changes QP attributes and one of those attributes may be the QP state.
/**
* ibv_modify_qp - Modify a queue pair.
*/
int ibv_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr,
int attr_mask);
参考这里.
A created QP still cannot be used until it is transitioned through several states, eventually getting to Ready To Send (RTS).
This provides needed information used by the QP to be able send / receive data.
QP
有如下一些状态:
RESET Newly created, queues empty.
INIT Basic information set. Ready for posting to receive queue.
RTR Ready to Receive. Remote address info set for connected QPs, QP may now receive packets.
RTS Ready to Send. Timeout and retry parameters set, QP may now send packets.
当QP创建时,为REST
状态,我们可以通过调用ibv_modify_qp将其设置为INIT
状态:
///...
{
struct ibv_qp_attr attr = {
.qp_state = IBV_QPS_INIT,
.pkey_index = 0,
.port_num = port,
.qp_access_flags = 0
};
if (ibv_modify_qp(ctx->qp, &attr,
IBV_QP_STATE |
IBV_QP_PKEY_INDEX |
IBV_QP_PORT |
IBV_QP_ACCESS_FLAGS)) {
fprintf(stderr, "Failed to modify QP to INIT\n");
goto clean_qp;
}
}
一旦QP处于INIT
状态,我们就可以调用ibv_post_recv
post receive buffers to the receive queue.
Once a queue pair (QP) has receive buffers posted to it, it is now possible to transition the QP into the ready to receive (RTR) state.
例如,对于client/server,需要将QP
设置为RTS
状态,参考rc_pingpong@pp_connect_ctx.
在将QP的状态设置为RTR
时,还需要填充其它一些属性,包括远端的地址信息(LID, QPN, PSN, GID)
等。如果不使用RDMA CM verb API
,则需要使用其它方式,比如基于TCP/IP的socket通信,在client/server
间交换该信息,例如rc_pingpong@pp_client_exch_dest。client先将自己的(LID, QPN, PSN, GID)
发送到server,server端读取到这些信息,保存起来,同时将自己的(LID, QPN, PSN, GID)
发给client。client收到这些信息后,就可以将QP设置为RTR
状态了。
static int pp_connect_ctx(struct pingpong_context *ctx, int port, int my_psn,
enum ibv_mtu mtu, int sl,
struct pingpong_dest *dest, int sgid_idx)
{
struct ibv_qp_attr attr = {
.qp_state = IBV_QPS_RTR,
.path_mtu = mtu,
.dest_qp_num = dest->qpn, /// remote QPN
.rq_psn = dest->psn, /// remote PSN
.max_dest_rd_atomic = 1,
.min_rnr_timer = 12,
.ah_attr = {
.is_global = 0,
.dlid = dest->lid, /// remote LID
.sl = sl, ///service level
.src_path_bits = 0,
.port_num = port
}
};
if (dest->gid.global.interface_id) {
attr.ah_attr.is_global = 1;
attr.ah_attr.grh.hop_limit = 1;
attr.ah_attr.grh.dgid = dest->gid;///remote GID
attr.ah_attr.grh.sgid_index = sgid_idx;
}
if (ibv_modify_qp(ctx->qp, &attr,
IBV_QP_STATE |
IBV_QP_AV |
IBV_QP_PATH_MTU |
IBV_QP_DEST_QPN |
IBV_QP_RQ_PSN |
IBV_QP_MAX_DEST_RD_ATOMIC |
IBV_QP_MIN_RNR_TIMER)) {
fprintf(stderr, "Failed to modify QP to RTR\n");
return 1;
///...
ah_attr/IBV_QP_AV an address handle (AH) needs to be created and filled in as appropriate. Minimally, ah_attr.dlid needs to be filled in.
dest_qp_num/IBV_QP_DEST_QPN QP number of remote QP.
rq_psn/IBV_QP_RQ_PSN starting receive packet sequence number (should matchremote QP’s sq_psn)
这里值得注意是IBV_QP_AV
,主要用来指示内核做地址解析,对于RoCE,则进行L3到MAC地址的转换。后面会详细介绍其实现。
另外,如果使用RDMA CM verb API
,例如使用rdma_connect
建立连接时,发送的CM Connect Request
包含这些信息:
struct cm_req_msg {
struct ib_mad_hdr hdr;
__be32 local_comm_id;
__be32 rsvd4;
__be64 service_id;
__be64 local_ca_guid;
__be32 rsvd24;
__be32 local_qkey;
/* local QPN:24, responder resources:8 */
__be32 offset32; ///QPN
/* local EECN:24, initiator depth:8 */
__be32 offset36;
/*
* remote EECN:24, remote CM response timeout:5,
* transport service type:2, end-to-end flow control:1
*/
__be32 offset40;
/* starting PSN:24, local CM response timeout:5, retry count:3 */
__be32 offset44; ///PSN
__be16 pkey;
/* path MTU:4, RDC exists:1, RNR retry count:3. */
u8 offset50;
/* max CM Retries:4, SRQ:1, extended transport type:3 */
u8 offset51;
__be16 primary_local_lid;
__be16 primary_remote_lid;
union ib_gid primary_local_gid; /// local GID
union ib_gid primary_remote_gid;
///...
server回复的CM Connect Response
也包含相应的信息:
struct cm_rep_msg {
struct ib_mad_hdr hdr;
__be32 local_comm_id;
__be32 remote_comm_id;
__be32 local_qkey;
/* local QPN:24, rsvd:8 */
__be32 offset12;
/* local EECN:24, rsvd:8 */
__be32 offset16;
/* starting PSN:24 rsvd:8 */
__be32 offset20;
u8 resp_resources;
u8 initiator_depth;
/* target ACK delay:5, failover accepted:2, end-to-end flow control:1 */
u8 offset26;
/* RNR retry count:3, SRQ:1, rsvd:5 */
u8 offset27;
__be64 local_ca_guid;
u8 private_data[IB_CM_REP_PRIVATE_DATA_SIZE];
} __attribute__ ((packed));
attr.qp_state = IBV_QPS_RTS;
attr.timeout = 14;
attr.retry_cnt = 7;
attr.rnr_retry = 7;
attr.sq_psn = my_psn;
attr.max_rd_atomic = 1;
if (ibv_modify_qp(ctx->qp, &attr,
IBV_QP_STATE |
IBV_QP_TIMEOUT |
IBV_QP_RETRY_CNT |
IBV_QP_RNR_RETRY |
IBV_QP_SQ_PSN |
IBV_QP_MAX_QP_RD_ATOMIC)) {
fprintf(stderr, "Failed to modify QP to RTS\n");
return 1;
}
相关属性:
timeout/IBV_QP_TIMEOUT local ack timeout (recommended value: 14)
retry_cnt/IBV_QP_RETRY_CNT retry count (recommended value: 7)
rnr_retry/IBV_QP_RNR_RETRYRNR retry count (recommended value: 7)
sq_psn/IBV_SQ_PSN send queue starting packet sequence number (should match remote QP’s rq_psn)
ibv_modify_qp
-> mlx4_modify_qp
-> ibv_cmd_modify_qp
:
///libibverbs/cmd.c
int ibv_cmd_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr,
int attr_mask,
struct ibv_modify_qp *cmd, size_t cmd_size)
{
/*
* Masks over IBV_QP_DEST_QPN are only supported by
* ibv_cmd_modify_qp_ex.
*/
if (attr_mask & ~((IBV_QP_DEST_QPN << 1) - 1))
return EOPNOTSUPP;
IBV_INIT_CMD(cmd, cmd_size, MODIFY_QP);
copy_modify_qp_fields(qp, attr, attr_mask, &cmd->base);
if (write(qp->context->cmd_fd, cmd, cmd_size) != cmd_size)
return errno;
return 0;
}
# ./funcgraph ib_uverbs_modify_qp
Tracing "ib_uverbs_modify_qp"... Ctrl-C to end.
0) | ib_uverbs_modify_qp [ib_uverbs]() {
0) | modify_qp.isra.24 [ib_uverbs]() {
0) 0.090 us | kmem_cache_alloc_trace();
0) | rdma_lookup_get_uobject [ib_uverbs]() {
0) 0.711 us | lookup_get_idr_uobject [ib_uverbs]();
0) 0.036 us | uverbs_try_lock_object [ib_uverbs]();
0) 2.012 us | }
0) 0.272 us | copy_ah_attr_from_uverbs.isra.23 [ib_uverbs]();
0) | ib_modify_qp_with_udata [ib_core]() {
0) | ib_resolve_eth_dmac [ib_core]() {
0) | ib_query_gid [ib_core]() {
0) | ib_get_cached_gid [ib_core]() {
0) 0.159 us | _raw_read_lock_irqsave();
0) 0.036 us | __ib_cache_gid_get [ib_core]();
0) 0.041 us | _raw_read_unlock_irqrestore();
0) 1.367 us | }
0) 1.677 us | }
0) 2.200 us | }
0) 2.742 us | }
0) | rdma_lookup_put_uobject [ib_uverbs]() {
0) 0.023 us | lookup_put_idr_uobject [ib_uverbs]();
0) 0.395 us | }
0) 0.055 us | kfree();
0) 7.688 us | }
0) 8.331 us | }
ib_modify_qp_with_udata
中,会调用ib_resolve_eth_dmac
解析remote gid
对应的MAC地址:
int ib_modify_qp_with_udata(struct ib_qp *qp, struct ib_qp_attr *attr,
int attr_mask, struct ib_udata *udata)
{
int ret;
if (attr_mask & IB_QP_AV) {
ret = ib_resolve_eth_dmac(qp->device, &attr->ah_attr); /// resolve remote mac address
if (ret)
return ret;
}
ret = ib_security_modify_qp(qp, attr, attr_mask, udata);
if (!ret && (attr_mask & IB_QP_PORT))
qp->port = attr->port_num;
return ret;
}
ib_resolve_eth_dmac
-> rdma_addr_find_l2_eth_by_grh
:
int rdma_addr_find_l2_eth_by_grh(const union ib_gid *sgid,
const union ib_gid *dgid,
u8 *dmac, u16 *vlan_id, int *if_index,
int *hoplimit)
{
int ret = 0;
struct rdma_dev_addr dev_addr;
struct resolve_cb_context ctx;
struct net_device *dev;
union {
struct sockaddr _sockaddr;
struct sockaddr_in _sockaddr_in;
struct sockaddr_in6 _sockaddr_in6;
} sgid_addr, dgid_addr;
rdma_gid2ip(&sgid_addr._sockaddr, sgid);
rdma_gid2ip(&dgid_addr._sockaddr, dgid);
memset(&dev_addr, 0, sizeof(dev_addr));
if (if_index)
dev_addr.bound_dev_if = *if_index;
dev_addr.net = &init_net; /// not support net namespace
ctx.addr = &dev_addr;
init_completion(&ctx.comp);
ret = rdma_resolve_ip(&self, &sgid_addr._sockaddr, &dgid_addr._sockaddr,
&dev_addr, 1000, resolve_cb, &ctx);
///..
if (dmac)
memcpy(dmac, dev_addr.dst_dev_addr, ETH_ALEN); ///set MAC address
从上面的代码可以看到,4.2版本还不支持net namespace
.
rdma_resolve_ip
|- addr_resolve
|- addr4_resolve /// route
|- addr_resolve_neigh /// ARP
ovs-vswitchd
的整体架构如下:
_
| +-------------------+
| | ovs-vswitchd |<-->ovsdb-server
| +-------------------+
| | ofproto |<-->OpenFlow controllers
| +--------+-+--------+ _
| | netdev | |ofproto-| |
userspace | +--------+ | dpif | |
| | netdev | +--------+ |
| |provider| | dpif | |
| +---||---+ +--------+ |
| || | dpif | | implementation of
| || |provider| | ofproto provider
|_ || +---||---+ |
|| || |
_ +---||-----+---||---+ |
| | |datapath| |
kernel | | +--------+ _|
| | |
|_ +--------||---------+
||
physical
NIC
The main Open vSwitch userspace program, in vswitchd/. It reads the desired Open vSwitch configuration from the ovsdb-server program over an IPC channel and passes this configuration down to the “ofproto” library. It also passes certain status and statistical information from ofproto back into the database.
ofproto The Open vSwitch library, in ofproto/, that implements an OpenFlow switch. It talks to OpenFlow controllers over the network and to switch hardware or software through an “ofproto provider”, explained further below.
netdev The Open vSwitch library, in lib/netdev.c, that abstracts interacting with network devices, that is, Ethernet interfaces. The netdev library is a thin layer over “netdev provider” code, explained further below.
struct ofproto
表示An OpenFlow switch
:
///ofproto/ofproto-provider.h
/* An OpenFlow switch.
*
* With few exceptions, ofproto implementations may look at these fields but
* should not modify them. */
struct ofproto {
struct hmap_node hmap_node; /* In global 'all_ofprotos' hmap. */
const struct ofproto_class *ofproto_class; /// see ofproto_dpif_class
char *type; /* Datapath type. */
char *name; /* Datapath name. */
///...
/* Datapath. */
struct hmap ports; /* Contains "struct ofport"s. */
struct shash port_by_name;
struct simap ofp_requests; /* OpenFlow port number requests. */
uint16_t alloc_port_no; /* Last allocated OpenFlow port number. */
uint16_t max_ports; /* Max possible OpenFlow port num, plus one. */
struct hmap ofport_usage; /* Map ofport to last used time. */
uint64_t change_seq; /* Change sequence for netdev status. */
/* Flow tables. */
long long int eviction_group_timer; /* For rate limited reheapification. */
struct oftable *tables;
int n_tables;
ovs_version_t tables_version; /* Controls which rules are visible to
* table lookups. */
///...
struct ofproto
包含两个最重要的组成信息:端口信息(struct ofport
)和流表信息(struct oftable
):
///ofproto/ofproto-provider.h
/* An OpenFlow port within a "struct ofproto".
*
* The port's name is netdev_get_name(port->netdev).
*
* With few exceptions, ofproto implementations may look at these fields but
* should not modify them. */
struct ofport {
struct hmap_node hmap_node; /* In struct ofproto's "ports" hmap. */
struct ofproto *ofproto; /* The ofproto that contains this port. */
struct netdev *netdev; /// network device
struct ofputil_phy_port pp;
ofp_port_t ofp_port; /* OpenFlow port number. */
uint64_t change_seq;
long long int created; /* Time created, in msec. */
int mtu;
};
///ofproto/ofproto-provider.h
/* A flow table within a "struct ofproto".
*/
struct oftable {
enum oftable_flags flags;
struct classifier cls; /* Contains "struct rule"s. */
char *name; /* Table name exposed via OpenFlow, or NULL. */
////...
struct oftable
通过struct classifier
关联所有流表规则。
///ofproto/ofproto-provider.h
struct rule {
/* Where this rule resides in an OpenFlow switch.
*
* These are immutable once the rule is constructed, hence 'const'. */
struct ofproto *const ofproto; /* The ofproto that contains this rule. */
const struct cls_rule cr; /* In owning ofproto's classifier. */
const uint8_t table_id; /* Index in ofproto's 'tables' array. */
enum rule_state state;
///...
/* OpenFlow actions. See struct rule_actions for more thread-safety
* notes. */
const struct rule_actions * const actions;
///...
struct rule
的actions
为规则对应的action信息:
///ofproto/ofproto-provider.h
/* A set of actions within a "struct rule".
*/
struct rule_actions {
/* Flags.
*
* 'has_meter' is true if 'ofpacts' contains an OFPACT_METER action.
*
* 'has_learn_with_delete' is true if 'ofpacts' contains an OFPACT_LEARN
* action whose flags include NX_LEARN_F_DELETE_LEARNED. */
bool has_meter;
bool has_learn_with_delete;
bool has_groups;
/* Actions. */
uint32_t ofpacts_len; /* Size of 'ofpacts', in bytes. */
struct ofpact ofpacts[]; /* Sequence of "struct ofpacts". */
};
///include/openvswitch/ofp-action.sh
/* Header for an action.
*
* Each action is a structure "struct ofpact_*" that begins with "struct
* ofpact", usually followed by other data that describes the action. Actions
* are padded out to a multiple of OFPACT_ALIGNTO bytes in length.
*/
struct ofpact {
/* We want the space advantage of an 8-bit type here on every
* implementation, without giving up the advantage of having a useful type
* on implementations that support packed enums. */
#ifdef HAVE_PACKED_ENUM
enum ofpact_type type; /* OFPACT_*. */
#else
uint8_t type; /* OFPACT_* */
#endif
uint8_t raw; /* Original type when added, if any. */
uint16_t len; /* Length of the action, in bytes, including
* struct ofpact, excluding padding. */
};
OFPACT_OUTPUT action:
/* OFPACT_OUTPUT.
*
* Used for OFPAT10_OUTPUT. */
struct ofpact_output {
struct ofpact ofpact;
ofp_port_t port; /* Output port. */
uint16_t max_len; /* Max send len, for port OFPP_CONTROLLER. */
};
struct ofproto_dpif
表示基于dpif datapath
的bridge:
///ofproto/ofproto-dpif.h
/* A bridge based on a "dpif" datapath. */
struct ofproto_dpif {
struct hmap_node all_ofproto_dpifs_node; /* In 'all_ofproto_dpifs'. */
struct ofproto up;
struct dpif_backer *backer;
///...
基于OVS
的datapath interface
(ofproto_dpif->backer->dpif
):
///ofproto/dpif-provider.h
/* Open vSwitch datapath interface.
*
* This structure should be treated as opaque by dpif implementations. */
struct dpif {
const struct dpif_class *dpif_class;
char *base_name;
char *full_name;
uint8_t netflow_engine_type;
uint8_t netflow_engine_id;
};
struct dpif_class
:
/* Datapath interface class structure, to be defined by each implementation of
* a datapath interface.
*
* These functions return 0 if successful or a positive errno value on failure,
* except where otherwise noted.
*
* These functions are expected to execute synchronously, that is, to block as
* necessary to obtain a result. Thus, they may not return EAGAIN or
* EWOULDBLOCK or EINPROGRESS. We may relax this requirement in the future if
* and when we encounter performance problems. */
struct { /// see dpif_netlink_class/dpif_netdev_class
/* Type of dpif in this class, e.g. "system", "netdev", etc.
*
* One of the providers should supply a "system" type, since this is
* the type assumed if no type is specified when opening a dpif. */
const char *type;
///...
struct dpif_class
实际上对应图中的dpif provider
:
The “dpif” library in turn delegates much of its functionality to a “dpif provider”.
struct dpif_class
, inlib/dpif-provider.h
, defines the interfaces required to implement a dpif provider for new hardware or software.There are two existing dpif implementations that may serve as useful examples during a port:
lib/dpif-netlink.c
is a Linux-specific dpif implementation that talks to an Open vSwitch-specific kernel module (whose sources are in the “datapath” directory). The kernel module performs all of the switching work, passing packets that do not match any flow table entry up to userspace. This dpif implementation is essentially a wrapper around calls into the kernel module.
lib/dpif-netdev.c
is a generic dpif implementation that performs all switching internally. This is how the Open vSwitch userspace switch is implemented.
//lib/dpif_netdev.c
const struct dpif_class dpif_netdev_class = {
"netdev",
dpif_netdev_init,
///...
用户态switch实现,DPDK需要使用该类型:
ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev
ovs-vsctl add-port br0 dpdk-p0 -- set Interface dpdk-p0 type=dpdk \
options:dpdk-devargs=0000:01:00.0
参考Using Open vSwitch with DPDK.
system datapath
,基于Linux内核实现的switch:
//lib/dpif-netlink.c
const struct dpif_class dpif_netlink_class = {
"system",
NULL, /* init */
///...
struct netdev
代表一个网络设备:
///lib/netdev-provider.h
/* A network device (e.g. an Ethernet device).
*
* Network device implementations may read these members but should not modify
* them. */
struct netdev {
/* The following do not change during the lifetime of a struct netdev. */
char *name; /* Name of network device. */
const struct netdev_class *netdev_class; /* Functions to control
this device. */
///...
const struct netdev_class netdev_linux_class =
NETDEV_LINUX_CLASS(
"system",
netdev_linux_construct,
netdev_linux_get_stats,
netdev_linux_get_features,
netdev_linux_get_status,
LINUX_FLOW_OFFLOAD_API);
const struct netdev_class netdev_tap_class =
NETDEV_LINUX_CLASS(
"tap",
netdev_linux_construct_tap,
netdev_tap_get_stats,
netdev_linux_get_features,
netdev_linux_get_status,
NO_OFFLOAD_API);
const struct netdev_class netdev_internal_class =
NETDEV_LINUX_CLASS(
"internal",
netdev_linux_construct,
netdev_internal_get_stats,
NULL, /* get_features */
netdev_internal_get_status,
NO_OFFLOAD_API);
ovs-vswitchd
的基本功能包括bridge的维护、流表的维护和upcall处理等逻辑。
int
main(int argc, char *argv[])
{
///...
bridge_run();
unixctl_server_run(unixctl);
netdev_run();
///...
bridge_run
根据从ovsdb-server
读取的配置信息进行网桥建立、配置和更新,参考OVS网桥建立和连接管理:
bridge_run
|- bridge_init_ofproto ///Initialize the ofproto library
| |- ofproto_init
| |- ofproto_class->init
|
|- bridge_run__
| |- ofproto_type_run
| |- ofproto_run
| |- ofproto_class->run
| |- handle_openflow ///处理openflow message
|
|- bridge_reconfigure /// config bridge
netdev_run
会调用所有netdev_class
的run
回调函数:
/* Performs periodic work needed by all the various kinds of netdevs.
*
* If your program opens any netdevs, it must call this function within its
* main poll loop. */
void
netdev_run(void)
OVS_EXCLUDED(netdev_mutex)
{
netdev_initialize();
struct netdev_registered_class *rc;
CMAP_FOR_EACH (rc, cmap_node, &netdev_classes) {
if (rc->class->run) {
rc->class->run(rc->class);//netdev_linux_class->netdev_linux_run
}
}
}
以netdev_linux_class
为例,netdev_linux_run
会通过netlink的sock得到虚拟网卡的状态,并且更新状态。
///ofproto/ofproto.c
/* Attempts to add 'netdev' as a port on 'ofproto'. If 'ofp_portp' is
* non-null and '*ofp_portp' is not OFPP_NONE, attempts to use that as
* the port's OpenFlow port number.
*
* If successful, returns 0 and sets '*ofp_portp' to the new port's
* OpenFlow port number (if 'ofp_portp' is non-null). On failure,
* returns a positive errno value and sets '*ofp_portp' to OFPP_NONE (if
* 'ofp_portp' is non-null). */
int
ofproto_port_add(struct ofproto *ofproto, struct netdev *netdev,
ofp_port_t *ofp_portp)
{
ofp_port_t ofp_port = ofp_portp ? *ofp_portp : OFPP_NONE;
int error;
error = ofproto->ofproto_class->port_add(ofproto, netdev); ///ofproto_dpif_class(->port_add)
///...
port_add
-> dpif_port_add
:
///lib/dpif.c
int
dpif_port_add(struct dpif *dpif, struct netdev *netdev, odp_port_t *port_nop)
{
const char *netdev_name = netdev_get_name(netdev);
odp_port_t port_no = ODPP_NONE;
int error;
COVERAGE_INC(dpif_port_add);
if (port_nop) {
port_no = *port_nop;
}
error = dpif->dpif_class->port_add(dpif, netdev, &port_no); ///dpif_netlink_class
///...
lib/dpif-netlink.c
: dpif_netlink_port_add
-> dpif_netlink_port_add_compat
-> dpif_netlink_port_add__
:
static int
dpif_netlink_port_add__(struct dpif_netlink *dpif, const char *name,
enum ovs_vport_type type,
struct ofpbuf *options,
odp_port_t *port_nop)
OVS_REQ_WRLOCK(dpif->upcall_lock)
{
///...
dpif_netlink_vport_init(&request);
request.cmd = OVS_VPORT_CMD_NEW; ///new port
request.dp_ifindex = dpif->dp_ifindex;
request.type = type;
request.name = name;
request.port_no = *port_nop;
upcall_pids = vport_socksp_to_pids(socksp, dpif->n_handlers);
request.n_upcall_pids = socksp ? dpif->n_handlers : 1;
request.upcall_pids = upcall_pids;
if (options) {
request.options = options->data;
request.options_len = options->size;
}
error = dpif_netlink_vport_transact(&request, &reply, &buf);
///...
# ovs-vsctl add-port br-int vxlan1 -- set interface vxlan1 type=vxlan options:remote_ip=172.18.42.161
datapath
函数调用:
ovs_vport_cmd_new
|-- new_vport
|-- ovs_vport_add
|-- vport_ops->create
ovs-ofctl add-flow br-int "in_port=2, nw_src=192.168.1.100, action=drop"
handle_openflow
-> handle_openflow__
-> handle_flow_mod
-> handle_flow_mod__
-> ofproto_flow_mod_start
-> add_flow_start
-> replace_rule_start
-> ofproto_rule_insert__
.
参考Openvswitch原理与代码分析(7): 添加一条流表flow.
无论是内核态datapath还是基于dpdk的用户态datapath,当flow table查不到之后都会进入upcall的处理. upcall的处理函数udpif_upcall_handler
会在udpif_start_threads
里面初始化,同时创建的还有udpif_revalidator
的线程:
recv_upcalls
是upcall处理的入口函数:
recv_upcalls
|-- dpif_recv // (1) read packet from datapath
|-- upcall_receive // (2) associate packet with a ofproto
|-- process_upcall // (3) process packet by flow rule
|-- handle_upcalls // (4) add flow rule to datapath
process_upcall
|-- upcall_xlate
|-- xlate_actions
|-- rule_dpif_lookup_from_table
|-- do_xlate_actions
static void
do_xlate_actions(const struct ofpact *ofpacts, size_t ofpacts_len,
struct xlate_ctx *ctx, bool is_last_action)
{
struct flow_wildcards *wc = ctx->wc;
struct flow *flow = &ctx->xin->flow;
const struct ofpact *a;
///...
///do each action
OFPACT_FOR_EACH (a, ofpacts, ofpacts_len) {
///...
switch (a->type) {
case OFPACT_OUTPUT: /// actions=output
xlate_output_action(ctx, ofpact_get_OUTPUT(a)->port,
ofpact_get_OUTPUT(a)->max_len, true, last,
false);
break;
case OFPACT_GROUP: /// actions=group
if (xlate_group_action(ctx, ofpact_get_GROUP(a)->group_id, last)) {
/* Group could not be found. */
/* XXX: Terminates action list translation, but does not
* terminate the pipeline. */
return;
}
break;
///...
case OFPACT_CT: /// actions=ct
compose_conntrack_action(ctx, ofpact_get_CT(a), last);
break;
OVS中的flow cache
分两级:microflow cache
和megaflow cache
。前者用于精确匹配(exact matching
),即用skb_buff->skb_hash
进行匹配;后者用于通配符匹配(wildcard matching
),使用元组空间搜索算法实现,即tuple space search (TSS) .
OVS使用sw_flow_key
进行流匹配,字段比较多,包含L1到L4协议的关键信息:
那么为什么OVS选择TSS,而不选择其他查找算法?这里给出了以下三点解释:
///mask cache entry
struct mask_cache_entry {
u32 skb_hash;
u32 mask_index; /// mask index in flow_table->mask_array->masks
};
struct mask_array {
struct rcu_head rcu;
int count, max;
struct sw_flow_mask __rcu *masks[]; /// mask array
};
struct table_instance { /// hash table
struct flex_array *buckets; ///bucket array
unsigned int n_buckets;
struct rcu_head rcu;
int node_ver;
u32 hash_seed;
bool keep_flows;
};
struct flow_table {
struct table_instance __rcu *ti; ///hash table
struct table_instance __rcu *ufid_ti;
struct mask_cache_entry __percpu *mask_cache; ///microflow cache, find entry by skb_hash, 256 entries(MC_HASH_ENTRIES)
struct mask_array __rcu *mask_array; ///mask array
unsigned long last_rehash;
unsigned int count;
unsigned int ufid_count;
};
struct flow_table
对应datapath的流表,它主要包括3部分:
ti
为流表对应的hash表,典型的哈希桶+链表的实现。hash表中的元素为sw_flow
,它表示一个流表项:
/// flow table entry
struct sw_flow {
struct rcu_head rcu;
struct {
struct hlist_node node[2];
u32 hash;
} flow_table, ufid_table; /// hash table node
int stats_last_writer; /* CPU id of the last writer on
* 'stats[0]'.
*/
struct sw_flow_key key; ///key
struct sw_flow_id id;
struct cpumask cpu_used_mask;
struct sw_flow_mask *mask;
struct sw_flow_actions __rcu *sf_acts; ///action
struct flow_stats __rcu *stats[]; /* One for each CPU. First one
* is allocated at flow creation time,
* the rest are allocated on demand
* while holding the 'stats[0].lock'.
*/
};
mask_cache
为microflow cache
,用于精确匹配(EMC
)。它是percpu
数组,有256个元素,即mask_cache_entry
。
mask_cache_entry
有2个字段:skb_hash
用于与skb_buff->skb_hash
比较;mask_index
为对应的sw_flow_mask
在mask_array
数组中的下标。
mask_array
为sw_flow_mask
数组,每个sw_flow_mask
表示一个掩码(Mask
),用于指示sw_flow_key
中的哪些字段需要进行匹配:
Each bit of the mask will be set to 1 when a match is required on that bit position; otherwise, it will be 0.
struct sw_flow_key_range {
unsigned short int start;
unsigned short int end;
};
/// flow mask
struct sw_flow_mask {
int ref_count;
struct rcu_head rcu;
struct sw_flow_key_range range;
struct sw_flow_key key;
};
流表查找的目标是基于skb_buff
创建一个sw_flow_key
,在流表中找到匹配的流表项sw_flow
。
ovs_flow_tbl_lookup_stats
是进行入流表查找的入口函数,参数包括流表对象、用于匹配的sw_flow_key
、和用于精确匹配的skb_hash
:
/*
* mask_cache maps flow to probable mask. This cache is not tightly
* coupled cache, It means updates to mask list can result in inconsistent
* cache entry in mask cache.
* This is per cpu cache and is divided in MC_HASH_SEGS segments.
* In case of a hash collision the entry is hashed in next segment.
*/
struct sw_flow *ovs_flow_tbl_lookup_stats(struct flow_table *tbl,
const struct sw_flow_key *key,
u32 skb_hash,
u32 *n_mask_hit)
{
struct mask_array *ma = rcu_dereference(tbl->mask_array);
struct table_instance *ti = rcu_dereference(tbl->ti);
struct mask_cache_entry *entries, *ce;
struct sw_flow *flow;
u32 hash;
int seg;
*n_mask_hit = 0;
if (unlikely(!skb_hash)) {
u32 mask_index = 0;
return flow_lookup(tbl, ti, ma, key, n_mask_hit, &mask_index);
}
/* Pre and post recirulation flows usually have the same skb_hash
* value. To avoid hash collisions, rehash the 'skb_hash' with
* 'recirc_id'. */
if (key->recirc_id)
skb_hash = jhash_1word(skb_hash, key->recirc_id);
ce = NULL;
hash = skb_hash;
entries = this_cpu_ptr(tbl->mask_cache);
/* Find the cache entry 'ce' to operate on. */
for (seg = 0; seg < MC_HASH_SEGS; seg++) { ///find in cache
int index = hash & (MC_HASH_ENTRIES - 1); ///skb_hash -> index
struct mask_cache_entry *e;
e = &entries[index];
if (e->skb_hash == skb_hash) {
flow = flow_lookup(tbl, ti, ma, key, n_mask_hit,
&e->mask_index);
if (!flow)
e->skb_hash = 0;
return flow;
}
if (!ce || e->skb_hash < ce->skb_hash)
ce = e; /* A better replacement cache candidate. */
hash >>= MC_HASH_SHIFT;
}
/* Cache miss, do full lookup. */
flow = flow_lookup(tbl, ti, ma, key, n_mask_hit, &ce->mask_index);
if (flow)
ce->skb_hash = skb_hash;
return flow;
}
可以看到,ovs_flow_tbl_lookup_stats
会先尝试使用skb_hash
查找mask_cache_entry
。如果成功,则将mask_index
传给flow_lookup
函数,后者会直接使用mask_array
中该下标的sw_flow_mask
进行查找。否则,则遍历mask_array
,依次尝试:
/* Flow lookup does full lookup on flow table. It starts with
* mask from index passed in *index.
*/
static struct sw_flow *flow_lookup(struct flow_table *tbl,
struct table_instance *ti,
const struct mask_array *ma,
const struct sw_flow_key *key,
u32 *n_mask_hit,
u32 *index)
{
struct sw_flow_mask *mask;
struct sw_flow *flow;
int i;
if (*index < ma->max) { /// get mask by index
mask = rcu_dereference_ovsl(ma->masks[*index]);
if (mask) {
flow = masked_flow_lookup(ti, key, mask, n_mask_hit);
if (flow)
return flow;
}
}
for (i = 0; i < ma->max; i++) { /// travel all mask in array
if (i == *index)
continue;
mask = rcu_dereference_ovsl(ma->masks[i]);
if (!mask)
continue;
flow = masked_flow_lookup(ti, key, mask, n_mask_hit);
if (flow) { /* Found */
*index = i;
return flow;
}
}
return NULL;
}
masked_flow_lookup
对sw_flow_key
进行掩码计算后,计算hash值,找到对应的bucket,然后遍历链表,依次每个流表项sw_flow
进行比较。并返回匹配的流表项。
static struct sw_flow *masked_flow_lookup(struct table_instance *ti,
const struct sw_flow_key *unmasked,
const struct sw_flow_mask *mask,
u32 *n_mask_hit)
{
struct sw_flow *flow;
struct hlist_head *head;
u32 hash;
struct sw_flow_key masked_key;
ovs_flow_mask_key(&masked_key, unmasked, false, mask);
hash = flow_hash(&masked_key, &mask->range); ///mask key -> hash
head = find_bucket(ti, hash); ///hash -> bucket
(*n_mask_hit)++;
hlist_for_each_entry_rcu(flow, head, flow_table.node[ti->node_ver]) { /// list
if (flow->mask == mask && flow->flow_table.hash == hash &&
flow_cmp_masked_key(flow, &masked_key, &mask->range)) ///compare key
return flow;
}
return NULL;
}
node2
:
mkdir -p /tmp/www
echo "i am vm1" > /tmp/www/index.html
cd /tmp/www
ip netns exec vm1 python -m SimpleHTTPServer 8000
node3
:
mkdir -p /tmp/www
echo "i am vm2" > /tmp/www/index.html
cd /tmp/www
ip netns exec vm2 python -m SimpleHTTPServer 8000
首先配置LB规则,VIP 172.18.1.254
为Physical Network
中的IP.
在node1
上执行:
uuid=`ovn-nbctl create load_balancer vips:172.18.1.254="192.168.100.10,192.168.100.11"`
echo $uuid
这会在Northbound DB
的load_balancer
表中创建对应的项:
[root@node1 ~]# ovn-nbctl list load_balancer
_uuid : 3760718f-294f-491a-bf63-3faf3573d44c
external_ids : {}
name : ""
protocol : []
vips : {"172.18.1.254"="192.168.100.10,192.168.100.11"}
Gateway Router
作为Load Balancer
node1
:
ovn-nbctl set logical_router gw1 load_balancer=$uuid
[root@node1 ~]# ovn-nbctl lb-list
UUID LB PROTO VIP IPs
3760718f-294f-491a-bf63-3faf3573d44c tcp/udp 172.18.1.254 192.168.100.10,192.168.100.11
[root@node1 ~]# ovn-nbctl lr-lb-list gw1
UUID LB PROTO VIP IPs
3760718f-294f-491a-bf63-3faf3573d44c tcp/udp 172.18.1.254 192.168.100.10,192.168.100.11
访问LB,client
-> vip
:
[root@client ~]# curl http://172.18.1.254:8000
i am vm2
[root@client ~]# curl http://172.18.1.254:8000
i am vm1
load balancer is not performing any sort of health checking. At present, the assumption is that health checks would be performed by an orchestration solution such as Kubernetes but it would be resonable to assume that this feature would be added at some future point.
删除LB:
ovn-nbctl clear logical_router gw1 load_balancer
ovn-nbctl destroy load_balancer $uuid
Logical Switch
上配置Load Balancer
为了方便测试,再创建logical switch 'ls2' 192.168.101.0/24
:
# create the logical switch
ovn-nbctl ls-add ls2
# create logical port
ovn-nbctl lsp-add ls2 ls2-vm3
ovn-nbctl lsp-set-addresses ls2-vm3 02:ac:10:ff:00:33
ovn-nbctl lsp-set-port-security ls2-vm3 02:ac:10:ff:00:33
# create logical port
ovn-nbctl lsp-add ls2 ls2-vm4
ovn-nbctl lsp-set-addresses ls2-vm4 02:ac:10:ff:00:44
ovn-nbctl lsp-set-port-security ls2-vm4 02:ac:10:ff:00:44
# create router port for the connection to 'ls2'
ovn-nbctl lrp-add router1 router1-ls2 02:ac:10:ff:01:01 192.168.101.1/24
# create the 'ls2' switch port for connection to 'router1'
ovn-nbctl lsp-add ls2 ls2-router1
ovn-nbctl lsp-set-type ls2-router1 router
ovn-nbctl lsp-set-addresses ls2-router1 02:ac:10:ff:01:01
ovn-nbctl lsp-set-options ls2-router1 router-port=router1-ls2
# ovn-nbctl show
switch b97eb754-ea59-41f6-b435-ea6ada4659d1 (ls2)
port ls2-vm3
addresses: ["02:ac:10:ff:00:33"]
port ls2-router1
type: router
addresses: ["02:ac:10:ff:01:01"]
router-port: router1-ls2
port ls2-vm4
addresses: ["02:ac:10:ff:00:44"]
在node2
上创建vm3
:
ip netns add vm3
ovs-vsctl add-port br-int vm3 -- set interface vm3 type=internal
ip link set vm3 netns vm3
ip netns exec vm3 ip link set vm3 address 02:ac:10:ff:00:33
ip netns exec vm3 ip addr add 192.168.101.10/24 dev vm3
ip netns exec vm3 ip link set vm3 up
ip netns exec vm3 ip route add default via 192.168.101.1 dev vm3
ovs-vsctl set Interface vm3 external_ids:iface-id=ls2-vm3
node1
:
uuid=`ovn-nbctl create load_balancer vips:10.254.10.10="192.168.100.10,192.168.100.11"`
echo $uuid
set LB for ls2
:
ovn-nbctl set logical_switch ls2 load_balancer=$uuid
ovn-nbctl get logical_switch ls2 load_balancer
结果:
# # ovn-nbctl ls-lb-list ls2
UUID LB PROTO VIP IPs
a19bece1-52bf-4555-89f4-257534c0b9d9 tcp/udp 10.254.10.10 192.168.100.10,192.168.100.11
测试(vm3
-> VIP
):
[root@node2 ~]# ip netns exec vm3 curl 10.254.10.10:8000
i am vm2
[root@node2 ~]# ip netns exec vm3 curl 10.254.10.10:8000
i am vm1
值得注意的是,如果在ls1
上设置LB,vm3
不能访问VIP,这说明,LB是对client的switch,而不是server端的switch.
This highlights the requirement that load balancing be applied on the client’s logical switch rather than the server’s logical switch.
删除LB:
ovn-nbctl clear logical_switch ls2 load_balancer
ovn-nbctl destroy load_balancer $uuid
vm3
-> 192.168.101.1
:ovs-appctl ofproto/trace br-int in_port=14,icmp,icmp_type=0x8,dl_src=02:ac:10:ff:00:33,dl_dst=02:ac:10:ff:01:01,nw_src=192.168.101.10,nw_dst=192.168.101.1
vm3
-> vm1
:ovs-appctl ofproto/trace br-int in_port=14,ip,dl_src=02:ac:10:ff:00:33,dl_dst=02:ac:10:ff:01:01,nw_src=192.168.101.10,nw_dst=192.168.100.10,nw_ttl=32
ovs-appctl ofproto/trace br-int in_port=14,icmp,icmp_type=0x8,dl_src=02:ac:10:ff:00:33,dl_dst=02:ac:10:ff:01:01,nw_src=192.168.101.10,nw_dst=192.168.100.10,nw_ttl=32
OVN
的实现利用了OVS
的NAT规则:
17. ct_state=+new+trk,ip,metadata=0xa,nw_dst=10.254.10.10, priority 110, cookie 0x30d9e9b5
group:1
ct(commit,table=18,zone=NXM_NX_REG13[0..15],nat(dst=192.168.100.10))
nat(dst=192.168.100.10)
-> A clone of the packet is forked to recirculate. The forked pipeline will be resumed at table 18.
如下:
table=17, n_packets=10, n_bytes=740, idle_age=516, priority=110,ct_state=+new+trk,ip,metadata=0xa,nw_dst=10.254.10.10 actions=group:1
执行ovn-nbctl lsp-set-addresses ls2-vm3 02:ac:10:ff:00:33
修改ls2-vm3
的MAC后,vm3
无法ping 192.168.101.1
,使用ovs-appctl ofproto/trace
:
# ovs-appctl ofproto/trace br-int in_port=14,icmp,icmp_type=0x8,dl_src=02:ac:10:ff:00:33,dl_dst=02:ac:10:ff:01:01,nw_src=192.168.101.10,nw_dst=192.168.101.1
...
14. ip,metadata=0x6, priority 0, cookie 0x986c8ad3
push:NXM_NX_REG0[]
push:NXM_NX_XXREG0[96..127]
pop:NXM_NX_REG0[]
-> NXM_NX_REG0[] is now 0xc0a8650a
set_field:00:00:00:00:00:00->eth_dst
resubmit(,66)
66. reg0=0xc0a8650a,reg15=0x3,metadata=0x6, priority 100
set_field:02:ac:10:ff:00:13->eth_dst
pop:NXM_NX_REG0[]
-> NXM_NX_REG0[] is now 0xc0a8650a
可以看到66
规则将目标MAC设置为02:ac:10:ff:00:13
:
[root@node2 ovs]# ovs-ofctl dump-flows br-int | grep table=66
table=66, n_packets=0, n_bytes=0, idle_age=8580, priority=100,reg0=0xc0a8640b,reg15=0x1,metadata=0x6 actions=mod_dl_dst:02:ac:10:ff:00:22
table=66, n_packets=18, n_bytes=1764, idle_age=1085, priority=100,reg0=0xc0a8640a,reg15=0x1,metadata=0x6 actions=mod_dl_dst:02:ac:10:ff:00:11
table=66, n_packets=33, n_bytes=3234, idle_age=1283, priority=100,reg0=0xc0a8650a,reg15=0x3,metadata=0x6 actions=mod_dl_dst:02:ac:10:ff:00:13
table=66, n_packets=0, n_bytes=0, idle_age=8580, priority=100,reg0=0,reg1=0,reg2=0,reg3=0,reg15=0x3,metadata=0x6 actions=mod_dl_dst:00:00:00:00:00:00
修改上面的第3条规则:
[root@node2 ovs]# ovs-ofctl del-flows br-int "table=66,reg0=0xc0a8650a,reg15=0x3,metadata=0x6"
[root@node2 ovs]# ovs-ofctl add-flow br-int "table=66,reg0=0xc0a8650a,reg15=0x3,metadata=0x6,actions=mod_dl_dst:02:ac:10:ff:00:33"
问题解决。
OVN是OVS社区在2015年1月份才宣布的一个子项目,但是到目前为止OVN已经支持了很多功能:
Logical switches:逻辑交换机,用来做二层转发。
L2/L3/L4 ACLs:二到四层的 ACL,可以根据报文的 MAC 地址,IP 地址,端口号来做访问控制。
Logical routers:逻辑路由器,分布式的,用来做三层转发。
Multiple tunnel overlays:支持多种隧道封装技术,有 Geneve,STT 和 VXLAN。
TOR switch or software logical switch gateways:支持使用硬件 TOR switch 或者软件逻辑 switch 当作网关来连接物理网络和虚拟网络。
CMS
|
|
+-----------|-----------+
| | |
| OVN/CMS Plugin |
| | |
| | |
| OVN Northbound DB |
| | |
| | |
| ovn-northd |
| | |
+-----------|-----------+
|
|
+-------------------+
| OVN Southbound DB |
+-------------------+
|
|
+------------------+------------------+
| | |
HV 1 | | HV n |
+---------------|---------------+ . +---------------|---------------+
| | | . | | |
| ovn-controller | . | ovn-controller |
| | | | . | | | |
| | | | | | | |
| ovs-vswitchd ovsdb-server | | ovs-vswitchd ovsdb-server |
| | | |
+-------------------------------+ +-------------------------------+
上图是OVN
的整体架构,最上面 Openstack/CMS plugin
是 CMS(Cloud Management System) 和 OVN 的接口,它把 CMS 的配置转化成 OVN 的格式写到 Nnorthbound DB
里面。
详细参考ovn-architecture - Open Virtual Network architecture。
Northbound DB
里面存的都是一些逻辑的数据,大部分和物理网络没有关系,比如logical switch
,logical router
,ACL
,logical port
,和传统网络设备概念一致。
Northbound DB
进程:
root 29981 1 0 Dec27 ? 00:00:00 ovsdb-server: monitoring pid 29982 (healthy)
root 29982 29981 2 Dec27 ? 00:35:53 ovsdb-server --detach --monitor -vconsole:off --log-file=/var/log/openvswitch/ovsdb-server-nb.log --remote=punix:/var/run/openvswitch/ovnnb_db.sock --pidfile=/var/run/openvswitch/ovnnb_db.pid --remote=db:OVN_Northbound,NB_Global,connections --unixctl=ovnnb_db.ctl --private-key=db:OVN_Northbound,SSL,private_key --certificate=db:OVN_Northbound,SSL,certificate --ca-cert=db:OVN_Northbound,SSL,ca_cert --ssl-protocols=db:OVN_Northbound,SSL,ssl_protocols --ssl-ciphers=db:OVN_Northbound,SSL,ssl_ciphers /etc/openvswitch/ovnnb_db.db
查看ovn-northd
的逻辑数据(logical switch
,logical port
, logical router
):
# ovn-nbctl show
switch 707dcb98-baa0-4ac5-8955-1ce4de2f780f (kube-master)
port stor-kube-master
type: router
addresses: ["00:00:00:BF:CC:B1"]
router-port: rtos-kube-master
port k8s-kube-master
addresses: ["66:ce:60:08:9e:cd 192.168.1.2"]
...
OVN-northd
类似于一个集中的控制器,它把 Northbound DB
里面的数据翻译一下,写到 Southbound DB
里面。
# start ovn northd
/usr/share/openvswitch/scripts/ovn-ctl start_northd
ovn-northd
进程:
root 29996 1 0 Dec27 ? 00:00:00 ovn-northd: monitoring pid 29997 (healthy)
root 29997 29996 25 Dec27 ? 06:52:54 ovn-northd -vconsole:emer -vsyslog:err -vfile:info --ovnnb-db=unix:/var/run/openvswitch/ovnnb_db.sock --ovnsb-db=unix:/var/run/openvswitch/ovnsb_db.sock --no-chdir --log-file=/var/log/openvswitch/ovn-northd.log --pidfile=/var/run/openvswitch/ovn-northd.pid --detach --monitor
Southbound DB
进程:
root 32441 1 0 Dec28 ? 00:00:00 ovsdb-server: monitoring pid 32442 (healthy)
root 32442 32441 0 Dec28 ? 00:01:45 ovsdb-server --detach --monitor -vconsole:off --log-file=/var/log/openvswitch/ovsdb-server-sb.log --remote=punix:/var/run/openvswitch/ovnsb_db.sock --pidfile=/var/run/openvswitch/ovnsb_db.pid --remote=db:OVN_Southbound,SB_Global,connections --unixctl=ovnsb_db.ctl --private-key=db:OVN_Southbound,SSL,private_key --certificate=db:OVN_Southbound,SSL,certificate --ca-cert=db:OVN_Southbound,SSL,ca_cert --ssl-protocols=db:OVN_Southbound,SSL,ssl_protocols --ssl-ciphers=db:OVN_Southbound,SSL,ssl_ciphers /etc/openvswitch/ovnsb_db.db
Southbound DB
里面存的数据和 Northbound DB
语义完全不一样,主要包含 3 类数据,一是物理网络数据,比如 HV(hypervisor)的 IP 地址,HV 的 tunnel 封装格式;二是逻辑网络数据,比如报文如何在逻辑网络里面转发;三是物理网络和逻辑网络的绑定关系,比如逻辑端口关联到哪个 HV 上面。
# ovn-sbctl show
Chassis "7f99371a-d51c-478c-8de2-facd70e2f739"
hostname: "kube-node2"
Encap vxlan
ip: "172.17.42.32"
options: {csum="true"}
Chassis "069367d8-8e07-4b81-b057-38cf6b21b2b7"
hostname: "kube-node3"
Encap vxlan
ip: "172.17.42.33"
options: {csum="true"}
ovn-controller
是 OVN 里面的 agent,类似于 neutron 里面的 ovs-agent,它也是运行在每个 HV 上面。北向,ovn-controller
会把物理网络的信息写到 Southbound DB
里面;南向,它会把 Southbound DB
里面存的一些数据转化成 Openflow flow
配到本地的 OVS table
里面,来实现报文的转发。
# start ovs
/usr/share/openvswitch/scripts/ovs-ctl start --system-id=random
# start ovn-controller
/usr/share/openvswitch/scripts/ovn-ctl start_controller
ovn-controller
进程:
root 13423 1 0 Dec26 ? 00:00:00 ovn-controller: monitoring pid 13424 (healthy)
root 13424 13423 82 Dec26 ? 1-15:53:23 ovn-controller unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --no-chdir --log-file=/var/log/openvswitch/ovn-controller.log --pidfile=/var/run/openvswitch/ovn-controller.pid --detach --monitor
ovs-vswitchd 和 ovsdb-server 是 OVS 的两个进程。
Northbound DB
是 OVN 和 CMS 之间的接口,Northbound DB 里面的几乎所有的内容都是由 CMS 产生的,ovn-northd 监听这个数据库的内容变化,然后翻译,保存到 Southbound DB 里面。
Northbound DB 里面主要有如下几张表:
每一行代表一个逻辑交换机,逻辑交换机有两种,一种是 overlay logical switches
,对应于 neutron network,每创建一个 neutron network,networking-ovn 会在这张表里增加一行;另一种是 bridged logical switch
,连接物理网络和逻辑网络,被 VTEP gateway
使用。Logical_Switch
里面保存了它包含的 logical port
(指向 Logical_Port table
)和应用在它上面的 ACL(指向 ACL table)。
ovn-nbctl list Logical_Switch
可以查看Logical_Switch
表中的数据:
# ovn-nbctl --db tcp:172.17.42.30:6641 list Logical_Switch
_uuid : 707dcb98-baa0-4ac5-8955-1ce4de2f780f
acls : []
dns_records : []
external_ids : {gateway_ip="192.168.1.1/24"}
load_balancer : [4522d0fa-9d46-4165-9524-51d20a35ea0a, 5842a5a9-6c8e-4a87-be3c-c8a0bc271626]
name : kube-master
other_config : {subnet="192.168.1.0/24"}
ports : [44222421-c811-4f38-8ea6-5504a35df703, ee5a5e97-c41d-4656-bd2a-8bc8ad180188]
qos_rules : []
...
每一行代表一个逻辑端口,每创建一个 neutron port
,networking-ovn
会在这张表里增加一行,每行保存的信息有端口的类型,比如 patch port
,localnet port
,端口的 IP 和 MAC 地址,端口的状态 UP/Down。
# ovn-nbctl --db tcp:172.17.42.30:6641 list Logical_Switch_Port
_uuid : 44222421-c811-4f38-8ea6-5504a35df703
addresses : ["00:00:00:BF:CC:B1"]
dhcpv4_options : []
dhcpv6_options : []
dynamic_addresses : []
enabled : []
external_ids : {}
name : stor-kube-master
options : {router-port=rtos-kube-master}
parent_name : []
port_security : []
tag : []
tag_request : []
type : router
up : false
...
每一行代表一个应用到逻辑交换机上的 ACL 规则,如果逻辑交换机上面的所有端口都没有配置 security group,那么这个逻辑交换机上不应用 ACL。每条 ACL 规则包含匹配的内容,方向,还有动作。
每一行代表一个逻辑路由器,每创建一个 neutron router,networking-ovn 会在这张表里增加一行,每行保存了它包含的逻辑的路由器端口。
# ovn-nbctl --db tcp:172.17.42.30:6641 list Logical_Router
_uuid : e12293ba-e61e-40bf-babc-8580d1121641
enabled : []
external_ids : {"k8s-cluster-router"=yes}
load_balancer : []
name : kube-master
nat : []
options : {}
ports : [2cbbbb8e-6b5d-44d5-9693-b4069ca9e12a, 3a046f60-161a-4fee-a1b3-d9d3043509d2, 40d3d95d-906b-483b-9b71-1fa6970de6e8, 840ab648-6436-4597-aeff-f84fbc44e3a9, b08758e5-7017-413f-b3db-ff68f49460a4]
static_routes : []
每一行代表一个逻辑路由器端口,每创建一个 router interface,networking-ovn 会在这张表里加一行,它主要保存了路由器端口的 IP 和 MAC。
# ovn-nbctl --db tcp:172.17.42.30:6641 list Logical_Router_Port
_uuid : 840ab648-6436-4597-aeff-f84fbc44e3a9
enabled : []
external_ids : {}
gateway_chassis : []
mac : "00:00:00:BF:CC:B1"
name : rtos-kube-master
networks : ["192.168.1.1/24"]
options : {}
peer : []
更多请参考ovn-nb - OVN_Northbound database schema.
Southbound DB
处在 OVN 架构的中心,它是 OVN 中非常重要的一部分,它跟 OVN 的其他组件都有交互。
Southbound DB 里面有如下几张表:
Chassis
是OVN
新增的概念,OVS里面没有这个概念,Chassis
可以是 HV,也可以是 VTEP 网关。
每一行表示一个 HV 或者 VTEP 网关,由 ovn-controller/ovn-controller-vtep
填写,包含 chassis 的名字和 chassis 支持的封装的配置(指向表 Encap),如果 chassis 是 VTEP 网关,VTEP 网关上和 OVN 关联的逻辑交换机也保存在这张表里。
# ovn-sbctl list Chassis
_uuid : 3dec4aa7-8f15-493d-89f4-4a260b510bbd
encaps : [bc324cd4-56f2-4f73-af9e-149b7401e0d2]
external_ids : {datapath-type="", iface-types="geneve,gre,internal,lisp,patch,stt,system,tap,vxlan", ovn-bridge-mappings=""}
hostname : "kube-node1"
name : "c7889c47-2d18-4dd4-a3b7-446d42b79f79"
nb_cfg : 34
vtep_logical_switches: []
...
保存着tunnel
的类型和 tunnel endpoint IP
地址。
# ovn-sbctl list Encap
_uuid : bc324cd4-56f2-4f73-af9e-149b7401e0d2
chassis_name : "c7889c47-2d18-4dd4-a3b7-446d42b79f79"
ip : "172.17.42.31"
options : {csum="true"}
type : vxlan
...
每一行表示一个逻辑的流表,这张表是 ovn-northd
根据 Nourthbound DB
里面二三层拓扑信息和 ACL 信息转换而来的,ovn-controller
把这个表里面的流表转换成 OVS 流表,配到 HV 上的 OVS table。流表主要包含匹配的规则,匹配的方向,优先级,table ID 和执行的动作。
# ovn-sbctl lflow-list
Datapath: "kube-node1" (2c3caa57-6a58-4416-9bd2-3e2982d83cf1) Pipeline: ingress
table=0 (ls_in_port_sec_l2 ), priority=100 , match=(eth.src[40]), action=(drop;)
table=0 (ls_in_port_sec_l2 ), priority=100 , match=(vlan.present), action=(drop;)
table=0 (ls_in_port_sec_l2 ), priority=50 , match=(inport == "default_sshd-2"), action=(next;)
table=0 (ls_in_port_sec_l2 ), priority=50 , match=(inport == "k8s-kube-node1"), action=(next;)
table=0 (ls_in_port_sec_l2 ), priority=50 , match=(inport == "stor-kube-node1"), action=(next;)
....
每一行代表一个组播组,组播报文和广播报文的转发由这张表决定,它保存了组播组所属的 datapath
,组播组包含的端口,还有代表 logical egress port
的 tunnel_key
。
每一行代表一个 datapath
和物理网络的绑定关系,每个 logical switch
和 logical router
对应一行。它主要保存了 OVN 给 datapath 分配的代表 logical datapath identifier
的 tunnel_key
。
示例:
# ovn-sbctl list Datapath_Binding
_uuid : 4cfe0e4c-1bbb-406a-9d85-e7bc24c818d0
external_ids : {logical-router="e12293ba-e61e-40bf-babc-8580d1121641", name=kube-master}
tunnel_key : 1
_uuid : 5ec4f962-77a8-44e8-ae01-5b7e46b6a286
external_ids : {logical-switch="e865aa50-7510-4b7f-9df4-b82801a8e92b", name=join}
tunnel_key : 2
_uuid : 2c3caa57-6a58-4416-9bd2-3e2982d83cf1
external_ids : {logical-switch="7c41601a-dcd5-4e77-b0e8-ca8692d7462b", name="kube-node1"}
tunnel_key : 4
Port_Binding
这张表主要用来确定 logical port
处在哪个 chassis 上面。每一行包含的内容主要有 logical port
的 MAC 和 IP 地址,端口类型,端口属于哪个 datapath binding
,代表 logical input/output port identifier
的 tunnel_key
, 以及端口处在哪个 chassis。端口所处的 chassis 由 ovn-controller/ovn-controller
设置,其余的值由 ovn-northd 设置。
示例:
# ovn-sbctl list Port_Binding
_uuid : 5e5746d8-3533-45a8-8abe-5a7028c97afa
chassis : []
datapath : 2c3caa57-6a58-4416-9bd2-3e2982d83cf1
external_ids : {}
gateway_chassis : []
logical_port : "stor-kube-node1"
mac : ["00:00:00:18:22:18"]
nat_addresses : []
options : {peer="rtos-kube-node1"}
parent_port : []
tag : []
tunnel_key : 2
type : patch
表 Chassis
和表 Encap
包含的是物理网络的数据,表 Logical_Flow
和表 Multicast_Group
包含的是逻辑网络的数据,表 Datapath_Binding
和表 Port_Binding
包含的是逻辑网络和物理网络绑定关系的数据。
OVN 支持的 tunnel 类型有三种,分别是 Geneve,STT 和 VXLAN。HV 与 HV 之间的流量,只能用 Geneve 和 STT 两种,HV 和 VTEP 网关之间的流量除了用 Geneve 和 STT 外,还能用 VXLAN,这是为了兼容硬件 VTEP 网关,因为大部分硬件 VTEP 网关只支持 VXLAN。
虽然 VXLAN 是数据中心常用的 tunnel 技术,但是 VXLAN header 是固定的,只能传递一个 VNID(VXLAN network identifier),如果想在 tunnel 里面传递更多的信息,VXLAN 实现不了。所以 OVN 选择了 Geneve 和 STT,Geneve 的头部有个 option 字段,支持 TLV 格式,用户可以根据自己的需要进行扩展,而 STT 的头部可以传递 64-bit 的数据,比 VXLAN 的 24-bit 大很多。
OVN tunnel 封装时使用了三种数据:
datapath
是 OVS 里面的概念,报文需要送到 datapath 进行处理,一个 datapath 对应一个 OVN 里面的逻辑交换机或者逻辑路由器,类似于 tunnel ID
。这个标识符有 24-bit,由 ovn-northd 分配的,全局唯一,保存在 Southbound DB
里面的表 Datapath_Binding
的列 tunnel_key
里(参考前面的示例)。
Logical input port identifier(逻辑的入端口标识符):进入 logical datapath
的端口标识符,15-bit 长,由 ovn-northd 分配的,在每个 datapath 里面唯一。它可用范围是 1-32767,0 预留给内部使用。保存在 Southbound DB
里面的表 Port_Binding
的列 tunnel_key
里。
Logical output port identifier(逻辑的出端口标识符)
离开 logical datapath
的端口标识符,16-bit 长,范围 0-32767 和 logical input port identifier
含义一样,范围 32768-65535 给组播组使用。对于每个 logical port
,input port identifier
和 output port identifier
相同。
如果 tunnel 类型是 Geneve,Geneve header
里面的 VNI 字段填 logical datapath identifier
,Option 字段填 logical input port identifier
和 logical output port identifier
,TLV 的 class 为 0xffff,type 为 0,value 为 0 (1-bit) + logical input port identifier (15-bit)
+ logical output port identifier (16-bit)
。详细参考Geneve: Generic Network Virtualization Encapsulation。
Geneve Option:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Option Class | Type |R|R|R| Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Variable Option Data |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
OVS 的 tunnel 封装是由 Openflow 流表来做的,所以 ovn-controller 需要把这三个标识符写到本地 HV 的 Openflow flow table
里面,对于每个进入 br-int 的报文,都会有这三个属性,logical datapath identifier
和 logical input port identifier
在入口方向被赋值,分别存在 openflow metadata
字段和 OVS 扩展寄存器 reg14 里面。报文经过 OVS 的 pipeline 处理后,如果需要从指定端口发出去,只需要把 Logical output port identifier
写在 OVS 扩展寄存器 reg15 里面。
示例(port 6
对应Geneve tunnel接口):
table=0,in_port=6 actions=move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],move:NXM_NX_TUN_METADATA0[16..30]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_METADATA0[0..15]->NXM_NX_REG15[0..15],resubmit(,33)
可以看到,NXM_NX_TUN_ID
为tunnel_key
,reg14
为15 bit,reg15
为16 bit.
OVN tunnel 里面所携带的 logical input port identifier
和 logical output port identifier
可以提高流表的查找效率,OVS 流表可以通过这两个值来处理报文,不需要解析报文的字段。
OVN Gateway
用于连接overlay network
和physical network
。它支持L2/L3两种方式:
layer-2 which bridge an OVN logical switch into a VLAN, and layer-3 which provide a routed connection between an OVN router and the physical network.
Unlike a distributed logical router (DLR), an OVN gateway router is centralized on a single host (chassis) so that it may provide services which cannot yet be distributed (NAT, load balancing, etc…). As of this publication there is a restriction that gateway routers may only connect to other routers via a logical switch, whereas DLRs may connect to one other directly via a peer link. Work is in progress to remove this restriction.
It should be noted that it is possible to have multiple gateway routers tied into an environment, which means that it is possible to perform ingress ECMP routing into logical space. However, it is worth mentioning that OVN currently does not support egress ECMP between gateway routers. Again, this is being looked at as a future enhancement.
环境:
网络拓扑结构:
_________
| client | 172.18.1.10/16 Physical Network
---------
____|____
| switch | outside
---------
|
____|____
| router | gw1 port 'gw1-outside': 172.18.1.2/16
--------- port 'gw1-join': 192.168.255.1/24
____|____
| switch | join 192.168.255.0/24
---------
____|____
| router | router1 port 'router1-join': 192.168.255.2/24
--------- port 'router1-ls1': 192.168.100.1/24
|
____|____
| switch | ls1 192.168.100.0/24
---------
/ \
_______/_ _\_______
| vm1 | | vm2 |
--------- ---------
192.168.100.10 192.168.100.11
连接vm1/vm2
的switch:
[root@node1 ~]# ovn-nbctl show
switch 0fab3ddd-6325-4219-aa1c-6dc9853b7069 (ls1)
port ls1-vm1
addresses: ["02:ac:10:ff:00:11"]
port ls1-vm2
addresses: ["02:ac:10:ff:00:22"]
[root@node1 ~]# ovn-sbctl show
Chassis "dc82b489-22b3-42dd-a28e-f25439316356"
hostname: "node1"
Encap geneve
ip: "172.17.42.160"
options: {csum="true"}
Chassis "598fec44-5787-452f-b527-2ef8c4adb942"
hostname: "node2"
Encap geneve
ip: "172.17.42.161"
options: {csum="true"}
Port_Binding "ls1-vm1"
Chassis "54292ae7-c91c-423b-a936-5b416d6bae9f"
hostname: "node3"
Encap geneve
ip: "172.17.42.162"
options: {csum="true"}
Port_Binding "ls1-vm2"
# add the router
ovn-nbctl lr-add router1
# create router port for the connection to 'ls1'
ovn-nbctl lrp-add router1 router1-ls1 02:ac:10:ff:00:01 192.168.100.1/24
# create the 'ls1' switch port for connection to 'router1'
ovn-nbctl lsp-add ls1 ls1-router1
ovn-nbctl lsp-set-type ls1-router1 router
ovn-nbctl lsp-set-addresses ls1-router1 02:ac:10:ff:00:01
ovn-nbctl lsp-set-options ls1-router1 router-port=router1-ls1
Logical network:
# ovn-nbctl show
switch 0fab3ddd-6325-4219-aa1c-6dc9853b7069 (ls1)
port ls1-router1
type: router
addresses: ["02:ac:10:ff:00:01"]
router-port: router1-ls1
port ls1-vm1
addresses: ["02:ac:10:ff:00:11"]
port ls1-vm2
addresses: ["02:ac:10:ff:00:22"]
router 6dec5e02-fa39-4f2c-8e1e-7a0182f110e6 (router1)
port router1-ls1
mac: "02:ac:10:ff:00:01"
networks: ["192.168.100.1/24"]
vm1
访问192.168.100.1
:
# ip netns exec vm1 ping -c 1 192.168.100.1
PING 192.168.100.1 (192.168.100.1) 56(84) bytes of data.
64 bytes from 192.168.100.1: icmp_seq=1 ttl=254 time=0.275 ms
指定在node1
上部署Gateway router
,创建router时,通过options:chassis={chassis_uuid}
实现:
# create router 'gw1'
ovn-nbctl create Logical_Router name=gw1 options:chassis=dc82b489-22b3-42dd-a28e-f25439316356
# create a new logical switch for connecting the 'gw1' and 'router1' routers
ovn-nbctl ls-add join
# connect 'gw1' to the 'join' switch
ovn-nbctl lrp-add gw1 gw1-join 02:ac:10:ff:ff:01 192.168.255.1/24
ovn-nbctl lsp-add join join-gw1
ovn-nbctl lsp-set-type join-gw1 router
ovn-nbctl lsp-set-addresses join-gw1 02:ac:10:ff:ff:01
ovn-nbctl lsp-set-options join-gw1 router-port=gw1-join
# 'router1' to the 'join' switch
ovn-nbctl lrp-add router1 router1-join 02:ac:10:ff:ff:02 192.168.255.2/24
ovn-nbctl lsp-add join join-router1
ovn-nbctl lsp-set-type join-router1 router
ovn-nbctl lsp-set-addresses join-router1 02:ac:10:ff:ff:02
ovn-nbctl lsp-set-options join-router1 router-port=router1-join
# add static routes
ovn-nbctl lr-route-add gw1 "192.168.100.0/24" 192.168.255.2
ovn-nbctl lr-route-add router1 "0.0.0.0/0" 192.168.255.1
# ovn-nbctl show
switch 0fab3ddd-6325-4219-aa1c-6dc9853b7069 (ls1)
port ls1-router1
type: router
addresses: ["02:ac:10:ff:00:01"]
router-port: router1-ls1
port ls1-vm1
addresses: ["02:ac:10:ff:00:11"]
port ls1-vm2
addresses: ["02:ac:10:ff:00:22"]
switch d4b119e9-0298-42ab-8cc7-480292231953 (join)
port join-gw1
type: router
addresses: ["02:ac:10:ff:ff:01"]
router-port: gw1-join
port join-router1
type: router
addresses: ["02:ac:10:ff:ff:02"]
router-port: router1-join
router 6dec5e02-fa39-4f2c-8e1e-7a0182f110e6 (router1)
port router1-ls1
mac: "02:ac:10:ff:00:01"
networks: ["192.168.100.1/24"]
port router1-join
mac: "02:ac:10:ff:ff:02"
networks: ["192.168.255.2/24"]
router f29af2c3-e9d1-46f9-bff7-b9b8f0fd56df (gw1)
port gw1-join
mac: "02:ac:10:ff:ff:01"
networks: ["192.168.255.1/24"]
node2
上的vm1
访问gw1
:
[root@node2 ~]# ip netns exec vm1 ip route add default via 192.168.100.1
[root@node2 ~]# ip netns exec vm1 ping -c 1 192.168.255.1
PING 192.168.255.1 (192.168.255.1) 56(84) bytes of data.
64 bytes from 192.168.255.1: icmp_seq=1 ttl=253 time=2.18 ms
这里假设物理网络的IP为172.18.0.0/16
:
# create new port on router 'gw1'
ovn-nbctl lrp-add gw1 gw1-outside 02:0a:7f:18:01:02 172.18.1.2/16
# create new logical switch and connect it to 'gw1'
ovn-nbctl ls-add outside
ovn-nbctl lsp-add outside outside-gw1
ovn-nbctl lsp-set-type outside-gw1 router
ovn-nbctl lsp-set-addresses outside-gw1 02:0a:7f:18:01:02
ovn-nbctl lsp-set-options outside-gw1 router-port=gw1-outside
# create a bridge for eth1 (run on 'node1')
ovs-vsctl add-br br-eth1
# create bridge mapping for eth1. map network name "phyNet" to br-eth1 (run on 'node1')
ovs-vsctl set Open_vSwitch . external-ids:ovn-bridge-mappings=phyNet:br-eth1
# create localnet port on 'outside'. set the network name to "phyNet"
ovn-nbctl lsp-add outside outside-localnet
ovn-nbctl lsp-set-addresses outside-localnet unknown
ovn-nbctl lsp-set-type outside-localnet localnet
ovn-nbctl lsp-set-options outside-localnet network_name=phyNet
# connect eth1 to br-eth1 (run on 'node1')
ovs-vsctl add-port br-eth1 eth1
完整的logical network
:
# ovn-nbctl show
switch d4b119e9-0298-42ab-8cc7-480292231953 (join)
port join-gw1
type: router
addresses: ["02:ac:10:ff:ff:01"]
router-port: gw1-join
port join-router1
type: router
addresses: ["02:ac:10:ff:ff:02"]
router-port: router1-join
switch 0fab3ddd-6325-4219-aa1c-6dc9853b7069 (ls1)
port ls1-router1
type: router
addresses: ["02:ac:10:ff:00:01"]
router-port: router1-ls1
port ls1-vm1
addresses: ["02:ac:10:ff:00:11"]
port ls1-vm2
addresses: ["02:ac:10:ff:00:22"]
switch 64dc14b1-3e0f-4f68-b388-d76826a5c972 (outside)
port outside-gw1
type: router
addresses: ["02:0a:7f:18:01:02"]
router-port: gw1-outside
port outside-localnet
type: localnet
addresses: ["unknown"]
router 6dec5e02-fa39-4f2c-8e1e-7a0182f110e6 (router1)
port router1-ls1
mac: "02:ac:10:ff:00:01"
networks: ["192.168.100.1/24"]
port router1-join
mac: "02:ac:10:ff:ff:02"
networks: ["192.168.255.2/24"]
router f29af2c3-e9d1-46f9-bff7-b9b8f0fd56df (gw1)
port gw1-outside
mac: "02:0a:7f:18:01:02"
networks: ["172.18.1.2/16"]
port gw1-join
mac: "02:ac:10:ff:ff:01"
networks: ["192.168.255.1/24"]
vm1
访问gw1-outside
:
[root@node2 ~]# ip netns exec vm1 ping -c 1 172.18.1.2
PING 172.18.1.2 (172.18.1.2) 56(84) bytes of data.
64 bytes from 172.18.1.2: icmp_seq=1 ttl=253 time=1.00 ms
对于与node1
在同一个L2网络的节点,可以通过配置路由访问overlay network:
client (172.18.1.10)
-> gw1
:
# ip route show
172.18.0.0/16 dev eth1 proto kernel scope link src 172.18.1.10
# ping -c 1 172.18.1.2
PING 172.18.1.2 (172.18.1.2) 56(84) bytes of data.
64 bytes from 172.18.1.2: icmp_seq=1 ttl=254 time=0.438 ms
client
增加访问192.168.0.0/16
的路由:
ip route add 192.168.0.0/16 via 172.18.1.2 dev eth1
client (172.18.1.10)
-> vm1 (192.168.100.10)
# ping -c 1 192.168.100.10
PING 192.168.100.10 (192.168.100.10) 56(84) bytes of data.
64 bytes from 192.168.100.10: icmp_seq=1 ttl=62 time=1.35 ms
vm1
:
[root@node2 ~]# ip netns exec vm1 tcpdump -nnvv -i vm1
tcpdump: listening on vm1, link-type EN10MB (Ethernet), capture size 262144 bytes
07:41:28.561299 IP (tos 0x0, ttl 62, id 0, offset 0, flags [DF], proto ICMP (1), length 84)
172.18.1.10 > 192.168.100.10: ICMP echo request, id 8160, seq 1, length 64
07:41:28.561357 IP (tos 0x0, ttl 64, id 53879, offset 0, flags [none], proto ICMP (1), length 84)
192.168.100.10 > 172.18.1.10: ICMP echo reply, id 8160, seq 1, length 64
对于overlay访问物理网络,可以通过NAT来实现。
创建NAT规则:
# ovn-nbctl -- --id=@nat create nat type="snat" logical_ip=192.168.100.0/24 external_ip=172.18.1.2 -- add logical_router gw1 nat @nat
3243ffd3-8d77-4bd3-9f7e-49e74e87b4a7
# ovn-nbctl lr-nat-list gw1
TYPE EXTERNAL_IP LOGICAL_IP EXTERNAL_MAC LOGICAL_PORT
snat 172.18.1.2 192.168.100.0/24
# ovn-nbctl list NAT
_uuid : 3243ffd3-8d77-4bd3-9f7e-49e74e87b4a7
external_ip : "172.18.1.2"
external_mac : []
logical_ip : "192.168.100.0/24"
logical_port : []
type : snat
可以通过命令ovn-nbctl lr-nat-del gw1 snat 192.168.100.0/24
删除NAT规则.
vm1 192.168.100.10
-> client 172.18.1.10
:
[root@node2 ~]# ip netns exec vm1 ping -c 1 172.18.1.10
PING 172.18.1.10 (172.18.1.10) 56(84) bytes of data.
64 bytes from 172.18.1.10: icmp_seq=1 ttl=62 time=1.63 ms
[root@client ~]# tcpdump icmp -nnvv -i eth1
[10894068.821880] device eth1 entered promiscuous mode
tcpdump: listening on eth1, link-type EN10MB (Ethernet), capture size 65535 bytes
08:24:50.316495 IP (tos 0x0, ttl 61, id 31580, offset 0, flags [DF], proto ICMP (1), length 84)
172.18.1.2 > 172.18.1.10: ICMP echo request, id 5587, seq 1, length 64
08:24:50.316536 IP (tos 0x0, ttl 64, id 6129, offset 0, flags [none], proto ICMP (1), length 84)
172.18.1.10 > 172.18.1.2: ICMP echo reply, id 5587, seq 1, length 64
可以看到,client
端看到的IP为172.18.1.2
.
CENTRAL_IP=172.17.42.30
LOCAL_IP=172.17.42.30
ENCAP_TYPE=geneve
## start ovs
/usr/share/openvswitch/scripts/ovs-ctl start
## set ovn-remote and ovn-nb
ovs-vsctl set Open_vSwitch . external_ids:ovn-remote="tcp:$CENTRAL_IP:6642" external_ids:ovn-nb="tcp:$CENTRAL_IP:6641" external_ids:ovn-encap-ip=$LOCAL_IP external_ids:ovn-encap-type="$ENCAP_TYPE"
## set system_id
id_file=/etc/openvswitch/system-id.conf
test -e $id_file || uuidgen > $id_file
ovs-vsctl set Open_vSwitch . external_ids:system-id=$(cat $id_file)
## start ovn-controller and vtep
/usr/share/openvswitch/scripts/ovn-ctl start_controller
/usr/share/openvswitch/scripts/ovn-ctl start_controller_vtep
# /usr/share/openvswitch/scripts/ovn-ctl start_northd
Starting ovn-northd [ OK ]
Open up TCP ports to access the OVN databases:
[root@kube-master ~]# ovn-nbctl set-connection ptcp:6641
[root@kube-master ~]# ovn-sbctl set-connection ptcp:6642
CENTRAL_IP=172.17.42.30
LOCAL_IP=172.17.42.31
ENCAP_TYPE=geneve
## start ovs
/usr/share/openvswitch/scripts/ovs-ctl start
## set ovn-remote and ovn-nb
ovs-vsctl set Open_vSwitch . external_ids:ovn-remote="tcp:$CENTRAL_IP:6642" external_ids:ovn-nb="tcp:$CENTRAL_IP:6641" external_ids:ovn-encap-ip=$LOCAL_IP external_ids:ovn-encap-type="$ENCAP_TYPE"
## set system_id
id_file=/etc/openvswitch/system-id.conf
test -e $id_file || uuidgen > $id_file
ovs-vsctl set Open_vSwitch . external_ids:system-id=$(cat $id_file)
## start ovn-controller and vtep
/usr/share/openvswitch/scripts/ovn-ctl start_controller
/usr/share/openvswitch/scripts/ovn-ctl start_controller_vtep
Set the k8s API server address in the Open vSwitch database for the initialization scripts (and later daemons) to pick from.
# ovs-vsctl set Open_vSwitch . external_ids:k8s-api-server="127.0.0.1:8080"
git clone https://github.com/openvswitch/ovn-kubernetes
cd ovn-kubernetes
pip install .
ovn-k8s-overlay master-init \
--cluster-ip-subnet="192.168.0.0/16" \
--master-switch-subnet="192.168.1.0/24" \
--node-name="kube-master"
这会创建logical switch/router
:
# ovn-nbctl show
switch d034f42f-6dd5-4ba9-bfdd-114ce17c9235 (kube-master)
port k8s-kube-master
addresses: ["ae:31:fa:c7:81:fc 192.168.1.2"]
port stor-kube-master
type: router
addresses: ["00:00:00:B5:F1:57"]
router-port: rtos-kube-master
switch 2680f36b-85c2-4064-b811-5c0bd91debdd (join)
port jtor-kube-master
type: router
addresses: ["00:00:00:1A:E4:98"]
router-port: rtoj-kube-master
router ce75b330-dbd3-43d2-aa4f-4e17af898532 (kube-master)
port rtos-kube-master
mac: "00:00:00:B5:F1:57"
networks: ["192.168.1.1/24"]
port rtoj-kube-master
mac: "00:00:00:1A:E4:98"
networks: ["100.64.1.1/24"]
kube-node1:
K8S_API_SERVER_IP=172.17.42.30
ovs-vsctl set Open_vSwitch . \
external_ids:k8s-api-server="$K8S_API_SERVER_IP:8080"
ovn-k8s-overlay minion-init \
--cluster-ip-subnet="192.168.0.0/16" \
--minion-switch-subnet="192.168.2.0/24" \
--node-name="kube-node1"
## 对于https需要指定CA和token
ovs-vsctl set Open_vSwitch . \
external_ids:k8s-api-server="https://$K8S_API_SERVER_IP" \
external_ids:k8s-ca-certificate="/etc/kubernetes/certs/ca.crt" \
external_ids:k8s-api-token="YMMFKeD4XqLDakZKQbTCvueGlcdcdgBx"
这会创建对应的logical switch
,并连接到logical router (kube-master)
:
# ovn-nbctl show
switch 0147b986-1dab-49a5-9c4e-57d9feae8416 (kube-node1)
port k8s-kube-node1
addresses: ["ba:2c:06:32:14:78 192.168.2.2"]
port stor-kube-node1
type: router
addresses: ["00:00:00:C0:2E:C7"]
router-port: rtos-kube-node1
...
router ce75b330-dbd3-43d2-aa4f-4e17af898532 (kube-master)
port rtos-kube-node2
mac: "00:00:00:D3:4B:AA"
networks: ["192.168.3.1/24"]
port rtos-kube-node1
mac: "00:00:00:C0:2E:C7"
networks: ["192.168.2.1/24"]
port rtos-kube-master
mac: "00:00:00:B5:F1:57"
networks: ["192.168.1.1/24"]
port rtoj-kube-master
mac: "00:00:00:1A:E4:98"
networks: ["100.64.1.1/24"]
kube-node2:
K8S_API_SERVER_IP=172.17.42.30
ovs-vsctl set Open_vSwitch . \
external_ids:k8s-api-server="$K8S_API_SERVER_IP:8080"
ovn-k8s-overlay minion-init \
--cluster-ip-subnet="192.168.0.0/16" \
--minion-switch-subnet="192.168.3.0/24" \
--node-name="kube-node2"
## attach eth0 to bridge breth0 and move IP/routes
ovn-k8s-util nics-to-bridge eth0
## initialize gateway
ovs-vsctl set Open_vSwitch . \
external_ids:k8s-api-server="$K8S_API_SERVER_IP:8080"
ovn-k8s-overlay gateway-init \
--cluster-ip-subnet="$CLUSTER_IP_SUBNET" \
--bridge-interface breth0 \
--physical-ip "$PHYSICAL_IP" \
--node-name="$NODE_NAME" \
--default-gw "$EXTERNAL_GATEWAY"
# Since you share a NIC for both mgmt and North-South connectivity, you will
# have to start a separate daemon to de-multiplex the traffic.
ovn-k8s-gateway-helper --physical-bridge=breth0 --physical-interface=eth0 \
--pidfile --detach
ovn-k8s-watcher \
--overlay \
--pidfile \
--log-file \
-vfile:info \
-vconsole:emer \
--detach
# ps -ef | grep ovn-k8s
root 28151 1 1 12:57 ? 00:00:00 /usr/bin/python /usr/bin/ovn-k8s-watcher --overlay --pidfile --log-file -vfile:info -vconsole:emer --detach
对应的日志位于/var/log/openvswitch/ovn-k8s-watcher.log
.
创建Pod:
apiVersion: v1
kind: Pod
metadata:
name: sshd-2
spec:
containers:
- name: sshd-2
image: dbyin/sshd:1.0
CNI执行程序为/opt/cni/bin/ovn_cni
,创建容器的日志:
# tail -f /var/log/openvswitch/ovn-k8s-cni-overlay.log
2018-01-03T08:42:39.609Z | 0 | ovn-k8s-cni-overlay | DBG | plugin invoked with cni_command = ADD cni_container_id = a2f5796e82e9286d7f56540585b6040b3a743093c46ea34364212cf1afd42a32 cni_ifname = eth0 cni_netns = /proc/31180/ns/net cni_args = IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=sshd-2;K8S_POD_INFRA_CONTAINER_ID=a2f5796e82e9286d7f56540585b6040b3a743093c46ea34364212cf1afd42a32
2018-01-03T08:42:39.633Z | 1 | kubernetes | DBG | Annotations for pod sshd-2: {u'ovn': u'{"gateway_ip": "192.168.2.1", "ip_address": "192.168.2.3/24", "mac_address": "0a:00:00:00:00:01"}'}
2018-01-03T08:42:39.635Z | 2 | ovn-k8s-cni-overlay | DBG | Creating veth pair for container a2f5796e82e9286d7f56540585b6040b3a743093c46ea34364212cf1afd42a32
2018-01-03T08:42:39.662Z | 3 | ovn-k8s-cni-overlay | DBG | Bringing up veth outer interface a2f5796e82e9286
2018-01-03T08:42:39.769Z | 4 | ovn-k8s-cni-overlay | DBG | Create a link for container namespace
2018-01-03T08:42:39.781Z | 5 | ovn-k8s-cni-overlay | DBG | Adding veth inner interface to namespace for container a2f5796e82e9286d7f56540585b6040b3a743093c46ea34364212cf1afd42a32
2018-01-03T08:42:39.887Z | 6 | ovn-k8s-cni-overlay | DBG | Configuring and bringing up veth inner interface a2f5796e82e92_c. New name:'eth0',MAC address:'0a:00:00:00:00:01', MTU:'1400', IP:192.168.2.3/24
2018-01-03T08:42:44.960Z | 7 | ovn-k8s-cni-overlay | DBG | Setting gateway_ip 192.168.2.1 for container:a2f5796e82e9286d7f56540585b6040b3a743093c46ea34364212cf1afd42a32
2018-01-03T08:42:44.983Z | 8 | ovn-k8s-cni-overlay | DBG | output is {"gateway_ip": "192.168.2.1", "ip_address": "192.168.2.3/24", "mac_address": "0a:00:00:00:00:01"}
从kube-node2
可以访问sshd-2`:
[root@kube-node2 ~]# ping -c 2 192.168.2.3
PING 192.168.2.3 (192.168.2.3) 56(84) bytes of data.
64 bytes from 192.168.2.3: icmp_seq=1 ttl=63 time=0.281 ms
64 bytes from 192.168.2.3: icmp_seq=2 ttl=63 time=0.304 ms
--- 192.168.2.3 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1009ms
rtt min/avg/max/mdev = 0.281/0.292/0.304/0.020 ms
OVS on kube-node1
:
# ovs-vsctl show
9b92e4fb-fc59-47ae-afa4-a95d1842e2bd
Bridge br-int
fail_mode: secure
Port "ovn-069367-0"
Interface "ovn-069367-0"
type: vxlan
options: {csum="true", key=flow, remote_ip="172.17.42.33"}
Port br-int
Interface br-int
type: internal
Port "k8s-kube-node1"
Interface "k8s-kube-node1"
type: internal
Port "a2f5796e82e9286"
Interface "a2f5796e82e9286"
Port "ovn-7f9937-0"
Interface "ovn-7f9937-0"
type: geneve
options: {csum="true", key=flow, remote_ip="172.17.42.32"}
Port "ovn-0696ca-0"
Interface "ovn-0696ca-0"
type: geneve
options: {csum="true", key=flow, remote_ip="172.17.42.30"}
ovs_version: "2.8.1"
a2f5796e82e9286
为网络容器的前16位。
来看看从192.168.3.2
到192.168.2.3
的数据包的Open Flow
处理过程:
[root@kube-node2 ~]# ovs-appctl ofproto/trace br-int in_port=9,ip,dl_src=82:ff:e7:83:99:a9,dl_dst=00:00:00:d3:4b:aa,nw_src=192.168.3.2,nw_dst=192.168.2.3,nw_ttl=32
bridge("br-int")
----------------
0. in_port=9, priority 100
set_field:0x1->reg13
set_field:0xa->reg11
set_field:0x6->reg12
set_field:0x5->metadata
set_field:0x2->reg14
resubmit(,8)
...
42. ip,reg0=0x1/0x1,metadata=0x5, priority 100, cookie 0x88177e0
ct(table=43,zone=NXM_NX_REG13[0..15])
drop
-> A clone of the packet is forked to recirculate. The forked pipeline will be resumed at table 43.
Final flow: ip,reg0=0x1,reg11=0xa,reg12=0x6,reg13=0x1,reg14=0x2,reg15=0x1,metadata=0x5,in_port=9,vlan_tci=0x0000,dl_src=82:ff:e7:83:99:a9,dl_dst=00:00:00:d3:4b:aa,nw_src=192.168.3.2,nw_dst=192.168.2.3,nw_proto=0,nw_tos=0,nw_ecn=0,nw_ttl=32
Megaflow: recirc_id=0,ct_state=-new-est-rel-inv-trk,eth,ip,in_port=9,vlan_tci=0x0000/0x1000,dl_src=00:00:00:00:00:00/01:00:00:00:00:00,dl_dst=00:00:00:d3:4b:aa,nw_dst=128.0.0.0/1,nw_frag=no
Datapath actions: ct(zone=1),recirc(0x24)
===============================================================================
recirc(0x24) - resume conntrack with default ct_state=trk|new (use --ct-next to customize)
===============================================================================
Flow: recirc_id=0x24,ct_state=new|trk,eth,ip,reg0=0x1,reg11=0xa,reg12=0x6,reg13=0x1,reg14=0x2,reg15=0x1,metadata=0x5,in_port=9,vlan_tci=0x0000,dl_src=82:ff:e7:83:99:a9,dl_dst=00:00:00:d3:4b:aa,nw_src=192.168.3.2,nw_dst=192.168.2.3,nw_proto=0,nw_tos=0,nw_ecn=0,nw_ttl=32
bridge("br-int")
----------------
thaw
Resuming from table 43
...
65. reg15=0x1,metadata=0x5, priority 100
clone(ct_clear,set_field:0->reg11,set_field:0->reg12,set_field:0->reg13,set_field:0x4->reg11,set_field:0xb->reg12,set_field:0x1->metadata,set_field:0x4->reg14,set_field:0->reg10,set_field:0->reg15,set_field:0->reg0,set_field:0->reg1,set_field:0->reg2,set_field:0->reg3,set_field:0->reg4,set_field:0->reg5,set_field:0->reg6,set_field:0->reg7,set_field:0->reg8,set_field:0->reg9,set_field:0->in_port,resubmit(,8))
ct_clear
set_field:0->reg11
set_field:0->reg12
set_field:0->reg13
set_field:0x4->reg11
set_field:0xb->reg12
set_field:0x1->metadata
set_field:0x4->reg14
set_field:0->reg10
set_field:0->reg15
set_field:0->reg0
set_field:0->reg1
set_field:0->reg2
set_field:0->reg3
set_field:0->reg4
set_field:0->reg5
set_field:0->reg6
set_field:0->reg7
set_field:0->reg8
set_field:0->reg9
set_field:0->in_port
resubmit(,8)
...
13. ip,metadata=0x1,nw_dst=192.168.2.0/24, priority 49, cookie 0xc6501434
dec_ttl()
move:NXM_OF_IP_DST[]->NXM_NX_XXREG0[96..127]
-> NXM_NX_XXREG0[96..127] is now 0xc0a80203
load:0xc0a80201->NXM_NX_XXREG0[64..95]
set_field:00:00:00:c0:2e:c7->eth_src
set_field:0x3->reg15
load:0x1->NXM_NX_REG10[0]
resubmit(,14)
14. reg0=0xc0a80203,reg15=0x3,metadata=0x1, priority 100, cookie 0x3b957bac
set_field:0a:00:00:00:00:01->eth_dst
resubmit(,15)
...
64. reg10=0x1/0x1,reg15=0x3,metadata=0x1, priority 100
push:NXM_OF_IN_PORT[]
set_field:0->in_port
resubmit(,65)
65. reg15=0x3,metadata=0x1, priority 100
clone(ct_clear,set_field:0->reg11,set_field:0->reg12,set_field:0->reg13,set_field:0x5->reg11,set_field:0x9->reg12,set_field:0x4->metadata,set_field:0x1->reg14,set_field:0->reg10,set_field:0->reg15,set_field:0->reg0,set_field:0->reg1,set_field:0->reg2,set_field:0->reg3,set_field:0->reg4,set_field:0->reg5,set_field:0->reg6,set_field:0->reg7,set_field:0->reg8,set_field:0->reg9,set_field:0->in_port,resubmit(,8))
ct_clear
set_field:0->reg11
set_field:0->reg12
set_field:0->reg13
set_field:0x5->reg11
set_field:0x9->reg12
set_field:0x4->metadata
set_field:0x1->reg14
...
23. metadata=0x4,dl_dst=0a:00:00:00:00:01, priority 50, cookie 0x6c2597ec
set_field:0x3->reg15
resubmit(,32)
32. reg15=0x3,metadata=0x4, priority 100
load:0x4->NXM_NX_TUN_ID[0..23]
set_field:0x3->tun_metadata0
move:NXM_NX_REG14[0..14]->NXM_NX_TUN_METADATA0[16..30]
-> NXM_NX_TUN_METADATA0[16..30] is now 0x1
output:7
-> output to kernel tunnel
pop:NXM_OF_IN_PORT[]
-> NXM_OF_IN_PORT[] is now 0
Final flow: unchanged
Megaflow: recirc_id=0x24,ct_state=+new-est-rel-inv+trk,eth,ip,tun_id=0/0xffffff,tun_metadata0=NP,in_port=9,vlan_tci=0x0000/0x1000,dl_src=82:ff:e7:83:99:a9,dl_dst=00:00:00:d3:4b:aa,nw_src=192.168.3.2/31,nw_dst=192.168.2.3,nw_ecn=0,nw_ttl=32,nw_frag=no
Datapath actions: set(tunnel(tun_id=0x4,dst=172.17.42.31,ttl=64,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x10003}),flags(df|csum|key))),set(eth(src=00:00:00:c0:2e:c7,dst=0a:00:00:00:00:01)),2
几个注意点:
ofproto/trace
中的dl_dst=00:00:00:d3:4b:aa
为kube-node2
对应的网关192.168.2.1
的MAC地址(即stor-kube-node2
的地址).0x5 (datapath/kube-node2)
改成0x1 (router/kube-master)
.Source MAC
; 第14条规则修改Dst MAC
地址.kube-node1 (0x4)
.port 7
,即tunnel设备: 7(ovn-c7889c-0): addr:76:c2:2f:bb:06:5b
config: 0
state: 0
speed: 0 Mbps now, 0 Mbps max
...
Port "ovn-c7889c-0"
Interface "ovn-c7889c-0"
type: geneve
options: {csum="true", key=flow, remote_ip="172.17.42.31"}
节点kube-node1
收到包后的处理过程:
[root@kube-node1 ~]# ovs-appctl ofproto/trace br-int in_port=6,tun_id=0x4,tun_metadata0=0x3,dl_src=00:00:00:c0:2e:c7,dl_dst=0a:00:00:00:00:01
Flow: tun_id=0x4,in_port=6,vlan_tci=0x0000,dl_src=00:00:00:c0:2e:c7,dl_dst=0a:00:00:00:00:01,dl_type=0x0000
bridge("br-int")
----------------
0. in_port=6, priority 100
move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23]
-> OXM_OF_METADATA[0..23] is now 0x4
move:NXM_NX_TUN_METADATA0[16..30]->NXM_NX_REG14[0..14]
-> NXM_NX_REG14[0..14] is now 0
move:NXM_NX_TUN_METADATA0[0..15]->NXM_NX_REG15[0..15]
-> NXM_NX_REG15[0..15] is now 0x3
resubmit(,33)
...
48. reg15=0x3,metadata=0x4, priority 50, cookie 0x37e139d4
resubmit(,64)
64. priority 0
resubmit(,65)
65. reg15=0x3,metadata=0x4, priority 100
output:10
Final flow: reg11=0x5,reg12=0x8,reg13=0xa,reg15=0x3,tun_id=0x4,metadata=0x4,in_port=6,vlan_tci=0x0000,dl_src=00:00:00:c0:2e:c7,dl_dst=0a:00:00:00:00:01,dl_type=0x0000
Megaflow: recirc_id=0,ct_state=-new-est-rel-inv-trk,eth,tun_id=0x4/0xffffff,tun_metadata0=0x3/0x7fffffff,in_port=6,dl_dst=00:00:00:00:00:00/01:00:00:00:00:00,dl_type=0x0000
Datapath actions: 5
最终将packet转给port 10
,即容器对应的Port:
10(a2f5796e82e9286): addr:de:d3:83:cf:22:7c
config: 0
state: 0
current: 10GB-FD COPPER
speed: 10000 Mbps now, 0 Mbps max