YY哥

Collective communication in MPI

hustcat — 2020-02-26T12:00:00+00:00

MPI中的Collective操作实现MPI多个进程间的通信。

MPI_Barrier

MPI_Barrier可以实现MPI进程间的同步，只有所有进程都到达该同步点(synchronization point)，才能继续向下执行。

MPI_Barrier(MPI_Comm communicator)

对于上图，进程P0在T1时刻到达barrier点，就会陷入等待，不能往下运行，因为其它进程还没有执行到这里。当T4时刻，所有进程都到达该barrier点时，所有进程才能继续往下运行。

MPI_Bcast

MPI_Bcast实现广播，即一个进程将相同的数据发送给其它进程：

MPI_Bcast(
    void* data,
    int count,
    MPI_Datatype datatype,
    int root,
    MPI_Comm communicator)

发送进程和接收进程都调用MPI_Bcast，root指定发送进程。对于发送者，data指向发送数据；对于接收者，接收到的数据保存到data指定的存储空间。

如何实现广播呢?

最简单的方式，是发送进程，通过循环遍历，依次向其它进程发送(MPI_Send)数据，但这种方式的时间复杂度为O(N)。下面是一种更高效的方式的实现:

stage1: 0 -> 1
stage2: 0 -> 2, 1 -> 3
stage3: 0 -> 4, 1 -> 5, 2 -> 6, 3 -> 7

在第1个阶段，进程0发送给进程1; 在第2个阶段，进程0发送给进程2，同时，进程1发送给进程3；这样，通过3轮迭代，就完成了数据广播。显然，这种树形算法的复杂度为O(logN)。

MPI_Scatter

MPI_Scatter与MPI_Bcast类型，也是root进程发送数据到其它所有进程，但不同的是，MPI_Scatter将发送缓冲区的数据分成大小相同的chunk，对不同的进程，发送不同的chunk。

MPI_Scatter(
    void* send_data,
    int send_count,
    MPI_Datatype send_datatype,
    void* recv_data,
    int recv_count,
    MPI_Datatype recv_datatype,
    int root,
    MPI_Comm communicator)

send_data是在root进程上的一个数据数组，send_count和send_datatype分别描述了发送给每个进程的数据数量和数据类型。recv_data为接收进程的缓冲区，它能够存储recv_count个recv_datatype数据类型的元素。root指定发送进程。

MPI_Gather

MPI_Gather是MPI_Scatter的反向操作，从多个进程汇总数据到一个进程。

MPI_Gather(
    void* send_data,
    int send_count,
    MPI_Datatype send_datatype,
    void* recv_data,
    int recv_count,
    MPI_Datatype recv_datatype,
    int root,
    MPI_Comm communicator)

在MPI_Gather中，只有root进程需要一个有效的接收缓存。所有其他的调用进程可以传递NULL给recv_data。另外，值得注意的是，recv_count参数是从每个进程接收到的数据数量，而不是所有进程的数据总量之和。

MPI_Allgather

MPI_Allgather是一种多对多的通信方式，每个进程都会从所有进程收集数据。一般来说，MPI_Allgather相当于一个MPI_Gather操作之后跟着一个MPI_Bcast操作。

MPI_Allgather(
    void* send_data,
    int send_count,
    MPI_Datatype send_datatype,
    void* recv_data,
    int recv_count,
    MPI_Datatype recv_datatype,
    MPI_Comm communicator)

相对于MPI_Gather，没有root参数。

MPI_Reduce

与MPI_Gather类似，MPI_Reduce实现root进程从所有进程读取数据。但多了一个MPI_Op参数，用于指定对数据缓冲区send_data中每个数据元素进行的计算操作。

MPI_Reduce(
    void* send_data,
    void* recv_data,
    int count,
    MPI_Datatype datatype,
    MPI_Op op,
    int root,
    MPI_Comm communicator)

所有进程send_data中第i个数据元素做相应的计算，然后写到recv_data中第i个单元。

instead of summing all of the elements from all the arrays into one element, the ith element from each array are summed into the ith element in result array of process 0.

MPI_Allreduce

MPI_Allreduce对MPI_Reduce，相当于MPI_Allgather对MPI_Gather:

MPI_Allreduce(
    void* send_data,
    void* recv_data,
    int count,
    MPI_Datatype datatype,
    MPI_Op op,
    MPI_Comm communicator)

MPI_Allreduce不需要指定root进程，相当于MPI_Reduce操作之后，跟一个MPI_Bcast操作。

Ring Allreduce

MPI_AllReduce这个通信原语背后，MPI中实现了多种AllReduce算法，包括Butterfly，Ring AllReduce，Segmented Ring等。

Ring AllReduce主要针对数据块过大的情况，把每个节点的数据切分成N份(相当于scatter操作)。所以，ring allreduce分2个阶段操作:

(1) scatter-reduce

通过(N-1)步，让每个节点都得到1/N的完整数据块。每一步的通信耗时是α+S/(NB)，计算耗时是(S/N)*C。这一阶段也可视为scatter-reduce。

(2) all-gather

通过(N-1)步，让所有节点的每个1/N数据块都变得完整。每一步的通信耗时也是α+S/(NB)，没有计算。这一阶段也可视为allgather。

Refs

机器学习资源汇总

hustcat — 2020-02-07T00:00:00+00:00

机器学习基础

机器学习: Andrew Ng

神经网络与深度学习

深度学习专项课程: Andrew Ng

生成对抗网络(GAN)

强化学习

深度强化学习

QoS in RoCE

hustcat — 2018-03-22T18:28:30+00:00

Overview

TCP/IP协议栈满足不了现代IDC工作负载(workloads)的需求，主要有2个原因：(1)内核处理收发包需要消耗大量的CPU；(2)TCP不能满足应用对低延迟的需求：一方面，内核协议栈会带来数十ms的延迟；另一方面，TCP的拥塞控制算法、超时重传机制都会增加延迟。

RDMA在NIC内部实现传输协议，所以没有第一个问题；同时，通过zero-copy、kernel bypass避免了内核层面的延迟。

与TCP不同的是，RDMA需要一个无损(lossless)的网络。例如，交换机不能因为缓冲区溢出而丢包。为此，RoCE使用PFC(Priority-based Flow Control)带进行流控。一旦交换机的port的接收队列超过一定阀值(shreshold)时，就会向对端发送PFC pause frame，通知发送端停止继续发包。一旦接收队列低于另一个阀值时，就会发送一个pause with zero duration，通知发送端恢复发包。

PFC对数据流进行分类(class)，不同种类的数据流设置不同的优先级。比如将RoCE的数据流和TCP/IP等其它数据流设置不同的优先级。详细参考Considerations for Global Pause, PFC and QoS with Mellanox Switches and Adapters

Network Flow Classification

对于IP/Ethernet，有2种方式对网络流量分类：

By using PCP bits on the VLAN header
By using DSCP bits on the IP header

详细介绍参考Understanding QoS Configuration for RoCE。

Traffic Control Mechanisms

对于RoCE，有2个机制用于流控：Flow Control (PFC)和Congestion Control (DCQCN)，这两个机制可以同时，也可以分开工作。

Flow Control (PFC)

PFC是一个链路层协议，只能针对port进行流控，粒度较粗。一旦发生拥塞，会导致整个端口停止pause。这是不合理的，参考Understanding RoCEv2 Congestion Management。为此，RoCE引入Congestion Control。

Congestion Control (DCQCN)

DC-QCN是RoCE使用的拥塞控制协议，它基于Explicit Congestion Notification (ECN)。后面会详细介绍。

PFC

前面介绍有2种方式对网络流量进行分类，所以，PFC也有2种实现。

VLAN-based PFC

VLAN tag

基于VLAN tag的Priority code point (PCP，3-bits)定义了8个Priority.

VLAN-based PFC

In case of L2 network, PFC uses the priority bits within the VLAN tag (IEEE 802.1p) to differentiate up to eight types of flows that can be subject to flow control (each one independently).

RoCE with VLAN-based PFC

HowTo Run RoCE and TCP over L2 Enabled with PFC.

## 将skb prio 0~7 映射到vlan prio 3
for i in {0..7}; do ip link set dev eth1.100 type vlan egress-qos-map $i:3 ; done

## enable PFC on TC3
mlnx_qos -i eth1 -f 0,0,0,1,0,0,0,0

例如：

[root@node1 ~]# cat /proc/net/vlan/eth1.100 
eth1.100  VID: 100       REORDER_HDR: 1  dev->priv_flags: 1001
         total frames received            0
          total bytes received            0
      Broadcast/Multicast Rcvd            0

      total frames transmitted            0
       total bytes transmitted            0
Device: eth1
INGRESS priority mappings: 0:0  1:0  2:0  3:0  4:0  5:0  6:0 7:0
 EGRESS priority mappings: 
[root@node1 ~]# for i in {0..7}; do ip link set dev eth1.100 type vlan egress-qos-map $i:3 ; done
[root@node1 ~]# cat /proc/net/vlan/eth1.100                                                      
eth1.100  VID: 100       REORDER_HDR: 1  dev->priv_flags: 1001
         total frames received            0
          total bytes received            0
      Broadcast/Multicast Rcvd            0

      total frames transmitted            0
       total bytes transmitted            0
Device: eth1
INGRESS priority mappings: 0:0  1:0  2:0  3:0  4:0  5:0  6:0 7:0
 EGRESS priority mappings: 0:3 1:3 2:3 3:3 4:3 5:3 6:3 7:3 

参考HowTo Set Egress Priority VLAN on Linux.

问题

基于VLAN的PFC机制有2个主要问题：(1)交换机需要工作在trunk模式；(2)没有标准的方式实现VLAN PCP跨L3网络传输(VLAN是一个L2协议)。

DSCP-based PFC通过使用IP头部的DSCP字段解决了上面2个问题。

DSCP-based PFC

DSCP-based PFC requires both NICs and switches to classify and queue packets based on the DSCP value instead of the VLAN tag.

DSCP vs TOS

The type of service (ToS) field in the IPv4 header has had various purposes over the years, and has been defined in different ways by five RFCs.[1] The modern redefinition of the ToS field is a six-bit Differentiated Services Code Point (DSCP) field[2] and a two-bit Explicit Congestion Notification (ECN) field.[3] While Differentiated Services is somewhat backwards compatible with ToS, ECN is not.

详细介绍参考：

PFC机制的一些问题

RDMA的PFC机制可能会导致一些问题：

RDMA transport livelock

尽管PFC可以避免buffer overflow导致的丢包，但是，其它一些原因，比如FCS错误，也可能导致网络丢包。RDMA的go-back-0算法，每次出现丢包，都会导致整个message的所有packet都会重传，从而导致livelock。TCP有SACK算法，由于RDMA传输层在NIC实现，受限于硬件资源，NIC很难实现SACK算法。可以使用go-back-N算法来避免这个问题。

PFC Deadlock

当PFC机制与Ethernet的广播机制工作时，可能导致出现PFC Deadlock。简单来说，就是PFC机制会导致相应的port停止发包，而Ethernet的广播包可能引起新的PFC pause依赖（比如port对端的server down掉)，从而引起循环依赖。广播和多播对于loseless是非常危险的，建议不要将其归于loseless classes。

NIC PFC pause frame storm

由于PFC pause是传递的，所以很容器引起pause frame storm。比如，NIC因为bug导致接收缓冲区填满，NIC会一直对外发送pause frame。需要在NIC端和交换机端使用watchdog机制来防止pause storm。

The Slow-receiver symptom

由于NIC的资源有限，它将大部分数据结构，比如QPC(Queue Pair Context) 和WQE (Work Queue Element)都放在host memory。而NIC只会缓存部分数据对象，一旦出现cache miss，NIC的处理速度就会下降。

ECN

ECN with TCP/IP

ECN是一个端到端的拥塞通知机制，而不需要丢包。ECN是可选的特性，它需要端点开启ECN支持，同时底层的网络也需要支持。

传统的TCP/IP网络，通过丢包来表明网络拥塞，router/switch/server都会这么做。而对于支持ECN的路由器，当发生网络拥塞时，会设置IP头部的ECN(2bits)标志位，而接收端会给发送端返回拥塞的通知(echo of the congestion indication)，然后发送端降低发送速率。

由于发送速率由传输层(TCP)控制，所以，ECN需要TCP和IP层同时配合。

rfc3168定义了ECN for TCP/IP。

ECN with IP

IP头部有2个bit的ECN标志位：

00 – Non ECN-Capable Transport, Non-ECT
10 – ECN Capable Transport, ECT(0)
01 – ECN Capable Transport, ECT(1)
11 – Congestion Encountered, CE.

如果端点支持ECN，就数据包中的标志位设置为ECT(0)或者ECT(1)。

ECN with TCP

为了支持ECN，TCP使用了TCP头部的3个标志位：Nonce Sum (NS)，ECN-Echo (ECE)和Congestion Window Reduced (CWR)。

ECN in RoCEv2

RoCEv2引入了ECN机制来实现拥塞控制，即RoCEv2 Congestion Management (RCM)。通过RCM，一旦网络发生拥塞，就会通知发送端降低发送速率。与TCP类似，RoCEv2使用传输层头部Base Transport Header (BTH)的FECN标志位来标识拥塞。

实现RCM的RoCEv2 HCAs必须遵循下面的规则：

(1) 如果收到IP.ECN为11的包，HCA生成一个RoCEv2 CNP(Congestion Notification Packet)包，返回给发送端； (2) 如果收到RoCEv2 CNP包，则降低对应QP的发送速率； (3) 从上一次收到RoCEv2 CNP后，经过配置的时间或者字节数，HCA可以增加对应QP的发送速率。

RCM的一些术语

Term	Description
RP (Injector)	Reaction Point - the end node that performs rate limitation to prevent congestion
NP	Notification Point - the end node that receives the packets from the injector and sends back notifications to the injector for indications regarding the congestion situation
CP	Congestion Point - the switch queue in which congestion happens
CNP	The RoCEv2 Congestion Notification Packet - The notification message an NP sends to the RP when it receives CE marked packets.

RoCEv2的ECN示例

参考Congestion Control Loop。

ECN的配置

参考How To Configure RoCE over a Lossless Fabric (PFC + ECN) End-to-End Using ConnectX-4 and Spectrum (Trust L2)。

Refs

一些关于PFC的文献

一些关于ECN的文献

Queue Pair in RDMA

hustcat — 2018-03-21T16:28:30+00:00

一个CA(Channel Adapter)可以包含多个QP，QP相当于socket。通信的两端都需要进行QP的初始化，Communication Manager (CM) 在双方真正建立连接前交换QP信息。每个QP包含一个Send Queue(SQ)和Receive Queue(RQ).

QP type

RC (Reliable Connected) QP

QP Setup. When it is set up by software, a RC QP is initialized with:

(1) The port number on the local CA through which it will send and receive all messages.

(2) The QP Number (QPN) that identifies the RC QP that it is married to in a remote CA.

(3) The port address of the remote CA port behind which the remote RC QP resides.

数据结构

QP in userspace

struct ibv_qp {
	struct ibv_context     *context;
	void		       *qp_context;
	struct ibv_pd	       *pd;
	struct ibv_cq	       *send_cq;
	struct ibv_cq	       *recv_cq;
	struct ibv_srq	       *srq;
	uint32_t		handle;
	uint32_t		qp_num;///QPN
	enum ibv_qp_state       state; /// stat
	enum ibv_qp_type	qp_type; ///type

	pthread_mutex_t		mutex;
	pthread_cond_t		cond;
	uint32_t		events_completed;
};

ibv_create_qp()用于创建QP.

struct ibv_qp *ibv_create_qp(struct ibv_pd *pd,struct ibv_qp_init_attr *qp_init_attr);

QP in ib_core(kernel)

/*
 * @max_write_sge: Maximum SGE elements per RDMA WRITE request.
 * @max_read_sge:  Maximum SGE elements per RDMA READ request.
 */
struct ib_qp {
	struct ib_device       *device;
	struct ib_pd	       *pd;
	struct ib_cq	       *send_cq;
	struct ib_cq	       *recv_cq;
///...
	void		       *qp_context;
	u32			qp_num; ///QP number(QPN)
	u32			max_write_sge;
	u32			max_read_sge;
	enum ib_qp_type		qp_type; ///QP type
///..
}

创建API为ib_uverbs_create_qp.

QP in mlx4_ib

struct mlx4_ib_qp {
	union {
		struct ib_qp	ibqp; //QP in ib_core
		struct ib_wq	ibwq;
	};
	struct mlx4_qp		mqp; // QP in mlx4_core
	struct mlx4_buf		buf;

	struct mlx4_db		db;
	struct mlx4_ib_wq	rq;///RQ
///...
	struct mlx4_ib_wq	sq; ///SQ
///...
}

创建API为mlx4_ib_create_qp.

QP in mlx4_core

struct mlx4_qp {
	void (*event)		(struct mlx4_qp *, enum mlx4_event);

	int			qpn; /// QP number

	atomic_t		refcount;
	struct completion	free;
	u8			usage;
};

创建的API为mlx4_qp_alloc.

QP attributes

QP有很多属性，包括状态(state)等，具体参考enum ibv_qp_attr_mask.这里主要讨论几个重要的属性.

ibv_modify_qp

ibv_modify_qp用于修改QP的属性，包括QP的状态等。

ibv_modify_qp this verb changes QP attributes and one of those attributes may be the QP state.

/**
 * ibv_modify_qp - Modify a queue pair.
 */
int ibv_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr,
		  int attr_mask);

参考这里.

A created QP still cannot be used until it is transitioned through several states, eventually getting to Ready To Send (RTS).

This provides needed information used by the QP to be able send / receive data.

状态(IBV_QP_STATE)

QP有如下一些状态：

RESET               Newly created, queues empty.
INIT                Basic information set. Ready for posting to receive queue.
RTR Ready to Receive. Remote address info set for connected QPs, QP may now receive packets.
RTS Ready to Send. Timeout and retry parameters set, QP may now send packets.

RESET to INIT

当QP创建时，为REST状态，我们可以通过调用ibv_modify_qp将其设置为INIT状态:

///...
	{
		struct ibv_qp_attr attr = {
			.qp_state        = IBV_QPS_INIT,
			.pkey_index      = 0,
			.port_num        = port,
			.qp_access_flags = 0
		};


		if (ibv_modify_qp(ctx->qp, &attr,
				  IBV_QP_STATE              |
				  IBV_QP_PKEY_INDEX         |
				  IBV_QP_PORT               |
				  IBV_QP_ACCESS_FLAGS)) {
			fprintf(stderr, "Failed to modify QP to INIT\n");
			goto clean_qp;
		}
	}

一旦QP处于INIT状态，我们就可以调用ibv_post_recv post receive buffers to the receive queue.

INIT to RTR

Once a queue pair (QP) has receive buffers posted to it, it is now possible to transition the QP into the ready to receive (RTR) state.

例如，对于client/server，需要将QP设置为RTS状态，参考rc_pingpong@pp_connect_ctx.

在将QP的状态设置为RTR时，还需要填充其它一些属性，包括远端的地址信息(LID, QPN, PSN, GID)等。如果不使用RDMA CM verb API，则需要使用其它方式，比如基于TCP/IP的socket通信，在client/server间交换该信息，例如rc_pingpong@pp_client_exch_dest。client先将自己的(LID, QPN, PSN, GID)发送到server，server端读取到这些信息，保存起来，同时将自己的(LID, QPN, PSN, GID)发给client。client收到这些信息后，就可以将QP设置为RTR状态了。

static int pp_connect_ctx(struct pingpong_context *ctx, int port, int my_psn,
			  enum ibv_mtu mtu, int sl,
			  struct pingpong_dest *dest, int sgid_idx)
{
	struct ibv_qp_attr attr = {
		.qp_state		= IBV_QPS_RTR,
		.path_mtu		= mtu,
		.dest_qp_num		= dest->qpn, /// remote QPN
		.rq_psn			= dest->psn, /// remote PSN
		.max_dest_rd_atomic	= 1,
		.min_rnr_timer		= 12,
		.ah_attr		= {
			.is_global	= 0,
			.dlid		= dest->lid, /// remote LID
			.sl		= sl, ///service level
			.src_path_bits	= 0,
			.port_num	= port
		}
	};

	if (dest->gid.global.interface_id) {
		attr.ah_attr.is_global = 1;
		attr.ah_attr.grh.hop_limit = 1;
		attr.ah_attr.grh.dgid = dest->gid;///remote GID
		attr.ah_attr.grh.sgid_index = sgid_idx;
	}
	if (ibv_modify_qp(ctx->qp, &attr,
			  IBV_QP_STATE              |
			  IBV_QP_AV                 |
			  IBV_QP_PATH_MTU           |
			  IBV_QP_DEST_QPN           |
			  IBV_QP_RQ_PSN             |
			  IBV_QP_MAX_DEST_RD_ATOMIC |
			  IBV_QP_MIN_RNR_TIMER)) {
		fprintf(stderr, "Failed to modify QP to RTR\n");
		return 1;
///...

ah_attr/IBV_QP_AV an address handle (AH) needs to be created and filled in as appropriate. Minimally, ah_attr.dlid needs to be filled in.
dest_qp_num/IBV_QP_DEST_QPN    QP number of remote QP.
rq_psn/IBV_QP_RQ_PSN    starting receive packet sequence number (should matchremote QP’s sq_psn)

这里值得注意是IBV_QP_AV，主要用来指示内核做地址解析，对于RoCE，则进行L3到MAC地址的转换。后面会详细介绍其实现。

另外，如果使用RDMA CM verb API，例如使用rdma_connect建立连接时，发送的CM Connect Request包含这些信息：

struct cm_req_msg {
	struct ib_mad_hdr hdr;

	__be32 local_comm_id;
	__be32 rsvd4;
	__be64 service_id;
	__be64 local_ca_guid;
	__be32 rsvd24;
	__be32 local_qkey;
	/* local QPN:24, responder resources:8 */
	__be32 offset32; ///QPN
	/* local EECN:24, initiator depth:8 */
	__be32 offset36;
	/*
	 * remote EECN:24, remote CM response timeout:5,
	 * transport service type:2, end-to-end flow control:1
	 */
	__be32 offset40;
	/* starting PSN:24, local CM response timeout:5, retry count:3 */
	__be32 offset44; ///PSN
	__be16 pkey;
	/* path MTU:4, RDC exists:1, RNR retry count:3. */
	u8 offset50;
	/* max CM Retries:4, SRQ:1, extended transport type:3 */
	u8 offset51;

	__be16 primary_local_lid;
	__be16 primary_remote_lid;
	union ib_gid primary_local_gid; /// local GID
	union ib_gid primary_remote_gid;
///...

server回复的CM Connect Response也包含相应的信息：

struct cm_rep_msg {
	struct ib_mad_hdr hdr;

	__be32 local_comm_id;
	__be32 remote_comm_id;
	__be32 local_qkey;
	/* local QPN:24, rsvd:8 */
	__be32 offset12;
	/* local EECN:24, rsvd:8 */
	__be32 offset16;
	/* starting PSN:24 rsvd:8 */
	__be32 offset20;
	u8 resp_resources;
	u8 initiator_depth;
	/* target ACK delay:5, failover accepted:2, end-to-end flow control:1 */
	u8 offset26;
	/* RNR retry count:3, SRQ:1, rsvd:5 */
	u8 offset27;
	__be64 local_ca_guid;

	u8 private_data[IB_CM_REP_PRIVATE_DATA_SIZE];

} __attribute__ ((packed));

RTR to RTS

一旦QP为RTR状态后，就可以将其转为RTS状态了，参考.

	attr.qp_state	    = IBV_QPS_RTS;
	attr.timeout	    = 14;
	attr.retry_cnt	    = 7;
	attr.rnr_retry	    = 7;
	attr.sq_psn	    = my_psn;
	attr.max_rd_atomic  = 1;
	if (ibv_modify_qp(ctx->qp, &attr,
			  IBV_QP_STATE              |
			  IBV_QP_TIMEOUT            |
			  IBV_QP_RETRY_CNT          |
			  IBV_QP_RNR_RETRY          |
			  IBV_QP_SQ_PSN             |
			  IBV_QP_MAX_QP_RD_ATOMIC)) {
		fprintf(stderr, "Failed to modify QP to RTS\n");
		return 1;
	}

ibv_modify_qp的实现

userspace

ibv_modify_qp -> mlx4_modify_qp -> ibv_cmd_modify_qp:

///libibverbs/cmd.c
int ibv_cmd_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr,
		      int attr_mask,
		      struct ibv_modify_qp *cmd, size_t cmd_size)
{
	/*
	 * Masks over IBV_QP_DEST_QPN are only supported by
	 * ibv_cmd_modify_qp_ex.
	 */
	if (attr_mask & ~((IBV_QP_DEST_QPN << 1) - 1))
		return EOPNOTSUPP;

	IBV_INIT_CMD(cmd, cmd_size, MODIFY_QP);

	copy_modify_qp_fields(qp, attr, attr_mask, &cmd->base);

	if (write(qp->context->cmd_fd, cmd, cmd_size) != cmd_size)
		return errno;

	return 0;
}

kernel

# ./funcgraph ib_uverbs_modify_qp
Tracing "ib_uverbs_modify_qp"... Ctrl-C to end.
  0)               |  ib_uverbs_modify_qp [ib_uverbs]() {
  0)               |    modify_qp.isra.24 [ib_uverbs]() {
  0)   0.090 us    |      kmem_cache_alloc_trace();
  0)               |      rdma_lookup_get_uobject [ib_uverbs]() {
  0)   0.711 us    |        lookup_get_idr_uobject [ib_uverbs]();
  0)   0.036 us    |        uverbs_try_lock_object [ib_uverbs]();
  0)   2.012 us    |      }
  0)   0.272 us    |      copy_ah_attr_from_uverbs.isra.23 [ib_uverbs]();
  0)               |      ib_modify_qp_with_udata [ib_core]() {
  0)               |        ib_resolve_eth_dmac [ib_core]() {
  0)               |          ib_query_gid [ib_core]() {
  0)               |            ib_get_cached_gid [ib_core]() {
  0)   0.159 us    |              _raw_read_lock_irqsave();
  0)   0.036 us    |              __ib_cache_gid_get [ib_core]();
  0)   0.041 us    |              _raw_read_unlock_irqrestore();
  0)   1.367 us    |            }
  0)   1.677 us    |          }
  0)   2.200 us    |        }
  0)   2.742 us    |      }
  0)               |      rdma_lookup_put_uobject [ib_uverbs]() {
  0)   0.023 us    |        lookup_put_idr_uobject [ib_uverbs]();
  0)   0.395 us    |      }
  0)   0.055 us    |      kfree();
  0)   7.688 us    |    }
  0)   8.331 us    |  }

ib_modify_qp_with_udata

ib_modify_qp_with_udata中，会调用ib_resolve_eth_dmac解析remote gid对应的MAC地址：

int ib_modify_qp_with_udata(struct ib_qp *qp, struct ib_qp_attr *attr,
			    int attr_mask, struct ib_udata *udata)
{
	int ret;

	if (attr_mask & IB_QP_AV) {
		ret = ib_resolve_eth_dmac(qp->device, &attr->ah_attr); /// resolve remote mac address
		if (ret)
			return ret;
	}
	ret = ib_security_modify_qp(qp, attr, attr_mask, udata);
	if (!ret && (attr_mask & IB_QP_PORT))
		qp->port = attr->port_num;

	return ret;
}

ib_resolve_eth_dmac -> rdma_addr_find_l2_eth_by_grh:

int rdma_addr_find_l2_eth_by_grh(const union ib_gid *sgid,
				 const union ib_gid *dgid,
				 u8 *dmac, u16 *vlan_id, int *if_index,
				 int *hoplimit)
{
	int ret = 0;
	struct rdma_dev_addr dev_addr;
	struct resolve_cb_context ctx;
	struct net_device *dev;

	union {
		struct sockaddr     _sockaddr;
		struct sockaddr_in  _sockaddr_in;
		struct sockaddr_in6 _sockaddr_in6;
	} sgid_addr, dgid_addr;


	rdma_gid2ip(&sgid_addr._sockaddr, sgid);
	rdma_gid2ip(&dgid_addr._sockaddr, dgid);

	memset(&dev_addr, 0, sizeof(dev_addr));
	if (if_index)
		dev_addr.bound_dev_if = *if_index;
	dev_addr.net = &init_net; /// not support net namespace

	ctx.addr = &dev_addr;
	init_completion(&ctx.comp);
	ret = rdma_resolve_ip(&self, &sgid_addr._sockaddr, &dgid_addr._sockaddr,
			&dev_addr, 1000, resolve_cb, &ctx);
///..
	if (dmac)
		memcpy(dmac, dev_addr.dst_dev_addr, ETH_ALEN); ///set MAC address

从上面的代码可以看到，4.2版本还不支持net namespace.

rdma_resolve_ip

rdma_resolve_ip
|- addr_resolve
   |- addr4_resolve  /// route
   |- addr_resolve_neigh /// ARP

Refs

RDMA Aware Networks Programming User Manual

The introduction to OVS architecture

hustcat — 2018-01-10T15:00:30+00:00

OVS的整体架构:

Architecture

ovs-vswitchd的整体架构如下：

           _
          |   +-------------------+
          |   |    ovs-vswitchd   |<-->ovsdb-server
          |   +-------------------+
          |   |      ofproto      |<-->OpenFlow controllers
          |   +--------+-+--------+  _
          |   | netdev | |ofproto-|   |
userspace |   +--------+ |  dpif  |   |
          |   | netdev | +--------+   |
          |   |provider| |  dpif  |   |
          |   +---||---+ +--------+   |
          |       ||     |  dpif  |   | implementation of
          |       ||     |provider|   | ofproto provider
          |_      ||     +---||---+   |
                  ||         ||       |
           _  +---||-----+---||---+   |
          |   |          |datapath|   |
   kernel |   |          +--------+  _|
          |   |                   |
          |_  +--------||---------+
                       ||
                    physical
                       NIC

ovs-vswitchd

The main Open vSwitch userspace program, in vswitchd/. It reads the desired Open vSwitch configuration from the ovsdb-server program over an IPC channel and passes this configuration down to the “ofproto” library. It also passes certain status and statistical information from ofproto back into the database.

ofproto The Open vSwitch library, in ofproto/, that implements an OpenFlow switch. It talks to OpenFlow controllers over the network and to switch hardware or software through an “ofproto provider”, explained further below.
netdev The Open vSwitch library, in lib/netdev.c, that abstracts interacting with network devices, that is, Ethernet interfaces. The netdev library is a thin layer over “netdev provider” code, explained further below.

ofproto

struct ofproto表示An OpenFlow switch:

///ofproto/ofproto-provider.h
/* An OpenFlow switch.
 *
 * With few exceptions, ofproto implementations may look at these fields but
 * should not modify them. */
struct ofproto {
    struct hmap_node hmap_node; /* In global 'all_ofprotos' hmap. */
    const struct ofproto_class *ofproto_class; /// see ofproto_dpif_class 
    char *type;                 /* Datapath type. */
    char *name;                 /* Datapath name. */
///...
    /* Datapath. */
    struct hmap ports;          /* Contains "struct ofport"s. */
    struct shash port_by_name;
    struct simap ofp_requests;  /* OpenFlow port number requests. */
    uint16_t alloc_port_no;     /* Last allocated OpenFlow port number. */
    uint16_t max_ports;         /* Max possible OpenFlow port num, plus one. */
    struct hmap ofport_usage;   /* Map ofport to last used time. */
    uint64_t change_seq;        /* Change sequence for netdev status. */

    /* Flow tables. */
    long long int eviction_group_timer; /* For rate limited reheapification. */
    struct oftable *tables;
    int n_tables;
    ovs_version_t tables_version;  /* Controls which rules are visible to
                                    * table lookups. */
///...

struct ofproto包含两个最重要的组成信息：端口信息（struct ofport）和流表信息(struct oftable):

struct ofport

///ofproto/ofproto-provider.h
/* An OpenFlow port within a "struct ofproto".
 *
 * The port's name is netdev_get_name(port->netdev).
 *
 * With few exceptions, ofproto implementations may look at these fields but
 * should not modify them. */
struct ofport {
    struct hmap_node hmap_node; /* In struct ofproto's "ports" hmap. */
    struct ofproto *ofproto;    /* The ofproto that contains this port. */
    struct netdev *netdev; /// network device
    struct ofputil_phy_port pp;
    ofp_port_t ofp_port;        /* OpenFlow port number. */
    uint64_t change_seq;
    long long int created;      /* Time created, in msec. */
    int mtu;
};

流表

///ofproto/ofproto-provider.h
/* A flow table within a "struct ofproto".
*/
struct oftable {
    enum oftable_flags flags;
    struct classifier cls;      /* Contains "struct rule"s. */
    char *name;                 /* Table name exposed via OpenFlow, or NULL. */
////...

struct oftable通过struct classifier关联所有流表规则。

流表规则

///ofproto/ofproto-provider.h
struct rule {
    /* Where this rule resides in an OpenFlow switch.
     *
     * These are immutable once the rule is constructed, hence 'const'. */
    struct ofproto *const ofproto; /* The ofproto that contains this rule. */
    const struct cls_rule cr;      /* In owning ofproto's classifier. */
    const uint8_t table_id;        /* Index in ofproto's 'tables' array. */

    enum rule_state state;
///...
    /* OpenFlow actions.  See struct rule_actions for more thread-safety
     * notes. */
    const struct rule_actions * const actions;
///...

struct rule的actions为规则对应的action信息:

///ofproto/ofproto-provider.h
/* A set of actions within a "struct rule".
*/
struct rule_actions {
    /* Flags.
     *
     * 'has_meter' is true if 'ofpacts' contains an OFPACT_METER action.
     *
     * 'has_learn_with_delete' is true if 'ofpacts' contains an OFPACT_LEARN
     * action whose flags include NX_LEARN_F_DELETE_LEARNED. */
    bool has_meter;
    bool has_learn_with_delete;
    bool has_groups;

    /* Actions. */
    uint32_t ofpacts_len;         /* Size of 'ofpacts', in bytes. */
    struct ofpact ofpacts[];      /* Sequence of "struct ofpacts". */
};

action

///include/openvswitch/ofp-action.sh
/* Header for an action.
 *
 * Each action is a structure "struct ofpact_*" that begins with "struct
 * ofpact", usually followed by other data that describes the action.  Actions
 * are padded out to a multiple of OFPACT_ALIGNTO bytes in length.
 */
struct ofpact {
    /* We want the space advantage of an 8-bit type here on every
     * implementation, without giving up the advantage of having a useful type
     * on implementations that support packed enums. */
#ifdef HAVE_PACKED_ENUM
    enum ofpact_type type;      /* OFPACT_*. */
#else
    uint8_t type;               /* OFPACT_* */
#endif

    uint8_t raw;                /* Original type when added, if any. */
    uint16_t len;               /* Length of the action, in bytes, including
                                 * struct ofpact, excluding padding. */
};

OFPACT_OUTPUT action:

/* OFPACT_OUTPUT.
 *
 * Used for OFPAT10_OUTPUT. */
struct ofpact_output {
    struct ofpact ofpact;
    ofp_port_t port;            /* Output port. */
    uint16_t max_len;           /* Max send len, for port OFPP_CONTROLLER. */
};

ofproto_dpif

struct ofproto_dpif

struct ofproto_dpif表示基于dpif datapath的bridge:

///ofproto/ofproto-dpif.h
/* A bridge based on a "dpif" datapath. */

struct ofproto_dpif {
    struct hmap_node all_ofproto_dpifs_node; /* In 'all_ofproto_dpifs'. */
    struct ofproto up;
    struct dpif_backer *backer;
///...

struct dpif

基于OVS的datapath interface(ofproto_dpif->backer->dpif):

///ofproto/dpif-provider.h
/* Open vSwitch datapath interface.
 *
 * This structure should be treated as opaque by dpif implementations. */
struct dpif {
    const struct dpif_class *dpif_class;
    char *base_name;
    char *full_name;
    uint8_t netflow_engine_type;
    uint8_t netflow_engine_id;
};

struct dpif_class:

/* Datapath interface class structure, to be defined by each implementation of
 * a datapath interface.
 *
 * These functions return 0 if successful or a positive errno value on failure,
 * except where otherwise noted.
 *
 * These functions are expected to execute synchronously, that is, to block as
 * necessary to obtain a result.  Thus, they may not return EAGAIN or
 * EWOULDBLOCK or EINPROGRESS.  We may relax this requirement in the future if
 * and when we encounter performance problems. */
struct  { /// see dpif_netlink_class/dpif_netdev_class
    /* Type of dpif in this class, e.g. "system", "netdev", etc.
     *
     * One of the providers should supply a "system" type, since this is
     * the type assumed if no type is specified when opening a dpif. */
    const char *type;
 ///...

struct dpif_class实际上对应图中的dpif provider:

The “dpif” library in turn delegates much of its functionality to a “dpif provider”.

struct dpif_class, in lib/dpif-provider.h, defines the interfaces required to implement a dpif provider for new hardware or software.

There are two existing dpif implementations that may serve as useful examples during a port:

lib/dpif-netlink.c is a Linux-specific dpif implementation that talks to an Open vSwitch-specific kernel module (whose sources are in the “datapath” directory). The kernel module performs all of the switching work, passing packets that do not match any flow table entry up to userspace. This dpif implementation is essentially a wrapper around calls into the kernel module.

lib/dpif-netdev.c is a generic dpif implementation that performs all switching internally. This is how the Open vSwitch userspace switch is implemented.

dpif_netdev_class

//lib/dpif_netdev.c
const struct dpif_class dpif_netdev_class = {
    "netdev",
    dpif_netdev_init,
///...

用户态switch实现，DPDK需要使用该类型：

ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev
ovs-vsctl add-port br0 dpdk-p0 -- set Interface dpdk-p0 type=dpdk \
    options:dpdk-devargs=0000:01:00.0

参考Using Open vSwitch with DPDK.

dpif_netlink_class

system datapath，基于Linux内核实现的switch:

//lib/dpif-netlink.c
const struct dpif_class dpif_netlink_class = {
    "system",
    NULL,                       /* init */
///...

netdev

struct netdev

struct netdev代表一个网络设备：

///lib/netdev-provider.h
/* A network device (e.g. an Ethernet device).
 *
 * Network device implementations may read these members but should not modify
 * them. */
struct netdev {
    /* The following do not change during the lifetime of a struct netdev. */
    char *name;                         /* Name of network device. */
    const struct netdev_class *netdev_class; /* Functions to control
                                                this device. */
///...

struct netdev_class

const struct netdev_class netdev_linux_class =
    NETDEV_LINUX_CLASS(
        "system",
        netdev_linux_construct,
        netdev_linux_get_stats,
        netdev_linux_get_features,
        netdev_linux_get_status,
        LINUX_FLOW_OFFLOAD_API);

const struct netdev_class netdev_tap_class =
    NETDEV_LINUX_CLASS(
        "tap",
        netdev_linux_construct_tap,
        netdev_tap_get_stats,
        netdev_linux_get_features,
        netdev_linux_get_status,
        NO_OFFLOAD_API);

const struct netdev_class netdev_internal_class =
    NETDEV_LINUX_CLASS(
        "internal",
        netdev_linux_construct,
        netdev_internal_get_stats,
        NULL,                  /* get_features */
        netdev_internal_get_status,
        NO_OFFLOAD_API);

ovs-vswitchd

ovs-vswitchd的基本功能包括bridge的维护、流表的维护和upcall处理等逻辑。

int
main(int argc, char *argv[])
{
///...
        bridge_run();
        unixctl_server_run(unixctl);
        netdev_run();
///...

bridge_run

bridge_run根据从ovsdb-server读取的配置信息进行网桥建立、配置和更新，参考OVS网桥建立和连接管理:

bridge_run
|- bridge_init_ofproto ///Initialize the ofproto library
|  |- ofproto_init
|    |- ofproto_class->init
|
|- bridge_run__
|  |- ofproto_type_run
|  |- ofproto_run
|     |- ofproto_class->run
|     |- handle_openflow ///处理openflow message
|
|- bridge_reconfigure /// config bridge

netdev_run

netdev_run会调用所有netdev_class的run回调函数：

/* Performs periodic work needed by all the various kinds of netdevs.
 *
 * If your program opens any netdevs, it must call this function within its
 * main poll loop. */
void
netdev_run(void)
    OVS_EXCLUDED(netdev_mutex)
{
    netdev_initialize();

    struct netdev_registered_class *rc;
    CMAP_FOR_EACH (rc, cmap_node, &netdev_classes) {
        if (rc->class->run) {
            rc->class->run(rc->class);//netdev_linux_class->netdev_linux_run
        }
    }
}

以netdev_linux_class为例，netdev_linux_run会通过netlink的sock得到虚拟网卡的状态，并且更新状态。

add port

///ofproto/ofproto.c
/* Attempts to add 'netdev' as a port on 'ofproto'.  If 'ofp_portp' is
 * non-null and '*ofp_portp' is not OFPP_NONE, attempts to use that as
 * the port's OpenFlow port number.
 *
 * If successful, returns 0 and sets '*ofp_portp' to the new port's
 * OpenFlow port number (if 'ofp_portp' is non-null).  On failure,
 * returns a positive errno value and sets '*ofp_portp' to OFPP_NONE (if
 * 'ofp_portp' is non-null). */
int
ofproto_port_add(struct ofproto *ofproto, struct netdev *netdev,
                 ofp_port_t *ofp_portp)
{
    ofp_port_t ofp_port = ofp_portp ? *ofp_portp : OFPP_NONE;
    int error;

    error = ofproto->ofproto_class->port_add(ofproto, netdev); ///ofproto_dpif_class(->port_add)
///...

dpif add port

port_add -> dpif_port_add:

///lib/dpif.c
int
dpif_port_add(struct dpif *dpif, struct netdev *netdev, odp_port_t *port_nop)
{
    const char *netdev_name = netdev_get_name(netdev);
    odp_port_t port_no = ODPP_NONE;
    int error;

    COVERAGE_INC(dpif_port_add);

    if (port_nop) {
        port_no = *port_nop;
    }

    error = dpif->dpif_class->port_add(dpif, netdev, &port_no); ///dpif_netlink_class
///...

dpif-netlink

lib/dpif-netlink.c: dpif_netlink_port_add -> dpif_netlink_port_add_compat -> dpif_netlink_port_add__:

static int
dpif_netlink_port_add__(struct dpif_netlink *dpif, const char *name,
                        enum ovs_vport_type type,
                        struct ofpbuf *options,
                        odp_port_t *port_nop)
    OVS_REQ_WRLOCK(dpif->upcall_lock)
{
///...
    dpif_netlink_vport_init(&request);
    request.cmd = OVS_VPORT_CMD_NEW; ///new port
    request.dp_ifindex = dpif->dp_ifindex;
    request.type = type;
    request.name = name;

    request.port_no = *port_nop;
    upcall_pids = vport_socksp_to_pids(socksp, dpif->n_handlers);
    request.n_upcall_pids = socksp ? dpif->n_handlers : 1;
    request.upcall_pids = upcall_pids;

    if (options) {
        request.options = options->data;
        request.options_len = options->size;
    }

    error = dpif_netlink_vport_transact(&request, &reply, &buf);
///...

data path

# ovs-vsctl add-port br-int vxlan1 -- set interface vxlan1 type=vxlan options:remote_ip=172.18.42.161

datapath函数调用:

ovs_vport_cmd_new
|-- new_vport
    |-- ovs_vport_add
        |-- vport_ops->create

add flow

ovs-ofctl add-flow br-int "in_port=2, nw_src=192.168.1.100, action=drop"

handle_openflow -> handle_openflow__ -> handle_flow_mod -> handle_flow_mod__ -> ofproto_flow_mod_start -> add_flow_start -> replace_rule_start -> ofproto_rule_insert__.

参考Openvswitch原理与代码分析(7): 添加一条流表flow.

upcall

无论是内核态datapath还是基于dpdk的用户态datapath，当flow table查不到之后都会进入upcall的处理. upcall的处理函数udpif_upcall_handler会在udpif_start_threads里面初始化，同时创建的还有udpif_revalidator的线程:

recv_upcalls是upcall处理的入口函数：

recv_upcalls
|-- dpif_recv         // (1) read packet from datapath
|-- upcall_receive    // (2) associate packet with a ofproto
|-- process_upcall    // (3) process packet by flow rule
|-- handle_upcalls    // (4) add flow rule to datapath

process_upcall

process_upcall
|-- upcall_xlate
    |-- xlate_actions
        |-- rule_dpif_lookup_from_table
        |-- do_xlate_actions

do_xlate_actions

static void
do_xlate_actions(const struct ofpact *ofpacts, size_t ofpacts_len,
                 struct xlate_ctx *ctx, bool is_last_action)
{
    struct flow_wildcards *wc = ctx->wc;
    struct flow *flow = &ctx->xin->flow;
    const struct ofpact *a;
///...
    ///do each action
    OFPACT_FOR_EACH (a, ofpacts, ofpacts_len) {
///...
        switch (a->type) {
        case OFPACT_OUTPUT: /// actions=output
            xlate_output_action(ctx, ofpact_get_OUTPUT(a)->port,
                                ofpact_get_OUTPUT(a)->max_len, true, last,
                                false);
            break;

        case OFPACT_GROUP: /// actions=group
            if (xlate_group_action(ctx, ofpact_get_GROUP(a)->group_id, last)) {
                /* Group could not be found. */

                /* XXX: Terminates action list translation, but does not
                 * terminate the pipeline. */
                return;
            }
            break;
///...
        case OFPACT_CT: /// actions=ct
            compose_conntrack_action(ctx, ofpact_get_CT(a), last);
            break;

Refs

OVS flow table implementation in datapath

hustcat — 2018-01-10T11:00:30+00:00

OVS的流表原理

OVS中的flow cache分两级：microflow cache和megaflow cache。前者用于精确匹配(exact matching)，即用skb_buff->skb_hash进行匹配；后者用于通配符匹配(wildcard matching)，使用元组空间搜索算法实现，即tuple space search (TSS) .

TSS

OVS使用sw_flow_key进行流匹配，字段比较多，包含L1到L4协议的关键信息：

那么为什么OVS选择TSS，而不选择其他查找算法？这里给出了以下三点解释：

（1）在虚拟化数据中心环境下，流的添加删除比较频繁，TSS支持高效的、常数时间的表项更新；
（2）TSS支持任意匹配域的组合；
（3）TSS存储空间随着流的数量线性增长。

OVS流表的实现

///mask cache entry
struct mask_cache_entry {
	u32 skb_hash;
	u32 mask_index; /// mask index in flow_table->mask_array->masks
};

struct mask_array {
	struct rcu_head rcu;
	int count, max;
	struct sw_flow_mask __rcu *masks[]; /// mask array
};

struct table_instance { /// hash table
	struct flex_array *buckets; ///bucket array
	unsigned int n_buckets;
	struct rcu_head rcu;
	int node_ver;
	u32 hash_seed;
	bool keep_flows;
};

struct flow_table {
	struct table_instance __rcu *ti; ///hash table
	struct table_instance __rcu *ufid_ti;
	struct mask_cache_entry __percpu *mask_cache; ///microflow cache, find entry by skb_hash, 256 entries(MC_HASH_ENTRIES)
	struct mask_array __rcu *mask_array; ///mask array
	unsigned long last_rehash;
	unsigned int count;
	unsigned int ufid_count;
};

struct flow_table对应datapath的流表，它主要包括3部分：

(1) ti

ti为流表对应的hash表，典型的哈希桶+链表的实现。hash表中的元素为sw_flow，它表示一个流表项：

/// flow table entry
struct sw_flow {
	struct rcu_head rcu;
	struct {
		struct hlist_node node[2];
		u32 hash;
	} flow_table, ufid_table; /// hash table node
	int stats_last_writer;		/* CPU id of the last writer on
					 * 'stats[0]'.
					 */
	struct sw_flow_key key; ///key
	struct sw_flow_id id;
	struct cpumask cpu_used_mask;
	struct sw_flow_mask *mask;
	struct sw_flow_actions __rcu *sf_acts; ///action
	struct flow_stats __rcu *stats[]; /* One for each CPU.  First one
					   * is allocated at flow creation time,
					   * the rest are allocated on demand
					   * while holding the 'stats[0].lock'.
					   */
};

(2) mask_cache

mask_cache为microflow cache，用于精确匹配(EMC)。它是percpu数组，有256个元素，即mask_cache_entry。

mask_cache_entry 有2个字段:skb_hash用于与skb_buff->skb_hash比较；mask_index为对应的sw_flow_mask在mask_array数组中的下标。

(3) mask_array

mask_array为sw_flow_mask数组，每个sw_flow_mask表示一个掩码（Mask），用于指示sw_flow_key中的哪些字段需要进行匹配：

Each bit of the mask will be set to 1 when a match is required on that bit position; otherwise, it will be 0.

struct sw_flow_key_range {
	unsigned short int start;
	unsigned short int end;
};
/// flow mask
struct sw_flow_mask {
	int ref_count;
	struct rcu_head rcu;
	struct sw_flow_key_range range;
	struct sw_flow_key key;
};

流表查找过程

流表查找的目标是基于skb_buff创建一个sw_flow_key，在流表中找到匹配的流表项sw_flow。

ovs_flow_tbl_lookup_stats是进行入流表查找的入口函数，参数包括流表对象、用于匹配的sw_flow_key、和用于精确匹配的skb_hash:

/*
 * mask_cache maps flow to probable mask. This cache is not tightly
 * coupled cache, It means updates to  mask list can result in inconsistent
 * cache entry in mask cache.
 * This is per cpu cache and is divided in MC_HASH_SEGS segments.
 * In case of a hash collision the entry is hashed in next segment.
 */
struct sw_flow *ovs_flow_tbl_lookup_stats(struct flow_table *tbl,
					  const struct sw_flow_key *key,
					  u32 skb_hash,
					  u32 *n_mask_hit)
{
	struct mask_array *ma = rcu_dereference(tbl->mask_array);
	struct table_instance *ti = rcu_dereference(tbl->ti);
	struct mask_cache_entry *entries, *ce;
	struct sw_flow *flow;
	u32 hash;
	int seg;

	*n_mask_hit = 0;
	if (unlikely(!skb_hash)) {
		u32 mask_index = 0;

		return flow_lookup(tbl, ti, ma, key, n_mask_hit, &mask_index);
	}

	/* Pre and post recirulation flows usually have the same skb_hash
	 * value. To avoid hash collisions, rehash the 'skb_hash' with
	 * 'recirc_id'.  */
	if (key->recirc_id)
		skb_hash = jhash_1word(skb_hash, key->recirc_id);

	ce = NULL;
	hash = skb_hash;
	entries = this_cpu_ptr(tbl->mask_cache);

	/* Find the cache entry 'ce' to operate on. */
	for (seg = 0; seg < MC_HASH_SEGS; seg++) { ///find in cache
		int index = hash & (MC_HASH_ENTRIES - 1); ///skb_hash -> index
		struct mask_cache_entry *e;

		e = &entries[index];
		if (e->skb_hash == skb_hash) {
			flow = flow_lookup(tbl, ti, ma, key, n_mask_hit,
					   &e->mask_index);
			if (!flow)
				e->skb_hash = 0;
			return flow;
		}

		if (!ce || e->skb_hash < ce->skb_hash)
			ce = e;  /* A better replacement cache candidate. */

		hash >>= MC_HASH_SHIFT;
	}

	/* Cache miss, do full lookup. */
	flow = flow_lookup(tbl, ti, ma, key, n_mask_hit, &ce->mask_index);
	if (flow)
		ce->skb_hash = skb_hash;

	return flow;
}

可以看到，ovs_flow_tbl_lookup_stats会先尝试使用skb_hash查找mask_cache_entry。如果成功，则将mask_index传给flow_lookup函数，后者会直接使用mask_array中该下标的sw_flow_mask进行查找。否则，则遍历mask_array，依次尝试：

flow_lookup

/* Flow lookup does full lookup on flow table. It starts with
 * mask from index passed in *index.
 */
static struct sw_flow *flow_lookup(struct flow_table *tbl,
				   struct table_instance *ti,
				   const struct mask_array *ma,
				   const struct sw_flow_key *key,
				   u32 *n_mask_hit,
				   u32 *index)
{
	struct sw_flow_mask *mask;
	struct sw_flow *flow;
	int i;

	if (*index < ma->max) { /// get mask by index
		mask = rcu_dereference_ovsl(ma->masks[*index]);
		if (mask) {
			flow = masked_flow_lookup(ti, key, mask, n_mask_hit);
			if (flow)
				return flow;
		}
	}

	for (i = 0; i < ma->max; i++)  { /// travel all mask in array

		if (i == *index)
			continue;

		mask = rcu_dereference_ovsl(ma->masks[i]);
		if (!mask)
			continue;

		flow = masked_flow_lookup(ti, key, mask, n_mask_hit);
		if (flow) { /* Found */
			*index = i;
			return flow;
		}
	}

	return NULL;
}

masked_flow_lookup

masked_flow_lookup对sw_flow_key进行掩码计算后，计算hash值，找到对应的bucket，然后遍历链表，依次每个流表项sw_flow进行比较。并返回匹配的流表项。

static struct sw_flow *masked_flow_lookup(struct table_instance *ti,
					  const struct sw_flow_key *unmasked,
					  const struct sw_flow_mask *mask,
					  u32 *n_mask_hit)
{
	struct sw_flow *flow;
	struct hlist_head *head;
	u32 hash;
	struct sw_flow_key masked_key;

	ovs_flow_mask_key(&masked_key, unmasked, false, mask);
	hash = flow_hash(&masked_key, &mask->range); ///mask key -> hash
	head = find_bucket(ti, hash); ///hash -> bucket
	(*n_mask_hit)++;
	hlist_for_each_entry_rcu(flow, head, flow_table.node[ti->node_ver]) { /// list
		if (flow->mask == mask && flow->flow_table.hash == hash &&
		    flow_cmp_masked_key(flow, &masked_key, &mask->range)) ///compare key
			return flow;
	}
	return NULL;
}

Refs

OVN load balancer practice

hustcat — 2018-01-05T16:28:30+00:00

Web server

node2:

mkdir -p /tmp/www
echo "i am vm1" > /tmp/www/index.html
cd /tmp/www
ip netns exec vm1 python -m SimpleHTTPServer 8000

node3:

mkdir -p /tmp/www
echo "i am vm2" > /tmp/www/index.html
cd /tmp/www
ip netns exec vm2 python -m SimpleHTTPServer 8000

Configuring the Load Balancer Rules

首先配置LB规则，VIP 172.18.1.254为Physical Network中的IP.

在node1上执行：

uuid=`ovn-nbctl create load_balancer vips:172.18.1.254="192.168.100.10,192.168.100.11"`
echo $uuid

这会在Northbound DB的load_balancer表中创建对应的项：

[root@node1 ~]# ovn-nbctl list load_balancer 
_uuid               : 3760718f-294f-491a-bf63-3faf3573d44c
external_ids        : {}
name                : ""
protocol            : []
vips                : {"172.18.1.254"="192.168.100.10,192.168.100.11"}

使用`Gateway Router`作为`Load Balancer`

node1:

ovn-nbctl set logical_router gw1 load_balancer=$uuid

[root@node1 ~]# ovn-nbctl lb-list                             
UUID                                    LB                  PROTO      VIP                      IPs
3760718f-294f-491a-bf63-3faf3573d44c                        tcp/udp    172.18.1.254             192.168.100.10,192.168.100.11

[root@node1 ~]# ovn-nbctl lr-lb-list gw1
UUID                                    LB                  PROTO      VIP                      IPs
3760718f-294f-491a-bf63-3faf3573d44c                        tcp/udp    172.18.1.254             192.168.100.10,192.168.100.11

访问LB，client -> vip:

[root@client ~]# curl http://172.18.1.254:8000
i am vm2
[root@client ~]# curl http://172.18.1.254:8000
i am vm1

load balancer is not performing any sort of health checking. At present, the assumption is that health checks would be performed by an orchestration solution such as Kubernetes but it would be resonable to assume that this feature would be added at some future point.

删除LB:

ovn-nbctl clear logical_router gw1 load_balancer
ovn-nbctl destroy load_balancer $uuid

在`Logical Switch`上配置`Load Balancer`

为了方便测试，再创建logical switch 'ls2' 192.168.101.0/24:

# create the logical switch
ovn-nbctl ls-add ls2

# create logical port
ovn-nbctl lsp-add ls2 ls2-vm3
ovn-nbctl lsp-set-addresses ls2-vm3 02:ac:10:ff:00:33
ovn-nbctl lsp-set-port-security ls2-vm3 02:ac:10:ff:00:33

# create logical port
ovn-nbctl lsp-add ls2 ls2-vm4
ovn-nbctl lsp-set-addresses ls2-vm4 02:ac:10:ff:00:44
ovn-nbctl lsp-set-port-security ls2-vm4 02:ac:10:ff:00:44


# create router port for the connection to 'ls2'
ovn-nbctl lrp-add router1 router1-ls2 02:ac:10:ff:01:01 192.168.101.1/24

# create the 'ls2' switch port for connection to 'router1'
ovn-nbctl lsp-add ls2 ls2-router1
ovn-nbctl lsp-set-type ls2-router1 router
ovn-nbctl lsp-set-addresses ls2-router1 02:ac:10:ff:01:01
ovn-nbctl lsp-set-options ls2-router1 router-port=router1-ls2

# ovn-nbctl show
switch b97eb754-ea59-41f6-b435-ea6ada4659d1 (ls2)
    port ls2-vm3
        addresses: ["02:ac:10:ff:00:33"]
    port ls2-router1
        type: router
        addresses: ["02:ac:10:ff:01:01"]
        router-port: router1-ls2
    port ls2-vm4
        addresses: ["02:ac:10:ff:00:44"]

在node2上创建vm3:

ip netns add vm3
ovs-vsctl add-port br-int vm3 -- set interface vm3 type=internal
ip link set vm3 netns vm3
ip netns exec vm3 ip link set vm3 address 02:ac:10:ff:00:33
ip netns exec vm3 ip addr add 192.168.101.10/24 dev vm3
ip netns exec vm3 ip link set vm3 up
ip netns exec vm3 ip route add default via 192.168.101.1 dev vm3
ovs-vsctl set Interface vm3 external_ids:iface-id=ls2-vm3

node1:

uuid=`ovn-nbctl create load_balancer vips:10.254.10.10="192.168.100.10,192.168.100.11"`
echo $uuid

set LB for ls2:

ovn-nbctl set logical_switch ls2 load_balancer=$uuid
ovn-nbctl get logical_switch ls2 load_balancer

结果:

# # ovn-nbctl ls-lb-list ls2
UUID                                    LB                  PROTO      VIP                      IPs
a19bece1-52bf-4555-89f4-257534c0b9d9                        tcp/udp    10.254.10.10             192.168.100.10,192.168.100.11

测试（vm3 -> VIP）:

[root@node2 ~]# ip netns exec vm3 curl 10.254.10.10:8000
i am vm2
[root@node2 ~]# ip netns exec vm3 curl 10.254.10.10:8000
i am vm1

值得注意的是，如果在ls1上设置LB，vm3不能访问VIP，这说明，LB是对client的switch，而不是server端的switch.

This highlights the requirement that load balancing be applied on the client’s logical switch rather than the server’s logical switch.

删除LB:

ovn-nbctl clear logical_switch ls2 load_balancer
ovn-nbctl destroy load_balancer $uuid

Tracing

vm3 -> 192.168.101.1:

ovs-appctl ofproto/trace br-int in_port=14,icmp,icmp_type=0x8,dl_src=02:ac:10:ff:00:33,dl_dst=02:ac:10:ff:01:01,nw_src=192.168.101.10,nw_dst=192.168.101.1 

vm3 -> vm1:

ovs-appctl ofproto/trace br-int in_port=14,ip,dl_src=02:ac:10:ff:00:33,dl_dst=02:ac:10:ff:01:01,nw_src=192.168.101.10,nw_dst=192.168.100.10,nw_ttl=32

ovs-appctl ofproto/trace br-int in_port=14,icmp,icmp_type=0x8,dl_src=02:ac:10:ff:00:33,dl_dst=02:ac:10:ff:01:01,nw_src=192.168.101.10,nw_dst=192.168.100.10,nw_ttl=32

原理

OVN的实现利用了OVS的NAT规则：

17. ct_state=+new+trk,ip,metadata=0xa,nw_dst=10.254.10.10, priority 110, cookie 0x30d9e9b5
    group:1
    ct(commit,table=18,zone=NXM_NX_REG13[0..15],nat(dst=192.168.100.10))
    nat(dst=192.168.100.10)
     -> A clone of the packet is forked to recirculate. The forked pipeline will be resumed at table 18.

如下：

table=17, n_packets=10, n_bytes=740, idle_age=516, priority=110,ct_state=+new+trk,ip,metadata=0xa,nw_dst=10.254.10.10 actions=group:1

问题定位

执行ovn-nbctl lsp-set-addresses ls2-vm3 02:ac:10:ff:00:33修改ls2-vm3的MAC后，vm3无法ping 192.168.101.1，使用ovs-appctl ofproto/trace:

# ovs-appctl ofproto/trace br-int in_port=14,icmp,icmp_type=0x8,dl_src=02:ac:10:ff:00:33,dl_dst=02:ac:10:ff:01:01,nw_src=192.168.101.10,nw_dst=192.168.101.1 
...
14. ip,metadata=0x6, priority 0, cookie 0x986c8ad3
    push:NXM_NX_REG0[]
    push:NXM_NX_XXREG0[96..127]
    pop:NXM_NX_REG0[]
     -> NXM_NX_REG0[] is now 0xc0a8650a
    set_field:00:00:00:00:00:00->eth_dst
    resubmit(,66)
    66. reg0=0xc0a8650a,reg15=0x3,metadata=0x6, priority 100
            set_field:02:ac:10:ff:00:13->eth_dst
    pop:NXM_NX_REG0[]
     -> NXM_NX_REG0[] is now 0xc0a8650a

可以看到66规则将目标MAC设置为02:ac:10:ff:00:13:

[root@node2 ovs]# ovs-ofctl dump-flows br-int | grep table=66
table=66, n_packets=0, n_bytes=0, idle_age=8580, priority=100,reg0=0xc0a8640b,reg15=0x1,metadata=0x6 actions=mod_dl_dst:02:ac:10:ff:00:22
table=66, n_packets=18, n_bytes=1764, idle_age=1085, priority=100,reg0=0xc0a8640a,reg15=0x1,metadata=0x6 actions=mod_dl_dst:02:ac:10:ff:00:11
table=66, n_packets=33, n_bytes=3234, idle_age=1283, priority=100,reg0=0xc0a8650a,reg15=0x3,metadata=0x6 actions=mod_dl_dst:02:ac:10:ff:00:13
table=66, n_packets=0, n_bytes=0, idle_age=8580, priority=100,reg0=0,reg1=0,reg2=0,reg3=0,reg15=0x3,metadata=0x6 actions=mod_dl_dst:00:00:00:00:00:00

修改上面的第3条规则：

[root@node2 ovs]# ovs-ofctl del-flows br-int "table=66,reg0=0xc0a8650a,reg15=0x3,metadata=0x6"
[root@node2 ovs]# ovs-ofctl add-flow br-int "table=66,reg0=0xc0a8650a,reg15=0x3,metadata=0x6,actions=mod_dl_dst:02:ac:10:ff:00:33"

问题解决。

Refs

The OVN Load Balancer

An introduction to OVN architecture

hustcat — 2018-01-04T23:10:30+00:00

OVN的特性

OVN是OVS社区在2015年1月份才宣布的一个子项目，但是到目前为止OVN已经支持了很多功能：

Logical switches：逻辑交换机，用来做二层转发。
L2/L3/L4 ACLs：二到四层的 ACL，可以根据报文的 MAC 地址，IP 地址，端口号来做访问控制。
Logical routers：逻辑路由器，分布式的，用来做三层转发。
Multiple tunnel overlays：支持多种隧道封装技术，有 Geneve，STT 和 VXLAN。
TOR switch or software logical switch gateways：支持使用硬件 TOR switch 或者软件逻辑 switch 当作网关来连接物理网络和虚拟网络。

OVN架构

                                         CMS
                                          |
                                          |
                              +-----------|-----------+
                              |           |           |
                              |     OVN/CMS Plugin    |
                              |           |           |
                              |           |           |
                              |   OVN Northbound DB   |
                              |           |           |
                              |           |           |
                              |       ovn-northd      |
                              |           |           |
                              +-----------|-----------+
                                          |
                                          |
                                +-------------------+
                                | OVN Southbound DB |
                                +-------------------+
                                          |
                                          |
                       +------------------+------------------+
                       |                  |                  |
         HV 1          |                  |    HV n          |
       +---------------|---------------+  .  +---------------|---------------+
       |               |               |  .  |               |               |
       |        ovn-controller         |  .  |        ovn-controller         |
       |         |          |          |  .  |         |          |          |
       |         |          |          |     |         |          |          |
       |  ovs-vswitchd   ovsdb-server  |     |  ovs-vswitchd   ovsdb-server  |
       |                               |     |                               |
       +-------------------------------+     +-------------------------------+

上图是OVN的整体架构，最上面 Openstack/CMS plugin 是 CMS（Cloud Management System）和 OVN 的接口，它把 CMS 的配置转化成 OVN 的格式写到 Nnorthbound DB 里面。

详细参考ovn-architecture - Open Virtual Network architecture。

Northbound DB

Northbound DB里面存的都是一些逻辑的数据，大部分和物理网络没有关系，比如logical switch，logical router，ACL，logical port，和传统网络设备概念一致。

Northbound DB进程：

root     29981     1  0 Dec27 ?        00:00:00 ovsdb-server: monitoring pid 29982 (healthy)
root     29982 29981  2 Dec27 ?        00:35:53 ovsdb-server --detach --monitor -vconsole:off --log-file=/var/log/openvswitch/ovsdb-server-nb.log --remote=punix:/var/run/openvswitch/ovnnb_db.sock --pidfile=/var/run/openvswitch/ovnnb_db.pid --remote=db:OVN_Northbound,NB_Global,connections --unixctl=ovnnb_db.ctl --private-key=db:OVN_Northbound,SSL,private_key --certificate=db:OVN_Northbound,SSL,certificate --ca-cert=db:OVN_Northbound,SSL,ca_cert --ssl-protocols=db:OVN_Northbound,SSL,ssl_protocols --ssl-ciphers=db:OVN_Northbound,SSL,ssl_ciphers /etc/openvswitch/ovnnb_db.db

查看ovn-northd的逻辑数据（logical switch,logical port, logical router）：

# ovn-nbctl show
switch 707dcb98-baa0-4ac5-8955-1ce4de2f780f (kube-master)
    port stor-kube-master
        type: router
        addresses: ["00:00:00:BF:CC:B1"]
        router-port: rtos-kube-master
    port k8s-kube-master
        addresses: ["66:ce:60:08:9e:cd 192.168.1.2"]
...

OVN-northd

OVN-northd类似于一个集中的控制器，它把 Northbound DB 里面的数据翻译一下，写到 Southbound DB 里面。

# start ovn northd
/usr/share/openvswitch/scripts/ovn-ctl start_northd

ovn-northd进程：

root     29996     1  0 Dec27 ?        00:00:00 ovn-northd: monitoring pid 29997 (healthy)
root     29997 29996 25 Dec27 ?        06:52:54 ovn-northd -vconsole:emer -vsyslog:err -vfile:info --ovnnb-db=unix:/var/run/openvswitch/ovnnb_db.sock --ovnsb-db=unix:/var/run/openvswitch/ovnsb_db.sock --no-chdir --log-file=/var/log/openvswitch/ovn-northd.log --pidfile=/var/run/openvswitch/ovn-northd.pid --detach --monitor

Southbound DB

Southbound DB进程：

root     32441     1  0 Dec28 ?        00:00:00 ovsdb-server: monitoring pid 32442 (healthy)
root     32442 32441  0 Dec28 ?        00:01:45 ovsdb-server --detach --monitor -vconsole:off --log-file=/var/log/openvswitch/ovsdb-server-sb.log --remote=punix:/var/run/openvswitch/ovnsb_db.sock --pidfile=/var/run/openvswitch/ovnsb_db.pid --remote=db:OVN_Southbound,SB_Global,connections --unixctl=ovnsb_db.ctl --private-key=db:OVN_Southbound,SSL,private_key --certificate=db:OVN_Southbound,SSL,certificate --ca-cert=db:OVN_Southbound,SSL,ca_cert --ssl-protocols=db:OVN_Southbound,SSL,ssl_protocols --ssl-ciphers=db:OVN_Southbound,SSL,ssl_ciphers /etc/openvswitch/ovnsb_db.db

Southbound DB 里面存的数据和 Northbound DB 语义完全不一样，主要包含 3 类数据，一是物理网络数据，比如 HV（hypervisor）的 IP 地址，HV 的 tunnel 封装格式；二是逻辑网络数据，比如报文如何在逻辑网络里面转发；三是物理网络和逻辑网络的绑定关系，比如逻辑端口关联到哪个 HV 上面。

# ovn-sbctl show
Chassis "7f99371a-d51c-478c-8de2-facd70e2f739"
    hostname: "kube-node2"
    Encap vxlan
        ip: "172.17.42.32"
        options: {csum="true"}
Chassis "069367d8-8e07-4b81-b057-38cf6b21b2b7"
    hostname: "kube-node3"
    Encap vxlan
        ip: "172.17.42.33"
        options: {csum="true"}

OVN-controller

ovn-controller 是 OVN 里面的 agent，类似于 neutron 里面的 ovs-agent，它也是运行在每个 HV 上面。北向，ovn-controller 会把物理网络的信息写到 Southbound DB 里面；南向，它会把 Southbound DB 里面存的一些数据转化成 Openflow flow 配到本地的 OVS table 里面，来实现报文的转发。

# start ovs
/usr/share/openvswitch/scripts/ovs-ctl start --system-id=random
# start ovn-controller
/usr/share/openvswitch/scripts/ovn-ctl start_controller

ovn-controller进程：

root     13423     1  0 Dec26 ?        00:00:00 ovn-controller: monitoring pid 13424 (healthy)
root     13424 13423 82 Dec26 ?        1-15:53:23 ovn-controller unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --no-chdir --log-file=/var/log/openvswitch/ovn-controller.log --pidfile=/var/run/openvswitch/ovn-controller.pid --detach --monitor

ovs-vswitchd 和 ovsdb-server 是 OVS 的两个进程。

OVN Northbound DB

Northbound DB 是 OVN 和 CMS 之间的接口，Northbound DB 里面的几乎所有的内容都是由 CMS 产生的，ovn-northd 监听这个数据库的内容变化，然后翻译，保存到 Southbound DB 里面。

Northbound DB 里面主要有如下几张表：

Logical_Switch

每一行代表一个逻辑交换机，逻辑交换机有两种，一种是 overlay logical switches，对应于 neutron network，每创建一个 neutron network，networking-ovn 会在这张表里增加一行；另一种是 bridged logical switch，连接物理网络和逻辑网络，被 VTEP gateway 使用。Logical_Switch 里面保存了它包含的 logical port（指向 Logical_Port table）和应用在它上面的 ACL（指向 ACL table）。

ovn-nbctl list Logical_Switch可以查看Logical_Switch表中的数据：

# ovn-nbctl --db tcp:172.17.42.30:6641 list Logical_Switch
_uuid               : 707dcb98-baa0-4ac5-8955-1ce4de2f780f
acls                : []
dns_records         : []
external_ids        : {gateway_ip="192.168.1.1/24"}
load_balancer       : [4522d0fa-9d46-4165-9524-51d20a35ea0a, 5842a5a9-6c8e-4a87-be3c-c8a0bc271626]
name                : kube-master
other_config        : {subnet="192.168.1.0/24"}
ports               : [44222421-c811-4f38-8ea6-5504a35df703, ee5a5e97-c41d-4656-bd2a-8bc8ad180188]
qos_rules           : []
...

Logical_Switch_Port

每一行代表一个逻辑端口，每创建一个 neutron port，networking-ovn 会在这张表里增加一行，每行保存的信息有端口的类型，比如 patch port，localnet port，端口的 IP 和 MAC 地址，端口的状态 UP/Down。

# ovn-nbctl --db tcp:172.17.42.30:6641 list Logical_Switch_Port
_uuid               : 44222421-c811-4f38-8ea6-5504a35df703
addresses           : ["00:00:00:BF:CC:B1"]
dhcpv4_options      : []
dhcpv6_options      : []
dynamic_addresses   : []
enabled             : []
external_ids        : {}
name                : stor-kube-master
options             : {router-port=rtos-kube-master}
parent_name         : []
port_security       : []
tag                 : []
tag_request         : []
type                : router
up                  : false
...

每一行代表一个应用到逻辑交换机上的 ACL 规则，如果逻辑交换机上面的所有端口都没有配置 security group，那么这个逻辑交换机上不应用 ACL。每条 ACL 规则包含匹配的内容，方向，还有动作。

Logical_Router

每一行代表一个逻辑路由器，每创建一个 neutron router，networking-ovn 会在这张表里增加一行，每行保存了它包含的逻辑的路由器端口。

# ovn-nbctl --db tcp:172.17.42.30:6641 list Logical_Router
_uuid               : e12293ba-e61e-40bf-babc-8580d1121641
enabled             : []
external_ids        : {"k8s-cluster-router"=yes}
load_balancer       : []
name                : kube-master
nat                 : []
options             : {}
ports               : [2cbbbb8e-6b5d-44d5-9693-b4069ca9e12a, 3a046f60-161a-4fee-a1b3-d9d3043509d2, 40d3d95d-906b-483b-9b71-1fa6970de6e8, 840ab648-6436-4597-aeff-f84fbc44e3a9, b08758e5-7017-413f-b3db-ff68f49460a4]
static_routes       : []

Logical_Router_Port

每一行代表一个逻辑路由器端口，每创建一个 router interface，networking-ovn 会在这张表里加一行，它主要保存了路由器端口的 IP 和 MAC。

# ovn-nbctl --db tcp:172.17.42.30:6641 list Logical_Router_Port
_uuid               : 840ab648-6436-4597-aeff-f84fbc44e3a9
enabled             : []
external_ids        : {}
gateway_chassis     : []
mac                 : "00:00:00:BF:CC:B1"
name                : rtos-kube-master
networks            : ["192.168.1.1/24"]
options             : {}
peer                : []

Southbound DB

Southbound DB 处在 OVN 架构的中心，它是 OVN 中非常重要的一部分，它跟 OVN 的其他组件都有交互。

Southbound DB 里面有如下几张表：

Chassis

Chassis是OVN新增的概念，OVS里面没有这个概念，Chassis可以是 HV，也可以是 VTEP 网关。

每一行表示一个 HV 或者 VTEP 网关，由 ovn-controller/ovn-controller-vtep 填写，包含 chassis 的名字和 chassis 支持的封装的配置（指向表 Encap），如果 chassis 是 VTEP 网关，VTEP 网关上和 OVN 关联的逻辑交换机也保存在这张表里。

# ovn-sbctl list Chassis
_uuid               : 3dec4aa7-8f15-493d-89f4-4a260b510bbd
encaps              : [bc324cd4-56f2-4f73-af9e-149b7401e0d2]
external_ids        : {datapath-type="", iface-types="geneve,gre,internal,lisp,patch,stt,system,tap,vxlan", ovn-bridge-mappings=""}
hostname            : "kube-node1"
name                : "c7889c47-2d18-4dd4-a3b7-446d42b79f79"
nb_cfg              : 34
vtep_logical_switches: []
...

Encap

保存着tunnel的类型和 tunnel endpoint IP 地址。

# ovn-sbctl list Encap
_uuid               : bc324cd4-56f2-4f73-af9e-149b7401e0d2
chassis_name        : "c7889c47-2d18-4dd4-a3b7-446d42b79f79"
ip                  : "172.17.42.31"
options             : {csum="true"}
type                : vxlan
...

Logical_Flow

每一行表示一个逻辑的流表，这张表是 ovn-northd 根据 Nourthbound DB 里面二三层拓扑信息和 ACL 信息转换而来的，ovn-controller 把这个表里面的流表转换成 OVS 流表，配到 HV 上的 OVS table。流表主要包含匹配的规则，匹配的方向，优先级，table ID 和执行的动作。

# ovn-sbctl lflow-list
Datapath: "kube-node1" (2c3caa57-6a58-4416-9bd2-3e2982d83cf1)  Pipeline: ingress
  table=0 (ls_in_port_sec_l2  ), priority=100  , match=(eth.src[40]), action=(drop;)
  table=0 (ls_in_port_sec_l2  ), priority=100  , match=(vlan.present), action=(drop;)
  table=0 (ls_in_port_sec_l2  ), priority=50   , match=(inport == "default_sshd-2"), action=(next;)
  table=0 (ls_in_port_sec_l2  ), priority=50   , match=(inport == "k8s-kube-node1"), action=(next;)
  table=0 (ls_in_port_sec_l2  ), priority=50   , match=(inport == "stor-kube-node1"), action=(next;)
....

Multicast_Group

每一行代表一个组播组，组播报文和广播报文的转发由这张表决定，它保存了组播组所属的 datapath，组播组包含的端口，还有代表 logical egress port 的 tunnel_key。

Datapath_Binding

每一行代表一个 datapath 和物理网络的绑定关系，每个 logical switch 和 logical router 对应一行。它主要保存了 OVN 给 datapath 分配的代表 logical datapath identifier 的 tunnel_key。

示例：

# ovn-sbctl list Datapath_Binding
_uuid               : 4cfe0e4c-1bbb-406a-9d85-e7bc24c818d0
external_ids        : {logical-router="e12293ba-e61e-40bf-babc-8580d1121641", name=kube-master}
tunnel_key          : 1

_uuid               : 5ec4f962-77a8-44e8-ae01-5b7e46b6a286
external_ids        : {logical-switch="e865aa50-7510-4b7f-9df4-b82801a8e92b", name=join}
tunnel_key          : 2

_uuid               : 2c3caa57-6a58-4416-9bd2-3e2982d83cf1
external_ids        : {logical-switch="7c41601a-dcd5-4e77-b0e8-ca8692d7462b", name="kube-node1"}
tunnel_key          : 4

Port_Binding

这张表主要用来确定 logical port 处在哪个 chassis 上面。每一行包含的内容主要有 logical port 的 MAC 和 IP 地址，端口类型，端口属于哪个 datapath binding，代表 logical input/output port identifier 的 tunnel_key, 以及端口处在哪个 chassis。端口所处的 chassis 由 ovn-controller/ovn-controller 设置，其余的值由 ovn-northd 设置。

示例：

# ovn-sbctl list Port_Binding    
_uuid               : 5e5746d8-3533-45a8-8abe-5a7028c97afa
chassis             : []
datapath            : 2c3caa57-6a58-4416-9bd2-3e2982d83cf1
external_ids        : {}
gateway_chassis     : []
logical_port        : "stor-kube-node1"
mac                 : ["00:00:00:18:22:18"]
nat_addresses       : []
options             : {peer="rtos-kube-node1"}
parent_port         : []
tag                 : []
tunnel_key          : 2
type                : patch

小结

表 Chassis 和表 Encap 包含的是物理网络的数据，表 Logical_Flow 和表 Multicast_Group 包含的是逻辑网络的数据，表 Datapath_Binding 和表 Port_Binding包含的是逻辑网络和物理网络绑定关系的数据。

OVN tunnel

OVN 支持的 tunnel 类型有三种，分别是 Geneve，STT 和 VXLAN。HV 与 HV 之间的流量，只能用 Geneve 和 STT 两种，HV 和 VTEP 网关之间的流量除了用 Geneve 和 STT 外，还能用 VXLAN，这是为了兼容硬件 VTEP 网关，因为大部分硬件 VTEP 网关只支持 VXLAN。

虽然 VXLAN 是数据中心常用的 tunnel 技术，但是 VXLAN header 是固定的，只能传递一个 VNID（VXLAN network identifier），如果想在 tunnel 里面传递更多的信息，VXLAN 实现不了。所以 OVN 选择了 Geneve 和 STT，Geneve 的头部有个 option 字段，支持 TLV 格式，用户可以根据自己的需要进行扩展，而 STT 的头部可以传递 64-bit 的数据，比 VXLAN 的 24-bit 大很多。

OVN tunnel 封装时使用了三种数据:

Logical datapath identifier（逻辑的数据通道标识符）

datapath 是 OVS 里面的概念，报文需要送到 datapath 进行处理，一个 datapath 对应一个 OVN 里面的逻辑交换机或者逻辑路由器，类似于 tunnel ID。这个标识符有 24-bit，由 ovn-northd 分配的，全局唯一，保存在 Southbound DB 里面的表 Datapath_Binding 的列 tunnel_key 里（参考前面的示例）。

Logical input port identifier（逻辑的入端口标识符）：进入 logical datapath 的端口标识符，15-bit 长，由 ovn-northd 分配的，在每个 datapath 里面唯一。它可用范围是 1-32767，0 预留给内部使用。保存在 Southbound DB 里面的表 Port_Binding 的列 tunnel_key 里。
Logical output port identifier（逻辑的出端口标识符）

离开 logical datapath 的端口标识符，16-bit 长，范围 0-32767 和 logical input port identifier 含义一样，范围 32768-65535 给组播组使用。对于每个 logical port，input port identifier 和 output port identifier 相同。

如果 tunnel 类型是 Geneve，Geneve header 里面的 VNI 字段填 logical datapath identifier，Option 字段填 logical input port identifier 和 logical output port identifier，TLV 的 class 为 0xffff，type 为 0，value 为 0 (1-bit) + logical input port identifier (15-bit) + logical output port identifier (16-bit)。详细参考Geneve: Generic Network Virtualization Encapsulation。

Geneve Option:

0                   1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|          Option Class         |      Type     |R|R|R| Length  |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                      Variable Option Data                     |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

OVS 的 tunnel 封装是由 Openflow 流表来做的，所以 ovn-controller 需要把这三个标识符写到本地 HV 的 Openflow flow table 里面，对于每个进入 br-int 的报文，都会有这三个属性，logical datapath identifier 和 logical input port identifier 在入口方向被赋值，分别存在 openflow metadata 字段和 OVS 扩展寄存器 reg14 里面。报文经过 OVS 的 pipeline 处理后，如果需要从指定端口发出去，只需要把 Logical output port identifier 写在 OVS 扩展寄存器 reg15 里面。

示例(port 6对应Geneve tunnel接口)：

table=0,in_port=6 actions=move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],move:NXM_NX_TUN_METADATA0[16..30]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_METADATA0[0..15]->NXM_NX_REG15[0..15],resubmit(,33)

可以看到，NXM_NX_TUN_ID为tunnel_key，reg14为15 bit，reg15为16 bit.

OVN tunnel 里面所携带的 logical input port identifier 和 logical output port identifier 可以提高流表的查找效率，OVS 流表可以通过这两个值来处理报文，不需要解析报文的字段。

Refs

OVN gateway practice

hustcat — 2018-01-04T23:00:30+00:00

OVN Gateway用于连接overlay network和physical network。它支持L2/L3两种方式：

layer-2 which bridge an OVN logical switch into a VLAN, and layer-3 which provide a routed connection between an OVN router and the physical network.

Unlike a distributed logical router (DLR), an OVN gateway router is centralized on a single host (chassis) so that it may provide services which cannot yet be distributed (NAT, load balancing, etc…). As of this publication there is a restriction that gateway routers may only connect to other routers via a logical switch, whereas DLRs may connect to one other directly via a peer link. Work is in progress to remove this restriction.

It should be noted that it is possible to have multiple gateway routers tied into an environment, which means that it is possible to perform ingress ECMP routing into logical space. However, it is worth mentioning that OVN currently does not support egress ECMP between gateway routers. Again, this is being looked at as a future enhancement.

环境

环境:

node1 172.17.42.160/16 – will serve as OVN Central and Gateway Node
node2 172.17.42.161/16 – will serve as an OVN Host
node3 172.17.42.162/16 – will serve as an OVN Host
client 172.18.1.10/16 - physical network node

网络拓扑结构：

         _________ 
        |  client | 172.18.1.10/16 Physical Network
         ---------
         ____|____ 
        |  switch | outside
         ---------
             |
         ____|____ 
        |  router | gw1 port 'gw1-outside': 172.18.1.2/16
         ---------      port 'gw1-join':    192.168.255.1/24
         ____|____ 
        |  switch | join  192.168.255.0/24
         ---------  
         ____|____ 
        |  router | router1 port 'router1-join':  192.168.255.2/24
         ---------          port 'router1-ls1': 192.168.100.1/24
             |
         ____|____ 
        |  switch | ls1 192.168.100.0/24
         ---------  
         /       \
 _______/_       _\_______  
|  vm1    |     |   vm2   |
 ---------       ---------
192.168.100.10  192.168.100.11

连接vm1/vm2的switch:

[root@node1 ~]# ovn-nbctl show
switch 0fab3ddd-6325-4219-aa1c-6dc9853b7069 (ls1)
    port ls1-vm1
        addresses: ["02:ac:10:ff:00:11"]
    port ls1-vm2
        addresses: ["02:ac:10:ff:00:22"]

[root@node1 ~]# ovn-sbctl show
Chassis "dc82b489-22b3-42dd-a28e-f25439316356"
    hostname: "node1"
    Encap geneve
        ip: "172.17.42.160"
        options: {csum="true"}
Chassis "598fec44-5787-452f-b527-2ef8c4adb942"
    hostname: "node2"
    Encap geneve
        ip: "172.17.42.161"
        options: {csum="true"}
    Port_Binding "ls1-vm1"
Chassis "54292ae7-c91c-423b-a936-5b416d6bae9f"
    hostname: "node3"
    Encap geneve
        ip: "172.17.42.162"
        options: {csum="true"}
    Port_Binding "ls1-vm2"

创建router(用于连接所有VM的switch)

# add the router
ovn-nbctl lr-add router1

# create router port for the connection to 'ls1'
ovn-nbctl lrp-add router1 router1-ls1 02:ac:10:ff:00:01 192.168.100.1/24

# create the 'ls1' switch port for connection to 'router1'
ovn-nbctl lsp-add ls1 ls1-router1
ovn-nbctl lsp-set-type ls1-router1 router
ovn-nbctl lsp-set-addresses ls1-router1 02:ac:10:ff:00:01
ovn-nbctl lsp-set-options ls1-router1 router-port=router1-ls1

Logical network:

# ovn-nbctl show
switch 0fab3ddd-6325-4219-aa1c-6dc9853b7069 (ls1)
    port ls1-router1
        type: router
        addresses: ["02:ac:10:ff:00:01"]
        router-port: router1-ls1
    port ls1-vm1
        addresses: ["02:ac:10:ff:00:11"]
    port ls1-vm2
        addresses: ["02:ac:10:ff:00:22"]
router 6dec5e02-fa39-4f2c-8e1e-7a0182f110e6 (router1)
    port router1-ls1
        mac: "02:ac:10:ff:00:01"
        networks: ["192.168.100.1/24"]

vm1访问192.168.100.1:

# ip netns exec vm1 ping -c 1 192.168.100.1 
PING 192.168.100.1 (192.168.100.1) 56(84) bytes of data.
64 bytes from 192.168.100.1: icmp_seq=1 ttl=254 time=0.275 ms

创建gateway router

指定在node1上部署Gateway router，创建router时，通过options:chassis={chassis_uuid}实现：

# create router 'gw1'
ovn-nbctl create Logical_Router name=gw1 options:chassis=dc82b489-22b3-42dd-a28e-f25439316356

# create a new logical switch for connecting the 'gw1' and 'router1' routers
ovn-nbctl ls-add join

# connect 'gw1' to the 'join' switch
ovn-nbctl lrp-add gw1 gw1-join 02:ac:10:ff:ff:01 192.168.255.1/24
ovn-nbctl lsp-add join join-gw1
ovn-nbctl lsp-set-type join-gw1 router
ovn-nbctl lsp-set-addresses join-gw1 02:ac:10:ff:ff:01
ovn-nbctl lsp-set-options join-gw1 router-port=gw1-join


# 'router1' to the 'join' switch
ovn-nbctl lrp-add router1 router1-join 02:ac:10:ff:ff:02 192.168.255.2/24
ovn-nbctl lsp-add join join-router1
ovn-nbctl lsp-set-type join-router1 router
ovn-nbctl lsp-set-addresses join-router1 02:ac:10:ff:ff:02
ovn-nbctl lsp-set-options join-router1 router-port=router1-join


# add static routes
ovn-nbctl lr-route-add gw1 "192.168.100.0/24" 192.168.255.2
ovn-nbctl lr-route-add router1 "0.0.0.0/0" 192.168.255.1

logical network

# ovn-nbctl show
switch 0fab3ddd-6325-4219-aa1c-6dc9853b7069 (ls1)
    port ls1-router1
        type: router
        addresses: ["02:ac:10:ff:00:01"]
        router-port: router1-ls1
    port ls1-vm1
        addresses: ["02:ac:10:ff:00:11"]
    port ls1-vm2
        addresses: ["02:ac:10:ff:00:22"]
switch d4b119e9-0298-42ab-8cc7-480292231953 (join)
    port join-gw1
        type: router
        addresses: ["02:ac:10:ff:ff:01"]
        router-port: gw1-join
    port join-router1
        type: router
        addresses: ["02:ac:10:ff:ff:02"]
        router-port: router1-join
router 6dec5e02-fa39-4f2c-8e1e-7a0182f110e6 (router1)
    port router1-ls1
        mac: "02:ac:10:ff:00:01"
        networks: ["192.168.100.1/24"]
    port router1-join
        mac: "02:ac:10:ff:ff:02"
        networks: ["192.168.255.2/24"]
router f29af2c3-e9d1-46f9-bff7-b9b8f0fd56df (gw1)
    port gw1-join
        mac: "02:ac:10:ff:ff:01"
        networks: ["192.168.255.1/24"]

node2上的vm1访问gw1:

[root@node2 ~]# ip netns exec vm1 ip route add default via 192.168.100.1

[root@node2 ~]# ip netns exec vm1 ping -c 1 192.168.255.1
PING 192.168.255.1 (192.168.255.1) 56(84) bytes of data.
64 bytes from 192.168.255.1: icmp_seq=1 ttl=253 time=2.18 ms

连接overlay network与physical network

这里假设物理网络的IP为172.18.0.0/16:

# create new port on router 'gw1'
ovn-nbctl lrp-add gw1 gw1-outside 02:0a:7f:18:01:02 172.18.1.2/16

# create new logical switch and connect it to 'gw1'
ovn-nbctl ls-add outside
ovn-nbctl lsp-add outside outside-gw1
ovn-nbctl lsp-set-type outside-gw1 router
ovn-nbctl lsp-set-addresses outside-gw1 02:0a:7f:18:01:02
ovn-nbctl lsp-set-options outside-gw1 router-port=gw1-outside

# create a bridge for eth1 (run on 'node1')
ovs-vsctl add-br br-eth1

# create bridge mapping for eth1. map network name "phyNet" to br-eth1 (run on 'node1')
ovs-vsctl set Open_vSwitch . external-ids:ovn-bridge-mappings=phyNet:br-eth1

# create localnet port on 'outside'. set the network name to "phyNet"
ovn-nbctl lsp-add outside outside-localnet
ovn-nbctl lsp-set-addresses outside-localnet unknown
ovn-nbctl lsp-set-type outside-localnet localnet
ovn-nbctl lsp-set-options outside-localnet network_name=phyNet

# connect eth1 to br-eth1 (run on 'node1')
ovs-vsctl add-port br-eth1 eth1

完整的logical network:

# ovn-nbctl show
switch d4b119e9-0298-42ab-8cc7-480292231953 (join)
    port join-gw1
        type: router
        addresses: ["02:ac:10:ff:ff:01"]
        router-port: gw1-join
    port join-router1
        type: router
        addresses: ["02:ac:10:ff:ff:02"]
        router-port: router1-join
switch 0fab3ddd-6325-4219-aa1c-6dc9853b7069 (ls1)
    port ls1-router1
        type: router
        addresses: ["02:ac:10:ff:00:01"]
        router-port: router1-ls1
    port ls1-vm1
        addresses: ["02:ac:10:ff:00:11"]
    port ls1-vm2
        addresses: ["02:ac:10:ff:00:22"]
switch 64dc14b1-3e0f-4f68-b388-d76826a5c972 (outside)
    port outside-gw1
        type: router
        addresses: ["02:0a:7f:18:01:02"]
        router-port: gw1-outside
    port outside-localnet
        type: localnet
        addresses: ["unknown"]
router 6dec5e02-fa39-4f2c-8e1e-7a0182f110e6 (router1)
    port router1-ls1
        mac: "02:ac:10:ff:00:01"
        networks: ["192.168.100.1/24"]
    port router1-join
        mac: "02:ac:10:ff:ff:02"
        networks: ["192.168.255.2/24"]
router f29af2c3-e9d1-46f9-bff7-b9b8f0fd56df (gw1)
    port gw1-outside
        mac: "02:0a:7f:18:01:02"
        networks: ["172.18.1.2/16"]
    port gw1-join
        mac: "02:ac:10:ff:ff:01"
        networks: ["192.168.255.1/24"]

vm1访问gw1-outside:

[root@node2 ~]# ip netns exec vm1 ping -c 1 172.18.1.2   
PING 172.18.1.2 (172.18.1.2) 56(84) bytes of data.
64 bytes from 172.18.1.2: icmp_seq=1 ttl=253 time=1.00 ms

物理网络访问overlay( by direct)

对于与node1在同一个L2网络的节点，可以通过配置路由访问overlay network：

client (172.18.1.10) -> gw1:

# ip route show
172.18.0.0/16 dev eth1  proto kernel  scope link  src 172.18.1.10 

# ping -c 1 172.18.1.2
PING 172.18.1.2 (172.18.1.2) 56(84) bytes of data.
64 bytes from 172.18.1.2: icmp_seq=1 ttl=254 time=0.438 ms

client增加访问192.168.0.0/16的路由：

ip route add 192.168.0.0/16 via 172.18.1.2 dev eth1

client (172.18.1.10) -> vm1 (192.168.100.10)

# ping -c 1 192.168.100.10
PING 192.168.100.10 (192.168.100.10) 56(84) bytes of data.
64 bytes from 192.168.100.10: icmp_seq=1 ttl=62 time=1.35 ms

vm1:

[root@node2 ~]# ip netns exec vm1 tcpdump -nnvv -i vm1
tcpdump: listening on vm1, link-type EN10MB (Ethernet), capture size 262144 bytes
07:41:28.561299 IP (tos 0x0, ttl 62, id 0, offset 0, flags [DF], proto ICMP (1), length 84)
    172.18.1.10 > 192.168.100.10: ICMP echo request, id 8160, seq 1, length 64
07:41:28.561357 IP (tos 0x0, ttl 64, id 53879, offset 0, flags [none], proto ICMP (1), length 84)
    192.168.100.10 > 172.18.1.10: ICMP echo reply, id 8160, seq 1, length 64

overlay访问物理网络(by NAT)

对于overlay访问物理网络，可以通过NAT来实现。

创建NAT规则：

# ovn-nbctl -- --id=@nat create nat type="snat" logical_ip=192.168.100.0/24 external_ip=172.18.1.2 -- add logical_router gw1 nat @nat
3243ffd3-8d77-4bd3-9f7e-49e74e87b4a7

# ovn-nbctl lr-nat-list gw1
TYPE             EXTERNAL_IP        LOGICAL_IP            EXTERNAL_MAC         LOGICAL_PORT
snat             172.18.1.2         192.168.100.0/24

# ovn-nbctl list NAT       
_uuid               : 3243ffd3-8d77-4bd3-9f7e-49e74e87b4a7
external_ip         : "172.18.1.2"
external_mac        : []
logical_ip          : "192.168.100.0/24"
logical_port        : []
type                : snat

可以通过命令ovn-nbctl lr-nat-del gw1 snat 192.168.100.0/24删除NAT规则.

vm1 192.168.100.10 -> client 172.18.1.10:

[root@node2 ~]# ip netns exec vm1 ping -c 1 172.18.1.10
PING 172.18.1.10 (172.18.1.10) 56(84) bytes of data.
64 bytes from 172.18.1.10: icmp_seq=1 ttl=62 time=1.63 ms


[root@client ~]# tcpdump icmp -nnvv -i eth1
[10894068.821880] device eth1 entered promiscuous mode
tcpdump: listening on eth1, link-type EN10MB (Ethernet), capture size 65535 bytes
08:24:50.316495 IP (tos 0x0, ttl 61, id 31580, offset 0, flags [DF], proto ICMP (1), length 84)
    172.18.1.2 > 172.18.1.10: ICMP echo request, id 5587, seq 1, length 64
08:24:50.316536 IP (tos 0x0, ttl 64, id 6129, offset 0, flags [none], proto ICMP (1), length 84)
    172.18.1.10 > 172.18.1.2: ICMP echo reply, id 5587, seq 1, length 64

可以看到，client端看到的IP为172.18.1.2.

Refs

OVN-kubernetes practice

hustcat — 2018-01-03T23:00:30+00:00

环境:

kube-master: 172.17.42.30 192.168.1.0/24
kube-node1: 172.17.42.31 192.168.2.0/24
kube-node2: 172.17.42.32 192.168.3.0/24
kube-node3: 172.17.42.33

Start OVN daemon

Central node

Start OVS and controller

CENTRAL_IP=172.17.42.30
LOCAL_IP=172.17.42.30
ENCAP_TYPE=geneve

## start ovs
/usr/share/openvswitch/scripts/ovs-ctl start

## set ovn-remote and ovn-nb
ovs-vsctl set Open_vSwitch . external_ids:ovn-remote="tcp:$CENTRAL_IP:6642" external_ids:ovn-nb="tcp:$CENTRAL_IP:6641" external_ids:ovn-encap-ip=$LOCAL_IP external_ids:ovn-encap-type="$ENCAP_TYPE"


## set system_id
id_file=/etc/openvswitch/system-id.conf
test -e $id_file || uuidgen > $id_file
ovs-vsctl set Open_vSwitch . external_ids:system-id=$(cat $id_file)


## start ovn-controller and vtep
/usr/share/openvswitch/scripts/ovn-ctl start_controller
/usr/share/openvswitch/scripts/ovn-ctl start_controller_vtep

start ovn-northd

# /usr/share/openvswitch/scripts/ovn-ctl start_northd
Starting ovn-northd                                        [  OK  ]

Open up TCP ports to access the OVN databases:

[root@kube-master ~]# ovn-nbctl set-connection ptcp:6641
[root@kube-master ~]# ovn-sbctl set-connection ptcp:6642

Compute node

CENTRAL_IP=172.17.42.30
LOCAL_IP=172.17.42.31
ENCAP_TYPE=geneve

## start ovs
/usr/share/openvswitch/scripts/ovs-ctl start

## set ovn-remote and ovn-nb
ovs-vsctl set Open_vSwitch . external_ids:ovn-remote="tcp:$CENTRAL_IP:6642" external_ids:ovn-nb="tcp:$CENTRAL_IP:6641" external_ids:ovn-encap-ip=$LOCAL_IP external_ids:ovn-encap-type="$ENCAP_TYPE"


## set system_id
id_file=/etc/openvswitch/system-id.conf
test -e $id_file || uuidgen > $id_file
ovs-vsctl set Open_vSwitch . external_ids:system-id=$(cat $id_file)


## start ovn-controller and vtep
/usr/share/openvswitch/scripts/ovn-ctl start_controller
/usr/share/openvswitch/scripts/ovn-ctl start_controller_vtep

OVN K8S配置

k8s master node

master node initialization

Set the k8s API server address in the Open vSwitch database for the initialization scripts (and later daemons) to pick from.

# ovs-vsctl set Open_vSwitch . external_ids:k8s-api-server="127.0.0.1:8080"

git clone https://github.com/openvswitch/ovn-kubernetes
cd ovn-kubernetes
pip install .

master init

ovn-k8s-overlay master-init \
   --cluster-ip-subnet="192.168.0.0/16" \
   --master-switch-subnet="192.168.1.0/24" \
   --node-name="kube-master"

这会创建logical switch/router:

# ovn-nbctl show
switch d034f42f-6dd5-4ba9-bfdd-114ce17c9235 (kube-master)
    port k8s-kube-master
        addresses: ["ae:31:fa:c7:81:fc 192.168.1.2"]
    port stor-kube-master
        type: router
        addresses: ["00:00:00:B5:F1:57"]
        router-port: rtos-kube-master
switch 2680f36b-85c2-4064-b811-5c0bd91debdd (join)
    port jtor-kube-master
        type: router
        addresses: ["00:00:00:1A:E4:98"]
        router-port: rtoj-kube-master
router ce75b330-dbd3-43d2-aa4f-4e17af898532 (kube-master)
    port rtos-kube-master
        mac: "00:00:00:B5:F1:57"
        networks: ["192.168.1.1/24"]
    port rtoj-kube-master
        mac: "00:00:00:1A:E4:98"
        networks: ["100.64.1.1/24"]

k8s node

kube-node1:

K8S_API_SERVER_IP=172.17.42.30
ovs-vsctl set Open_vSwitch . \
  external_ids:k8s-api-server="$K8S_API_SERVER_IP:8080"

ovn-k8s-overlay minion-init \
  --cluster-ip-subnet="192.168.0.0/16" \
  --minion-switch-subnet="192.168.2.0/24" \
  --node-name="kube-node1"

## 对于https需要指定CA和token
ovs-vsctl set Open_vSwitch . \
  external_ids:k8s-api-server="https://$K8S_API_SERVER_IP" \
  external_ids:k8s-ca-certificate="/etc/kubernetes/certs/ca.crt" \
  external_ids:k8s-api-token="YMMFKeD4XqLDakZKQbTCvueGlcdcdgBx"

这会创建对应的logical switch，并连接到logical router (kube-master):

# ovn-nbctl show
switch 0147b986-1dab-49a5-9c4e-57d9feae8416 (kube-node1)
    port k8s-kube-node1
        addresses: ["ba:2c:06:32:14:78 192.168.2.2"]
    port stor-kube-node1
        type: router
        addresses: ["00:00:00:C0:2E:C7"]
        router-port: rtos-kube-node1
...
router ce75b330-dbd3-43d2-aa4f-4e17af898532 (kube-master)
    port rtos-kube-node2
        mac: "00:00:00:D3:4B:AA"
        networks: ["192.168.3.1/24"]
    port rtos-kube-node1
        mac: "00:00:00:C0:2E:C7"
        networks: ["192.168.2.1/24"]
    port rtos-kube-master
        mac: "00:00:00:B5:F1:57"
        networks: ["192.168.1.1/24"]
    port rtoj-kube-master
        mac: "00:00:00:1A:E4:98"
        networks: ["100.64.1.1/24"]

kube-node2:

K8S_API_SERVER_IP=172.17.42.30
ovs-vsctl set Open_vSwitch . \
  external_ids:k8s-api-server="$K8S_API_SERVER_IP:8080"

ovn-k8s-overlay minion-init \
  --cluster-ip-subnet="192.168.0.0/16" \
  --minion-switch-subnet="192.168.3.0/24" \
  --node-name="kube-node2"

Gateway node

## attach eth0 to bridge breth0 and move IP/routes
ovn-k8s-util nics-to-bridge eth0

## initialize gateway

ovs-vsctl set Open_vSwitch . \
  external_ids:k8s-api-server="$K8S_API_SERVER_IP:8080"

ovn-k8s-overlay gateway-init \
  --cluster-ip-subnet="$CLUSTER_IP_SUBNET" \
  --bridge-interface breth0 \
  --physical-ip "$PHYSICAL_IP" \
  --node-name="$NODE_NAME" \
  --default-gw "$EXTERNAL_GATEWAY"

# Since you share a NIC for both mgmt and North-South connectivity, you will 
# have to start a separate daemon to de-multiplex the traffic.
ovn-k8s-gateway-helper --physical-bridge=breth0 --physical-interface=eth0 \
    --pidfile --detach

Watchers on master node

ovn-k8s-watcher \
  --overlay \
  --pidfile \
  --log-file \
  -vfile:info \
  -vconsole:emer \
  --detach

# ps -ef | grep ovn-k8s
root     28151     1  1 12:57 ?        00:00:00 /usr/bin/python /usr/bin/ovn-k8s-watcher --overlay --pidfile --log-file -vfile:info -vconsole:emer --detach

对应的日志位于/var/log/openvswitch/ovn-k8s-watcher.log.

测试

创建Pod:

apiVersion: v1
kind: Pod
metadata:
  name: sshd-2
spec:
  containers:
  - name: sshd-2
    image: dbyin/sshd:1.0

CNI执行程序为/opt/cni/bin/ovn_cni，创建容器的日志：

# tail -f /var/log/openvswitch/ovn-k8s-cni-overlay.log
2018-01-03T08:42:39.609Z |  0  | ovn-k8s-cni-overlay | DBG | plugin invoked with cni_command = ADD cni_container_id = a2f5796e82e9286d7f56540585b6040b3a743093c46ea34364212cf1afd42a32 cni_ifname = eth0 cni_netns = /proc/31180/ns/net cni_args = IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=sshd-2;K8S_POD_INFRA_CONTAINER_ID=a2f5796e82e9286d7f56540585b6040b3a743093c46ea34364212cf1afd42a32
2018-01-03T08:42:39.633Z |  1  | kubernetes | DBG | Annotations for pod sshd-2: {u'ovn': u'{"gateway_ip": "192.168.2.1", "ip_address": "192.168.2.3/24", "mac_address": "0a:00:00:00:00:01"}'}
2018-01-03T08:42:39.635Z |  2  | ovn-k8s-cni-overlay | DBG | Creating veth pair for container a2f5796e82e9286d7f56540585b6040b3a743093c46ea34364212cf1afd42a32
2018-01-03T08:42:39.662Z |  3  | ovn-k8s-cni-overlay | DBG | Bringing up veth outer interface a2f5796e82e9286
2018-01-03T08:42:39.769Z |  4  | ovn-k8s-cni-overlay | DBG | Create a link for container namespace
2018-01-03T08:42:39.781Z |  5  | ovn-k8s-cni-overlay | DBG | Adding veth inner interface to namespace for container a2f5796e82e9286d7f56540585b6040b3a743093c46ea34364212cf1afd42a32
2018-01-03T08:42:39.887Z |  6  | ovn-k8s-cni-overlay | DBG | Configuring and bringing up veth inner interface a2f5796e82e92_c. New name:'eth0',MAC address:'0a:00:00:00:00:01', MTU:'1400', IP:192.168.2.3/24
2018-01-03T08:42:44.960Z |  7  | ovn-k8s-cni-overlay | DBG | Setting gateway_ip 192.168.2.1 for container:a2f5796e82e9286d7f56540585b6040b3a743093c46ea34364212cf1afd42a32
2018-01-03T08:42:44.983Z |  8  | ovn-k8s-cni-overlay | DBG | output is {"gateway_ip": "192.168.2.1", "ip_address": "192.168.2.3/24", "mac_address": "0a:00:00:00:00:01"}

从kube-node2可以访问sshd-2`:

[root@kube-node2 ~]# ping -c 2 192.168.2.3
PING 192.168.2.3 (192.168.2.3) 56(84) bytes of data.
64 bytes from 192.168.2.3: icmp_seq=1 ttl=63 time=0.281 ms
64 bytes from 192.168.2.3: icmp_seq=2 ttl=63 time=0.304 ms

--- 192.168.2.3 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1009ms
rtt min/avg/max/mdev = 0.281/0.292/0.304/0.020 ms

OVS on kube-node1:

# ovs-vsctl show                                      
9b92e4fb-fc59-47ae-afa4-a95d1842e2bd
    Bridge br-int
        fail_mode: secure
        Port "ovn-069367-0"
            Interface "ovn-069367-0"
                type: vxlan
                options: {csum="true", key=flow, remote_ip="172.17.42.33"}
        Port br-int
            Interface br-int
                type: internal
        Port "k8s-kube-node1"
            Interface "k8s-kube-node1"
                type: internal
        Port "a2f5796e82e9286"
            Interface "a2f5796e82e9286"
        Port "ovn-7f9937-0"
            Interface "ovn-7f9937-0"
                type: geneve
                options: {csum="true", key=flow, remote_ip="172.17.42.32"}
        Port "ovn-0696ca-0"
            Interface "ovn-0696ca-0"
                type: geneve
                options: {csum="true", key=flow, remote_ip="172.17.42.30"}
    ovs_version: "2.8.1"

a2f5796e82e9286为网络容器的前16位。

Tracing

来看看从192.168.3.2到192.168.2.3的数据包的Open Flow处理过程：

[root@kube-node2 ~]# ovs-appctl ofproto/trace br-int in_port=9,ip,dl_src=82:ff:e7:83:99:a9,dl_dst=00:00:00:d3:4b:aa,nw_src=192.168.3.2,nw_dst=192.168.2.3,nw_ttl=32
bridge("br-int")
----------------
 0. in_port=9, priority 100
    set_field:0x1->reg13
    set_field:0xa->reg11
    set_field:0x6->reg12
    set_field:0x5->metadata
    set_field:0x2->reg14
    resubmit(,8)
...
42. ip,reg0=0x1/0x1,metadata=0x5, priority 100, cookie 0x88177e0
    ct(table=43,zone=NXM_NX_REG13[0..15])
    drop
     -> A clone of the packet is forked to recirculate. The forked pipeline will be resumed at table 43.

Final flow: ip,reg0=0x1,reg11=0xa,reg12=0x6,reg13=0x1,reg14=0x2,reg15=0x1,metadata=0x5,in_port=9,vlan_tci=0x0000,dl_src=82:ff:e7:83:99:a9,dl_dst=00:00:00:d3:4b:aa,nw_src=192.168.3.2,nw_dst=192.168.2.3,nw_proto=0,nw_tos=0,nw_ecn=0,nw_ttl=32
Megaflow: recirc_id=0,ct_state=-new-est-rel-inv-trk,eth,ip,in_port=9,vlan_tci=0x0000/0x1000,dl_src=00:00:00:00:00:00/01:00:00:00:00:00,dl_dst=00:00:00:d3:4b:aa,nw_dst=128.0.0.0/1,nw_frag=no
Datapath actions: ct(zone=1),recirc(0x24)

===============================================================================
recirc(0x24) - resume conntrack with default ct_state=trk|new (use --ct-next to customize)
===============================================================================

Flow: recirc_id=0x24,ct_state=new|trk,eth,ip,reg0=0x1,reg11=0xa,reg12=0x6,reg13=0x1,reg14=0x2,reg15=0x1,metadata=0x5,in_port=9,vlan_tci=0x0000,dl_src=82:ff:e7:83:99:a9,dl_dst=00:00:00:d3:4b:aa,nw_src=192.168.3.2,nw_dst=192.168.2.3,nw_proto=0,nw_tos=0,nw_ecn=0,nw_ttl=32

bridge("br-int")
----------------
    thaw
        Resuming from table 43
...
65. reg15=0x1,metadata=0x5, priority 100
    clone(ct_clear,set_field:0->reg11,set_field:0->reg12,set_field:0->reg13,set_field:0x4->reg11,set_field:0xb->reg12,set_field:0x1->metadata,set_field:0x4->reg14,set_field:0->reg10,set_field:0->reg15,set_field:0->reg0,set_field:0->reg1,set_field:0->reg2,set_field:0->reg3,set_field:0->reg4,set_field:0->reg5,set_field:0->reg6,set_field:0->reg7,set_field:0->reg8,set_field:0->reg9,set_field:0->in_port,resubmit(,8))
    ct_clear
    set_field:0->reg11
    set_field:0->reg12
    set_field:0->reg13
    set_field:0x4->reg11
    set_field:0xb->reg12
    set_field:0x1->metadata
    set_field:0x4->reg14
    set_field:0->reg10
    set_field:0->reg15
    set_field:0->reg0
    set_field:0->reg1
    set_field:0->reg2
    set_field:0->reg3
    set_field:0->reg4
    set_field:0->reg5
    set_field:0->reg6
    set_field:0->reg7
    set_field:0->reg8
    set_field:0->reg9
    set_field:0->in_port
    resubmit(,8)
...
13. ip,metadata=0x1,nw_dst=192.168.2.0/24, priority 49, cookie 0xc6501434
    dec_ttl()
    move:NXM_OF_IP_DST[]->NXM_NX_XXREG0[96..127]
     -> NXM_NX_XXREG0[96..127] is now 0xc0a80203
    load:0xc0a80201->NXM_NX_XXREG0[64..95]
    set_field:00:00:00:c0:2e:c7->eth_src
    set_field:0x3->reg15
    load:0x1->NXM_NX_REG10[0]
    resubmit(,14)
14. reg0=0xc0a80203,reg15=0x3,metadata=0x1, priority 100, cookie 0x3b957bac
    set_field:0a:00:00:00:00:01->eth_dst
    resubmit(,15)
...
64. reg10=0x1/0x1,reg15=0x3,metadata=0x1, priority 100
    push:NXM_OF_IN_PORT[]
    set_field:0->in_port
    resubmit(,65)
    65. reg15=0x3,metadata=0x1, priority 100
            clone(ct_clear,set_field:0->reg11,set_field:0->reg12,set_field:0->reg13,set_field:0x5->reg11,set_field:0x9->reg12,set_field:0x4->metadata,set_field:0x1->reg14,set_field:0->reg10,set_field:0->reg15,set_field:0->reg0,set_field:0->reg1,set_field:0->reg2,set_field:0->reg3,set_field:0->reg4,set_field:0->reg5,set_field:0->reg6,set_field:0->reg7,set_field:0->reg8,set_field:0->reg9,set_field:0->in_port,resubmit(,8))
            ct_clear
            set_field:0->reg11
            set_field:0->reg12
            set_field:0->reg13
            set_field:0x5->reg11
            set_field:0x9->reg12
            set_field:0x4->metadata
            set_field:0x1->reg14
...
        23. metadata=0x4,dl_dst=0a:00:00:00:00:01, priority 50, cookie 0x6c2597ec
            set_field:0x3->reg15
            resubmit(,32)
        32. reg15=0x3,metadata=0x4, priority 100
            load:0x4->NXM_NX_TUN_ID[0..23]
            set_field:0x3->tun_metadata0
            move:NXM_NX_REG14[0..14]->NXM_NX_TUN_METADATA0[16..30]
             -> NXM_NX_TUN_METADATA0[16..30] is now 0x1
            output:7
             -> output to kernel tunnel
    pop:NXM_OF_IN_PORT[]
     -> NXM_OF_IN_PORT[] is now 0

Final flow: unchanged
Megaflow: recirc_id=0x24,ct_state=+new-est-rel-inv+trk,eth,ip,tun_id=0/0xffffff,tun_metadata0=NP,in_port=9,vlan_tci=0x0000/0x1000,dl_src=82:ff:e7:83:99:a9,dl_dst=00:00:00:d3:4b:aa,nw_src=192.168.3.2/31,nw_dst=192.168.2.3,nw_ecn=0,nw_ttl=32,nw_frag=no
Datapath actions: set(tunnel(tun_id=0x4,dst=172.17.42.31,ttl=64,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x10003}),flags(df|csum|key))),set(eth(src=00:00:00:c0:2e:c7,dst=0a:00:00:00:00:01)),2

几个注意点：

(1) ofproto/trace中的dl_dst=00:00:00:d3:4b:aa为kube-node2对应的网关192.168.2.1的MAC地址（即stor-kube-node2的地址）.
(2) 第65条规则将metadata从0x5 (datapath/kube-node2) 改成0x1 (router/kube-master).
(3) 第13条规则为路由规则，修改ttl，并修改Source MAC; 第14条规则修改Dst MAC地址.
(4) 第64条规则将datapath改成kube-node1 (0x4).
(5) 将32条规则修改packet的tun_id为0x4，tun_metadata0为0x3，然后将packet转给port 7，即tunnel设备：

 7(ovn-c7889c-0): addr:76:c2:2f:bb:06:5b
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
...

        Port "ovn-c7889c-0"
            Interface "ovn-c7889c-0"
                type: geneve
                options: {csum="true", key=flow, remote_ip="172.17.42.31"}

节点kube-node1收到包后的处理过程：

[root@kube-node1 ~]# ovs-appctl ofproto/trace br-int in_port=6,tun_id=0x4,tun_metadata0=0x3,dl_src=00:00:00:c0:2e:c7,dl_dst=0a:00:00:00:00:01
Flow: tun_id=0x4,in_port=6,vlan_tci=0x0000,dl_src=00:00:00:c0:2e:c7,dl_dst=0a:00:00:00:00:01,dl_type=0x0000

bridge("br-int")
----------------
 0. in_port=6, priority 100
    move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23]
     -> OXM_OF_METADATA[0..23] is now 0x4
    move:NXM_NX_TUN_METADATA0[16..30]->NXM_NX_REG14[0..14]
     -> NXM_NX_REG14[0..14] is now 0
    move:NXM_NX_TUN_METADATA0[0..15]->NXM_NX_REG15[0..15]
     -> NXM_NX_REG15[0..15] is now 0x3
    resubmit(,33)
...
48. reg15=0x3,metadata=0x4, priority 50, cookie 0x37e139d4
    resubmit(,64)
64. priority 0
    resubmit(,65)
65. reg15=0x3,metadata=0x4, priority 100
    output:10

Final flow: reg11=0x5,reg12=0x8,reg13=0xa,reg15=0x3,tun_id=0x4,metadata=0x4,in_port=6,vlan_tci=0x0000,dl_src=00:00:00:c0:2e:c7,dl_dst=0a:00:00:00:00:01,dl_type=0x0000
Megaflow: recirc_id=0,ct_state=-new-est-rel-inv-trk,eth,tun_id=0x4/0xffffff,tun_metadata0=0x3/0x7fffffff,in_port=6,dl_dst=00:00:00:00:00:00/01:00:00:00:00:00,dl_type=0x0000
Datapath actions: 5

最终将packet转给port 10，即容器对应的Port:

 10(a2f5796e82e9286): addr:de:d3:83:cf:22:7c
     config:     0
     state:      0
     current:    10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max

YY哥

Collective communication in MPI

MPI_Barrier

MPI_Bcast

MPI_Scatter

MPI_Gather

MPI_Allgather

MPI_Reduce

MPI_Allreduce

Ring Allreduce

Refs

机器学习资源汇总

机器学习基础

神经网络与深度学习

生成对抗网络(GAN)

强化学习

QoS in RoCE

Overview

Network Flow Classification

Traffic Control Mechanisms

PFC

VLAN-based PFC

DSCP-based PFC

PFC机制的一些问题

ECN

ECN with TCP/IP

ECN with IP

ECN with TCP

ECN in RoCEv2

Refs

一些关于PFC的文献

一些关于ECN的文献

Queue Pair in RDMA

QP type

数据结构

QP attributes

状态(IBV_QP_STATE)

ibv_modify_qp的实现

userspace

kernel

Refs

The introduction to OVS architecture

Architecture

ofproto

ofproto_dpif

netdev

ovs-vswitchd

add port

add flow

upcall

Refs

OVS flow table implementation in datapath

OVS的流表原理

TSS

OVS流表的实现

流表查找过程

Refs

OVN load balancer practice

Web server

Configuring the Load Balancer Rules

使用Gateway Router作为Load Balancer

在Logical Switch上配置Load Balancer

Tracing

原理

问题定位

Refs

An introduction to OVN architecture

OVN的特性

OVN架构

OVN Northbound DB

Southbound DB

OVN tunnel

Refs

OVN gateway practice

环境

创建router(用于连接所有VM的switch)

创建gateway router

连接overlay network与physical network

物理网络访问overlay( by direct)

overlay访问物理网络(by NAT)

使用`Gateway Router`作为`Load Balancer`

在`Logical Switch`上配置`Load Balancer`