YY哥 2018-01-11T06:41:33+00:00 hustcat@gmail.com The introduction to OVS architecture 2018-01-10T15:00:30+00:00 hustcat http://hustcat.github.io/an-introduction-to-ovs-architecture OVS的整体架构:



          |   +-------------------+
          |   |    ovs-vswitchd   |<-->ovsdb-server
          |   +-------------------+
          |   |      ofproto      |<-->OpenFlow controllers
          |   +--------+-+--------+  _
          |   | netdev | |ofproto-|   |
userspace |   +--------+ |  dpif  |   |
          |   | netdev | +--------+   |
          |   |provider| |  dpif  |   |
          |   +---||---+ +--------+   |
          |       ||     |  dpif  |   | implementation of
          |       ||     |provider|   | ofproto provider
          |_      ||     +---||---+   |
                  ||         ||       |
           _  +---||-----+---||---+   |
          |   |          |datapath|   |
   kernel |   |          +--------+  _|
          |   |                   |
          |_  +--------||---------+
  • ovs-vswitchd

The main Open vSwitch userspace program, in vswitchd/. It reads the desired Open vSwitch configuration from the ovsdb-server program over an IPC channel and passes this configuration down to the “ofproto” library. It also passes certain status and statistical information from ofproto back into the database.

  • ofproto The Open vSwitch library, in ofproto/, that implements an OpenFlow switch. It talks to OpenFlow controllers over the network and to switch hardware or software through an “ofproto provider”, explained further below.

  • netdev The Open vSwitch library, in lib/netdev.c, that abstracts interacting with network devices, that is, Ethernet interfaces. The netdev library is a thin layer over “netdev provider” code, explained further below.


struct ofproto表示An OpenFlow switch:

/* An OpenFlow switch.
 * With few exceptions, ofproto implementations may look at these fields but
 * should not modify them. */
struct ofproto {
    struct hmap_node hmap_node; /* In global 'all_ofprotos' hmap. */
    const struct ofproto_class *ofproto_class; /// see ofproto_dpif_class 
    char *type;                 /* Datapath type. */
    char *name;                 /* Datapath name. */
    /* Datapath. */
    struct hmap ports;          /* Contains "struct ofport"s. */
    struct shash port_by_name;
    struct simap ofp_requests;  /* OpenFlow port number requests. */
    uint16_t alloc_port_no;     /* Last allocated OpenFlow port number. */
    uint16_t max_ports;         /* Max possible OpenFlow port num, plus one. */
    struct hmap ofport_usage;   /* Map ofport to last used time. */
    uint64_t change_seq;        /* Change sequence for netdev status. */

    /* Flow tables. */
    long long int eviction_group_timer; /* For rate limited reheapification. */
    struct oftable *tables;
    int n_tables;
    ovs_version_t tables_version;  /* Controls which rules are visible to
                                    * table lookups. */

struct ofproto包含两个最重要的组成信息:端口信息(struct ofport)和流表信息(struct oftable):

  • struct ofport
/* An OpenFlow port within a "struct ofproto".
 * The port's name is netdev_get_name(port->netdev).
 * With few exceptions, ofproto implementations may look at these fields but
 * should not modify them. */
struct ofport {
    struct hmap_node hmap_node; /* In struct ofproto's "ports" hmap. */
    struct ofproto *ofproto;    /* The ofproto that contains this port. */
    struct netdev *netdev; /// network device
    struct ofputil_phy_port pp;
    ofp_port_t ofp_port;        /* OpenFlow port number. */
    uint64_t change_seq;
    long long int created;      /* Time created, in msec. */
    int mtu;
  • 流表
/* A flow table within a "struct ofproto".
struct oftable {
    enum oftable_flags flags;
    struct classifier cls;      /* Contains "struct rule"s. */
    char *name;                 /* Table name exposed via OpenFlow, or NULL. */

struct oftable通过struct classifier关联所有流表规则。

  • 流表规则
struct rule {
    /* Where this rule resides in an OpenFlow switch.
     * These are immutable once the rule is constructed, hence 'const'. */
    struct ofproto *const ofproto; /* The ofproto that contains this rule. */
    const struct cls_rule cr;      /* In owning ofproto's classifier. */
    const uint8_t table_id;        /* Index in ofproto's 'tables' array. */

    enum rule_state state;
    /* OpenFlow actions.  See struct rule_actions for more thread-safety
     * notes. */
    const struct rule_actions * const actions;

struct ruleactions为规则对应的action信息:

/* A set of actions within a "struct rule".
struct rule_actions {
    /* Flags.
     * 'has_meter' is true if 'ofpacts' contains an OFPACT_METER action.
     * 'has_learn_with_delete' is true if 'ofpacts' contains an OFPACT_LEARN
     * action whose flags include NX_LEARN_F_DELETE_LEARNED. */
    bool has_meter;
    bool has_learn_with_delete;
    bool has_groups;

    /* Actions. */
    uint32_t ofpacts_len;         /* Size of 'ofpacts', in bytes. */
    struct ofpact ofpacts[];      /* Sequence of "struct ofpacts". */
  • action
/* Header for an action.
 * Each action is a structure "struct ofpact_*" that begins with "struct
 * ofpact", usually followed by other data that describes the action.  Actions
 * are padded out to a multiple of OFPACT_ALIGNTO bytes in length.
struct ofpact {
    /* We want the space advantage of an 8-bit type here on every
     * implementation, without giving up the advantage of having a useful type
     * on implementations that support packed enums. */
    enum ofpact_type type;      /* OFPACT_*. */
    uint8_t type;               /* OFPACT_* */

    uint8_t raw;                /* Original type when added, if any. */
    uint16_t len;               /* Length of the action, in bytes, including
                                 * struct ofpact, excluding padding. */


 * Used for OFPAT10_OUTPUT. */
struct ofpact_output {
    struct ofpact ofpact;
    ofp_port_t port;            /* Output port. */
    uint16_t max_len;           /* Max send len, for port OFPP_CONTROLLER. */


  • struct ofproto_dpif

struct ofproto_dpif表示基于dpif datapath的bridge:

/* A bridge based on a "dpif" datapath. */

struct ofproto_dpif {
    struct hmap_node all_ofproto_dpifs_node; /* In 'all_ofproto_dpifs'. */
    struct ofproto up;
    struct dpif_backer *backer;
  • struct dpif

基于OVSdatapath interface(ofproto_dpif->backer->dpif):

/* Open vSwitch datapath interface.
 * This structure should be treated as opaque by dpif implementations. */
struct dpif {
    const struct dpif_class *dpif_class;
    char *base_name;
    char *full_name;
    uint8_t netflow_engine_type;
    uint8_t netflow_engine_id;

struct dpif_class:

/* Datapath interface class structure, to be defined by each implementation of
 * a datapath interface.
 * These functions return 0 if successful or a positive errno value on failure,
 * except where otherwise noted.
 * These functions are expected to execute synchronously, that is, to block as
 * necessary to obtain a result.  Thus, they may not return EAGAIN or
 * EWOULDBLOCK or EINPROGRESS.  We may relax this requirement in the future if
 * and when we encounter performance problems. */
struct  { /// see dpif_netlink_class/dpif_netdev_class
    /* Type of dpif in this class, e.g. "system", "netdev", etc.
     * One of the providers should supply a "system" type, since this is
     * the type assumed if no type is specified when opening a dpif. */
    const char *type;

struct dpif_class实际上对应图中的dpif provider:

The “dpif” library in turn delegates much of its functionality to a “dpif provider”.

struct dpif_class, in lib/dpif-provider.h, defines the interfaces required to implement a dpif provider for new hardware or software.

There are two existing dpif implementations that may serve as useful examples during a port:

  • lib/dpif-netlink.c is a Linux-specific dpif implementation that talks to an Open vSwitch-specific kernel module (whose sources are in the “datapath” directory). The kernel module performs all of the switching work, passing packets that do not match any flow table entry up to userspace. This dpif implementation is essentially a wrapper around calls into the kernel module.

  • lib/dpif-netdev.c is a generic dpif implementation that performs all switching internally. This is how the Open vSwitch userspace switch is implemented.

  • dpif_netdev_class
const struct dpif_class dpif_netdev_class = {


ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev
ovs-vsctl add-port br0 dpdk-p0 -- set Interface dpdk-p0 type=dpdk \

参考Using Open vSwitch with DPDK.

  • dpif_netlink_class

system datapath,基于Linux内核实现的switch:

const struct dpif_class dpif_netlink_class = {
    NULL,                       /* init */


  • struct netdev

struct netdev代表一个网络设备:

/* A network device (e.g. an Ethernet device).
 * Network device implementations may read these members but should not modify
 * them. */
struct netdev {
    /* The following do not change during the lifetime of a struct netdev. */
    char *name;                         /* Name of network device. */
    const struct netdev_class *netdev_class; /* Functions to control
                                                this device. */
  • struct netdev_class
const struct netdev_class netdev_linux_class =

const struct netdev_class netdev_tap_class =

const struct netdev_class netdev_internal_class =
        NULL,                  /* get_features */



main(int argc, char *argv[])
  • bridge_run


|- bridge_init_ofproto ///Initialize the ofproto library
|  |- ofproto_init
|    |- ofproto_class->init
|- bridge_run__
|  |- ofproto_type_run
|  |- ofproto_run
|     |- ofproto_class->run
|     |- handle_openflow ///处理openflow message
|- bridge_reconfigure /// config bridge
  • netdev_run


/* Performs periodic work needed by all the various kinds of netdevs.
 * If your program opens any netdevs, it must call this function within its
 * main poll loop. */

    struct netdev_registered_class *rc;
    CMAP_FOR_EACH (rc, cmap_node, &netdev_classes) {
        if (rc->class->run) {


add port

/* Attempts to add 'netdev' as a port on 'ofproto'.  If 'ofp_portp' is
 * non-null and '*ofp_portp' is not OFPP_NONE, attempts to use that as
 * the port's OpenFlow port number.
 * If successful, returns 0 and sets '*ofp_portp' to the new port's
 * OpenFlow port number (if 'ofp_portp' is non-null).  On failure,
 * returns a positive errno value and sets '*ofp_portp' to OFPP_NONE (if
 * 'ofp_portp' is non-null). */
ofproto_port_add(struct ofproto *ofproto, struct netdev *netdev,
                 ofp_port_t *ofp_portp)
    ofp_port_t ofp_port = ofp_portp ? *ofp_portp : OFPP_NONE;
    int error;

    error = ofproto->ofproto_class->port_add(ofproto, netdev); ///ofproto_dpif_class(->port_add)
  • dpif add port

port_add -> dpif_port_add:

dpif_port_add(struct dpif *dpif, struct netdev *netdev, odp_port_t *port_nop)
    const char *netdev_name = netdev_get_name(netdev);
    odp_port_t port_no = ODPP_NONE;
    int error;


    if (port_nop) {
        port_no = *port_nop;

    error = dpif->dpif_class->port_add(dpif, netdev, &port_no); ///dpif_netlink_class
  • dpif-netlink

lib/dpif-netlink.c: dpif_netlink_port_add -> dpif_netlink_port_add_compat -> dpif_netlink_port_add__:

static int
dpif_netlink_port_add__(struct dpif_netlink *dpif, const char *name,
                        enum ovs_vport_type type,
                        struct ofpbuf *options,
                        odp_port_t *port_nop)
    request.cmd = OVS_VPORT_CMD_NEW; ///new port
    request.dp_ifindex = dpif->dp_ifindex;
    request.type = type;
    request.name = name;

    request.port_no = *port_nop;
    upcall_pids = vport_socksp_to_pids(socksp, dpif->n_handlers);
    request.n_upcall_pids = socksp ? dpif->n_handlers : 1;
    request.upcall_pids = upcall_pids;

    if (options) {
        request.options = options->data;
        request.options_len = options->size;

    error = dpif_netlink_vport_transact(&request, &reply, &buf);
  • data path
# ovs-vsctl add-port br-int vxlan1 -- set interface vxlan1 type=vxlan options:remote_ip=


|-- new_vport
    |-- ovs_vport_add
        |-- vport_ops->create

add flow

ovs-ofctl add-flow br-int "in_port=2, nw_src=, action=drop"

handle_openflow -> handle_openflow__ -> handle_flow_mod -> handle_flow_mod__ -> ofproto_flow_mod_start -> add_flow_start -> replace_rule_start -> ofproto_rule_insert__.

参考Openvswitch原理与代码分析(7): 添加一条流表flow.


无论是内核态datapath还是基于dpdk的用户态datapath,当flow table查不到之后都会进入upcall的处理. upcall的处理函数udpif_upcall_handler会在udpif_start_threads里面初始化,同时创建的还有udpif_revalidator的线程:


|-- dpif_recv         // (1) read packet from datapath
|-- upcall_receive    // (2) associate packet with a ofproto
|-- process_upcall    // (3) process packet by flow rule
|-- handle_upcalls    // (4) add flow rule to datapath
  • process_upcall
|-- upcall_xlate
    |-- xlate_actions
        |-- rule_dpif_lookup_from_table
        |-- do_xlate_actions
  • do_xlate_actions
static void
do_xlate_actions(const struct ofpact *ofpacts, size_t ofpacts_len,
                 struct xlate_ctx *ctx, bool is_last_action)
    struct flow_wildcards *wc = ctx->wc;
    struct flow *flow = &ctx->xin->flow;
    const struct ofpact *a;
    ///do each action
    OFPACT_FOR_EACH (a, ofpacts, ofpacts_len) {
        switch (a->type) {
        case OFPACT_OUTPUT: /// actions=output
            xlate_output_action(ctx, ofpact_get_OUTPUT(a)->port,
                                ofpact_get_OUTPUT(a)->max_len, true, last,

        case OFPACT_GROUP: /// actions=group
            if (xlate_group_action(ctx, ofpact_get_GROUP(a)->group_id, last)) {
                /* Group could not be found. */

                /* XXX: Terminates action list translation, but does not
                 * terminate the pipeline. */
        case OFPACT_CT: /// actions=ct
            compose_conntrack_action(ctx, ofpact_get_CT(a), last);


OVS flow table implementation in datapath 2018-01-10T11:00:30+00:00 hustcat http://hustcat.github.io/ovs-flow-table-in-datapath OVS的流表原理

OVS中的flow cache分两级:microflow cachemegaflow cache。前者用于精确匹配(exact matching),即用skb_buff->skb_hash进行匹配;后者用于通配符匹配(wildcard matching),使用元组空间搜索算法实现,即tuple space search (TSS) .




  • (1)在虚拟化数据中心环境下,流的添加删除比较频繁,TSS支持高效的、常数时间的表项更新;
  • (2)TSS支持任意匹配域的组合;
  • (3)TSS存储空间随着流的数量线性增长。


///mask cache entry
struct mask_cache_entry {
	u32 skb_hash;
	u32 mask_index; /// mask index in flow_table->mask_array->masks

struct mask_array {
	struct rcu_head rcu;
	int count, max;
	struct sw_flow_mask __rcu *masks[]; /// mask array

struct table_instance { /// hash table
	struct flex_array *buckets; ///bucket array
	unsigned int n_buckets;
	struct rcu_head rcu;
	int node_ver;
	u32 hash_seed;
	bool keep_flows;

struct flow_table {
	struct table_instance __rcu *ti; ///hash table
	struct table_instance __rcu *ufid_ti;
	struct mask_cache_entry __percpu *mask_cache; ///microflow cache, find entry by skb_hash, 256 entries(MC_HASH_ENTRIES)
	struct mask_array __rcu *mask_array; ///mask array
	unsigned long last_rehash;
	unsigned int count;
	unsigned int ufid_count;

struct flow_table对应datapath的流表,它主要包括3部分:

  • (1) ti


/// flow table entry
struct sw_flow {
	struct rcu_head rcu;
	struct {
		struct hlist_node node[2];
		u32 hash;
	} flow_table, ufid_table; /// hash table node
	int stats_last_writer;		/* CPU id of the last writer on
					 * 'stats[0]'.
	struct sw_flow_key key; ///key
	struct sw_flow_id id;
	struct cpumask cpu_used_mask;
	struct sw_flow_mask *mask;
	struct sw_flow_actions __rcu *sf_acts; ///action
	struct flow_stats __rcu *stats[]; /* One for each CPU.  First one
					   * is allocated at flow creation time,
					   * the rest are allocated on demand
					   * while holding the 'stats[0].lock'.
  • (2) mask_cache

mask_cachemicroflow cache,用于精确匹配(EMC)。它是percpu数组,有256个元素,即mask_cache_entry

mask_cache_entry 有2个字段:skb_hash用于与skb_buff->skb_hash比较;mask_index为对应的sw_flow_maskmask_array数组中的下标。

  • (3) mask_array


Each bit of the mask will be set to 1 when a match is required on that bit position; otherwise, it will be 0.

struct sw_flow_key_range {
	unsigned short int start;
	unsigned short int end;
/// flow mask
struct sw_flow_mask {
	int ref_count;
	struct rcu_head rcu;
	struct sw_flow_key_range range;
	struct sw_flow_key key;




 * mask_cache maps flow to probable mask. This cache is not tightly
 * coupled cache, It means updates to  mask list can result in inconsistent
 * cache entry in mask cache.
 * This is per cpu cache and is divided in MC_HASH_SEGS segments.
 * In case of a hash collision the entry is hashed in next segment.
struct sw_flow *ovs_flow_tbl_lookup_stats(struct flow_table *tbl,
					  const struct sw_flow_key *key,
					  u32 skb_hash,
					  u32 *n_mask_hit)
	struct mask_array *ma = rcu_dereference(tbl->mask_array);
	struct table_instance *ti = rcu_dereference(tbl->ti);
	struct mask_cache_entry *entries, *ce;
	struct sw_flow *flow;
	u32 hash;
	int seg;

	*n_mask_hit = 0;
	if (unlikely(!skb_hash)) {
		u32 mask_index = 0;

		return flow_lookup(tbl, ti, ma, key, n_mask_hit, &mask_index);

	/* Pre and post recirulation flows usually have the same skb_hash
	 * value. To avoid hash collisions, rehash the 'skb_hash' with
	 * 'recirc_id'.  */
	if (key->recirc_id)
		skb_hash = jhash_1word(skb_hash, key->recirc_id);

	ce = NULL;
	hash = skb_hash;
	entries = this_cpu_ptr(tbl->mask_cache);

	/* Find the cache entry 'ce' to operate on. */
	for (seg = 0; seg < MC_HASH_SEGS; seg++) { ///find in cache
		int index = hash & (MC_HASH_ENTRIES - 1); ///skb_hash -> index
		struct mask_cache_entry *e;

		e = &entries[index];
		if (e->skb_hash == skb_hash) {
			flow = flow_lookup(tbl, ti, ma, key, n_mask_hit,
			if (!flow)
				e->skb_hash = 0;
			return flow;

		if (!ce || e->skb_hash < ce->skb_hash)
			ce = e;  /* A better replacement cache candidate. */

		hash >>= MC_HASH_SHIFT;

	/* Cache miss, do full lookup. */
	flow = flow_lookup(tbl, ti, ma, key, n_mask_hit, &ce->mask_index);
	if (flow)
		ce->skb_hash = skb_hash;

	return flow;


  • flow_lookup
/* Flow lookup does full lookup on flow table. It starts with
 * mask from index passed in *index.
static struct sw_flow *flow_lookup(struct flow_table *tbl,
				   struct table_instance *ti,
				   const struct mask_array *ma,
				   const struct sw_flow_key *key,
				   u32 *n_mask_hit,
				   u32 *index)
	struct sw_flow_mask *mask;
	struct sw_flow *flow;
	int i;

	if (*index < ma->max) { /// get mask by index
		mask = rcu_dereference_ovsl(ma->masks[*index]);
		if (mask) {
			flow = masked_flow_lookup(ti, key, mask, n_mask_hit);
			if (flow)
				return flow;

	for (i = 0; i < ma->max; i++)  { /// travel all mask in array

		if (i == *index)

		mask = rcu_dereference_ovsl(ma->masks[i]);
		if (!mask)

		flow = masked_flow_lookup(ti, key, mask, n_mask_hit);
		if (flow) { /* Found */
			*index = i;
			return flow;

	return NULL;
  • masked_flow_lookup


static struct sw_flow *masked_flow_lookup(struct table_instance *ti,
					  const struct sw_flow_key *unmasked,
					  const struct sw_flow_mask *mask,
					  u32 *n_mask_hit)
	struct sw_flow *flow;
	struct hlist_head *head;
	u32 hash;
	struct sw_flow_key masked_key;

	ovs_flow_mask_key(&masked_key, unmasked, false, mask);
	hash = flow_hash(&masked_key, &mask->range); ///mask key -> hash
	head = find_bucket(ti, hash); ///hash -> bucket
	hlist_for_each_entry_rcu(flow, head, flow_table.node[ti->node_ver]) { /// list
		if (flow->mask == mask && flow->flow_table.hash == hash &&
		    flow_cmp_masked_key(flow, &masked_key, &mask->range)) ///compare key
			return flow;
	return NULL;


OVN load balancer practice 2018-01-05T16:28:30+00:00 hustcat http://hustcat.github.io/ovn-lb-practice Web server


mkdir -p /tmp/www
echo "i am vm1" > /tmp/www/index.html
cd /tmp/www
ip netns exec vm1 python -m SimpleHTTPServer 8000


mkdir -p /tmp/www
echo "i am vm2" > /tmp/www/index.html
cd /tmp/www
ip netns exec vm2 python -m SimpleHTTPServer 8000

Configuring the Load Balancer Rules

首先配置LB规则,VIP Network中的IP.


uuid=`ovn-nbctl create load_balancer vips:","`
echo $uuid

这会在Northbound DBload_balancer表中创建对应的项:

[root@node1 ~]# ovn-nbctl list load_balancer 
_uuid               : 3760718f-294f-491a-bf63-3faf3573d44c
external_ids        : {}
name                : ""
protocol            : []
vips                : {""=","}

使用Gateway Router作为Load Balancer


ovn-nbctl set logical_router gw1 load_balancer=$uuid
[root@node1 ~]# ovn-nbctl lb-list                             
UUID                                    LB                  PROTO      VIP                      IPs
3760718f-294f-491a-bf63-3faf3573d44c                        tcp/udp   ,

[root@node1 ~]# ovn-nbctl lr-lb-list gw1
UUID                                    LB                  PROTO      VIP                      IPs
3760718f-294f-491a-bf63-3faf3573d44c                        tcp/udp   ,

访问LB,client -> vip:

[root@client ~]# curl
i am vm2
[root@client ~]# curl
i am vm1

load balancer is not performing any sort of health checking. At present, the assumption is that health checks would be performed by an orchestration solution such as Kubernetes but it would be resonable to assume that this feature would be added at some future point.


ovn-nbctl clear logical_router gw1 load_balancer
ovn-nbctl destroy load_balancer $uuid

Logical Switch上配置Load Balancer

为了方便测试,再创建logical switch 'ls2'

# create the logical switch
ovn-nbctl ls-add ls2

# create logical port
ovn-nbctl lsp-add ls2 ls2-vm3
ovn-nbctl lsp-set-addresses ls2-vm3 02:ac:10:ff:00:33
ovn-nbctl lsp-set-port-security ls2-vm3 02:ac:10:ff:00:33

# create logical port
ovn-nbctl lsp-add ls2 ls2-vm4
ovn-nbctl lsp-set-addresses ls2-vm4 02:ac:10:ff:00:44
ovn-nbctl lsp-set-port-security ls2-vm4 02:ac:10:ff:00:44

# create router port for the connection to 'ls2'
ovn-nbctl lrp-add router1 router1-ls2 02:ac:10:ff:01:01

# create the 'ls2' switch port for connection to 'router1'
ovn-nbctl lsp-add ls2 ls2-router1
ovn-nbctl lsp-set-type ls2-router1 router
ovn-nbctl lsp-set-addresses ls2-router1 02:ac:10:ff:01:01
ovn-nbctl lsp-set-options ls2-router1 router-port=router1-ls2
# ovn-nbctl show
switch b97eb754-ea59-41f6-b435-ea6ada4659d1 (ls2)
    port ls2-vm3
        addresses: ["02:ac:10:ff:00:33"]
    port ls2-router1
        type: router
        addresses: ["02:ac:10:ff:01:01"]
        router-port: router1-ls2
    port ls2-vm4
        addresses: ["02:ac:10:ff:00:44"]


ip netns add vm3
ovs-vsctl add-port br-int vm3 -- set interface vm3 type=internal
ip link set vm3 netns vm3
ip netns exec vm3 ip link set vm3 address 02:ac:10:ff:00:33
ip netns exec vm3 ip addr add dev vm3
ip netns exec vm3 ip link set vm3 up
ip netns exec vm3 ip route add default via dev vm3
ovs-vsctl set Interface vm3 external_ids:iface-id=ls2-vm3


uuid=`ovn-nbctl create load_balancer vips:","`
echo $uuid

set LB for ls2:

ovn-nbctl set logical_switch ls2 load_balancer=$uuid
ovn-nbctl get logical_switch ls2 load_balancer


# # ovn-nbctl ls-lb-list ls2
UUID                                    LB                  PROTO      VIP                      IPs
a19bece1-52bf-4555-89f4-257534c0b9d9                        tcp/udp   ,

测试(vm3 -> VIP):

[root@node2 ~]# ip netns exec vm3 curl
i am vm2
[root@node2 ~]# ip netns exec vm3 curl
i am vm1


This highlights the requirement that load balancing be applied on the client’s logical switch rather than the server’s logical switch.


ovn-nbctl clear logical_switch ls2 load_balancer
ovn-nbctl destroy load_balancer $uuid


  • vm3 ->
ovs-appctl ofproto/trace br-int in_port=14,icmp,icmp_type=0x8,dl_src=02:ac:10:ff:00:33,dl_dst=02:ac:10:ff:01:01,nw_src=,nw_dst= 
  • vm3 -> vm1:
ovs-appctl ofproto/trace br-int in_port=14,ip,dl_src=02:ac:10:ff:00:33,dl_dst=02:ac:10:ff:01:01,nw_src=,nw_dst=,nw_ttl=32

ovs-appctl ofproto/trace br-int in_port=14,icmp,icmp_type=0x8,dl_src=02:ac:10:ff:00:33,dl_dst=02:ac:10:ff:01:01,nw_src=,nw_dst=,nw_ttl=32



17. ct_state=+new+trk,ip,metadata=0xa,nw_dst=, priority 110, cookie 0x30d9e9b5
     -> A clone of the packet is forked to recirculate. The forked pipeline will be resumed at table 18.


table=17, n_packets=10, n_bytes=740, idle_age=516, priority=110,ct_state=+new+trk,ip,metadata=0xa,nw_dst= actions=group:1


执行ovn-nbctl lsp-set-addresses ls2-vm3 02:ac:10:ff:00:33修改ls2-vm3的MAC后,vm3无法ping,使用ovs-appctl ofproto/trace:

# ovs-appctl ofproto/trace br-int in_port=14,icmp,icmp_type=0x8,dl_src=02:ac:10:ff:00:33,dl_dst=02:ac:10:ff:01:01,nw_src=,nw_dst= 
14. ip,metadata=0x6, priority 0, cookie 0x986c8ad3
     -> NXM_NX_REG0[] is now 0xc0a8650a
    66. reg0=0xc0a8650a,reg15=0x3,metadata=0x6, priority 100
     -> NXM_NX_REG0[] is now 0xc0a8650a


[root@node2 ovs]# ovs-ofctl dump-flows br-int | grep table=66
table=66, n_packets=0, n_bytes=0, idle_age=8580, priority=100,reg0=0xc0a8640b,reg15=0x1,metadata=0x6 actions=mod_dl_dst:02:ac:10:ff:00:22
table=66, n_packets=18, n_bytes=1764, idle_age=1085, priority=100,reg0=0xc0a8640a,reg15=0x1,metadata=0x6 actions=mod_dl_dst:02:ac:10:ff:00:11
table=66, n_packets=33, n_bytes=3234, idle_age=1283, priority=100,reg0=0xc0a8650a,reg15=0x3,metadata=0x6 actions=mod_dl_dst:02:ac:10:ff:00:13
table=66, n_packets=0, n_bytes=0, idle_age=8580, priority=100,reg0=0,reg1=0,reg2=0,reg3=0,reg15=0x3,metadata=0x6 actions=mod_dl_dst:00:00:00:00:00:00


[root@node2 ovs]# ovs-ofctl del-flows br-int "table=66,reg0=0xc0a8650a,reg15=0x3,metadata=0x6"
[root@node2 ovs]# ovs-ofctl add-flow br-int "table=66,reg0=0xc0a8650a,reg15=0x3,metadata=0x6,actions=mod_dl_dst:02:ac:10:ff:00:33"



An introduction to OVN architecture 2018-01-04T23:10:30+00:00 hustcat http://hustcat.github.io/an-introduction-to-ovn-architecture OVN的特性


  • Logical switches:逻辑交换机,用来做二层转发。

  • L2/L3/L4 ACLs:二到四层的 ACL,可以根据报文的 MAC 地址,IP 地址,端口号来做访问控制。

  • Logical routers:逻辑路由器,分布式的,用来做三层转发。

  • Multiple tunnel overlays:支持多种隧道封装技术,有 Geneve,STT 和 VXLAN。

  • TOR switch or software logical switch gateways:支持使用硬件 TOR switch 或者软件逻辑 switch 当作网关来连接物理网络和虚拟网络。


                              |           |           |
                              |     OVN/CMS Plugin    |
                              |           |           |
                              |           |           |
                              |   OVN Northbound DB   |
                              |           |           |
                              |           |           |
                              |       ovn-northd      |
                              |           |           |
                                | OVN Southbound DB |
                       |                  |                  |
         HV 1          |                  |    HV n          |
       +---------------|---------------+  .  +---------------|---------------+
       |               |               |  .  |               |               |
       |        ovn-controller         |  .  |        ovn-controller         |
       |         |          |          |  .  |         |          |          |
       |         |          |          |     |         |          |          |
       |  ovs-vswitchd   ovsdb-server  |     |  ovs-vswitchd   ovsdb-server  |
       |                               |     |                               |
       +-------------------------------+     +-------------------------------+

上图是OVN的整体架构,最上面 Openstack/CMS plugin 是 CMS(Cloud Management System) 和 OVN 的接口,它把 CMS 的配置转化成 OVN 的格式写到 Nnorthbound DB 里面。

详细参考ovn-architecture - Open Virtual Network architecture

  • Northbound DB

Northbound DB里面存的都是一些逻辑的数据,大部分和物理网络没有关系,比如logical switchlogical routerACLlogical port,和传统网络设备概念一致。

Northbound DB进程:

root     29981     1  0 Dec27 ?        00:00:00 ovsdb-server: monitoring pid 29982 (healthy)
root     29982 29981  2 Dec27 ?        00:35:53 ovsdb-server --detach --monitor -vconsole:off --log-file=/var/log/openvswitch/ovsdb-server-nb.log --remote=punix:/var/run/openvswitch/ovnnb_db.sock --pidfile=/var/run/openvswitch/ovnnb_db.pid --remote=db:OVN_Northbound,NB_Global,connections --unixctl=ovnnb_db.ctl --private-key=db:OVN_Northbound,SSL,private_key --certificate=db:OVN_Northbound,SSL,certificate --ca-cert=db:OVN_Northbound,SSL,ca_cert --ssl-protocols=db:OVN_Northbound,SSL,ssl_protocols --ssl-ciphers=db:OVN_Northbound,SSL,ssl_ciphers /etc/openvswitch/ovnnb_db.db

查看ovn-northd的逻辑数据(logical switch,logical port, logical router):

# ovn-nbctl show
switch 707dcb98-baa0-4ac5-8955-1ce4de2f780f (kube-master)
    port stor-kube-master
        type: router
        addresses: ["00:00:00:BF:CC:B1"]
        router-port: rtos-kube-master
    port k8s-kube-master
        addresses: ["66:ce:60:08:9e:cd"]
  • OVN-northd

OVN-northd类似于一个集中的控制器,它把 Northbound DB 里面的数据翻译一下,写到 Southbound DB 里面。

# start ovn northd
/usr/share/openvswitch/scripts/ovn-ctl start_northd


root     29996     1  0 Dec27 ?        00:00:00 ovn-northd: monitoring pid 29997 (healthy)
root     29997 29996 25 Dec27 ?        06:52:54 ovn-northd -vconsole:emer -vsyslog:err -vfile:info --ovnnb-db=unix:/var/run/openvswitch/ovnnb_db.sock --ovnsb-db=unix:/var/run/openvswitch/ovnsb_db.sock --no-chdir --log-file=/var/log/openvswitch/ovn-northd.log --pidfile=/var/run/openvswitch/ovn-northd.pid --detach --monitor
  • Southbound DB

Southbound DB进程:

root     32441     1  0 Dec28 ?        00:00:00 ovsdb-server: monitoring pid 32442 (healthy)
root     32442 32441  0 Dec28 ?        00:01:45 ovsdb-server --detach --monitor -vconsole:off --log-file=/var/log/openvswitch/ovsdb-server-sb.log --remote=punix:/var/run/openvswitch/ovnsb_db.sock --pidfile=/var/run/openvswitch/ovnsb_db.pid --remote=db:OVN_Southbound,SB_Global,connections --unixctl=ovnsb_db.ctl --private-key=db:OVN_Southbound,SSL,private_key --certificate=db:OVN_Southbound,SSL,certificate --ca-cert=db:OVN_Southbound,SSL,ca_cert --ssl-protocols=db:OVN_Southbound,SSL,ssl_protocols --ssl-ciphers=db:OVN_Southbound,SSL,ssl_ciphers /etc/openvswitch/ovnsb_db.db

Southbound DB 里面存的数据和 Northbound DB 语义完全不一样,主要包含 3 类数据,一是物理网络数据,比如 HV(hypervisor)的 IP 地址,HV 的 tunnel 封装格式;二是逻辑网络数据,比如报文如何在逻辑网络里面转发;三是物理网络和逻辑网络的绑定关系,比如逻辑端口关联到哪个 HV 上面。

# ovn-sbctl show
Chassis "7f99371a-d51c-478c-8de2-facd70e2f739"
    hostname: "kube-node2"
    Encap vxlan
        ip: ""
        options: {csum="true"}
Chassis "069367d8-8e07-4b81-b057-38cf6b21b2b7"
    hostname: "kube-node3"
    Encap vxlan
        ip: ""
        options: {csum="true"}
  • OVN-controller

ovn-controller 是 OVN 里面的 agent,类似于 neutron 里面的 ovs-agent,它也是运行在每个 HV 上面。北向,ovn-controller 会把物理网络的信息写到 Southbound DB 里面;南向,它会把 Southbound DB 里面存的一些数据转化成 Openflow flow 配到本地的 OVS table 里面,来实现报文的转发。

# start ovs
/usr/share/openvswitch/scripts/ovs-ctl start --system-id=random
# start ovn-controller
/usr/share/openvswitch/scripts/ovn-ctl start_controller


root     13423     1  0 Dec26 ?        00:00:00 ovn-controller: monitoring pid 13424 (healthy)
root     13424 13423 82 Dec26 ?        1-15:53:23 ovn-controller unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --no-chdir --log-file=/var/log/openvswitch/ovn-controller.log --pidfile=/var/run/openvswitch/ovn-controller.pid --detach --monitor

ovs-vswitchd 和 ovsdb-server 是 OVS 的两个进程。

OVN Northbound DB

Northbound DB 是 OVN 和 CMS 之间的接口,Northbound DB 里面的几乎所有的内容都是由 CMS 产生的,ovn-northd 监听这个数据库的内容变化,然后翻译,保存到 Southbound DB 里面。

Northbound DB 里面主要有如下几张表:

  • Logical_Switch

每一行代表一个逻辑交换机,逻辑交换机有两种,一种是 overlay logical switches,对应于 neutron network,每创建一个 neutron network,networking-ovn 会在这张表里增加一行;另一种是 bridged logical switch,连接物理网络和逻辑网络,被 VTEP gateway 使用。Logical_Switch 里面保存了它包含的 logical port(指向 Logical_Port table)和应用在它上面的 ACL(指向 ACL table)。

ovn-nbctl list Logical_Switch可以查看Logical_Switch表中的数据:

# ovn-nbctl --db tcp: list Logical_Switch
_uuid               : 707dcb98-baa0-4ac5-8955-1ce4de2f780f
acls                : []
dns_records         : []
external_ids        : {gateway_ip=""}
load_balancer       : [4522d0fa-9d46-4165-9524-51d20a35ea0a, 5842a5a9-6c8e-4a87-be3c-c8a0bc271626]
name                : kube-master
other_config        : {subnet=""}
ports               : [44222421-c811-4f38-8ea6-5504a35df703, ee5a5e97-c41d-4656-bd2a-8bc8ad180188]
qos_rules           : []
  • Logical_Switch_Port

每一行代表一个逻辑端口,每创建一个 neutron portnetworking-ovn 会在这张表里增加一行,每行保存的信息有端口的类型,比如 patch portlocalnet port,端口的 IP 和 MAC 地址,端口的状态 UP/Down。

# ovn-nbctl --db tcp: list Logical_Switch_Port
_uuid               : 44222421-c811-4f38-8ea6-5504a35df703
addresses           : ["00:00:00:BF:CC:B1"]
dhcpv4_options      : []
dhcpv6_options      : []
dynamic_addresses   : []
enabled             : []
external_ids        : {}
name                : stor-kube-master
options             : {router-port=rtos-kube-master}
parent_name         : []
port_security       : []
tag                 : []
tag_request         : []
type                : router
up                  : false
  • ACL

每一行代表一个应用到逻辑交换机上的 ACL 规则,如果逻辑交换机上面的所有端口都没有配置 security group,那么这个逻辑交换机上不应用 ACL。每条 ACL 规则包含匹配的内容,方向,还有动作。

  • Logical_Router

每一行代表一个逻辑路由器,每创建一个 neutron router,networking-ovn 会在这张表里增加一行,每行保存了它包含的逻辑的路由器端口。

# ovn-nbctl --db tcp: list Logical_Router
_uuid               : e12293ba-e61e-40bf-babc-8580d1121641
enabled             : []
external_ids        : {"k8s-cluster-router"=yes}
load_balancer       : []
name                : kube-master
nat                 : []
options             : {}
ports               : [2cbbbb8e-6b5d-44d5-9693-b4069ca9e12a, 3a046f60-161a-4fee-a1b3-d9d3043509d2, 40d3d95d-906b-483b-9b71-1fa6970de6e8, 840ab648-6436-4597-aeff-f84fbc44e3a9, b08758e5-7017-413f-b3db-ff68f49460a4]
static_routes       : []
  • Logical_Router_Port

每一行代表一个逻辑路由器端口,每创建一个 router interface,networking-ovn 会在这张表里加一行,它主要保存了路由器端口的 IP 和 MAC。

# ovn-nbctl --db tcp: list Logical_Router_Port
_uuid               : 840ab648-6436-4597-aeff-f84fbc44e3a9
enabled             : []
external_ids        : {}
gateway_chassis     : []
mac                 : "00:00:00:BF:CC:B1"
name                : rtos-kube-master
networks            : [""]
options             : {}
peer                : []

更多请参考ovn-nb - OVN_Northbound database schema.

Southbound DB

Southbound DB 处在 OVN 架构的中心,它是 OVN 中非常重要的一部分,它跟 OVN 的其他组件都有交互。

Southbound DB 里面有如下几张表:

  • Chassis

ChassisOVN新增的概念,OVS里面没有这个概念,Chassis可以是 HV,也可以是 VTEP 网关。

每一行表示一个 HV 或者 VTEP 网关,由 ovn-controller/ovn-controller-vtep 填写,包含 chassis 的名字和 chassis 支持的封装的配置(指向表 Encap),如果 chassis 是 VTEP 网关,VTEP 网关上和 OVN 关联的逻辑交换机也保存在这张表里。

# ovn-sbctl list Chassis
_uuid               : 3dec4aa7-8f15-493d-89f4-4a260b510bbd
encaps              : [bc324cd4-56f2-4f73-af9e-149b7401e0d2]
external_ids        : {datapath-type="", iface-types="geneve,gre,internal,lisp,patch,stt,system,tap,vxlan", ovn-bridge-mappings=""}
hostname            : "kube-node1"
name                : "c7889c47-2d18-4dd4-a3b7-446d42b79f79"
nb_cfg              : 34
vtep_logical_switches: []
  • Encap

保存着tunnel的类型和 tunnel endpoint IP 地址。

# ovn-sbctl list Encap
_uuid               : bc324cd4-56f2-4f73-af9e-149b7401e0d2
chassis_name        : "c7889c47-2d18-4dd4-a3b7-446d42b79f79"
ip                  : ""
options             : {csum="true"}
type                : vxlan
  • Logical_Flow

每一行表示一个逻辑的流表,这张表是 ovn-northd 根据 Nourthbound DB 里面二三层拓扑信息和 ACL 信息转换而来的,ovn-controller 把这个表里面的流表转换成 OVS 流表,配到 HV 上的 OVS table。流表主要包含匹配的规则,匹配的方向,优先级,table ID 和执行的动作。

# ovn-sbctl lflow-list
Datapath: "kube-node1" (2c3caa57-6a58-4416-9bd2-3e2982d83cf1)  Pipeline: ingress
  table=0 (ls_in_port_sec_l2  ), priority=100  , match=(eth.src[40]), action=(drop;)
  table=0 (ls_in_port_sec_l2  ), priority=100  , match=(vlan.present), action=(drop;)
  table=0 (ls_in_port_sec_l2  ), priority=50   , match=(inport == "default_sshd-2"), action=(next;)
  table=0 (ls_in_port_sec_l2  ), priority=50   , match=(inport == "k8s-kube-node1"), action=(next;)
  table=0 (ls_in_port_sec_l2  ), priority=50   , match=(inport == "stor-kube-node1"), action=(next;)
  • Multicast_Group

每一行代表一个组播组,组播报文和广播报文的转发由这张表决定,它保存了组播组所属的 datapath,组播组包含的端口,还有代表 logical egress porttunnel_key

  • Datapath_Binding

每一行代表一个 datapath 和物理网络的绑定关系,每个 logical switchlogical router 对应一行。它主要保存了 OVN 给 datapath 分配的代表 logical datapath identifiertunnel_key


# ovn-sbctl list Datapath_Binding
_uuid               : 4cfe0e4c-1bbb-406a-9d85-e7bc24c818d0
external_ids        : {logical-router="e12293ba-e61e-40bf-babc-8580d1121641", name=kube-master}
tunnel_key          : 1

_uuid               : 5ec4f962-77a8-44e8-ae01-5b7e46b6a286
external_ids        : {logical-switch="e865aa50-7510-4b7f-9df4-b82801a8e92b", name=join}
tunnel_key          : 2

_uuid               : 2c3caa57-6a58-4416-9bd2-3e2982d83cf1
external_ids        : {logical-switch="7c41601a-dcd5-4e77-b0e8-ca8692d7462b", name="kube-node1"}
tunnel_key          : 4


这张表主要用来确定 logical port 处在哪个 chassis 上面。每一行包含的内容主要有 logical port 的 MAC 和 IP 地址,端口类型,端口属于哪个 datapath binding,代表 logical input/output port identifiertunnel_key, 以及端口处在哪个 chassis。端口所处的 chassis 由 ovn-controller/ovn-controller 设置,其余的值由 ovn-northd 设置。


# ovn-sbctl list Port_Binding    
_uuid               : 5e5746d8-3533-45a8-8abe-5a7028c97afa
chassis             : []
datapath            : 2c3caa57-6a58-4416-9bd2-3e2982d83cf1
external_ids        : {}
gateway_chassis     : []
logical_port        : "stor-kube-node1"
mac                 : ["00:00:00:18:22:18"]
nat_addresses       : []
options             : {peer="rtos-kube-node1"}
parent_port         : []
tag                 : []
tunnel_key          : 2
type                : patch
  • 小结

Chassis 和表 Encap 包含的是物理网络的数据,表 Logical_Flow 和表 Multicast_Group 包含的是逻辑网络的数据,表 Datapath_Binding 和表 Port_Binding包含的是逻辑网络和物理网络绑定关系的数据。

OVN tunnel

OVN 支持的 tunnel 类型有三种,分别是 Geneve,STT 和 VXLAN。HV 与 HV 之间的流量,只能用 Geneve 和 STT 两种,HV 和 VTEP 网关之间的流量除了用 Geneve 和 STT 外,还能用 VXLAN,这是为了兼容硬件 VTEP 网关,因为大部分硬件 VTEP 网关只支持 VXLAN。

虽然 VXLAN 是数据中心常用的 tunnel 技术,但是 VXLAN header 是固定的,只能传递一个 VNID(VXLAN network identifier),如果想在 tunnel 里面传递更多的信息,VXLAN 实现不了。所以 OVN 选择了 Geneve 和 STT,Geneve 的头部有个 option 字段,支持 TLV 格式,用户可以根据自己的需要进行扩展,而 STT 的头部可以传递 64-bit 的数据,比 VXLAN 的 24-bit 大很多。

OVN tunnel 封装时使用了三种数据:

  • Logical datapath identifier(逻辑的数据通道标识符)

datapath 是 OVS 里面的概念,报文需要送到 datapath 进行处理,一个 datapath 对应一个 OVN 里面的逻辑交换机或者逻辑路由器,类似于 tunnel ID。这个标识符有 24-bit,由 ovn-northd 分配的,全局唯一,保存在 Southbound DB 里面的表 Datapath_Binding 的列 tunnel_key 里(参考前面的示例)。

  • Logical input port identifier(逻辑的入端口标识符):进入 logical datapath 的端口标识符,15-bit 长,由 ovn-northd 分配的,在每个 datapath 里面唯一。它可用范围是 1-32767,0 预留给内部使用。保存在 Southbound DB 里面的表 Port_Binding 的列 tunnel_key 里。

  • Logical output port identifier(逻辑的出端口标识符)

离开 logical datapath 的端口标识符,16-bit 长,范围 0-32767 和 logical input port identifier 含义一样,范围 32768-65535 给组播组使用。对于每个 logical portinput port identifier output port identifier 相同。

如果 tunnel 类型是 Geneve,Geneve header 里面的 VNI 字段填 logical datapath identifier,Option 字段填 logical input port identifierlogical output port identifier,TLV 的 class 为 0xffff,type 为 0,value 为 0 (1-bit) + logical input port identifier (15-bit) + logical output port identifier (16-bit)。详细参考Geneve: Generic Network Virtualization Encapsulation

Geneve Option:

0                   1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
|          Option Class         |      Type     |R|R|R| Length  |
|                      Variable Option Data                     |

OVS 的 tunnel 封装是由 Openflow 流表来做的,所以 ovn-controller 需要把这三个标识符写到本地 HV 的 Openflow flow table 里面,对于每个进入 br-int 的报文,都会有这三个属性,logical datapath identifierlogical input port identifier 在入口方向被赋值,分别存在 openflow metadata 字段和 OVS 扩展寄存器 reg14 里面。报文经过 OVS 的 pipeline 处理后,如果需要从指定端口发出去,只需要把 Logical output port identifier 写在 OVS 扩展寄存器 reg15 里面。

示例(port 6对应Geneve tunnel接口):

table=0,in_port=6 actions=move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],move:NXM_NX_TUN_METADATA0[16..30]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_METADATA0[0..15]->NXM_NX_REG15[0..15],resubmit(,33)

可以看到,NXM_NX_TUN_IDtunnel_keyreg14为15 bit,reg15为16 bit.

OVN tunnel 里面所携带的 logical input port identifierlogical output port identifier 可以提高流表的查找效率,OVS 流表可以通过这两个值来处理报文,不需要解析报文的字段。


OVN gateway practice 2018-01-04T23:00:30+00:00 hustcat http://hustcat.github.io/ovn-gateway-practice OVN Gateway用于连接overlay networkphysical network。它支持L2/L3两种方式:

layer-2 which bridge an OVN logical switch into a VLAN, and layer-3 which provide a routed connection between an OVN router and the physical network.

Unlike a distributed logical router (DLR), an OVN gateway router is centralized on a single host (chassis) so that it may provide services which cannot yet be distributed (NAT, load balancing, etc…). As of this publication there is a restriction that gateway routers may only connect to other routers via a logical switch, whereas DLRs may connect to one other directly via a peer link. Work is in progress to remove this restriction.

It should be noted that it is possible to have multiple gateway routers tied into an environment, which means that it is possible to perform ingress ECMP routing into logical space. However, it is worth mentioning that OVN currently does not support egress ECMP between gateway routers. Again, this is being looked at as a future enhancement.



  • node1 – will serve as OVN Central and Gateway Node
  • node2 – will serve as an OVN Host
  • node3 – will serve as an OVN Host
  • client - physical network node


        |  client | Physical Network
        |  switch | outside
        |  router | gw1 port 'gw1-outside':
         ---------      port 'gw1-join':
        |  switch | join
        |  router | router1 port 'router1-join':
         ---------          port 'router1-ls1':
        |  switch | ls1
         /       \
 _______/_       _\_______  
|  vm1    |     |   vm2   |
 ---------       ---------


[root@node1 ~]# ovn-nbctl show
switch 0fab3ddd-6325-4219-aa1c-6dc9853b7069 (ls1)
    port ls1-vm1
        addresses: ["02:ac:10:ff:00:11"]
    port ls1-vm2
        addresses: ["02:ac:10:ff:00:22"]

[root@node1 ~]# ovn-sbctl show
Chassis "dc82b489-22b3-42dd-a28e-f25439316356"
    hostname: "node1"
    Encap geneve
        ip: ""
        options: {csum="true"}
Chassis "598fec44-5787-452f-b527-2ef8c4adb942"
    hostname: "node2"
    Encap geneve
        ip: ""
        options: {csum="true"}
    Port_Binding "ls1-vm1"
Chassis "54292ae7-c91c-423b-a936-5b416d6bae9f"
    hostname: "node3"
    Encap geneve
        ip: ""
        options: {csum="true"}
    Port_Binding "ls1-vm2"


# add the router
ovn-nbctl lr-add router1

# create router port for the connection to 'ls1'
ovn-nbctl lrp-add router1 router1-ls1 02:ac:10:ff:00:01

# create the 'ls1' switch port for connection to 'router1'
ovn-nbctl lsp-add ls1 ls1-router1
ovn-nbctl lsp-set-type ls1-router1 router
ovn-nbctl lsp-set-addresses ls1-router1 02:ac:10:ff:00:01
ovn-nbctl lsp-set-options ls1-router1 router-port=router1-ls1

Logical network:

# ovn-nbctl show
switch 0fab3ddd-6325-4219-aa1c-6dc9853b7069 (ls1)
    port ls1-router1
        type: router
        addresses: ["02:ac:10:ff:00:01"]
        router-port: router1-ls1
    port ls1-vm1
        addresses: ["02:ac:10:ff:00:11"]
    port ls1-vm2
        addresses: ["02:ac:10:ff:00:22"]
router 6dec5e02-fa39-4f2c-8e1e-7a0182f110e6 (router1)
    port router1-ls1
        mac: "02:ac:10:ff:00:01"
        networks: [""]


# ip netns exec vm1 ping -c 1 
PING ( 56(84) bytes of data.
64 bytes from icmp_seq=1 ttl=254 time=0.275 ms

创建gateway router

指定在node1上部署Gateway router,创建router时,通过options:chassis={chassis_uuid}实现:

# create router 'gw1'
ovn-nbctl create Logical_Router name=gw1 options:chassis=dc82b489-22b3-42dd-a28e-f25439316356

# create a new logical switch for connecting the 'gw1' and 'router1' routers
ovn-nbctl ls-add join

# connect 'gw1' to the 'join' switch
ovn-nbctl lrp-add gw1 gw1-join 02:ac:10:ff:ff:01
ovn-nbctl lsp-add join join-gw1
ovn-nbctl lsp-set-type join-gw1 router
ovn-nbctl lsp-set-addresses join-gw1 02:ac:10:ff:ff:01
ovn-nbctl lsp-set-options join-gw1 router-port=gw1-join

# 'router1' to the 'join' switch
ovn-nbctl lrp-add router1 router1-join 02:ac:10:ff:ff:02
ovn-nbctl lsp-add join join-router1
ovn-nbctl lsp-set-type join-router1 router
ovn-nbctl lsp-set-addresses join-router1 02:ac:10:ff:ff:02
ovn-nbctl lsp-set-options join-router1 router-port=router1-join

# add static routes
ovn-nbctl lr-route-add gw1 ""
ovn-nbctl lr-route-add router1 ""
  • logical network
# ovn-nbctl show
switch 0fab3ddd-6325-4219-aa1c-6dc9853b7069 (ls1)
    port ls1-router1
        type: router
        addresses: ["02:ac:10:ff:00:01"]
        router-port: router1-ls1
    port ls1-vm1
        addresses: ["02:ac:10:ff:00:11"]
    port ls1-vm2
        addresses: ["02:ac:10:ff:00:22"]
switch d4b119e9-0298-42ab-8cc7-480292231953 (join)
    port join-gw1
        type: router
        addresses: ["02:ac:10:ff:ff:01"]
        router-port: gw1-join
    port join-router1
        type: router
        addresses: ["02:ac:10:ff:ff:02"]
        router-port: router1-join
router 6dec5e02-fa39-4f2c-8e1e-7a0182f110e6 (router1)
    port router1-ls1
        mac: "02:ac:10:ff:00:01"
        networks: [""]
    port router1-join
        mac: "02:ac:10:ff:ff:02"
        networks: [""]
router f29af2c3-e9d1-46f9-bff7-b9b8f0fd56df (gw1)
    port gw1-join
        mac: "02:ac:10:ff:ff:01"
        networks: [""]


[root@node2 ~]# ip netns exec vm1 ip route add default via

[root@node2 ~]# ip netns exec vm1 ping -c 1
PING ( 56(84) bytes of data.
64 bytes from icmp_seq=1 ttl=253 time=2.18 ms

连接overlay network与physical network


# create new port on router 'gw1'
ovn-nbctl lrp-add gw1 gw1-outside 02:0a:7f:18:01:02

# create new logical switch and connect it to 'gw1'
ovn-nbctl ls-add outside
ovn-nbctl lsp-add outside outside-gw1
ovn-nbctl lsp-set-type outside-gw1 router
ovn-nbctl lsp-set-addresses outside-gw1 02:0a:7f:18:01:02
ovn-nbctl lsp-set-options outside-gw1 router-port=gw1-outside

# create a bridge for eth1 (run on 'node1')
ovs-vsctl add-br br-eth1

# create bridge mapping for eth1. map network name "phyNet" to br-eth1 (run on 'node1')
ovs-vsctl set Open_vSwitch . external-ids:ovn-bridge-mappings=phyNet:br-eth1

# create localnet port on 'outside'. set the network name to "phyNet"
ovn-nbctl lsp-add outside outside-localnet
ovn-nbctl lsp-set-addresses outside-localnet unknown
ovn-nbctl lsp-set-type outside-localnet localnet
ovn-nbctl lsp-set-options outside-localnet network_name=phyNet

# connect eth1 to br-eth1 (run on 'node1')
ovs-vsctl add-port br-eth1 eth1

完整的logical network:

# ovn-nbctl show
switch d4b119e9-0298-42ab-8cc7-480292231953 (join)
    port join-gw1
        type: router
        addresses: ["02:ac:10:ff:ff:01"]
        router-port: gw1-join
    port join-router1
        type: router
        addresses: ["02:ac:10:ff:ff:02"]
        router-port: router1-join
switch 0fab3ddd-6325-4219-aa1c-6dc9853b7069 (ls1)
    port ls1-router1
        type: router
        addresses: ["02:ac:10:ff:00:01"]
        router-port: router1-ls1
    port ls1-vm1
        addresses: ["02:ac:10:ff:00:11"]
    port ls1-vm2
        addresses: ["02:ac:10:ff:00:22"]
switch 64dc14b1-3e0f-4f68-b388-d76826a5c972 (outside)
    port outside-gw1
        type: router
        addresses: ["02:0a:7f:18:01:02"]
        router-port: gw1-outside
    port outside-localnet
        type: localnet
        addresses: ["unknown"]
router 6dec5e02-fa39-4f2c-8e1e-7a0182f110e6 (router1)
    port router1-ls1
        mac: "02:ac:10:ff:00:01"
        networks: [""]
    port router1-join
        mac: "02:ac:10:ff:ff:02"
        networks: [""]
router f29af2c3-e9d1-46f9-bff7-b9b8f0fd56df (gw1)
    port gw1-outside
        mac: "02:0a:7f:18:01:02"
        networks: [""]
    port gw1-join
        mac: "02:ac:10:ff:ff:01"
        networks: [""]


[root@node2 ~]# ip netns exec vm1 ping -c 1   
PING ( 56(84) bytes of data.
64 bytes from icmp_seq=1 ttl=253 time=1.00 ms

物理网络访问overlay( by direct)

对于与node1在同一个L2网络的节点,可以通过配置路由访问overlay network:

client ( -> gw1:

# ip route show dev eth1  proto kernel  scope link  src 

# ping -c 1
PING ( 56(84) bytes of data.
64 bytes from icmp_seq=1 ttl=254 time=0.438 ms


ip route add via dev eth1

client ( -> vm1 (

# ping -c 1
PING ( 56(84) bytes of data.
64 bytes from icmp_seq=1 ttl=62 time=1.35 ms


[root@node2 ~]# ip netns exec vm1 tcpdump -nnvv -i vm1
tcpdump: listening on vm1, link-type EN10MB (Ethernet), capture size 262144 bytes
07:41:28.561299 IP (tos 0x0, ttl 62, id 0, offset 0, flags [DF], proto ICMP (1), length 84) > ICMP echo request, id 8160, seq 1, length 64
07:41:28.561357 IP (tos 0x0, ttl 64, id 53879, offset 0, flags [none], proto ICMP (1), length 84) > ICMP echo reply, id 8160, seq 1, length 64

overlay访问物理网络(by NAT)



# ovn-nbctl -- --id=@nat create nat type="snat" logical_ip= external_ip= -- add logical_router gw1 nat @nat

# ovn-nbctl lr-nat-list gw1
TYPE             EXTERNAL_IP        LOGICAL_IP            EXTERNAL_MAC         LOGICAL_PORT

# ovn-nbctl list NAT       
_uuid               : 3243ffd3-8d77-4bd3-9f7e-49e74e87b4a7
external_ip         : ""
external_mac        : []
logical_ip          : ""
logical_port        : []
type                : snat

可以通过命令ovn-nbctl lr-nat-del gw1 snat删除NAT规则.

vm1 -> client

[root@node2 ~]# ip netns exec vm1 ping -c 1
PING ( 56(84) bytes of data.
64 bytes from icmp_seq=1 ttl=62 time=1.63 ms

[root@client ~]# tcpdump icmp -nnvv -i eth1
[10894068.821880] device eth1 entered promiscuous mode
tcpdump: listening on eth1, link-type EN10MB (Ethernet), capture size 65535 bytes
08:24:50.316495 IP (tos 0x0, ttl 61, id 31580, offset 0, flags [DF], proto ICMP (1), length 84) > ICMP echo request, id 5587, seq 1, length 64
08:24:50.316536 IP (tos 0x0, ttl 64, id 6129, offset 0, flags [none], proto ICMP (1), length 84) > ICMP echo reply, id 5587, seq 1, length 64



OVN-kubernetes practice 2018-01-03T23:00:30+00:00 hustcat http://hustcat.github.io/ovn-k8s-practice 环境:

  • kube-master:
  • kube-node1:
  • kube-node2:
  • kube-node3:

Start OVN daemon

Central node

  • Start OVS and controller

## start ovs
/usr/share/openvswitch/scripts/ovs-ctl start

## set ovn-remote and ovn-nb
ovs-vsctl set Open_vSwitch . external_ids:ovn-remote="tcp:$CENTRAL_IP:6642" external_ids:ovn-nb="tcp:$CENTRAL_IP:6641" external_ids:ovn-encap-ip=$LOCAL_IP external_ids:ovn-encap-type="$ENCAP_TYPE"

## set system_id
test -e $id_file || uuidgen > $id_file
ovs-vsctl set Open_vSwitch . external_ids:system-id=$(cat $id_file)

## start ovn-controller and vtep
/usr/share/openvswitch/scripts/ovn-ctl start_controller
/usr/share/openvswitch/scripts/ovn-ctl start_controller_vtep
  • start ovn-northd
# /usr/share/openvswitch/scripts/ovn-ctl start_northd
Starting ovn-northd                                        [  OK  ]

Open up TCP ports to access the OVN databases:

[root@kube-master ~]# ovn-nbctl set-connection ptcp:6641
[root@kube-master ~]# ovn-sbctl set-connection ptcp:6642

Compute node


## start ovs
/usr/share/openvswitch/scripts/ovs-ctl start

## set ovn-remote and ovn-nb
ovs-vsctl set Open_vSwitch . external_ids:ovn-remote="tcp:$CENTRAL_IP:6642" external_ids:ovn-nb="tcp:$CENTRAL_IP:6641" external_ids:ovn-encap-ip=$LOCAL_IP external_ids:ovn-encap-type="$ENCAP_TYPE"

## set system_id
test -e $id_file || uuidgen > $id_file
ovs-vsctl set Open_vSwitch . external_ids:system-id=$(cat $id_file)

## start ovn-controller and vtep
/usr/share/openvswitch/scripts/ovn-ctl start_controller
/usr/share/openvswitch/scripts/ovn-ctl start_controller_vtep


k8s master node

  • master node initialization

Set the k8s API server address in the Open vSwitch database for the initialization scripts (and later daemons) to pick from.

# ovs-vsctl set Open_vSwitch . external_ids:k8s-api-server=""
git clone https://github.com/openvswitch/ovn-kubernetes
cd ovn-kubernetes
pip install .
  • master init
ovn-k8s-overlay master-init \
   --cluster-ip-subnet="" \
   --master-switch-subnet="" \

这会创建logical switch/router:

# ovn-nbctl show
switch d034f42f-6dd5-4ba9-bfdd-114ce17c9235 (kube-master)
    port k8s-kube-master
        addresses: ["ae:31:fa:c7:81:fc"]
    port stor-kube-master
        type: router
        addresses: ["00:00:00:B5:F1:57"]
        router-port: rtos-kube-master
switch 2680f36b-85c2-4064-b811-5c0bd91debdd (join)
    port jtor-kube-master
        type: router
        addresses: ["00:00:00:1A:E4:98"]
        router-port: rtoj-kube-master
router ce75b330-dbd3-43d2-aa4f-4e17af898532 (kube-master)
    port rtos-kube-master
        mac: "00:00:00:B5:F1:57"
        networks: [""]
    port rtoj-kube-master
        mac: "00:00:00:1A:E4:98"
        networks: [""]

k8s node


ovs-vsctl set Open_vSwitch . \

ovn-k8s-overlay minion-init \
  --cluster-ip-subnet="" \
  --minion-switch-subnet="" \

## 对于https需要指定CA和token
ovs-vsctl set Open_vSwitch . \
  external_ids:k8s-api-server="https://$K8S_API_SERVER_IP" \
  external_ids:k8s-ca-certificate="/etc/kubernetes/certs/ca.crt" \

这会创建对应的logical switch,并连接到logical router (kube-master):

# ovn-nbctl show
switch 0147b986-1dab-49a5-9c4e-57d9feae8416 (kube-node1)
    port k8s-kube-node1
        addresses: ["ba:2c:06:32:14:78"]
    port stor-kube-node1
        type: router
        addresses: ["00:00:00:C0:2E:C7"]
        router-port: rtos-kube-node1
router ce75b330-dbd3-43d2-aa4f-4e17af898532 (kube-master)
    port rtos-kube-node2
        mac: "00:00:00:D3:4B:AA"
        networks: [""]
    port rtos-kube-node1
        mac: "00:00:00:C0:2E:C7"
        networks: [""]
    port rtos-kube-master
        mac: "00:00:00:B5:F1:57"
        networks: [""]
    port rtoj-kube-master
        mac: "00:00:00:1A:E4:98"
        networks: [""]


ovs-vsctl set Open_vSwitch . \

ovn-k8s-overlay minion-init \
  --cluster-ip-subnet="" \
  --minion-switch-subnet="" \

Gateway node

## attach eth0 to bridge breth0 and move IP/routes
ovn-k8s-util nics-to-bridge eth0

## initialize gateway

ovs-vsctl set Open_vSwitch . \

ovn-k8s-overlay gateway-init \
  --cluster-ip-subnet="$CLUSTER_IP_SUBNET" \
  --bridge-interface breth0 \
  --physical-ip "$PHYSICAL_IP" \
  --node-name="$NODE_NAME" \
  --default-gw "$EXTERNAL_GATEWAY"

# Since you share a NIC for both mgmt and North-South connectivity, you will 
# have to start a separate daemon to de-multiplex the traffic.
ovn-k8s-gateway-helper --physical-bridge=breth0 --physical-interface=eth0 \
    --pidfile --detach

Watchers on master node

ovn-k8s-watcher \
  --overlay \
  --pidfile \
  --log-file \
  -vfile:info \
  -vconsole:emer \

# ps -ef | grep ovn-k8s
root     28151     1  1 12:57 ?        00:00:00 /usr/bin/python /usr/bin/ovn-k8s-watcher --overlay --pidfile --log-file -vfile:info -vconsole:emer --detach




apiVersion: v1
kind: Pod
  name: sshd-2
  - name: sshd-2
    image: dbyin/sshd:1.0


# tail -f /var/log/openvswitch/ovn-k8s-cni-overlay.log
2018-01-03T08:42:39.609Z |  0  | ovn-k8s-cni-overlay | DBG | plugin invoked with cni_command = ADD cni_container_id = a2f5796e82e9286d7f56540585b6040b3a743093c46ea34364212cf1afd42a32 cni_ifname = eth0 cni_netns = /proc/31180/ns/net cni_args = IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=sshd-2;K8S_POD_INFRA_CONTAINER_ID=a2f5796e82e9286d7f56540585b6040b3a743093c46ea34364212cf1afd42a32
2018-01-03T08:42:39.633Z |  1  | kubernetes | DBG | Annotations for pod sshd-2: {u'ovn': u'{"gateway_ip": "", "ip_address": "", "mac_address": "0a:00:00:00:00:01"}'}
2018-01-03T08:42:39.635Z |  2  | ovn-k8s-cni-overlay | DBG | Creating veth pair for container a2f5796e82e9286d7f56540585b6040b3a743093c46ea34364212cf1afd42a32
2018-01-03T08:42:39.662Z |  3  | ovn-k8s-cni-overlay | DBG | Bringing up veth outer interface a2f5796e82e9286
2018-01-03T08:42:39.769Z |  4  | ovn-k8s-cni-overlay | DBG | Create a link for container namespace
2018-01-03T08:42:39.781Z |  5  | ovn-k8s-cni-overlay | DBG | Adding veth inner interface to namespace for container a2f5796e82e9286d7f56540585b6040b3a743093c46ea34364212cf1afd42a32
2018-01-03T08:42:39.887Z |  6  | ovn-k8s-cni-overlay | DBG | Configuring and bringing up veth inner interface a2f5796e82e92_c. New name:'eth0',MAC address:'0a:00:00:00:00:01', MTU:'1400', IP:
2018-01-03T08:42:44.960Z |  7  | ovn-k8s-cni-overlay | DBG | Setting gateway_ip for container:a2f5796e82e9286d7f56540585b6040b3a743093c46ea34364212cf1afd42a32
2018-01-03T08:42:44.983Z |  8  | ovn-k8s-cni-overlay | DBG | output is {"gateway_ip": "", "ip_address": "", "mac_address": "0a:00:00:00:00:01"}


[root@kube-node2 ~]# ping -c 2
PING ( 56(84) bytes of data.
64 bytes from icmp_seq=1 ttl=63 time=0.281 ms
64 bytes from icmp_seq=2 ttl=63 time=0.304 ms

--- ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1009ms
rtt min/avg/max/mdev = 0.281/0.292/0.304/0.020 ms

OVS on kube-node1:

# ovs-vsctl show                                      
    Bridge br-int
        fail_mode: secure
        Port "ovn-069367-0"
            Interface "ovn-069367-0"
                type: vxlan
                options: {csum="true", key=flow, remote_ip=""}
        Port br-int
            Interface br-int
                type: internal
        Port "k8s-kube-node1"
            Interface "k8s-kube-node1"
                type: internal
        Port "a2f5796e82e9286"
            Interface "a2f5796e82e9286"
        Port "ovn-7f9937-0"
            Interface "ovn-7f9937-0"
                type: geneve
                options: {csum="true", key=flow, remote_ip=""}
        Port "ovn-0696ca-0"
            Interface "ovn-0696ca-0"
                type: geneve
                options: {csum="true", key=flow, remote_ip=""}
    ovs_version: "2.8.1"



来看看从192.168.3.2192.168.2.3的数据包的Open Flow处理过程:

[root@kube-node2 ~]# ovs-appctl ofproto/trace br-int in_port=9,ip,dl_src=82:ff:e7:83:99:a9,dl_dst=00:00:00:d3:4b:aa,nw_src=,nw_dst=,nw_ttl=32
 0. in_port=9, priority 100
42. ip,reg0=0x1/0x1,metadata=0x5, priority 100, cookie 0x88177e0
     -> A clone of the packet is forked to recirculate. The forked pipeline will be resumed at table 43.

Final flow: ip,reg0=0x1,reg11=0xa,reg12=0x6,reg13=0x1,reg14=0x2,reg15=0x1,metadata=0x5,in_port=9,vlan_tci=0x0000,dl_src=82:ff:e7:83:99:a9,dl_dst=00:00:00:d3:4b:aa,nw_src=,nw_dst=,nw_proto=0,nw_tos=0,nw_ecn=0,nw_ttl=32
Megaflow: recirc_id=0,ct_state=-new-est-rel-inv-trk,eth,ip,in_port=9,vlan_tci=0x0000/0x1000,dl_src=00:00:00:00:00:00/01:00:00:00:00:00,dl_dst=00:00:00:d3:4b:aa,nw_dst=,nw_frag=no
Datapath actions: ct(zone=1),recirc(0x24)

recirc(0x24) - resume conntrack with default ct_state=trk|new (use --ct-next to customize)

Flow: recirc_id=0x24,ct_state=new|trk,eth,ip,reg0=0x1,reg11=0xa,reg12=0x6,reg13=0x1,reg14=0x2,reg15=0x1,metadata=0x5,in_port=9,vlan_tci=0x0000,dl_src=82:ff:e7:83:99:a9,dl_dst=00:00:00:d3:4b:aa,nw_src=,nw_dst=,nw_proto=0,nw_tos=0,nw_ecn=0,nw_ttl=32

        Resuming from table 43
65. reg15=0x1,metadata=0x5, priority 100
13. ip,metadata=0x1,nw_dst=, priority 49, cookie 0xc6501434
     -> NXM_NX_XXREG0[96..127] is now 0xc0a80203
14. reg0=0xc0a80203,reg15=0x3,metadata=0x1, priority 100, cookie 0x3b957bac
64. reg10=0x1/0x1,reg15=0x3,metadata=0x1, priority 100
    65. reg15=0x3,metadata=0x1, priority 100
        23. metadata=0x4,dl_dst=0a:00:00:00:00:01, priority 50, cookie 0x6c2597ec
        32. reg15=0x3,metadata=0x4, priority 100
             -> NXM_NX_TUN_METADATA0[16..30] is now 0x1
             -> output to kernel tunnel
     -> NXM_OF_IN_PORT[] is now 0

Final flow: unchanged
Megaflow: recirc_id=0x24,ct_state=+new-est-rel-inv+trk,eth,ip,tun_id=0/0xffffff,tun_metadata0=NP,in_port=9,vlan_tci=0x0000/0x1000,dl_src=82:ff:e7:83:99:a9,dl_dst=00:00:00:d3:4b:aa,nw_src=,nw_dst=,nw_ecn=0,nw_ttl=32,nw_frag=no
Datapath actions: set(tunnel(tun_id=0x4,dst=,ttl=64,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x10003}),flags(df|csum|key))),set(eth(src=00:00:00:c0:2e:c7,dst=0a:00:00:00:00:01)),2


  • (1) ofproto/trace中的dl_dst=00:00:00:d3:4b:aakube-node2对应的网关192.168.2.1的MAC地址(即stor-kube-node2的地址).
  • (2) 第65条规则将metadata从0x5 (datapath/kube-node2) 改成0x1 (router/kube-master).
  • (3) 第13条规则为路由规则,修改ttl,并修改Source MAC; 第14条规则修改Dst MAC地址.
  • (4) 第64条规则将datapath改成kube-node1 (0x4).
  • (5) 将32条规则修改packet的tun_id为0x4,tun_metadata0为0x3,然后将packet转给port 7,即tunnel设备:
 7(ovn-c7889c-0): addr:76:c2:2f:bb:06:5b
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max

        Port "ovn-c7889c-0"
            Interface "ovn-c7889c-0"
                type: geneve
                options: {csum="true", key=flow, remote_ip=""}


[root@kube-node1 ~]# ovs-appctl ofproto/trace br-int in_port=6,tun_id=0x4,tun_metadata0=0x3,dl_src=00:00:00:c0:2e:c7,dl_dst=0a:00:00:00:00:01
Flow: tun_id=0x4,in_port=6,vlan_tci=0x0000,dl_src=00:00:00:c0:2e:c7,dl_dst=0a:00:00:00:00:01,dl_type=0x0000

 0. in_port=6, priority 100
     -> OXM_OF_METADATA[0..23] is now 0x4
     -> NXM_NX_REG14[0..14] is now 0
     -> NXM_NX_REG15[0..15] is now 0x3
48. reg15=0x3,metadata=0x4, priority 50, cookie 0x37e139d4
64. priority 0
65. reg15=0x3,metadata=0x4, priority 100

Final flow: reg11=0x5,reg12=0x8,reg13=0xa,reg15=0x3,tun_id=0x4,metadata=0x4,in_port=6,vlan_tci=0x0000,dl_src=00:00:00:c0:2e:c7,dl_dst=0a:00:00:00:00:01,dl_type=0x0000
Megaflow: recirc_id=0,ct_state=-new-est-rel-inv-trk,eth,tun_id=0x4/0xffffff,tun_metadata0=0x3/0x7fffffff,in_port=6,dl_dst=00:00:00:00:00:00/01:00:00:00:00:00,dl_type=0x0000
Datapath actions: 5

最终将packet转给port 10,即容器对应的Port:

 10(a2f5796e82e9286): addr:de:d3:83:cf:22:7c
     config:     0
     state:      0
     current:    10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max


GPU计算 -- GPU体系结构及CUDA编程模型 2017-11-11T20:20:30+00:00 hustcat http://hustcat.github.io/gpu-architecture 体系结构




(1) CPU有强大的ALU,时钟频率很高; (2) CPU的容量较大的cache,一般包括L1、L2和L3三级高速缓存;L3可以达到8MB,这些cache占据相当一部分片上空间; (3) CPU有复杂的控制逻辑,例如:复杂的流水线(pipeline)、分支预测(branch prediction)、乱序执行(Out-of-order execution)等;



(1) GPU有大量的ALU; (2) cache很小;缓存的目的不是保存后面需要访问的数据的,这点和CPU不同,而是为thread提高服务的; (2) 没有复杂的控制逻辑,没有分支预测等这些组件;



  • GPU内部结构

一个“假想”的GPU Core结构如下:

它包括8个ALU,4组执行环境(Execution context),每组有8个Ctx。这样,一个Core可以并发(concurrent but interleaved)执行4条指令流(instruction streams),32个并发程序片元(fragment)。



  • 示例

NVIDIA GeForce GTX 580为例,每个GPU Core的内部如下:

每个Core有64个CUDA core(也叫做Stream Processor, SP),每个CUDA core可以理解为一个复杂完整的ALU。这些CUDA core,分成2组,每组32个CUDA core,共享相同的取指/译码部件,这一组称为Stream Multiprocessor(SM)

每个Core可以并发执行1536个程序片元,即1536个CUDA threads

一个GTX 580GPU包含16个Core,总共1024个CUDA core,可以并发执行24576(1536*16)个CUDA threads.



一般来说,CPU和内存之间的带宽只有数十GB/s。比如对于Intel Xeon E5-2699 v3,内存带宽达到68GB/s((2133 * 64 / 8)*4 MB/s):

最大内存大小(取决于内存类型) 768 GB
内存类型 DDR4 1600/1866/2133
最大内存通道数 4
最大内存带宽 68 GB/s





 ----------           ------------
|___DRAM___|         |___GDRAM____|
      |                    |
 ----------           ------------
|   CPU    |         |    GPU     |
|__________|         |____________|
      |                    |
  ---------            --------

对于PCIe Gen3 x1理论带宽约为1000MB/s,所以对于Gen3 x32的最大带宽为~32GB/s,而受限于本身的实现机制,有效带宽往往只有理论值的2/3还低。所以,CPU与GPU之间的通信开销是比较大的。




一个CUDA程序的可以分为两个部分: 在CPU上运行的Host程序;在GPU上运行的Device程序。两者拥有各自的存储器。GPU上运行的函数又被叫做kernel函数,通过__global__关键字声名,例如:

// Kernel definition
__global__ void VecAdd(float* A, float* B, float* C)
    int i = threadIdx.x;
    C[i] = A[i] + B[i];
int main()
    // Kernel invocation with N threads
    VecAdd<<<1, N>>>(A, B, C);

Host程序在调用Device程序时,可以通过<<<...>>>中的参数提定执行该kernelCUDA threads的数量。每个Thread在执行Kernel函数时,会被分配一个thread ID,kernel函数可以通过内置变量threadIdx访问。

线程层次 (Thread Hierarchy)

CUDA中的线程组织为三个层次GridBlockThreadthreadIdx是一个3-component向量(vector),所以线程可以使用1维、2维、3维的线程索引(thread index)来标识。同时由多个线程组成的thread block也可以分别是1维、2维或者3维的。

For convenience, threadIdx is a 3-component vector, so that threads can be identified using a one-dimensional, two-dimensional, or three-dimensional thread index, forming a one-dimensional, two-dimensional, or three-dimensional block of threads, called a thread block.


There is a limit to the number of threads per block, since all threads of a block are expected to reside on the same processor core and must share the limited memory resources of that core. On current GPUs, a thread block may contain up to 1024 threads


Blocks are organized into a one-dimensional, two-dimensional, or three-dimensional grid of thread blocks.The number of thread blocks in a grid is usually dictated by the size of the data being processed or the number of processors in the system, which it can greatly exceed.

  • thread index的计算

根据thread、block、grid的维度不同,thread index的计算方式也不一样。这时考虑两种简单的情况:

(1) grid划分成1维,block划分为1维

    int threadId = blockIdx.x *blockDim.x + threadIdx.x;  

(2) grid划分成1维,block划分为2维

    int threadId = blockIdx.x * blockDim.x * blockDim.y+ threadIdx.y * blockDim.x + threadIdx.x;  



// Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N],
float C[N][N])
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if (i < N && j < N)
        C[i][j] = A[i][j] + B[i][j];
int main()
    // Kernel invocation
    dim3 threadsPerBlock(16, 16);
    dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);
    MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);



CUDA thread最终由实际的物理硬件计算单元执行,这里看看thread是如何映射到硬件单元的。先重复一下几个概念。

  • 基本概念
简称 全称 注释
SM Stream Multiprocessor 实际上对应一个CUDA core
SP Stream Processor 每个SM包含若干个SP, 由SM取指, 解码, 发射到各个SP, GPU可看作是一组SM
  • 映射关系


Thread -> SP Block -> SM Grid -> GPU

值得注意的是虽然Block映射到SM,但两者并不需要一一对应的关系。Blocks可以由任意数量的SM以任意的顺序调度,然后彼此独立的并行或者串行执行。这样,使得硬件的SM能够适应任意数量的CUDA block

内存层次 (Thread Hierarchy)

CUDA threads在执行时,可以访问多个memory spaces,每个线程有自己的私有的local memory。每个block有一个shared memory,block的所有线程都可以访问。最后,所有线程都可以访问global memory


本地内存 > 共享内存 > 全局内存

通过cudaMalloc分配的内存就是全局内存。核函数中用__shared__修饰的变量就是共享内存。 核函数定义的变量使用的就是本地内存。


Understanding the RoCE network protocol 2017-11-09T15:20:30+00:00 hustcat http://hustcat.github.io/roce-protocol RoCERDMA over Converged Ethernet的简称,基于它可以在以太网上实现RDMA.另外一种方式是RDMA over an InfiniBand.所以RoCE(严格来说是RoCEv1)是一个与InfiniBand相对应的链路层协议。

There are two RoCE versions, RoCE v1 and RoCE v2. RoCE v1 is an Ethernet link layer protocol and hence allows communication between any two hosts in the same Ethernet broadcast domain. RoCE v2 is an internet layer protocol which means that RoCE v2 packets can be routed.


对于RoCE互联网络,硬件方面需要支持IEEE DCB的L2以太网交换机,计算节点需要支持RoCE的网卡:

On the hardware side, basically you need an L2 Ethernet switch with IEEE DCB (Data Center Bridging, aka Converged Enhanced Ethernet) with support for priority flow control.

 On the compute or storage server end, you need an RoCE-capable network adapter.


对应的协议规范参考InfiniBand™ Architecture Specification Release 1.2.1 Annex A16: RoCE



由于RoCEv1的数据帧不带IP头部,所以只能在L2子网内通信。所以RoCEv2扩展了RoCEv1,将GRH(Global Routing Header)换成UDP header + IP header:

RoCEv2 is a straightforward extension of the RoCE protocol that involves a simple modification of the RoCE packet format.

Instead of the GRH, RoCEv2 packets carry an IP header which allows traversal of IP L3 Routers and a UDP header that serves as a stateless encapsulation layer for the RDMA Transport Protocol Packets over IP.





Linux Soft-RoCE implementation 2017-11-08T23:20:30+00:00 hustcat http://hustcat.github.io/linux-soft-roce-implementation 内核在4.9实现的Soft-RoCE实现了RoCEv2.


libRXE (user space library)

|--- rxe_create_qp
    |--- ibv_cmd_create_qp
  • ibv_create_qp
LATEST_SYMVER_FUNC(ibv_create_qp, 1_1, "IBVERBS_1.1",
		   struct ibv_qp *,
		   struct ibv_pd *pd,
		   struct ibv_qp_init_attr *qp_init_attr)
	struct ibv_qp *qp = pd->context->ops.create_qp(pd, qp_init_attr); ///rxe_ctx_ops
  • rxe_create_qp
static struct ibv_qp *rxe_create_qp(struct ibv_pd *pd,
				    struct ibv_qp_init_attr *attr)
	struct ibv_create_qp cmd;
	struct rxe_create_qp_resp resp;
	struct rxe_qp *qp;
	int ret;
	ret = ibv_cmd_create_qp(pd, &qp->ibv_qp, attr, &cmd, sizeof cmd,
				&resp.ibv_resp, sizeof resp); /// ibv_create_qp CMD, to kernel
	qp->sq.max_sge = attr->cap.max_send_sge;
	qp->sq.max_inline = attr->cap.max_inline_data;
	qp->sq.queue = mmap(NULL, resp.sq_mi.size, PROT_READ | PROT_WRITE,
			    pd->context->cmd_fd, resp.sq_mi.offset); ///mmap,参考rxe_mmap




static const struct file_operations uverbs_fops = {
	.owner	 = THIS_MODULE,
	.write	 = ib_uverbs_write,
	.open	 = ib_uverbs_open,
	.release = ib_uverbs_close,
	.llseek	 = no_llseek,
  • ibv_open_device
LATEST_SYMVER_FUNC(ibv_open_device, 1_1, "IBVERBS_1.1",
		   struct ibv_context *,
		   struct ibv_device *device)
	struct verbs_device *verbs_device = verbs_get_device(device);
	char *devpath;
	int cmd_fd, ret;
	struct ibv_context *context;
	struct verbs_context *context_ex;

	if (asprintf(&devpath, "/dev/infiniband/%s", device->dev_name) < 0)
		return NULL;

	 * We'll only be doing writes, but we need O_RDWR in case the
	 * provider needs to mmap() the file.
	cmd_fd = open(devpath, O_RDWR | O_CLOEXEC); /// /dev/infiniband/uverbs0

	if (cmd_fd < 0)
		return NULL;

	if (!verbs_device->ops->init_context) {
		context = verbs_device->ops->alloc_context(device, cmd_fd); ///rxe_alloc_context, rxe_dev_ops
		if (!context)
			goto err;
	context->device = device;
	context->cmd_fd = cmd_fd;
	pthread_mutex_init(&context->mutex, NULL);


	return context;

kernel (rdma_rxe module)

  • ib_uverbs_create_qp


|--- ib_uverbs_create_qp
     |--- create_qp
	      |--- ib_device->create_qp
		       |--- rxe_create_qp

create_qp调用ib_device->create_qp,对于RXE, 为函数rxe_create_qp, 参考rxe_register_device.

  • rxe_create_qp
|--- rxe_qp_from_init
     |--- rxe_qp_init_req


  • rxe_qp_init_req


创建对应的UDP socket



  • rxe_queue_init


struct rxe_queue *rxe_queue_init(struct rxe_dev *rxe,
				 int *num_elem,
				 unsigned int elem_size)
	struct rxe_queue *q;
	size_t buf_size;
	unsigned int num_slots;
	buf_size = sizeof(struct rxe_queue_buf) + num_slots * elem_size;

	q->buf = vmalloc_user(buf_size);

rxe_queue->buf指向的内存缓冲区,由rxe_mmap映射到用户空间,队列的element对应数据结构struct rxe_send_wqe.

libiverbs API调用ibv_post_send时,会将对应的struct rxe_send_wqe加入到该队列,参考rdma-core@post_one_send.

  • rxe_mmap
 * rxe_mmap - create a new mmap region
 * @context: the IB user context of the process making the mmap() call
 * @vma: the VMA to be initialized
 * Return zero if the mmap is OK. Otherwise, return an errno.
int rxe_mmap(struct ib_ucontext *context, struct vm_area_struct *vma)
	struct rxe_dev *rxe = to_rdev(context->device);
	unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
	unsigned long size = vma->vm_end - vma->vm_start;
	struct rxe_mmap_info *ip, *pp;

	ret = remap_vmalloc_range(vma, ip->obj, 0);
	if (ret) {
		pr_err("rxe: err %d from remap_vmalloc_range\n", ret);
		goto done;

	vma->vm_ops = &rxe_vm_ops;
	vma->vm_private_data = ip;



rxe_post_send会将struct ibv_send_wr转成struct rxe_send_wqe,并加入到发送队列rxe_qp->rq,然后通过cmd_fd给RXE内核模块发送IB_USER_VERBS_CMD_POST_SEND命令:

/* this API does not make a distinction between
   restartable and non-restartable errors */
static int rxe_post_send(struct ibv_qp *ibqp,
			 struct ibv_send_wr *wr_list,
			 struct ibv_send_wr **bad_wr)
	int rc = 0;
	int err;
	struct rxe_qp *qp = to_rqp(ibqp);/// ibv_qp -> rxe_qp
	struct rxe_wq *sq = &qp->sq;

	if (!bad_wr)
		return EINVAL;

	*bad_wr = NULL;

	if (!sq || !wr_list || !sq->queue)
	 	return EINVAL;


	while (wr_list) {
		rc = post_one_send(qp, sq, wr_list); /// ibv_send_wr -> rxe_send_wqe, enqueue
		if (rc) {
			*bad_wr = wr_list;

		wr_list = wr_list->next;


	err =  post_send_db(ibqp); /// IB_USER_VERBS_CMD_POST_SEND cmd
	return err ? err : rc;



ib_uverbs_post_send -> ib_device->post_send -> rxe_post_send -> rxe_requester -> ip_local_out

  • rxe_post_send
static int rxe_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
			 struct ib_send_wr **bad_wr)
	int err = 0;
	struct rxe_qp *qp = to_rqp(ibqp); ///ib_qp -> rxe_qp
	 * Must sched in case of GSI QP because ib_send_mad() hold irq lock,
	 * and the requester call ip_local_out_sk() that takes spin_lock_bh.
	must_sched = (qp_type(qp) == IB_QPT_GSI) ||
			(queue_count(qp->sq.queue) > 1);

	rxe_run_task(&qp->req.task, must_sched); /// to rxe_requester

	return err;
  • rxe_requester


int rxe_requester(void *arg)
	struct rxe_qp *qp = (struct rxe_qp *)arg;
	struct rxe_pkt_info pkt;
	struct sk_buff *skb;
	struct rxe_send_wqe *wqe;
	wqe = req_next_wqe(qp); /// get rxe_send_wqe
	/// rxe_send_wqe -> skb
	skb = init_req_packet(qp, wqe, opcode, payload, &pkt);
	ret = rxe_xmit_packet(to_rdev(qp->ibqp.device), qp, &pkt, skb);

static inline int rxe_xmit_packet(struct rxe_dev *rxe, struct rxe_qp *qp,
				  struct rxe_pkt_info *pkt, struct sk_buff *skb)
	if (pkt->mask & RXE_LOOPBACK_MASK) {
		memcpy(SKB_TO_PKT(skb), pkt, sizeof(*pkt));
		err = rxe->ifc_ops->loopback(skb);
	} else {
		err = rxe->ifc_ops->send(rxe, pkt, skb);/// ifc_ops->send, send



RDMA Programming - Base on linux-rdma 2017-11-08T23:00:30+00:00 hustcat http://hustcat.github.io/rdma-programming linux-rdma为Linux内核Infiniband子系统drivers/infiniband对应的用户态库,提供了Infiniband Verbs APIRDMA Verbs API.


  • Queue Pair(QP)

为了进行RDMA操作,需要在两端建立连接,这通过Queue Pair (QP)来完成,QP相当于socket。通信的两端都需要进行QP的初始化,Communication Manager (CM) 在双方真正建立连接前交换QP信息。

Once a QP is established, the verbs API can be used to perform RDMA reads, RDMA writes, and atomic operations. Serialized send/receive operations, which are similar to socket reads/writes, can be performed as well.

QP对应数据结构struct ibv_qpibv_create_qp用于创建QP.

 * ibv_create_qp - Create a queue pair.
struct ibv_qp *ibv_create_qp(struct ibv_pd *pd,
			     struct ibv_qp_init_attr *qp_init_attr);
  • Completion Queue(CQ)

A Completion Queue is an object which contains the completed work requests which were posted to the Work Queues (WQ). Every completion says that a specific WR was completed (both successfully completed WRs and unsuccessfully completed WRs). A Completion Queue is a mechanism to notify the application about information of ended Work Requests (status, opcode, size, source).

对应数据结构struct ibv_cq. ibv_create_cq用于创建CQ:

 * ibv_create_cq - Create a completion queue
 * @context - Context CQ will be attached to
 * @cqe - Minimum number of entries required for CQ
 * @cq_context - Consumer-supplied context returned for completion events
 * @channel - Completion channel where completion events will be queued.
 *     May be NULL if completion events will not be used.
 * @comp_vector - Completion vector used to signal completion events.
 *     Must be >= 0 and < context->num_comp_vectors.
struct ibv_cq *ibv_create_cq(struct ibv_context *context, int cqe,
			     void *cq_context,
			     struct ibv_comp_channel *channel,
			     int comp_vector);
  • Memory Registration (MR)

Memory Registration is a mechanism that allows an application to describe a set of virtually con- tiguous memory locations or a set of physically contiguous memory locations to the network adapter as a virtually contiguous buffer using Virtual Addresses.

对应数据结构struct ibv_mr:

struct ibv_mr {
	struct ibv_context     *context;
	struct ibv_pd	       *pd;
	void		       *addr;
	size_t			length;
	uint32_t		handle;
	uint32_t		lkey;
	uint32_t		rkey;

Every MR has a remote and a local key (rkey, lkey).

Local keys are used by the local HCA to access local memory, such as during a receive operation.

Remote keys are given to the remote HCA to allow a remote process access to system memory during RDMA operations.

ibv_reg_mr registers a memory region (MR), associates it with a protection domain (PD), and assigns it local and remote keys (lkey, rkey).

 * ibv_reg_mr - Register a memory region
struct ibv_mr *ibv_reg_mr(struct ibv_pd *pd, void *addr,
			  size_t length, int access);
  • Protection Domain (PD)

Object whose components can interact with only each other. These components can be AH, QP, MR, and SRQ.

A protection domain is used to associate Queue Pairs with Memory Regions and Memory Windows , as a means for enabling and controlling network adapter access to Host System memory.

struct ibv_pd is used to implement protection domains:

struct ibv_pd {
	struct ibv_context     *context;
	uint32_t		handle;

ibv_alloc_pd creates a protection domain (PD). PDs limit which memory regions can be accessed by which queue pairs (QP) providing a degree of protection from unauthorized access.

 * ibv_alloc_pd - Allocate a protection domain
struct ibv_pd *ibv_alloc_pd(struct ibv_context *context);
  • Send Request (SR)

An SR defines how much data will be sent, from where, how and, with RDMA, to where. struct ibv_send_wr is used to implement SRs.参考struct ibv_send_wr

示例(IB Verbs API example)

RDMA应用可以使用librdmacm或者libibverbs API编程。前者是对后者的进一步封装。

rc_pingpong是直接使用libibverbs API编程的示例。

一般来说,使用IB Verbs API的基本流程如下:

  • (1) Get the device list

First you must retrieve the list of available IB devices on the local host. Every device in this list contains both a name and a GUID. For example the device names can be: mthca0, mlx4_1.参考这里.

IB devices对应数据结构struct ibv_device:

struct ibv_device {
	struct _ibv_device_ops	_ops;
	enum ibv_node_type	node_type;
	enum ibv_transport_type	transport_type;
	/* Name of underlying kernel IB device, eg "mthca0" */
	char			name[IBV_SYSFS_NAME_MAX];
	/* Name of uverbs device, eg "uverbs0" */
	char			dev_name[IBV_SYSFS_NAME_MAX];
	/* Path to infiniband_verbs class device in sysfs */
	char			dev_path[IBV_SYSFS_PATH_MAX];
	/* Path to infiniband class device in sysfs */
	char			ibdev_path[IBV_SYSFS_PATH_MAX];

应用程序通过API ibv_get_device_list获取IB设备列表:

 * ibv_get_device_list - Get list of IB devices currently available
 * @num_devices: optional.  if non-NULL, set to the number of devices
 * returned in the array.
 * Return a NULL-terminated array of IB devices.  The array can be
 * released with ibv_free_device_list().
struct ibv_device **ibv_get_device_list(int *num_devices);
  • (2) Open the requested device

Iterate over the device list, choose a device according to its GUID or name and open it.参考这里.


 * ibv_open_device - Initialize device for use
struct ibv_context *ibv_open_device(struct ibv_device *device);


struct ibv_context {
	struct ibv_device      *device;
	struct ibv_context_ops	ops;
	int			cmd_fd;
	int			async_fd;
	int			num_comp_vectors;
	pthread_mutex_t		mutex;
	void		       *abi_compat;
  • (3) Allocate a Protection Domain


A Protection Domain (PD) allows the user to restrict which components can interact with only each other.

These components can be AH, QP, MR, MW, and SRQ.

  • (4) Register a memory region


Any memory buffer which is valid in the process’s virtual space can be registered.

During the registration process the user sets memory permissions and receives local and remote keys (lkey/rkey) which will later be used to refer to this memory buffer.

  • (5) Create a Completion Queue(CQ)


A CQ contains completed work requests (WR). Each WR will generate a completion queue entry (CQE) that is placed on the CQ.

The CQE will specify if the WR was completed successfully or not.

  • (6) Create a Queue Pair(QP)


Creating a QP will also create an associated send queue and receive queue.

  • (7) Bring up a QP


A created QP still cannot be used until it is transitioned through several states, eventually getting to Ready To Send (RTS).

This provides needed information used by the QP to be able send / receive data.


 * ibv_modify_qp - Modify a queue pair.
int ibv_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr,
		  int attr_mask);



RESET               Newly created, queues empty.
INIT                Basic information set. Ready for posting to receive queue.
RTR Ready to Receive. Remote address info set for connected QPs, QP may now receive packets.
RTS Ready to Send. Timeout and retry parameters set, QP may now send packets.
  • (8) Post work requests and poll for completion

Use the created QP for communication operations.


  • (9) Cleanup
Destroy objects in the reverse order you created them:
Delete QP
Delete CQ
Deregister MR
Deallocate PD
Close device


  • server
# ibv_rc_pingpong -d rxe0 -g 0 -s 128 -r 1 -n 1
  local address:  LID 0x0000, QPN 0x000011, PSN 0x626753, GID fe80::5054:61ff:fe57:1211
  remote address: LID 0x0000, QPN 0x000011, PSN 0x849753, GID fe80::5054:61ff:fe56:1211
256 bytes in 0.00 seconds = 11.38 Mbit/sec
1 iters in 0.00 seconds = 180.00 usec/iter
  • client
# ibv_rc_pingpong -d rxe0 -g 0 -s 128 -r 1 -n 1
  local address:  LID 0x0000, QPN 0x000011, PSN 0x849753, GID fe80::5054:61ff:fe56:1211
  remote address: LID 0x0000, QPN 0x000011, PSN 0x626753, GID fe80::5054:61ff:fe57:1211
256 bytes in 0.00 seconds = 16.13 Mbit/sec
1 iters in 0.00 seconds = 127.00 usec/iter


其中,第一个RC Send only为client发送给server的包,参考这里. 然后server回了一个RC Ack,并给client发送了一个RC Send only,参考这里.