Understanding the RoCE network protocol

| 分类 Network  | 标签 RDMA  RoCE 

RoCERDMA over Converged Ethernet的简称,基于它可以在以太网上实现RDMA.另外一种方式是RDMA over an InfiniBand.所以RoCE(严格来说是RoCEv1)是一个与InfiniBand相对应的链路层协议。

There are two RoCE versions, RoCE v1 and RoCE v2. RoCE v1 is an Ethernet link layer protocol and hence allows communication between any two hosts in the same Ethernet broadcast domain. RoCE v2 is an internet layer protocol which means that RoCE v2 packets can be routed.

RoCEv1

对于RoCE互联网络,硬件方面需要支持IEEE DCB的L2以太网交换机,计算节点需要支持RoCE的网卡:

On the hardware side, basically you need an L2 Ethernet switch with IEEE DCB (Data Center Bridging, aka Converged Enhanced Ethernet) with support for priority flow control.

 On the compute or storage server end, you need an RoCE-capable network adapter.

对应的数据帧格式如下:

对应的协议规范参考InfiniBand™ Architecture Specification Release 1.2.1 Annex A16: RoCE

示例:

RoCEv2

由于RoCEv1的数据帧不带IP头部,所以只能在L2子网内通信。所以RoCEv2扩展了RoCEv1,将GRH(Global Routing Header)换成UDP header + IP header:

RoCEv2 is a straightforward extension of the RoCE protocol that involves a simple modification of the RoCE packet format.

Instead of the GRH, RoCEv2 packets carry an IP header which allows traversal of IP L3 Routers and a UDP header that serves as a stateless encapsulation layer for the RDMA Transport Protocol Packets over IP.

数据帧的格式如下:

示例:

值得一提的是内核在4.9通过软件的方式的实现了RoCEv2,即Soft-RoCE.

Refs


上一篇     下一篇