214 lines
		
	
	
		
			10 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			214 lines
		
	
	
		
			10 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
****************************
 | 
						|
RDMA Transport (RTRS)
 | 
						|
****************************
 | 
						|
 | 
						|
RTRS (RDMA Transport) is a reliable high speed transport library
 | 
						|
which provides support to establish optimal number of connections
 | 
						|
between client and server machines using RDMA (InfiniBand, RoCE, iWarp)
 | 
						|
transport. It is optimized to transfer (read/write) IO blocks.
 | 
						|
 | 
						|
In its core interface it follows the BIO semantics of providing the
 | 
						|
possibility to either write data from an sg list to the remote side
 | 
						|
or to request ("read") data transfer from the remote side into a given
 | 
						|
sg list.
 | 
						|
 | 
						|
RTRS provides I/O fail-over and load-balancing capabilities by using
 | 
						|
multipath I/O (see "add_path" and "mp_policy" configuration entries in
 | 
						|
Documentation/ABI/testing/sysfs-class-rtrs-client).
 | 
						|
 | 
						|
RTRS is used by the RNBD (RDMA Network Block Device) modules.
 | 
						|
 | 
						|
==================
 | 
						|
Transport protocol
 | 
						|
==================
 | 
						|
 | 
						|
Overview
 | 
						|
--------
 | 
						|
An established connection between a client and a server is called rtrs
 | 
						|
session. A session is associated with a set of memory chunks reserved on the
 | 
						|
server side for a given client for rdma transfer. A session
 | 
						|
consists of multiple paths, each representing a separate physical link
 | 
						|
between client and server. Those are used for load balancing and failover.
 | 
						|
Each path consists of as many connections (QPs) as there are cpus on
 | 
						|
the client.
 | 
						|
 | 
						|
When processing an incoming write or read request, rtrs client uses memory
 | 
						|
chunks reserved for him on the server side. Their number, size and addresses
 | 
						|
need to be exchanged between client and server during the connection
 | 
						|
establishment phase. Apart from the memory related information client needs to
 | 
						|
inform the server about the session name and identify each path and connection
 | 
						|
individually.
 | 
						|
 | 
						|
On an established session client sends to server write or read messages.
 | 
						|
Server uses immediate field to tell the client which request is being
 | 
						|
acknowledged and for errno. Client uses immediate field to tell the server
 | 
						|
which of the memory chunks has been accessed and at which offset the message
 | 
						|
can be found.
 | 
						|
 | 
						|
Module parameter always_invalidate is introduced for the security problem
 | 
						|
discussed in LPC RDMA MC 2019. When always_invalidate=Y, on the server side we
 | 
						|
invalidate each rdma buffer before we hand it over to RNBD server and
 | 
						|
then pass it to the block layer. A new rkey is generated and registered for the
 | 
						|
buffer after it returns back from the block layer and RNBD server.
 | 
						|
The new rkey is sent back to the client along with the IO result.
 | 
						|
The procedure is the default behaviour of the driver. This invalidation and
 | 
						|
registration on each IO causes performance drop of up to 20%. A user of the
 | 
						|
driver may choose to load the modules with this mechanism switched off
 | 
						|
(always_invalidate=N), if he understands and can take the risk of a malicious
 | 
						|
client being able to corrupt memory of a server it is connected to. This might
 | 
						|
be a reasonable option in a scenario where all the clients and all the servers
 | 
						|
are located within a secure datacenter.
 | 
						|
 | 
						|
 | 
						|
Connection establishment
 | 
						|
------------------------
 | 
						|
 | 
						|
1. Client starts establishing connections belonging to a path of a session one
 | 
						|
by one via attaching RTRS_MSG_CON_REQ messages to the rdma_connect requests.
 | 
						|
Those include uuid of the session and uuid of the path to be
 | 
						|
established. They are used by the server to find a persisting session/path or
 | 
						|
to create a new one when necessary. The message also contains the protocol
 | 
						|
version and magic for compatibility, total number of connections per session
 | 
						|
(as many as cpus on the client), the id of the current connection and
 | 
						|
the reconnect counter, which is used to resolve the situations where
 | 
						|
client is trying to reconnect a path, while server is still destroying the old
 | 
						|
one.
 | 
						|
 | 
						|
2. Server accepts the connection requests one by one and attaches
 | 
						|
RTRS_MSG_CONN_RSP messages to the rdma_accept. Apart from magic and
 | 
						|
protocol version, the messages include error code, queue depth supported by
 | 
						|
the server (number of memory chunks which are going to be allocated for that
 | 
						|
session) and the maximum size of one io, RTRS_MSG_NEW_RKEY_F flags is set
 | 
						|
when always_invalidate=Y.
 | 
						|
 | 
						|
3. After all connections of a path are established client sends to server the
 | 
						|
RTRS_MSG_INFO_REQ message, containing the name of the session. This message
 | 
						|
requests the address information from the server.
 | 
						|
 | 
						|
4. Server replies to the session info request message with RTRS_MSG_INFO_RSP,
 | 
						|
which contains the addresses and keys of the RDMA buffers allocated for that
 | 
						|
session.
 | 
						|
 | 
						|
5. Session becomes connected after all paths to be established are connected
 | 
						|
(i.e. steps 1-4 finished for all paths requested for a session)
 | 
						|
 | 
						|
6. Server and client exchange periodically heartbeat messages (empty rdma
 | 
						|
messages with an immediate field) which are used to detect a crash on remote
 | 
						|
side or network outage in an absence of IO.
 | 
						|
 | 
						|
7. On any RDMA related error or in the case of a heartbeat timeout, the
 | 
						|
corresponding path is disconnected, all the inflight IO are failed over to a
 | 
						|
healthy path, if any, and the reconnect mechanism is triggered.
 | 
						|
 | 
						|
CLT                                     SRV
 | 
						|
*for each connection belonging to a path and for each path:
 | 
						|
RTRS_MSG_CON_REQ  ------------------->
 | 
						|
                   <------------------- RTRS_MSG_CON_RSP
 | 
						|
...
 | 
						|
*after all connections are established:
 | 
						|
RTRS_MSG_INFO_REQ ------------------->
 | 
						|
                   <------------------- RTRS_MSG_INFO_RSP
 | 
						|
*heartbeat is started from both sides:
 | 
						|
                   -------------------> [RTRS_HB_MSG_IMM]
 | 
						|
[RTRS_HB_MSG_ACK] <-------------------
 | 
						|
[RTRS_HB_MSG_IMM] <-------------------
 | 
						|
                   -------------------> [RTRS_HB_MSG_ACK]
 | 
						|
 | 
						|
IO path
 | 
						|
-------
 | 
						|
 | 
						|
* Write (always_invalidate=N) *
 | 
						|
 | 
						|
1. When processing a write request client selects one of the memory chunks
 | 
						|
on the server side and rdma writes there the user data, user header and the
 | 
						|
RTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only
 | 
						|
contains size of the user header. The client tells the server which chunk has
 | 
						|
been accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by
 | 
						|
using the IMM field.
 | 
						|
 | 
						|
2. When confirming a write request server sends an "empty" rdma message with
 | 
						|
an immediate field. The 32 bit field is used to specify the outstanding
 | 
						|
inflight IO and for the error code.
 | 
						|
 | 
						|
CLT                                                          SRV
 | 
						|
usr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM]
 | 
						|
[RTRS_IO_RSP_IMM]                        <----------------- (id + errno)
 | 
						|
 | 
						|
* Write (always_invalidate=Y) *
 | 
						|
 | 
						|
1. When processing a write request client selects one of the memory chunks
 | 
						|
on the server side and rdma writes there the user data, user header and the
 | 
						|
RTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only
 | 
						|
contains size of the user header. The client tells the server which chunk has
 | 
						|
been accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by
 | 
						|
using the IMM field, Server invalidate rkey associated to the memory chunks
 | 
						|
first, when it finishes, pass the IO to RNBD server module.
 | 
						|
 | 
						|
2. When confirming a write request server sends an "empty" rdma message with
 | 
						|
an immediate field. The 32 bit field is used to specify the outstanding
 | 
						|
inflight IO and for the error code. The new rkey is sent back using
 | 
						|
SEND_WITH_IMM WR, client When it recived new rkey message, it validates
 | 
						|
the message and finished IO after update rkey for the rbuffer, then post
 | 
						|
back the recv buffer for later use.
 | 
						|
 | 
						|
CLT                                                          SRV
 | 
						|
usr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM]
 | 
						|
[RTRS_MSG_RKEY_RSP]                     <----------------- (RTRS_MSG_RKEY_RSP)
 | 
						|
[RTRS_IO_RSP_IMM]                        <----------------- (id + errno)
 | 
						|
 | 
						|
 | 
						|
* Read (always_invalidate=N)*
 | 
						|
 | 
						|
1. When processing a read request client selects one of the memory chunks
 | 
						|
on the server side and rdma writes there the user header and the
 | 
						|
RTRS_MSG_RDMA_READ message. This message contains the type (read), size of
 | 
						|
the user header, flags (specifying if memory invalidation is necessary) and the
 | 
						|
list of addresses along with keys for the data to be read into.
 | 
						|
 | 
						|
2. When confirming a read request server transfers the requested data first,
 | 
						|
attaches an invalidation message if requested and finally an "empty" rdma
 | 
						|
message with an immediate field. The 32 bit field is used to specify the
 | 
						|
outstanding inflight IO and the error code.
 | 
						|
 | 
						|
CLT                                           SRV
 | 
						|
usr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM]
 | 
						|
[RTRS_IO_RSP_IMM]            <-------------- usr_data + (id + errno)
 | 
						|
or in case client requested invalidation:
 | 
						|
[RTRS_IO_RSP_IMM_W_INV]      <-------------- usr_data + (INV) + (id + errno)
 | 
						|
 | 
						|
* Read (always_invalidate=Y)*
 | 
						|
 | 
						|
1. When processing a read request client selects one of the memory chunks
 | 
						|
on the server side and rdma writes there the user header and the
 | 
						|
RTRS_MSG_RDMA_READ message. This message contains the type (read), size of
 | 
						|
the user header, flags (specifying if memory invalidation is necessary) and the
 | 
						|
list of addresses along with keys for the data to be read into.
 | 
						|
Server invalidate rkey associated to the memory chunks first, when it finishes,
 | 
						|
passes the IO to RNBD server module.
 | 
						|
 | 
						|
2. When confirming a read request server transfers the requested data first,
 | 
						|
attaches an invalidation message if requested and finally an "empty" rdma
 | 
						|
message with an immediate field. The 32 bit field is used to specify the
 | 
						|
outstanding inflight IO and the error code. The new rkey is sent back using
 | 
						|
SEND_WITH_IMM WR, client When it recived new rkey message, it validates
 | 
						|
the message and finished IO after update rkey for the rbuffer, then post
 | 
						|
back the recv buffer for later use.
 | 
						|
 | 
						|
CLT                                           SRV
 | 
						|
usr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM]
 | 
						|
[RTRS_IO_RSP_IMM]            <-------------- usr_data + (id + errno)
 | 
						|
[RTRS_MSG_RKEY_RSP]	     <----------------- (RTRS_MSG_RKEY_RSP)
 | 
						|
or in case client requested invalidation:
 | 
						|
[RTRS_IO_RSP_IMM_W_INV]      <-------------- usr_data + (INV) + (id + errno)
 | 
						|
=========================================
 | 
						|
Contributors List(in alphabetical order)
 | 
						|
=========================================
 | 
						|
Danil Kipnis <danil.kipnis@profitbricks.com>
 | 
						|
Fabian Holler <mail@fholler.de>
 | 
						|
Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
 | 
						|
Jack Wang <jinpu.wang@profitbricks.com>
 | 
						|
Kleber Souza <kleber.souza@profitbricks.com>
 | 
						|
Lutz Pogrell <lutz.pogrell@cloud.ionos.com>
 | 
						|
Milind Dumbare <Milind.dumbare@gmail.com>
 | 
						|
Roman Penyaev <roman.penyaev@profitbricks.com>
 |