SMART is a framework to deal with the complexity of building scalable applications for memory disaggregation, which integrated with the following techniques to resolve RDMA-related performance issues:
- Thread-aware resource allocation,
- Adaptive work request throttling, and
- Conflict avoidance technique.
To ease the programming, SMART provides a set of coroutine-based asynchronous APIs that are almost identical to the original RDMA verbs. We also design adaptive methods for adjusting configuration parameters that are important to the performance.
For more technical details, please refer to our paper:
Feng Ren, Mingxing Zhang, Kang Chen, Huaxia Xia, Zuoning Chen, Yongwei Wu. Scaling Up Memory Disaggregated Applications with SMART. In 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '24).
- SMART’s library (including techniques mentioned in this paper,
include/
andsmart/
); - Synthesis workloads (
test/
) - Applications: SMART-HT (derived from RACE,
hashtable/
), SMART-BT (derived from Sherman,btree/
) and SMART-DTX (derived from FORD,dtx/
). - Reproduce scripts (
ae/
)
-
Mellanox InfiniBand RNIC (ConnectX-6 in our testbed). Mellanox OpenFabrics Enterprise Distribution for Linux (MLNX_OFED) v5.3-1.0.0.1. Other OFED versions are under testing.
-
Optane DC PMEM (optional), configured as DEVDAX mode and mounted to
/dev/dax1.0
. Optane DC PMEM is used for evaulating distributed persistent transactions (dtx/
) in our paper. However, it is okay to perform these tests using DRAM only. We provide an optiondev_dax_path
in configuration fileconfig/backend.json
, that specifies the PMEM DEVDAX device (e.g./dev/dax1.0
) to be used. By default, the value ofdev_dax_path
is an empty string, which indicates that only DRAM is used. If PMEM exists, Run commandsudo chmod 777 /dev/dax1.0
that allows unprevileged users to manipulate PMEM data. -
Build Toolchain: GCC = 9.4.0 or newer, CMake = 3.16.3 or newer.
-
Other Software: libnuma, clustershell, Python 3 (with matplotlib, numpy, python-tk)
Evaluation setup in this paper
- CPU: Two-socket Intel Xeon Gold 6240R CPU (96 cores in total)
- Memory: 384GB DDR4 DRAM (2666MHz)
- Persistent Memory: 1.5TB (128GB*12) Intel Optane DC Persistent Memory (1st Gen) with DEVDAX mode
- RNIC: 200Gbps Mellanox ConnectX-6 InfiniBand RNIC. Each RNIC is connected to a 200 Gbps Mellanox InfiniBand switch
- RNIC Driver: Mellanox OpenFabrics Enterprise Distribution for Linux (MLNX_OFED) v5.x (v5.3-1.0.0.1 in this paper).
Execute the following command in your terminal:
bash ./deps.sh
bash ./build.sh
We provide a patched RDMA core library that unlimits the number of median-latency doorbell registers per RDMA context. However, if the installed version of MLNX_OFED is NOT v5.3-1.0.0.1, you must patch the library by yourself, and replace the build/libmlx5.so
file. The guide is provided here.
We provide a micro benchmark program test/test_rdma
in SMART. It can be used to evaluate the throughput of RDMA READs/WRITEs between two servers. Note that the optimizations of SMART are enabled by default.
- Set the server hostname, access granularity and type in
config/test_rdma.json
:{ "servers": [ "optane06" // the hostname of server side ], "port": 12345, // TCP port for connection establishation "block_size": 8, // access granularity in bytes "dump_file_path": "test_rdma.csv", // path of dump file "type": "read" // can be `read`, `write` or `atomic` }
- Set RDMA device and enabled optimizations in
config/smart_config.json
:{ "infiniband": { "name": "", // `mlx5_0` by default "port": 1, "gid_idx": 1 // Some RoCE RNICs needs to change it accordingly }, // available optimizations "use_thread_aware_alloc": true, "thread_aware_alloc": { "total_uuar": 100, // >= shared_uuar "shared_uuar": 96, // >= thread count "shared_cq": true }, "use_work_req_throt": true, "use_conflict_avoidance": true, "use_speculative_lookup": true, }
In server side (optane06
in the sample), run the following command:
cd /path/to/smart && cd build
LD_PRELOAD=libmlx5.so ./test/test_rdma
In another machine, run the following command:
cd /path/to/smart && cd build
LD_PRELOAD=libmlx5.so ./test/test_rdma \
[nr_thread] \
[outstanding_work_request_per_thread]
For example,
LD_PRELOAD=libmlx5.so ./test/test_rdma 96 8
After execution, the client displays the following information to stdout.
rdma-read: #threads=96, #depth=8, #block\_size=8, BW=848.217 MB/s, IOPS=111.177 M/s, conn establish time=1245.924 ms
It also append a line to test_rdma.csv
(specified by "dump_file_path"
):
rdma-read, 96, 8, 8, 848.217, 111.177, 1245.924
For the sake of illustration, we present the functional evaluation method based on the following cluster.
- Memory Blades: hostname
optane06
andoptane07
, with TCP port12345
, - Compute Blades: hostname
optane04
SMART and its applications rely on multiple configuration files. By default, they are located in the config/
subdirectory. You can use other files using environment variables SMART_CONFIG_PATH
and APP_CONFIG_PATH
.
Options of the SMART framework, which are appliable to all kinds of servers. See here in detail.
Options of backend servers running in memory blades.
{
"dev_dax_path": "", // empty string for DRAM, /dev/daxX.Y for NVM
"capacity": 16000, // amount of memory to use (MiB)
"tcp_port": 12345, // Listen TCP port
"nic_numa_node": 1 // Prefer bind socket
}
Options of hashtable/B+tree clients running in the compute blades. Use memory_servers
to specify hostnames and ports of memory blades to be connected. If you change the topology of the cluster, modify it accordingly.
{
"dataset": "ycsb-a", // see "include/util/ycsb.h" for available datasets
"dump_file_path": "datastructure.csv",
"insert_before_execution": true, // insert all keys before performing tests
"max_key": 100000000, // key range [0, max_key)
"key_length": 8, // key length in bytes
"value_length": 8, // value length in bytes
"rehash_key": false, // key/value randomly inserted
"duration": 15, // test duration in seconds
"zipfian_const": 0.99, // zipfian parameter
"memory_servers": [ // memory server hostnames and ports
{
"hostname": "optane06",
"port": 12345
},
{
"hostname": "optane07",
"port": 12345
}
]
}
Options of transaction processing clients running in the compute blades. Use memory_servers
to specify hostnames and ports of memory blades to be connected. If you change the topology of the cluster, modify it accordingly.
{
"memory_servers": [
{
"hostname": "optane06",
"port": 12345
},
{
"hostname": "optane07",
"port": 12345
}
],
"tpcc": { ... },
"smallbank": { ... },
"tatp": { ... },
"nr_transactions": 60000000,
"dump_file_path": "dtx.csv"
}
For each memory blade (optane06
and optane07
in our example), run one of the following programs:
# working path should be build/
# start hash table backend
LD_PRELOAD=libmlx5.so ./hashtable/hashtable_backend
# start B+Tree backend
LD_PRELOAD=libmlx5.so ./btree/btree_backend
# start SmallBank backend
LD_PRELOAD=libmlx5.so ./dtx/smallbank/smallbank_backend
# start TATP backend
LD_PRELOAD=libmlx5.so ./dtx/tatp/tatp_backend
In compute blades (optane04
in our example), run one of the benchmark program using the following command. It must match the backend started in the previous step.
# working path should be build/
# start hash table bench
# `coro` means `coroutine`
LD_PRELOAD=libmlx5.so ./hashtable/hashtable_bench [nr_thread] [nr_coro]
# start B+Tree bench
LD_PRELOAD=libmlx5.so ./btree/btree_bench [nr_thread] [nr_coro]
# start SmallBank bench
LD_PRELOAD=libmlx5.so ./dtx/smallbank/smallbank_bench [nr_thread] [nr_coro]
# start TATP bench
LD_PRELOAD=libmlx5.so ./dtx/tatp/tatp_bench [nr_thread] [nr_coro]
After execution, the benchmark program displays the throughput and latency to stdout. It also adds a line to the file specified by dump_file_path
.
We also provide scripts to reproduce all experiments in Section 3 and Section 6. In Section 3, there are 3 experiments (Figures 3, 4 and 5), each of them points out one of the scalability bottlenecks. In Section 6, there are 9 experiments (Figures 7–14, and Table 1) that compare SMART-refactorized applications (i.e. SMART-HT, SMART-BT and SMART-DTX) with the state-of-the-art disaggregated systems (i.e. RACE, Sherman and FORD) respectively.
Details are aviilable in ae/README.md
.
For any questions, please contact us at renfeng.chn AT outlook.com
.