§1 资源准备

ASPLOS’23

Repo

Setting tiered memory systems with Intel DCPMM

Reconfigures a namespace with devdax mode.

1
sudo ndctl create-namespace -f -e namespace0.0 --mode=devdax
  • -f: Allow the operation to continue on enabled namespaces
  • -e: Reconfigure existing namespace configuration
  • --mode: Define the namespace mode - fsdax, devdax, sector, and raw

Reconfigures a dax device with system-ram mode (KMEM DAX).

1
sudo daxctl reconfigure-device dax0.0 --mode=system-ram
  • --mode: system-ram or devdax

NDCTL User Guide

The mode is the most important feature to get correct.

  • fsdax: Filesystem-DAX mode is the default mode of a namespace when specifying ndctl create-namespace with no options. It creates a block device /dev/pmemX[.Y] that supports the DAX capabilities of Linux filesystems (xfs and ext4 to date). ( DAX 是文件系统的功能 ? )
    1
    2
    3
    $ ls -li /dev | grep pmem
    836 brw-rw---- 1 root disk 259, 1 Aug 5 20:19 pmem0 # [b] means this is a blk device under fsdax mode.
    835 brw-rw---- 1 root disk 259, 0 Aug 5 20:19 pmem1
    DAX removes the page cache from the I/O path and allows mmap(2) to establish direct mappings to persistent memory media. The DAX capability enables workloads / working-sets that would exceed the capacity of the page cache to scale up to the capacity of persistent memory. Workloads that fit in page cache or perform bulk data transfers may not see benefit from DAX. When in doubt, pick this mode.
  • devdax: Device-DAX mode enables similar mmap(2) DAX mapping capabilities as Filesystem-DAX. However, instead of a block-device that can support a DAX-enabled filesystem, this mode emits a single character device file /dev/daxX.Y. Use this mode to assign persistent memory to a virtual-machine (like k8s device plugin?), register persistent memory for RDMA, or when gigantic mappings are needed.
  • raw: Raw mode is effectively just a memory disk that does not support DAX. Typically this indicates a namespace that was created by tooling or another operating system that did not know how to create a Linux fsdax or devdax mode namespace. This mode is compatible with other operating systems, but again, does not support DAX operation.

For more detailed explainations, please refer to pmem.io/UtilLibs/ndctl.

Memtis 要求使用 devdax,即将 NVDIMM 作为字符设备进行管理。

cgroup v2

“cgroup” stands for “control group” and is never capitalized.

cgroup is a mechanism to organize processes hierarchically and distribute system resources along the hierarchy in a controlled and configurable manner.

cgroups form a tree structure and every process in the system belongs to one and only one cgroup. All threads of a process belong to the same cgroup. On creation, all processes are put in the cgroup that the parent process belongs to at the time. A process can be migrated to another cgroup. Migration of a process doesn’t affect already existing descendant processes.

For more info about Linux Kernel, go to https://docs.kernel.org

编译 Memtis 内核

For direct instructions on the compilation process, please click:

平台配置

实验服务器内核版本是 6.5 的,要把它换成 Memtis 内核。

1
2
3
4
5
6
Linux pve 6.5.11-4-pve #1 SMP PREEMPT_DYNAMIC PMX 6.5.11-4 (2023-11-20T10:19Z) x86_64 GNU/Linux

Distributor ID: Debian
Description: Debian GNU/Linux 12 (bookworm)
Release: 12
Codename: bookworm

装好依赖后,拷贝配置信息,并按 Memtis 要求设置 CONFIG_HTMM=y

踩坑记录

第一次编译,make deb-pkg -j$(nproc),出现:

1
cc1: all warnings being treated as errors

Google 一下发现这个报错挺多的,启用了 -Werror。检查输出,主要是 use-after-free,依次进行如下尝试:[1] [2]

  • 注释掉顶层 (Linux/) 下 Makefile 中所有的 -Werror
  • make CFLAGS="-Wno-error=use-after-free" KBUILD_CFLAGS=ditto
  • -w (Inhibit all warnings into errors)

均不生效。给 @LRL52 发邮件,第二天就收到了回复,建议不要用太新的工具链,在 Ubuntu 20.04 镜像编译,不过这个镜像 GLIBC 最高版本为 2.31,编译要求为 2.33 或 2.34。[3] 最后使用 Ubuntu 22.04,能够编译。

两个容器可以 -v 挂载同一个宿主机目录,文件操作同步。所以可以在 codeserver 中编辑文件,在 Ubuntu 镜像中编译。

第二次编译,出现:

1
make[1]: ***[kernel/Makefile:146: kernel/kheaders_data.tar.xz] Error 127

少装了一个包,apt install cpio[4],重新 make -j$(nproc),多线程编译似乎有问题,make 在未注意的地方没有成功,但是没有输出错误信息。导致 make modules_install 报如下错误:

1
sed: can't read modules.order: No such file or directory

实际上没有编译 modules。[5] [6]

第三次编译,make 单线程编了三个多小时,出现:

1
2
3
4
  BTFIDS  vmlinux
FAILED: load BTF from vmlinux: Invalid argument
make: *** [Makefile:1189: vmlinux] Error 255
make: *** Deleting file 'vmlinux'

如果在虚拟机中编译,这个问题可能由内存不够引起[7],pahole ( 一个结构体检查工具 ) 吃满全部 RAM 导致。看了一下,这个 2.5G 确实像 pahole 吃的,不过没有权限使用 dmesg 检查,可能是 ssh 套 Docker 的问题 ?

1
2
3
4
root@542c62015e14:/home/ubuntu/project/memtis-main/linux# free -h
total used free shared buff/cache available
Mem: 251Gi 2.5Gi 229Gi 33Mi 19Gi 247Gi
Swap: 0B 0B 0B

由于用的是 Docker,不太可能是内存问题引起,宿主机 200 多个 G,完全够吃。

目前安装的 pahole 为 1.25,现在降级试一下。[8] [9]

1
2
3
4
$ apt-cache madison pahole
pahole | 1.25-0ubuntu1~22.04.2 | http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 Packages
pahole | 1.22-8 | http://archive.ubuntu.com/ubuntu jammy/universe amd64 Packages
$ apt install pahole=1.22-8

再次 make 三个小时。成功与否,下班之后远程看吧。

第四次编译,赢!

make modules_install 安装模块,即把模块文件拷贝到根文件系统的目录中。

make install 安装内核,取部分输出,说明其完成的工作。

1. 复制 bzImage System.map 到 /boot/

1
2
sh ./arch/x86/boot/install.sh 5.15.19-htmm \
arch/x86/boot/bzImage System.map "/boot"

2. 生成 initramfs 文件到 /boot/initrd.img-5.15.19-htmm

1
2
run-parts: executing /etc/kernel/postinst.d/initramfs-tools 5.15.19-htmm /boot/vmlinuz-5.15.19-htmm
update-initramfs: Generating /boot/initrd.img-5.15.19-htmm

3. 更改 grub 配置文件,把新内核的相关信息添加到配置中

1
2
3
4
5
6
7
Generating grub configuration file ...
...
Found linux image: /boot/vmlinuz-5.15.19-htmm
Found initrd image: /boot/initrd.img-5.15.19-htmm
Found memtest86+ 64bit EFI image: /boot/memtest86+x64.efi
Adding boot menu entry for UEFI Firmware Settings ...
done

检查 /boot/grub/grub.cfg,已经能看到 Memtis 内核的相关配置。

虽然但是,这次的内核还是不能用,GRUB 加载报错:

1
2
loading initial ramdisk error
out of memory

推测是 initrd 镜像过大的问题[10]du -sh /boot/*

1
2
3
4
...
346M /boot/initrd.img-5.15.19-htmm
58M /boot/initrd.img-6.5.11-4-pve # 原Kernel
...

这个 346 M 疑似有点大了,重编,并去掉调试信息。

1
$ make INSTALL_MOD_STRIP=1 modules_install

第五次编译,检查 /boot/initrd.img 已经在 50M.

编译过程

  1. 环境
    1
    2
    3
    宿主机 Debian12 Linux 6.5.11
    Memtis Linux 5.15.19
    编译环境 Docker Ubuntu 22.04
  2. 从 Debian 12 拷贝 .config 到 memtis/Linux/
    1
    $ cp -v /boot/config-$(uname -r) .config
  3. 安装依赖[11],还要装个 cpio
    1
    $ sudo apt install git fakeroot build-essential ncurses-dev xz-utils libssl-dev bc flex libelf-dev bison dwarves zstd cpio
  4. 降级 pahole
  5. 编辑 .config,CONFIG_SYSTEM_TRUSTED_KEYS=""
  6. 其他内容在脚本
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    # 每次 remake 前执行
    make clean
    make mrproper

    # 因为不在一个环境里
    # 宿主机执行 cp -v /boot/config-$(uname -r) debian_config.txt
    cp debian_config.txt .config

    # memtis config
    echo "CONFIG_HTMM=y" | cat >> .config
    cat .config | grep HTMM
    make oldconfig
    修改 .config 后必须 make menuconfigmake oldconfig 以确保更改被保存。之后就是持续回车。
  7. make
  8. make INSTALL_MOD_STRIP=1 modules_install && make install

Debian 卸载手动编译安装的内核

Optane Configuration

Memtis 支持两种硬件平台,分别为 Optane,即 Intel DCPMM,以及模拟 CXL ( with non-CPU NUMA node )

下文介绍 Optane 配置过程。完整流程直达第三节

ndctl

ndctl 是给 NVDIMM 设备的管理工具库,配合 Linux 4+ 中的 libnvdimm subsystem,目前还是以支持 PMem 为主。NVDIMM 标准包含多个部分,Optane 属于目前市面唯一的 PMem 产品。PMDK 声称是平台中立的,但毕竟由 Intel 维护,其维护周期承诺到 PMem 生命周期终止。

在多级内存中,Optane 主要作为内存使用,由于访存速度上的差距,很容易由 NUMA 进行分层,实际上 daxctl 也是这样做的,在将 Optane 配置为 system-ram mode 时作为第三个无 CPU 的 NUMA 节点。

前置

应处于 AppDirect Mode.

把 NVM 当作一个 NUMA 节点,这个特性加入了 Kernel 5.1 及以上版本。需要以下开启 Kernel 编译选项:

1
2
CONFIG_MEMCG_KMEM=y
CONFIG_DEV_DAX_KMEM=m

这里检查了一下,上述配置已开启。

配置为 NUMA Memory

在配置为 RAM 以前,free -h 是无法看到 Optane 容量的。当前 Optane 的 ndctl mode 为 fsdax,即映射为块设备,通过 PMem-Aware Filesystem API 调用。首先映射为字符设备,使用 ndctl 对现有的 namespace 进行 reconfiguration,或直接删掉重来,这里采用第二种。[12] [13]

1
2
$ ndctl destroy-namespace --force all
$ ndctl create-namespace --mode=dax

服务器上原有四个 namespace,两个为 fsdax,分别拥有四条 Optane 容量,以及两个空的 raw namespace. 重新创建,指定 ndctl 为 devdax mode.

下面通过 daxctl 映射为 RAM.

If your Kernel supports and defaults to automatically online hotplug memory, you’ll see a message similar to the following:

1
2
3
4
# daxctl reconfigure-device --mode=system-ram all
dax1.0: error: kernel policy will auto-online memory, aborting
error reconfiguring devices: Device or resource busy
reconfigured 0 devices

Check your Kernel config to validate if it auto onlines hot plugged memory:

1
2
3
4
5
# grep ONLINE /boot/config-$(uname -r)
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y

# cat /sys/devices/system/memory/auto_online_blocks
online

Temporarily disable the hotplug memory feature:

1
2
3
# sudo echo offline > /sys/devices/system/memory/auto_online_blocks
# cat /sys/devices/system/memory/auto_online_blocks
offline

禁用内存热插拔后再配置 daxctl 成功。

1
2
3
4
5
6
7
8
9
10
11
12
[
{
"chardev":"dax1.0",
"size":532708065280,
"target_node":3,
"align":2097152,
"mode":"system-ram",
"online_memblocks":248,
"total_memblocks":248,
"movable":true
}
]

此时检查内存,能看到 Optane 容量。numactl -H 能够看到 Optane 位于新的 NUMA 节点,因为每个节点应是同质化内存,不应出现 DRAM-DCPMM 混用。

[Optional] 关掉交换内存:

1
swapoff -a

Memtis 中没有明确说明是否应关闭页面迁移,但 benchmark 脚本中关闭了 NUMA 均衡。

1
2
[?] echo 0 > /sys/kernel/mm/numa/demotion_enabled
[√] echo 0 > /proc/sys/kernel/numa_balancing

重启后,daxctl mode 又变回 devdax.

完整配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#!/bin/bash

echo offline > /sys/devices/system/memory/auto_online_blocks
cat /sys/devices/system/memory/auto_online_blocks
# offline

# check pmem
ndctl list
daxctl list

# check numa
numactl -H

# configure pmem as numa node
daxctl reconfigure-device --mode=system-ram all
numactl -H

  1. linux - How to suppress all warnings being treated as errors for format-truncation - Unix & Linux Stack Exchange ↩︎

  2. Warning Options (Using the GNU Compiler Collection (GCC)) ↩︎

  3. glibc - libc.so.6: version `GLIBC_2.14’ not found - Ask Ubuntu ↩︎

  4. [ERROR] [SOLVED] I cannot install the latest Linux kernel from source / Newbie Corner / Arch Linux Forums ↩︎

  5. Gentoo Forums :: 阅读主题 - sed: can’t read modules.order: No such file or directory ↩︎

  6. sed - attempting to install new kernel, error modules.order & Makefile Error 2 - Stack Overflow ↩︎

  7. ubuntu - “FAILED: load BTF from vmlinux: Unknown error -2make: *** [Makefile:1162: vmlinux] Error 255”, while compiling kernel-5.9.1 - Unix & Linux Stack Exchange ↩︎

  8. kernel - failed: load btf from vmlinux: invalid argument make on CONFIG_DEBUG_INFO_BTF=y - Unix & Linux Stack Exchange ↩︎

  9. pahole v1.24: FAILED: load BTF from vmlinux: Invalid argument ↩︎

  10. Out of memory on “Loading initial ramdisk” after kernel upgrade (4.15 to 4.19) on Ubuntu 18 - Unix & Linux Stack Exchange ↩︎

  11. 这一篇非常详细,在Ubuntu上编译安装linux内核详细过程 - robotech_erx - 博客园 ↩︎

  12. Linux服务器配置持久内存PM_daxctl-CSDN博客 ↩︎

  13. Using Linux Kernel Memory Tiering ↩︎