Nomad

Intro & Backgrounds

Multi-Tier Memory

Management

多级内存：时延、容量、功耗、成本。

Central to tiered memory management is page management within operating systems (OS), including page allocation, placement, and migration.

强调 OS-Managed 是多级内存管理的核心。相比于专用硬件架构，其灵活性更高 ( e.g., K-V-Separated Storage )，同时可保持对用户程序透明。

Assumption

传统内存层次结构由性能相差一个数量级以上的存储介质组成。DRAM-Disk 下，延迟、带宽、容量均相差 2 ~ 3 个数量级，页面管理系统只需专注于将热页面移至 DRAM。而新兴内存介质 ( e.g., Optane PMem, CXL Memory ) 与 DRAM 的性能差距已经缩小到数倍。因此，过去的页面管理假设可能不再成立，如果迁移成本太高，则将热页面迁移到快速层不再有益。

结构差异。传统存储介质只能作为块设备访问，而新兴内存设备是字节可寻址 ( Byte-addressable ) 的。( Miya: 我们在关于 Optane 的文章中验证了这一点。)

已经讨论过，这带来一些好处，如 PMem 的 DAX 特性允许对存储进行原地映射，无需经过内核页缓存等。对于 CPU，这意味着可以直接使用 LOAD 与 STORE 指令进行访问。

Most importantly, while the performance of tiered memory remains hierarchical, the hardware is no longer hierarchical. Both the Optane PMem and CXL memory appear to the processor as a CPUless memory node and thus can be used by the OS as ordinary DRAM.

这里其实有一点需要注意，在考虑 PMem 作为多级内存层级时，我们同样将其作为一个 non-CPU NUMA Node 处理，并没有考虑其持久性。无论是 CXL 还是 NVM，都作为容量层而存在。对于 Optane，无论是 fsdax 作为块设备，经由 PMem-Aware 文件系统提供 DAX 能力，还是 devdax 作为字符设备，在裸设备层直接提供 DAX，OS 都没有将其视为内存。Optane 的分配对应用是不透明的。

Prior Works

Nimble.

透明大页 (THP)
多线程迁移 / 并发迁移 ( Miya: 有何区别，待考察 )

TPP.

改进的 NUMA 均衡
async demotion / sync promotion

Memtis. & TMTS.

PEBS
后台 kthread 定期执行 promotion

Limitation

排他性 ( 互斥性，exclusive )。这将导致

Exclusive memory tiering inevitably leads to excessive hot-cold page swapping or memory thrashing when the performance tier is not large enough to hold hot data.

认为排他性是快存层满时，反复迁移的成因。
缺乏有效的页面迁移机制，来支持 Tiered-Memory 管理。迁移过程涉及修改页表映射项 ( 新/旧 ) 以及数据拷贝，这个过程无论是否是异步的，其成本都是高昂。发生内存抖动时，用户感知的带宽最多可低于峰值内存带宽 95%.

Contribution

※ Assume that page migrations only occur between two adjacent tiers if there are more than two memory tiers.

This paper advocates non-exclusive memory tiering that allows a subset of pages on the performance tier to have shadow copies on the capacity tier. Note that non-exclusive tiering is different from inclusive tiering which strictly uses the performance tier as a cache of the capacity tier. The most important benefit is that under memory pressure, page demotion is made less expensive by simply remapping a page if it is not dirty and its shadow copy exists on the capacity tier. This allows for smooth performance transition when memory demand exceeds the capacity of the performance tier.

提出事务性页面迁移，允许在迁移过程中访问页面。

Summary

整体看起来与 HYBRID2 类似，同时允许 Cached 和 Tiered 两种内存组织。

Motivation

Cache & Tiering. 划分性能层和容量层。

CXL.

More recently, compute express link (CXL), an open-standard interconnect technology based on PCI Express (PCIe), provides a memory-like, byte-addressable interface (i.e., via the CXL.mem protocol) for connecting diverse memory devices (e.g., DRAM, PM, GPUs, and smartNICs). Real-world CXL memory offers comparable memory access latency (<2x) and throughput (∼50%) to ordinary DRAM.

基于页表的访问跟踪

Page Fault-Based 将 PTE 设为无访问权限。精确跟踪，开销巨大 ( 原理需再看一下 )
PTE-Scanning 定期扫描 PTE 引用位
PEBS 基于事件的性能计数

Linux lazy PT scanning

Hardware Support

PEBS

( Miya: 还是在聊采样开销与精度之间的平衡 )

完整的页面迁移过程 be like

The system must trap to the kernel (e.g., via page faults) to handle migration;
The PTE of a migrating page must be locked to prevent others from accessing the page during migration and be ummapped from the page table;
A translation lookaside buffer (TLB) shootdown must be issued to each processor (via inter-processor interrupts (IPIs)) that may have cached copies of the stale PTE;
The content of the page is copied between tiers;
Finally, the PTE must be remapped to point to the new location.

这里还提到另一个问题，TPP 策略下，同步页面提升可能产生多达 15 个页面错误。在 §3.1 给出了解释，Linux 中，内存跟踪以 15 个请求为一组，将页面从 inactive LRU 提升到 active LRU，由于同步页面迁移，页面移动与缺页在同一个 core 上处理，这 15 个请求可能都为对同一个页面的重复请求，造成最多 15 次缺页。

( 这里的机制需要细看。)