CHAPTER 3

Memory Mapping

Listing 3-1. mmap_example.c

Memory-mapped file on Linux.

…

We first open() the file to get file descriptor. Then, we retrive the file statistics to use the length:

1 2	if (fstat(fd /* file descriptor */, &stbuf) < 0) err(1, "stat %s", argv[1]);

Map the file into the application’s address space to allow our program to access the contents as if in memory. In the second parameter, we pass the length of the file, requesting Linux to initialize memory with the full file. We also map the file with both READ and WRITE access and also as SHARED allowing other processes to map the same file.

/*
 * Map the file into our address space for read 
 * & write. Use MAP_SHARED so stores are visible 
 * to other programs.
 */
if ((pmaddr = mmap(NULL, stbuf.st_size,
			PROT_READ|PROT_WRITE,
			MAP_SHARED, fd, 0)) == MAP_FAILED)
	err(1, "mmap %s", argv[1]);
	
/* Don't need the fd anymore because the mapping
 * stays around */
close(fd);

Now we can write data into the file by accessing it like byte-addressable memory.

1	strcpy(pmaddr, "This is new data written to the file");

Finally, we explicitly flush the data back to the storage device. MS_SYNC is set to make msync wait for the writeback to complete before returning.

/*
 * Simplest way to flush is to call msync().
 * The length needs to be rounded up to a 4k page.
*/
if (msync((void *)pmaddr, 4096, MS_SYNC) < 0)
	err(1, "msync");

Figure 3-4. Memory-mapped files with storage

$P_{43}$

First, a portion of the limited kernel memory page cache in main memory is used to store a copy of the file.

Generated by GPT.

这段描述指的是在内存映射文件的过程中，操作系统会利用内核内存页缓存（kernel memory page cache）来存储文件的一部分副本。

具体来说，当一个文件被内存映射到一个进程的地址空间中时，操作系统并不会立即将整个文件加载到进程的内存中。相反，它会在内核内存页缓存中保留文件的一个部分，这样当进程访问文件的特定部分时，操作系统可以快速将所需的数据页加载到进程的地址空间中，而无需每次都从磁盘读取。

这种方式带来了几个好处：

减少磁盘 I/O 操作： 通过将文件数据缓存在内核的内存页中，可以减少频繁的磁盘读写操作，提升文件访问的速度。
提高性能： 内存中的数据访问速度比磁盘快得多，因此可以显著提高应用程序对文件数据的访问性能。
共享文件访问： 如果多个进程映射了同一个文件，它们可以共享相同的物理内存页，从而节省系统资源并提高效率。

总之，这段描述说明了内核在处理内存映射文件时的策略：通过内存页缓存，部分文件数据被保留在内存中，以提高对文件的访问效率和响应速度。

Summary

内存映射文件的操作仍然发生在用户态地址空间，但内核中有一些页面缓存
缺页异常产生的前提 ( 虚拟已映射，物理未加载 ) 是内存映射，之后才会因产生缺页中断而确认页面被实际访问。对于块 I/O 设备，物理页在 LOAD/STORE 时都要经过内核页缓存。Page Fault Handler 首先检查是否有缓存 ( 即物理页已被别的程序使用，物理页副本已位于内核 )，若没有，再引发 I/O
( Page Fault 也是进行内存访问监测的方式之一，或扫描 PTE )

**File API vs. Memory Mapping**

Generated by GPT.

文件操作API（如 read 和 write）通常不直接调用 mmap() 或 msync()。它们是独立的系统调用，有各自的功能：

read & write：用于在文件描述符上进行读写操作，直接操作文件系统，通过内核的文件操作接口进行数据传输。
mmap：用于将文件或设备映射到进程的地址空间，使得文件内容可以像内存一样被访问。mmap() 允许文件的部分或全部内容被映射到内存中，程序可以通过直接内存访问来进行文件操作。

总结而言，read 和 write 直接操作文件描述符，而 mmap() 和 msync() 提供了一种不同的文件操作机制。它们可以组合使用，例如通过 mmap() 映射文件后，用内存访问方式来替代 read 和 write。

Persistent Memory Direct Access (DAX)

Persistent memory devices will appear as /dev/pmem* device special files.

Use ndctl and ipmctl to check. (on our Inspur Server)

Listing 3-3. Displaying persistent memory physical devices and regions.

root@pve:~/pmdk# ipmctl show -dimm
 DimmID | Capacity    | LockState | HealthState | FWVersion    
===============================================================
 0x0010 | 126.742 GiB | Disabled  | Healthy     | 02.02.00.1553
 0x0110 | 126.742 GiB | Disabled  | Healthy     | 02.02.00.1553
 0x0210 | 126.742 GiB | Disabled  | Healthy     | 02.02.00.1553
 0x0310 | 126.742 GiB | Disabled  | Healthy     | 02.02.00.1553
 0x1010 | 126.742 GiB | Disabled  | Healthy     | 02.02.00.1553
 0x1110 | 126.742 GiB | Disabled  | Healthy     | 02.02.00.1553
 0x1210 | 126.742 GiB | Disabled  | Healthy     | 02.02.00.1553
 0x1310 | 126.742 GiB | Disabled  | Healthy     | 02.02.00.1553

root@pve:~/pmdk# ipmctl show -region
 SocketID | ISetID             | PersistentMemoryType | Capacity    | FreeCapacity | HealthState 
=================================================================================================
 0x0000   | 0x45c9c3d034fb8888 | AppDirect            | 504.000 GiB | 0.000 GiB    | Healthy
 0x0001   | 0xbb39c3d022d18888 | AppDirect            | 504.000 GiB | 0.000 GiB    | Healthy

Listing 3-4. Displaying persistent memory physical devices(D), regions(R), and namespaces(N).

1	ndctl list -DRN

Listing 3-5. Locating persistent memory.

root@pve:~/pmdk# df -h /dev/pmem*
Filesystem      Size  Used Avail Use% Mounted on
udev            126G     0  126G   0% /dev        # /pmem0
udev            126G     0  126G   0% /dev        # /pmem1

DAX 下，创建并打开持久内存支持的文件，则用户仍然可以使用 mmap() 等内存映射 API，此时 PMem 中的数据被原地 (native) 映射为内存，无需 cache files (in kernel memory space) 或执行 I/O 操作。通过 mmap() 返回的指针，可直接在 PMem 中对数据进行操作。

这包括两个优势：

No kernel I/O operations are required.
The full file is mapped into the application’s memory.

PMem 不需要像 msync() 一样经过内核页缓存。对于第 2. 条，优势体现在两方面，既规避了大文件频繁 I/O，又能避免调用 msync() 到持久性被确认期间发生 failure.

It can manipulate large collections of data objects with higher and more consistent performance as compared to files on I/O-accessed storage.

┌------┐   ┌--╤════════╤-----╤═════════════╤-┐   ┌------┐
|PMem  |   |  |PMem    |←---→|NVDIMM Driver|←┼--→|PMem  |
|Region|   |  |Aware   |     └-------------┘ |   |Module|
|in    |   |  |File    |                     |   |      |
|App   |   |  |System  |←--------------------┼--→|      |
|      |   |  ├--------┤                     |   |      |
|      |←--┼--┼- MMU  -┼---------------------┼--→|      |
|      |   |  |Mappings|          OS Kernel  |   |      |
└------┘   └--╧════════╧---------------------┘   └------┘

Use DAX

持久性内存的直接内存访问是通过文件系统实现的。使用时，管理员账户在 PMem Module 中创建一个文件系统，并将其挂载到 Linux 的文件系统树中。

Listing 3-7. pmem_map_file.c

We use DAX to write a string directly into persistent memory in the code.

This code uses libpmem, one of the persistent memory API libraries included in Linux and Windows. This sample code is portable across both operating system platforms.

The pmem_map_file function handles opening the file and mapping it into our address space. Since the file resides on persistent memory, the operating system maps the persistent memory region directly into the application’s virtual address space.

/* Create a pmem file and memory map it. */
if ((pmemaddr = pmem_map_file(argv[1], PMEM_LEN, 
		PMEM_FILE_CREATE, 0666, &mapped_len, 
		&is_pmem)) == NULL) {
	perror("pmem_map_file");
	exit(1);
}

Notice that is_pmem should be set if we desire a DAX memory access with Persistent Memory. This is because pmem_map_file can also be used for memory mapping diskbased files through kernel main memory as well as directly mapping persistent memory.

After strcpy() something to the persistent memory region, the persistence should be guaranteed. If the file resides on persistent memory, pmem_persist() makes data flushed through CPU cache levels to the power-failure safe domain and ultimately to persistent memory. If the file resided on disk-based storage, mmap() would be used to flush data to storage.

/* Flush our string to persistence. */
if (is_pmem)
	pmem_persist(pmemaddr, sizeof(s));
else
	pmem_msync(pmemaddr, sizeof(s));

/* Delete the mappings. */
pmem_unmap(pmemaddr, mapped_len);

Finally, we pmem_unmap() the persistent memory region.

Note that we can pass small sizes to persist() (even the size of a small string) instead of requiring flushes at page granularity when using msync().

补充

DAX 包括 fsdax (default) / devdax 等模式.

fsdax: Provides a block-device that can support a DAX-enabled filesystem
devdax: Emits a single character device file (/dev/daxX.Y).

Summary

Fig. 3-6.Persistent memory programming interfaces.

Figure 3-6 shows the complete view of the operating system support that this chapter describes. As we discussed, an application can use persistent memory as a fast SSD, more directly through a persistent memory-aware file system, or mapped directly into the application’s memory space with the DAX option. DAX leverages operating system services for memory-mapped files but takes advantage of the server hardware’s ability to map persistent memory directly into the application’s address space. This avoids the need to move data between main memory and storage.