Записки программиста
Отображения файла в память под Linux с помощью mmap
В прошлый раз мы поговорили об отображении файлов в память при помощи WinAPI, а сегодня разберемся, как то же самое делается под nix-системами, в частности Linux и MacOS. Проверить код под FreeBSD я поленился, но по идее все должно работать и в этой операционной системе. Повторюсь — я почти уверен, что многие читатели сего блога уже знакомы с отображением файлов в память, поэтому пост предназначен для всех остальных читателей.
Укажем необходимые инклуды и объявим структуру FileMapping, хранящую файловый дескриптор, размер файла и указатель на участок памяти с отображением:
#include
#include
#include
#include
struct FileMapping <
int fd ;
size_t fsize ;
unsigned char * dataPtr ;
> ;
Рассмотрим чтение из файла с использованием отображения.
Открываем файл на чтение:
Узнаем размер файла:
size_t fsize = ( size_t ) st. st_size ;
Вызовом mmap создаем отображение файла в память:
Наконец, заполняем структуру FileMapping и возвращаем указатель на нее в качестве результата:
FileMapping * mapping = ( FileMapping * ) malloc ( sizeof ( FileMapping ) ) ;
if ( mapping == nullptr ) <
std :: cerr «fileMappingCreate — malloc failed, fname = »
fname std :: endl ;
munmap ( dataPtr, fsize ) ;
close ( fd ) ;
return nullptr ;
>
mapping — > fd = fd ;
mapping — > fsize = fsize ;
mapping — > dataPtr = dataPtr ;
Теперь по адресу mapping->dataPtr мы можем читать mapping->fsize байт содержимого файла.
Как всегда, не забываем освобождать за собой ресурсы, когда они становятся ненужны:
Вот и все! Сожалею, если вы ожидали чего-то более сложного 🙂 Полную версию исходного кода вы найдете здесь.
Те, кому представленный материал показался слишком простым, могут в качестве домашнего задания сделать следующее:
- Взять одну из *BSD систем и проверить, работает ли код на ней;
- Переписать пример так, чтобы файл можно было не только читать, но и писать в него;
- Выяснить, можно ли менять размер файла, отображенного в память;
- Выяснить, что будет, если создать отображение файла, а затем записать в него что-то через обычный вызов write;
- Погуглить на тему использования mmap в качестве IPC, написать соответствующий пример;
Признаюсь, сам я эти упражнения не выполнял, так как соответствующая задача мне пока что не подворачивалась. Поэтому буду очень вам благодарен, если вы отпишите о результатах их решения в комментариях.
Дополнение: Обратите также внимание на системные вызовы mlock / munlock, msync, madvise и mremap. В определенных типах приложений (например, СУБД) они могут быть очень и очень полезны!
Is boost memory mapped file zeroed on Linux
I’m re-learning C++, and I need to use memory mapped files. I decided to use boost (since it seems to be solid library).
I created a memory mapped file mapping to an array of doubles, and wrote to first double in this array. On disk file contained some data in first four bytes, and rest of this were zeroed, this was curious for me as generally if I obtain a pointer in C++ to memory location, in most cases I have to assume that it contains garbage.
Do I have any guarantees that newly created memory mapped files will be zeroed (at least on Linux)? I didn’t find any reference for that.
Here is the file contents:
EDIT
As @Zan pointed — boost actually uses ftruncate to resize mmaped files, so zeroing is guaranteed (at least on Linux).
2 Answers 2
A memory mapped file contains whatever was in the file.
If it is a new file, it has been extended to the right size and the extension will contain zeros. Extending a file is usually done with the ftruncate function.
The ftruncate manpage says:
So yes, zeros are guaranteed.
I think boost is zeroing the file for you in order to achieve that the mapped address space is really backed up by disk-space and not by a sparse file. This is slow, especially if you want to create a big address space up front, which might never be fully used — just so that you can allocate many objects in this address space. They are doing this, as there is no useable way on UNIXs to handle out-of-disk-space when writing to memory mapped sparse files (ignoring for the moment such sick solutions as setjmp/longjmp). But you have still the possibility that some other process truncates the file on disk, in which case the before mentioned problem rears its head again.
Unfortunately they are also doing this (allocating disk space matching the size of the address space instead of using a sparse file) on Windows, where structured exception handling exists.
C to C# (Mono) memory mapped files/shared memory in linux
I’m working on an embedded system that aquires about 20 megs of data per second. My lower-level acquisition, control, and processing layer converts most of it into a handful of important values, but it can also be useful for the end user to get a view of a window of the unprocessed data.
I’m working on an ASP.NET front end in C# with mod-mono. I would like the server-side part of the ASP.NET page to be able to easily request the last half-second or so worth of data. The C++ code has real-time constraints, so I can’t use message passing to respond — it could easily get bogged down by too many clients or someone quickly refreshing. I would like it to be able to place the data somewhere where any number of C# readers can access it as needed.
I’m picturing an area of shared memory with a rolling buffer of the least 16 or 32MB of data. The C++ code is constantly updating it, and the C# code can peek at it whenever it wants to. Is there a way to handle this? All the information I find on using memory-mapped files seems to focus on forking a child, rather than having two unrelated processes use it for IPC — does it have to hit the disk (or fs cache, etc) before the C# application can see it, or does memory mapping from the two programs actually make them share the same pages?
Is there a way of accessing POSIX shared memory objects in C#?
Questions about anonymous mapped memory in linux
I’m playing around with the idea of using the virtual memory system to allow me to do transparent data conversion (eg int to float) for some numeric data stuff I’ve got. The basic idea is that the library I’m writing mmaps the data file you want, and at the same time mmaps an anonymous region of an appropriate size to hold the converted data, and this pointer is returned to the user.
The anonymous region is read/write protected, so that whenever the user goes to access data through the pointer, every new page will cause a segfault, which I can catch, then transparently convert data from the mmaped file and fix up the permissions allowing the access to continue. This part of the whole thing works great so far.
However, sometimes I mmap very large files (hundreds of gigabytes), and with the anonymous memory proxying access to it, pretty quickly you’ll start eating swap space as anonymous pages are dropped to disk. My thought was that if I could explicitly set the dirty bit on the anonymous pages to false after writing converted data to them, the OS would just drop them and zero-fill on demand later if they were re-accessed.
For this to work though, I think I’d have to set the dirty bit to false and convince the OS to set pages to be read protected when they’re swapped out, so I can re-catch the ensuing segfault and reconvert the data on demand. After doing some research I don’t think this is possible without kernel hacking, but I thought I’d ask and see if someone that knows more about the virtual memory system knows a way this might be achieved.
2 Answers 2
Here’s an idea (completely untested though): for the converted data, mmap and munmap individual pages as you need them. Since the pages are backed by anonymous memory they should be discarded when they are unmapped. Linux will coalesce adjacent mappings into a single VMA, so this might have acceptable overhead.
Of course, there needs to be a mechanism to trigger the unmapping. You could maintain an LRU structure and evict an older page when you need to bring a new one in, thus keeping the size of the mapped region constant.
Extending on a suggestion I mentioned in your earlier related question, I think the following (Linux-specific, definitely not portable) scheme should work quite reliably:
Set up a datagram socket pair using socketpair(AF_UNIX, SOCK_DGRAM, 0, &sv) , and signal handler for SIGSEGV . (You won’t need to worry about SIGBUS , even if other processes might truncate the data file.)
The signal handler uses write() to write the size_t addr = siginfo->si_addr; to its end of the socket. The signal handler then read() s one byte from the socket it wrote into (blocking — this is basically just a reliable sleep() — so remember to handle EINTR ), and returns.
Note that even if there are multiple threads faulting at or near the same time, there is no race condition. The signals just get reraised until the mappings are fixed.
If there is any kind of a problem with the socket communications, you can use sigaction() with .sa_handler = SIG_DFL to restore the default SIGSEGV signal handler, so that when the same signal is reraised the entire process dies as normal.
A separate thread reads the other end of the socket pair for addresses faulted with SIGSEGV , does all the mapping and file I/O necessary, and finally writes a zero byte to the same end of the socket pair to let the real signal handler know the mapping should be fixed now.
This is basically the «real» signal handler, without the drawbacks of an actual signal handler. Remember, the same thread will keep reraising the same signal until the mapping is fixed, so any race conditions between the separate thread and SIGSEGV signals are irrelevant.
Have one PROT_NONE , MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE mapping matching the size of the original data file.
To reduce the cost in actual RAM — using MAP_NORESERVE you use neither RAM nor SWAP for the mapping, but for gigabytes of data, the page table entries themselves require considerable RAM —, you could try using MAP_HUGETLB too. It would use huge pages, and therefore significantly less entries, but I am unsure whether there are issues when normal page sized holes are eventually punched into the mappings; you’d probably have to use huge pages all the way.
This is the «full» mapping that your «userspace» will use to access the data.
Have one PROT_READ or PROT_READ | PROT_WRITE , MAP_PRIVATE | MAP_ANONYMOUS mapping for pristine or dirty (respectively), converted data. If your «userspace» almost always modifies the data, you can always treat the converted data as «dirty», but otherwise you can avoid unnecessary writes of unmodified data by first mapping the converted data PROT_READ only; if it faults, mprotect() it PROT_READ | PROT_WRITE and mark it dirty (so needs to be converted and saved back to the file). I’ll call these two stages «clean» and «dirty» mappings respectively.
When the dedicated thread punches a hole into a «full» mapping for a «clean» page(s), you first mmap(NULL, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, . ) a new memory area of suitable size, read() the data from the desired data file into it, convert the data, mprotect(. PROT_READ) if you separate «clean» and «dirty» mappings, and finally mremap(newly_mapped, size, size, MREMAP_MAYMOVE | MREMAP_FIXED, new_ptr) it over the section of the «full» mapping.
Note that to avoid any accidents, you should use a global pthread_mutex_t , which is grabbed for the duration of these mremap() s and any mmap() calls elsewhere, to avoid having the kernel give the punched hole to the wrong thread by accident. The mutex will guard against any other thread getting in between. (Otherwise, the kernel might place a small map requested by another thread into the temporary hole.)
When discarding «clean» page(s), you call mmap(NULL, length, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE, -1, 0) to get a new map of suitable length, then grab the global mutex mentioned above, and mremap() that new map over the «clean» page(s); the kernel does an implicit munmap() . Unlock the mutex.
When discarding «dirty» page(s), you call mmap(NULL, length, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE, -1, 0) *twice to get two new maps of suitable length*. You then grab the global mutex mentioned above, and mremap() the dirty data over the first of the new mappings. (Basically it was only used to find out a suitable address to move the dirty data into.) Then, mremap() the second of the new mappings to where the dirty data used to reside in. Unlock the mutex.
Using a separate thread to handle the fault conditions avoids all async-signal-safe function problems. read() , write() , and sigaction() are all async-signal safe.
You only need one global pthread_mutex_t to avoid the case where the kernel hands the recently-moved hole ( mremap() ped from memory area) to another thread; you can also use it to protect your internal data structure (pointer chain, if you support multiple concurrent file mappings).
There should be no race conditions (other than when other threads use mmap() or mremap() , which is handled by the mutex mentioned above). When a «dirty» page or page group is moved away, it becomes inaccessible to other threads, before it is converted and saved; even perfectly concurrent access by another thread should be handled perfectly: the page will simply be re-read from the file, and re-converted. (If that occurs often, you might wish to cache recently saved page groups.)
I do recommend using large page groups, say 2M or more, instead of single pages, to reduce the overhead. The optimal size depends on your applications access patterns, but the huge page size (if supported by your architecture) is a very good starting point.
If your data structures do not align to pages or page groups, you should cache the full converted first and last records (which are only partially within the page or page group). It usually makes the conversion back to storage format much easier.
If you know or can detect typical access patterns within the file, you probably should use posix_fadvise() to tell the kernel; POSIX_FADV_WILLNEED and POSIX_FADV_DONTNEED are most useful. It helps the kernel avoid keeping unnecessary pages of the actual data file in page cache.
Finally, you might consider adding a second special thread for converting and writing dirty records back to disk asynchronously. If you take care to make sure the two threads don’t get confused when the first thread wants to re-read back a record still being written to disk by the second thread, there should be no other issues there either — but asynchronous writing is likely to increase your throughput with most access patterns, unless you are I/O bound anyway, or really short on RAM (relatively speaking).
Why use read() and write() instead of another memory map? Because of the overhead in-kernel for the virtual memory structures needed.