Relocatable Malloc: augusti 2012

The output from running my modified Memcheck (of Valgrind) tool on Opera (12.x, latest as of 2012-08-08) gives 12Gb of output, which is about 500M lines in the following format:

('new', 16, 0x41c3028)

('store', 0x41c3028, 4)

('store', 0x41c3030, 4)

('store', 0x41c302c, 4)

('new', 16, 0x41c3068)

('store', 0x41c3068, 4)

('store', 0x41c3070, 4)

('store', 0x41c306c, 4)

('new', 16, 0x41c30a8)

Valid operations are new, free, store, load, modify. Each new request is stored in a hash, which then subsequential memory access ops are mapped to. The data collected is then used to generate a C program with a fixed number of handles and a long list of lock/unlock/free/new calls into the C allocator being benchmarked. This for comparison between cmalloc and other allocators. The collector program (translate.py) evals each line and does processing on it, very simple.

Python too slow? Nope.

The problem is that Python is very slow and uses too much memory, which my 4GB Core i3 laptop can't handle - translate.py works for small-ish outputs. This because the list of handles is checked for each memory access, i.e. a 2'000 (approx) entries list for each memory access (~500M), quickly becomes unusable. I tried various approaches, such as moving out the code to Cython (formerly known as Pyrex), which translates the Python code into C and builds it as a Python extension module (a regular .so), but only doing that did not speed things up.

Brute-force hashing!

Hashing on the start and end address was given as a suggestion by @bobf, where each access address would be decremented towards start and incremented towards end, and when both values hit a hash, and the hash were the same (i.e. the memory handle), it'd be done. That was even slower than iterating through 2000 elements, because the hash has to be checked on average one lookup per allocated byte in the memory area.

Finally, I came up with a brute-force solution: hash all addresses within the requested memory area - from start to end, mapping each address to the corresponding memory handle. This proved to be fast, but blew up with a MemoryError at about 2 GB data read (out of 12 GB in total), and was ready to investigate a key-value pair store (i.e. hash) like Redis, but it's in-memory only.

Yup. Or maybe. Still need more RAM though.

My server with 8GB RAM has swap enabled, but by default Ubuntu 10.04 LTS doesn't over-commit memory. Setting /proc/sys/vm/overcommit_memory to 1 effectively enables swap for application memory allocation, if I've understood it correctly. I've just restarted the application, chugging away at 2,3G data read at 336M physical RAM free and 0 bytes swap space used.

Or maybe, just maybe, 64-bit OS.

So, what I've realized is that the problem is, of course, that using a 32-bit system to allocate data larger than 4GB doesn't work very well. Installed a 64-bit Ubuntu LiveCD on a USB stick and did post-processing from that end.

However, it's not good enough. Calculating the handle mappings can be done in one pass, but also including all ops (mapped to handles, instead of pointers) will not fit in memory. Therefore, my nice and handy post-processing script that does everything in one pass does not cut the mustard. Splitting it up into more parts, where each one does one specific thing:

map addresses to handles and write out ops (mapped to handle) to a file
read ops file, pruning duplicate ops (e.g. two or more successive L/M/S to the same handle) and write out malloc C source
read ops file, calculate handle lifetime for histogram

That's what it does for now.

More on lifetime

The lifetime might be more elaborate, for now the calculation is fairly naive in that it only checks for really long-lived areas, but it could also be setup to scan for "sub-lifetimes", i.e. module-global. My guess is that it would look like the histogram data above (spikes), but located in the middle. Calculating that would mean that start and end points for calculating lifetime would be sliding, such that end is fixed and start moves towards end, and the other way around, where start is fixed and end moves towards start. Storing each value takes up lots of memory and analyzing the end-result by hand takes a very long time since one'd have to look at each histogram.

Current histogram is plotted for lifetime which is already calculated. A plot showing ops per handle over time (3D graph: ops, handle, time) could possibly give useful information about the clustering of ops and handles, in turn being used for calculating new lifetimes. If time allows for it, otherwise left in future work, since I'm not quite sure on what to plot to give the most useful information, and how much it would affect real-life lock/unlock patterns.

Lifetime calculations too slow, alas

Lifetime is defined as number of ops on own handle divided by ops for all other handles, for each handle's lifetime. Each handle is mapped to a tuple (own, others), and for each operation either own or others is incremented, until the handle is freed, at which point it's moved to the set of inactive handles. This means going through all handles for each operation, which for smaller datasets would be OK. Again, we don't have that luck, and for the Opera data set it's about 8GB data. Even removing duplicates (two successive ops on the same handle) this quadratic O(m*n) (m = ops, n = live handles) takes too long time.

Instead, keep a counter of ops so far (ops_counter) and for each handle, store the tuple (own, <value of ops_counter at New>), and only increase the "own" value for ops mapping to a handle. Then, at death (free), calculate the "others" value by calculating ops_counter - own - cached_ops_counter. Example, with ops counter, set of alive, set of dead:

20 | {(a 5 0) (b 2 5) (c 10 7) (d 3 17)} | {}, (death b) =>

20 | {(a 5 0) (c 10 7) (d 3 17)} | {(b 2 20-5-2=13)}, (death a) =>

20 | {(c 10 7) (d 3 17)} | {(b 2 13) {a 5 20-5-0=15}, (death d) =>

20 | {(c 10 7) (d 3 17)} | {(b 2 13) (a 5 15) (d 3 20-17-3=0)}, (new e) =>

25 | {(c 10 7) (d 3 17) (e 5 20)} | {(b 2 13) (a 5 15) (d 3 0)}, (new f) =>

| {(c 10 7) (d 3 17) (e 5 20) (f 3 25)} | {(b 2 13) (a 5 15) (d 3 0)}, (death e) =>

  

28 | {(c 10 7) (d 3 17) (e 5 20) (f 3 25)} | {(b 2 13) (a 5 15) (d 3 0) (e 5 28-20-5=3}

At end, any remaining live handles (due to missing frees) are moved to the dead set. This algorithm is O(m) + m*n.


Histogram of macro lifetime, 100'000 = full program

A global variable is one that is likely to be locked at all times, i.e. it's accessed throughout the applications, and often indirectly.

The type of any memory area, with regard to lifetime, can be determined in many ways. The most obvious one is to measure the number of memory accesses (load, store), henceforth called an "op", between new and free, compared to the total number of operations within the entire program. That will give us all memory areas that are allocated in the beginning and freed at the end, which is common for architectural-type objects that are visible from everywhere. They usually also have a fairly low number of accesses compared to the total number of accesses within its lifetime. Calculating the number (own ops + other ops) / total app ops will give a number from 0 to 1, where 1 is an object allocated at the very beginning and freed at the very end.

Looking at the distribution of lifetime among pointers in a histogram for different applications, we see a clustering pattern for long-lived vs short-lived objects:

sort /etc/dictionaries-common/words

uniq /etc/dictionaries-common/words

soffice lorem-ipsum-10.odt

The different apps show different behaviour.

Starting with sort, no objects have a really long life-time, and it's approximately evenly distributed throughout the lifetime of the application. There's a cut-off point between about 30% and 15% where there's a distinction between longer and shorter life spans, but here a better way of determining life time needs to be used.

The sceond example, uniq, is on the other hand a very non-malloc-intensive application, where a a bunch of memory is allocated in the beginning and then no more. It is safe to assume that this memory will be locked throughout the program's lifetime.

The last example, Star Office, shows a more common large-scale application memory allocation usage, where a set of architectural objects are allocated in the beginning, with a lifetime of close to 100%, then a few allocations here and there, followed by an set of objects that have a very low life-time, presumably for the current document itself. In this case, the first objects (i.e. long lifetime) would be locked all the time, whereas the later objects (i.e. short lifetime) would be locked just-in-time.

Well, how do you determine the type of variable, then?

As stated above in the example with sort, the application displays an interesting behaviour that could possibly give better understanding of applications with similar memory allocations patterns. Because sort is a small program, I might do a manual analysis of memory usage to see how else we can determine if an object is "global" (over a piece of code) and should be locked or not. This might be work for the future.

Relocatable Malloc

Sidor

fredag 10 augusti 2012

Large-scale data processing