Relocatable Malloc: Large-scale data processing

The output from running my modified Memcheck (of Valgrind) tool on Opera (12.x, latest as of 2012-08-08) gives 12Gb of output, which is about 500M lines in the following format:

('new', 16, 0x41c3028)

('store', 0x41c3028, 4)

('store', 0x41c3030, 4)

('store', 0x41c302c, 4)

('new', 16, 0x41c3068)

('store', 0x41c3068, 4)

('store', 0x41c3070, 4)

('store', 0x41c306c, 4)

('new', 16, 0x41c30a8)

Valid operations are new, free, store, load, modify. Each new request is stored in a hash, which then subsequential memory access ops are mapped to. The data collected is then used to generate a C program with a fixed number of handles and a long list of lock/unlock/free/new calls into the C allocator being benchmarked. This for comparison between cmalloc and other allocators. The collector program (translate.py) evals each line and does processing on it, very simple.

Python too slow? Nope.

The problem is that Python is very slow and uses too much memory, which my 4GB Core i3 laptop can't handle - translate.py works for small-ish outputs. This because the list of handles is checked for each memory access, i.e. a 2'000 (approx) entries list for each memory access (~500M), quickly becomes unusable. I tried various approaches, such as moving out the code to Cython (formerly known as Pyrex), which translates the Python code into C and builds it as a Python extension module (a regular .so), but only doing that did not speed things up.

Brute-force hashing!

Hashing on the start and end address was given as a suggestion by @bobf, where each access address would be decremented towards start and incremented towards end, and when both values hit a hash, and the hash were the same (i.e. the memory handle), it'd be done. That was even slower than iterating through 2000 elements, because the hash has to be checked on average one lookup per allocated byte in the memory area.

Finally, I came up with a brute-force solution: hash all addresses within the requested memory area - from start to end, mapping each address to the corresponding memory handle. This proved to be fast, but blew up with a MemoryError at about 2 GB data read (out of 12 GB in total), and was ready to investigate a key-value pair store (i.e. hash) like Redis, but it's in-memory only.

Yup. Or maybe. Still need more RAM though.

My server with 8GB RAM has swap enabled, but by default Ubuntu 10.04 LTS doesn't over-commit memory. Setting /proc/sys/vm/overcommit_memory to 1 effectively enables swap for application memory allocation, if I've understood it correctly. I've just restarted the application, chugging away at 2,3G data read at 336M physical RAM free and 0 bytes swap space used.

Or maybe, just maybe, 64-bit OS.

So, what I've realized is that the problem is, of course, that using a 32-bit system to allocate data larger than 4GB doesn't work very well. Installed a 64-bit Ubuntu LiveCD on a USB stick and did post-processing from that end.

However, it's not good enough. Calculating the handle mappings can be done in one pass, but also including all ops (mapped to handles, instead of pointers) will not fit in memory. Therefore, my nice and handy post-processing script that does everything in one pass does not cut the mustard. Splitting it up into more parts, where each one does one specific thing:

map addresses to handles and write out ops (mapped to handle) to a file
read ops file, pruning duplicate ops (e.g. two or more successive L/M/S to the same handle) and write out malloc C source
read ops file, calculate handle lifetime for histogram

That's what it does for now.

More on lifetime

The lifetime might be more elaborate, for now the calculation is fairly naive in that it only checks for really long-lived areas, but it could also be setup to scan for "sub-lifetimes", i.e. module-global. My guess is that it would look like the histogram data above (spikes), but located in the middle. Calculating that would mean that start and end points for calculating lifetime would be sliding, such that end is fixed and start moves towards end, and the other way around, where start is fixed and end moves towards start. Storing each value takes up lots of memory and analyzing the end-result by hand takes a very long time since one'd have to look at each histogram.

Current histogram is plotted for lifetime which is already calculated. A plot showing ops per handle over time (3D graph: ops, handle, time) could possibly give useful information about the clustering of ops and handles, in turn being used for calculating new lifetimes. If time allows for it, otherwise left in future work, since I'm not quite sure on what to plot to give the most useful information, and how much it would affect real-life lock/unlock patterns.

Lifetime calculations too slow, alas

Lifetime is defined as number of ops on own handle divided by ops for all other handles, for each handle's lifetime. Each handle is mapped to a tuple (own, others), and for each operation either own or others is incremented, until the handle is freed, at which point it's moved to the set of inactive handles. This means going through all handles for each operation, which for smaller datasets would be OK. Again, we don't have that luck, and for the Opera data set it's about 8GB data. Even removing duplicates (two successive ops on the same handle) this quadratic O(m*n) (m = ops, n = live handles) takes too long time.

Instead, keep a counter of ops so far (ops_counter) and for each handle, store the tuple (own, <value of ops_counter at New>), and only increase the "own" value for ops mapping to a handle. Then, at death (free), calculate the "others" value by calculating ops_counter - own - cached_ops_counter. Example, with ops counter, set of alive, set of dead:

20 | {(a 5 0) (b 2 5) (c 10 7) (d 3 17)} | {}, (death b) =>

20 | {(a 5 0) (c 10 7) (d 3 17)} | {(b 2 20-5-2=13)}, (death a) =>

20 | {(c 10 7) (d 3 17)} | {(b 2 13) {a 5 20-5-0=15}, (death d) =>

20 | {(c 10 7) (d 3 17)} | {(b 2 13) (a 5 15) (d 3 20-17-3=0)}, (new e) =>

25 | {(c 10 7) (d 3 17) (e 5 20)} | {(b 2 13) (a 5 15) (d 3 0)}, (new f) =>

| {(c 10 7) (d 3 17) (e 5 20) (f 3 25)} | {(b 2 13) (a 5 15) (d 3 0)}, (death e) =>

  

28 | {(c 10 7) (d 3 17) (e 5 20) (f 3 25)} | {(b 2 13) (a 5 15) (d 3 0) (e 5 28-20-5=3}

At end, any remaining live handles (due to missing frees) are moved to the dead set. This algorithm is O(m) + m*n.


Histogram of macro lifetime, 100'000 = full program

Relocatable Malloc

Sidor

fredag 10 augusti 2012

Large-scale data processing