fredag 10 augusti 2012

Large-scale data processing

The output from running my modified Memcheck (of Valgrind) tool on Opera (12.x, latest as of 2012-08-08) gives 12Gb of output, which is about 500M lines in the following format:

('new', 16, 0x41c3028)
('store', 0x41c3028, 4)
('store', 0x41c3030, 4)
('store', 0x41c302c, 4)
('new', 16, 0x41c3068)
('store', 0x41c3068, 4)
('store', 0x41c3070, 4)
('store', 0x41c306c, 4)
('new', 16, 0x41c30a8)

Valid operations are new, free, store, load, modify.  Each new request is stored in a hash, which then subsequential memory access ops are mapped to.  The data collected is then used to generate a C program with a fixed number of handles and a long list of lock/unlock/free/new calls into the C allocator being benchmarked. This for comparison between cmalloc and other allocators.  The collector program ( evals each line and does processing on it, very simple.

Python too slow? Nope.

The problem is that Python is very slow and uses too much memory, which my 4GB Core i3 laptop can't handle - works for small-ish outputs. This because the list of handles is checked for each memory access, i.e. a 2'000 (approx) entries list for each memory access (~500M), quickly becomes unusable.   I tried various approaches, such as moving out the code to Cython (formerly known as Pyrex), which translates the Python code into C and builds it as a Python extension module (a regular .so), but only doing that did not speed things up.

Brute-force hashing!

Hashing on the start and end address was given as a suggestion by @bobf, where each access address would be decremented towards start and incremented towards end, and when both values hit a hash, and the hash were the same (i.e. the memory handle), it'd be done.  That was even slower than iterating through 2000 elements, because the hash has to be checked on average one lookup per allocated byte in the memory area.

Finally, I came up with a brute-force solution: hash all addresses within the requested memory area - from start to end, mapping each address to the corresponding memory handle.  This proved to be fast, but blew up with a MemoryError at about 2 GB data read (out of 12 GB in total), and was ready to investigate a key-value pair store (i.e. hash) like Redis, but it's in-memory only.

Yup. Or maybe. Still need more RAM though.

My server with 8GB RAM has swap enabled, but by default Ubuntu 10.04 LTS doesn't over-commit memory. Setting /proc/sys/vm/overcommit_memory to 1 effectively enables swap for application memory allocation, if I've understood it correctly. I've just restarted the application, chugging away at 2,3G data read at 336M physical RAM free and 0 bytes swap space used.

Or maybe, just maybe, 64-bit OS.

So, what I've realized is that the problem is, of course, that using a 32-bit system to allocate data larger than 4GB doesn't work very well.  Installed a 64-bit Ubuntu LiveCD on a USB stick and did post-processing from that end.

However, it's not good enough. Calculating the handle mappings can be done in one pass, but also including all ops (mapped to handles, instead of pointers) will not fit in memory. Therefore, my nice and handy post-processing script that does everything in one pass does not cut the mustard.   Splitting it up into more parts, where each one does one specific thing:
  • map addresses to handles and write out ops (mapped to handle) to a file
  • read ops file, pruning duplicate ops (e.g. two or more successive L/M/S to the same handle) and write out malloc C source
  • read ops file, calculate handle lifetime for histogram
That's what it does for now.  

More on lifetime

The lifetime might be more elaborate, for now the calculation is fairly naive in that it only checks for really long-lived areas, but it could also be setup to scan for "sub-lifetimes", i.e. module-global.  My guess is that it would look like the histogram data above (spikes), but located in the middle.  Calculating that would mean that start and end points for calculating lifetime would be sliding, such that end is fixed and start moves towards end, and the other way around, where start is fixed and end moves towards start.  Storing each value takes up lots of memory and analyzing the end-result by hand takes a very long time since one'd have to look at each histogram.

Current histogram is plotted for lifetime which is already calculated. A plot showing ops per handle over time (3D graph: ops, handle, time) could possibly give useful information about the clustering of ops and handles, in turn being used for calculating new lifetimes.  If time allows for it, otherwise left in future work, since I'm not quite sure on what to plot to give the most useful information, and how much it would affect real-life lock/unlock patterns.

Lifetime calculations too slow, alas

Lifetime is defined as number of ops on own handle divided by ops for all other handles, for each handle's lifetime.  Each handle is mapped to a tuple (own, others), and for each operation either own or others is incremented, until the handle is freed, at which point it's moved to the set of inactive handles. This means going through all handles for each operation, which for smaller datasets would be OK. Again, we don't have that luck, and for the Opera data set it's about 8GB data. Even removing duplicates (two successive ops on the same handle) this quadratic O(m*n) (m = ops, n = live handles) takes too long time.

Instead, keep a counter of ops so far (ops_counter) and for each handle, store the tuple (own, <value of ops_counter at New>), and only increase the "own" value for ops mapping to a handle. Then, at death (free), calculate the "others" value by calculating ops_counter - own - cached_ops_counter. Example, with ops counter, set of alive, set of dead:

20 | {(a 5 0) (b 2 5) (c 10 7) (d 3 17)} | {}, (death b) =>
20 | {(a 5 0) (c 10 7) (d 3 17)} | {(b 2 20-5-2=13)}, (death a) =>
20 | {(c 10 7) (d 3 17)} | {(b 2 13) {a 5 20-5-0=15}, (death d) =>
20 | {(c 10 7) (d 3 17)} | {(b 2 13) (a 5 15) (d 3 20-17-3=0)}, (new e) =>
25 | {(c 10 7) (d 3 17) (e 5 20)} | {(b 2 13) (a 5 15) (d 3 0)}, (new f) =>
28 | {(c 10 7) (d 3 17) (e 5 20) (f 3 25)} | {(b 2 13) (a 5 15) (d 3 0)}, (death e) =>
28 | {(c 10 7) (d 3 17) (e 5 20) (f 3 25)} | {(b 2 13) (a 5 15) (d 3 0) (e 5 28-20-5=3}

At end, any remaining live handles (due to missing frees) are moved to the dead set. This algorithm is O(m) + m*n.

Histogram of macro lifetime, 100'000 = full program

onsdag 8 augusti 2012

Determining global variables

A global variable is one that is likely to be locked at all times, i.e. it's accessed throughout the applications, and often indirectly.

The type of any memory area, with regard to lifetime, can be determined in many ways.  The most obvious one is to measure the number of memory accesses (load, store), henceforth called an "op", between new and free, compared to the total number of operations within the entire program.  That will give us all memory areas that are allocated in the beginning and freed at the end, which is common for architectural-type objects that are visible from everywhere. They usually also have a fairly low number of accesses compared to the total number of accesses within its lifetime. Calculating the number (own ops + other ops) / total app ops will give a number from 0 to 1, where 1 is an object allocated at the very beginning and freed at the very end.

Looking at the distribution of lifetime among pointers in a histogram for different applications, we see a clustering pattern for long-lived vs short-lived objects:

sort /etc/dictionaries-common/words

uniq /etc/dictionaries-common/words

soffice lorem-ipsum-10.odt
The different apps show different behaviour.

Starting with sort, no objects have a really long life-time, and it's approximately evenly distributed throughout the lifetime of the application. There's a cut-off point between about 30% and 15% where there's a distinction between longer and shorter life spans, but here a better way of determining life time needs to be used.

The sceond example, uniq, is on the other hand a very non-malloc-intensive application, where a a bunch of memory is allocated in the beginning and then no more. It is safe to assume that this memory will be locked throughout the program's lifetime.

The last example, Star Office, shows a more common large-scale application memory allocation usage, where a set of architectural objects are allocated in the beginning, with a lifetime of close to 100%, then a few allocations here and there, followed by an set of objects that have a very low life-time, presumably for the current document itself.   In this case, the first objects (i.e. long lifetime) would be locked all the time, whereas the later objects (i.e. short lifetime) would be locked just-in-time.

Well, how do you determine the type of variable, then?

As stated above in the example with sort, the application displays an interesting behaviour that could possibly give better understanding of applications with similar memory allocations patterns. Because sort is a small program, I might do a manual analysis of memory usage to see how else we can determine if an object is "global" (over a piece of code) and should be locked or not.  This might be work for the future.

måndag 27 februari 2012

w9d1. Locked blocks and unrecoverable fragmentation

CompactLock1 seed 1330343837


Output that follows:
  1. First free block 0xb7cf0ff8, last free 0xb7cf0fe8, total free 58036
  2. First locked block 0xb7cf0fd8, last locked 0xb7cf0fc8
  3. First used block = 0xb7cf0fb8, last used block = 0xb7cf0fb8, found_used_block = 0
  4. Free and used blocks are not adjacent.
  5. no unlocked blocks
  1. 0xb7cf1008 locked block skipped
  2. 0xb7cf0ff8 - 0xb7cf0fe8 free
  3. 0xb7cf0fd8 - 0xb7cf0fc8 locked
  4. 0xb7cf0fb8 used too large
  5. stop.
Needs to look further for smaller used blocks to move (skipping over potential free blocks!), then continuing again from the first free block and re-link the blocks moved. Relinking blocks is always messy.

måndag 13 februari 2012

w7d7: Free list became free block slot array

Bunch of commits relating to the free list. Since the top-most block can be locked, there has to be a fallback.  That's handled with a buddy-style slot array with free lists of the corresponding size.  Too large blocks (via merging) are moved to the correct location by alloc(), but could in theory be handled completely by the compactor. Benchmark task!

commit 5d1f7237d0d8e800007be6b2960bb8c3cf33c01c
Author: Mikael Jansson <>
Date:   Fri Feb 10 18:33:57 2012 +0100
Replaced g_free_list with g_free_block_slots
Blocks are now stored in linked list indexed by log2(size) where blocks of 2^k - 2^(k+1) bytes can be found.
There can also exist larger blocks than 2^(k+1), in which case they must be moved (by alloc()) to the appropriate slot immediately. Then, continue with the next item in list.

commit 1992d2d09ad3b4647d17ff93afb34b2b7908f140
Author: Mikael Jansson <>
Date:   Sun Feb 12 12:51:49 2012 +0100
* move too large objects out of the way
* if no other blocks found, use what we did find: shrink, insert rest, return block.
commit c5921bc442996f1e522a9c5aaf7794aa801bd7c2
Author: Mikael Jansson <>
Date:   Sun Feb 12 16:01:50 2012 +0100

* new shrink layout [rest|block] instead of [block|rest] to avoid aliasing by code referring to block.

onsdag 8 februari 2012

w7d3: Current issues with compacting

The free list is stored as an unsorted single-linked list to save a) space (minimum block size) and b) time (insertion).  Used blocks are linked to from the header list, also unsorted.
There are two ways of doing compacting: sliding and arbitrary.

Arbitrary compacting: blocks are moved at random to free spaces. Order of blocks is not preserved.
Sliding compacting: blocks are pushed together (moved by offset) to the free space. Order of blocks is preserved.

Preserving blocks gives better cache locality on the premise that related objects are often allocated together in time.

Optimally, preserving blocks is the better idea. However, there are two constraints that makes this hard:

1. Memory is allocated from the top of the heap and no other location.
2. Compacting is allowed to take time, but it needs to be incremental so it can run in a given time slice.

(1) makes a perfect match for arbitrary compacting, where the top-most object (if unlocked) is moved to the bottom of the heap, freeing memory. However:

Issues with allocating objects at top of heap
Any object that is locked will effectively limit how much heap is available. Ponder the following allocation (_ = free, O = unlocked, X = locked)::


In this situation, there will never be any larger block available than what's left at the right side of the X, regardless of how well-compacted the rest of the heap is.  Therefore, simply increasing the top pointer is not a good idea. All available memory needs to be utilized, regardless of compacting option.

Greedy sliding, to free up as much memory as possible::

    1. [_______OOOO__OOOO_XOOX_______]
    2. [_________OOOOOOOO_XOOX_______]
    3. [__________OOOOOOOOXOOX_______]

When choosing between segments 1 and 2 (3 is locked between Xs), pick the one that causes the largest continuous block of free memory after move, i.e. move block 1 to the right.  Disregard any block with one or more X:s in them. Wiht a slightly different layout, it looks like this::

    1. [_______OOOO__OOOO___OO__X____]: g1 left
    2. [_________OOOOOOOO___OO__X____]: g3 left
    3. [_________OOOOOOOOOO_____X____]: done.

This all requires sorted list of used blocks and free blocks to find largest continuous areas not dispersed with locked blocks.  Insertion can be done sorted, for O(log N) in malloc()/free(), or they can be sorted in compact() and re-sorted on each entry. On each subsequent compact step, the list would have to be sorted, but it would also always be semi-sorted.  Thus, an algorithm can be picked that has best performance when the list is already in near-order.

Start at bottom also problematic
It's not certain, however, that we will have an ideal situation, in fact, the first objects tend to live longer, i.e.:

    1. [X______OOOO__OOOO___OO__X____]: g1 left

which means we cannot allocate from start.  Instead, have to allocate in free blocks. What blocks to choose? Just the first one and split it up into smaller blocks? Would it, possibly, be a better idea to look at the free block that is the closest size of the requested block?  Searching for block inside an unordered list, however, is O(n). That is too slow.  So, keep a 2^k array (2^k ... heap_size) of linked lists where blocks of *at least* the size would be linked in. Possibly sorted by order (binary tree? red-black? what techniques are there?).

Then, finding a suiting block would be heap[log2(size)], grab the first block, split up between used and free, relink list, done.  Similar to the buddy allocator.  However, since there will be a split in any event, why even bother about looking at the smallest block? Might as well just any block. No.  The block would have to fit. That's why they would have to be stored in the 2^k array, but internal order of each list doesn't matter.

fredag 3 februari 2012

w6d5: Header fragmentation

Continuously growing header list
Growing the header list is O(1) but will cause holes when headers are in state FREE_BLOCK (for later compacting) or UNUSED. Since handle is mapped 1:1 with header, they cannot be compacted.  Scanning the list to find a free header to use in malloc() would take O(n) time

By wasting an extra integer value in the header structure, a monotonically increasing number can be assigned to each new handle and copied into the header.  When compacting, the handles will be sorted by that key.  Looking up a handle in lock/unlock() will be O(log n). Is that acceptable?

According to, the overhead is 8-24 bytes per chunk. I'm up to 20 bytes right now.

Could the address of the header be cached in the handle for as long as compact hasn't happened yet? Then, when the compact happens, all handles are updated according to the new value.  Caching the address with the integer requires actual storage somewhere, and we're back to square one with actual lookup having to be done. Rats.

A hash function mapping <handle-integer> to <header> in O(1) that is rewritten on each compact? See for more on hashes -- has a thorough discussion. Also Paul Hsieh's quick hash used in WebKit:

To consider: to get rid of the extra integer, is it possible to combine the handle and memory block addresses and use that to adress, and update the handle according to an invariant, e.g. that only one will ever be moved at a time, or similar?

Don't grow the header list, don't scan
Since we don't care about any particular order of the headers, only the entire table remains compact, getting any header is fine.

Therefore, use the fact that UNUSED headers don't use their pointer field to keep track of memory blocks, to instead maintain a header free list by using the pointer as the next pointer!  Requires two pointers: one to keep track of the start and the end of the free list. New headers are grabbed from the start and appended to the end.  If no free list, then grow list as before.

At the compact() step, all FREE_BLOCK headers turn into UNUSED and inserted to the free list.

w6d4: Workings and growing header list

Memory layout and algorithm
memory layout:
- header list grows from top of stack and downwards
- objects grow from bottom of stack and upwards

- lock type: unlocked = 0, locked = 1, weak = 2
- pointer to memory block
- size of memory block

free memory block:
- 4 bytes at N-4: header address, points back to this block
- 4 bytes at N-8: next free memory block.

- alias of header, opaque type. means the header block cannot be compacted!

header list:
- insert new at the first free location

- keep bitmap (64 bytes) to narrow down the search of first free header
- 64*8=512 bits, initially 8 headers per bit = 4096 entries.
- bitmap scaled to number of objects, when limit reached, double and scale down (b0=b0*b1, b2=b2*b3, ...).

Current issues
Header is mapped 1:1 with handle. There must be header for each block in the free list (for identity when merging with previous block).  Which means that between compacts, header list will always grow, since there are by definition never any unused handles.

This means that in the current setup, the header list will never be fully compacted.

tisdag 31 januari 2012

w6d1: Cold.

No work done, save for thinking of the data model.

fredag 27 januari 2012

w5d5: balloc: compacting; related work

Related work

Similar what I want, except I don't have hard real-time requirements nor do I want to limit the amount of memory available or use size-classes. Was presented around the same time as I was given the task. Their implementation incl. future work is similar to what I had thought out in the beginning of my thesis work (circa 2008), in particular storing the handle table in the free-list to save memory.

Trivial moving compacting (doesn't yet respect locked handles) that works only in the same K line. Scans for the first (i.e. lowest starting memory address) free slot of the globally largest size. Checks for a chunk with a free and a used component, moves the used one to the free slot and merges the two parts of the chunk into a new (partial) chunk of size (k-1).

torsdag 19 januari 2012

w4d3: More alloc/free tests and handles.

Test to randomly allocate and free data, verifying that the internal free/used nibble count matches allocations.

Introduced handles.

måndag 16 januari 2012

w4d1: balloc() lower bounds

Allocation does what it's supposed to, according to the printout of the memory layout. Good!

Chunks and calculating memory usage
Writing a test case for how much RAM is available is not straight-forward.  Simply counting blocks of the  size used for the requested size does not work, e.g. balloc(5000) = 8192 nibble used, subtract from the total amount of memory available, and keep on until available memory < requested nibble size. For the following series (with 16M of starting RAM):

  1. allocated 5316418 bytes to nibble 0xa25e0008 (k 11)
  2. allocated 3142922 bytes to nibble 0xa2de0008 (k 10)
  3. allocated 479360 bytes to nibble 0xa31e0008 (k 7)
  4. allocated 1359988 bytes to nibble 0xa33e0008 (k 9)
  5. allocated 55739 bytes to nibble 0xa3260008 (k 4)
  6. allocated 232232 bytes to nibble 0xa32a0008 (k 6)
  7. allocated 160914 bytes to nibble 0xa32e0008 (k 6)
  8. allocated 23031 bytes to nibble 0xa3270008 (k 3)
  9. allocated 284332 bytes to nibble 0xa3360008 (k 7)
  10. allocated 208094 bytes to nibble 0xa3320008 (k 6)
  11. allocated 121497 bytes to nibble 0xa3280008 (k 5)
  12. allocated 4092 bytes to nibble 0xa3278008 (k 0)
  13. allocated 11350 bytes to nibble 0xa327c008 (k 2)
  14. allocated 2033 bytes to nibble 0xa3279008 (k 0)
  15. allocated 6196 bytes to nibble 0xa327a008 (k 1)
  16. allocated 431 bytes to nibble (nil) (k 0)
After #15, there is 2048 bytes left and a block of size k=0 is requested. However, as I write this, I realize that you must check for available memory >= minimum nibble size (4kb), or the allocation will, of course, fail. But, that is not enough.  There are situations in which there's enough memory available from just counting allocated blocks, but because how the blocks are fit into memory (greedily), there aren't any spare blocks of appropriate size available. Knapsack problem.  

There is a upper and higher bound on how much spill memory (i.e. internal fragmentation) you can have before you can't allocate another minimum nibble sized-block. My guess is that it's not only related to allocated memory vs. memory left, but also the distribution of the earlier block allocations.  However, as this is still the buddy allocator prototype, I'll test allocation until a semi-arbitrary threshold of block size*2-1. i.e. available memory + maximum spill in one chunk.

And in the end, it all turned out to be my boundary checking: bytes = up2(size); balloc(bytes); without checking for bytes < MIN_CHUNK_SIZE. Test now passes.

Visualizing a flat image is tricky since earlier memory addresses are modified (= split) as smaller chunks are requested. The final memory layout can easily be dumped and drawn, but how useful will it be for performance evaluation?  It gives the layout at the end, but could miss interesting data in the middle.

torsdag 5 januari 2012

w3d3: Meeting and roadmap adjustment

Short-term roadmap adjustments
The prioritized items for the prototype are as follows:
  1. bmalloc()/bfree()
  2. compacting
  3. remove internal dependency on system malloc()/free()
bmalloc()/bfree() will be built using request_chunk() and merge_nibbles() and glue code.

Compacting requires the four operations bmalloc(), block(), bunlock(), bfree(). handle_t to be introduced.

Internal dependency is for the structures used for the data storage. Supervisor pointed out those are non-important for the prototype and should therefore be left out.
No code will be stored so the actual solutions can be ugly/inefficient, as long as the key concepts are there. Which, in the case of the prototype buddy allocator, is the case.

There are different ways of measuring the efficiency of the allocator and there are many definitions of efficiency (with t=time, n=number of alloc/free so far):
  • fragmentation over n
  • n before OOM
  • n/t, i.e. how long time memory operations take
It's interesting to plot these in the same graph such that design choices that giving trade-offs in time can be compared to e.g. n before OOM, if that is a sought goal.  

w3d2: Merge nibbles test case

free(foo) with malloc(sizeof(foo)) immediately after will give the pointers the same address.

MergeOneLevel (for merge_nibbles) test case done.

måndag 2 januari 2012

w3d1: Merging nibbles

Chunks with nibbles or one nibble per chunk?
Block = info and free lists for chunks of size k
Chunk = two nibbles, a and b, each of size k
Nibble = a memory block of size k
k = represents size in bytes, calculated by MIN_BLOCK_SIZE*2^k

Having A and B inside a chunk was easiest thing instead of returning two chunks, however, when merging nibbles that come from two different chunks (i.e. merging  a B | A' b'), first and second's parents will be locked-up as follows in scenario 1.

p->a => c1->a, c1->b
p->b => c2->a, c2->b
release(c1->b), release(c2->a) => new parent chunk p2 can be created

At this point, p->a will be used as c1->a and p->b will be used as c2->b. The new p2 contains a full k-sized chunk. There are now two (k-1) chunks and one k chunk.

Scenario 2, in which chunks only have one memory blob instead of A and B:

p1->mem => c1->mem, c2->mem
p2->mem => c3->mem, c4->mem
release(c2->mem), release(c3->mem) => new parent chunk p3 can be created
p1->mem blocked by c1
p2->mem blocked by c4

My fear that this would lead to bits of memory never being restored (blocked by new parent chunk) was unfounded, since the same thing will happen in both scenarios.  The advantage of not having A/B in one chunk is the somewhat easier management.

Merging nibbles to a new chunk
Until now there has been a relationship between parent and child chunks, i.e. p->a = c1->a, c1->b. However, as discussed in the above scenario, partially restoring the parent and creating a new would look like this:

p->a => c1->a, c1->b
p->b => c2->a, c2->b
release(c1->b), release(c2->a) => new parent chunk p2 can be created
p->flags |= A_SECOND_HALF_USED
p->flags |= B_FIRST_HALF_USED

This because p->a and p->b are only partially free (p->a's first half and p->b's second half) and cannot be used to create new child chunks, and becase of that, the flags are only useful if/when they're eventually combined, i.e. A_SECOND_HALF_USED|A_FIRST_HALF_USED. While possible, the occurance is most likely too rare to be justified the extra logic and memory for book keeping.  Instead, the relationship between the parent and the child is ignored, and the flags are still A_SPLIT, B_SPLIT, A_USED and B_USED. The parent chunk will continue to hold A_SPLIT and B_SPLIT, and the new parent chunk will have blank flags, ready for use.  Then, at an appropriate point in time (alloc? free? compact?), each chunk can be checked for children, and if needed, discarded.

In other words, a new parent will be created on each merge with no respect to the previous parent of the merged nibbles, except for the case when the nibbles are from the same chunk. Garbage collection happens at a later point, to make logic simpler.