måndag 9 september 2013

Fine-grained locking not feasible due to time constraints

Calculating lock statutus per handle and op takes too long time to be feasible.  It would give a better picture of how memory is actually used, but unfortunately it cannot be calculated in something close to realtime - even pushing it out to many workers on a fast CPU takes well over two days and yields data proportional to the size of the number of ops.

Coming up with a better way to calculate "micro" lifetime is something for the future work-section.

söndag 8 september 2013

Memory block access locking heuristics

Locking and unlocking objects is the main reason for rmalloc's existance. Not applying locking in benchmarking will skew results favorably since defragmentation will always be perfect.

Figuring out when an object should be locked within its lifetime ("micro lifetime") is tricky. What I've done so far with lifetime ("macro lifetime") is a ratio of own ops vs others ops, producing a number between 0 and 1 that tells us an average over the entire lifetime.  By having a threshold at 50% we can say that all objects below that threshold are never locked and above is always locked.

That's a gross simplification. What I want to do with the micro lifetime is to be able to lock/unlock the object several times within the object's lifetime. Doing so can be reduced to the following problem: in what intervals does an object T "dominate" over non-T objects, e.g. in a series of ops looking like this:


Using fixed-width windows, this is easy. But that's not what I want to do.  Thinking first about how to make the windows dynamic in length, I thought about letting the window "float" and "sink" depending on the current op.  After a bit of tinkering, I came up with:

def dominance(xs):
    life = 0
    dom = []
    for x in xs:
        if x == NEEDLE:
            life += FLOAT_SPEED
            life -= SINK_SPEED
        if life < 0:
            life = 0

        if life > 0: dom.append(1)
        else: dom.append(0)

    return dom

Applying this to a series of random values [0, 3], I get the following pretty graph:

Alas, this needs to be done for each needle (i.e. handle) during its lifetime. That's horribly slow (O(n2)), and my previous optimization with caching the number of ops within an object's entire lifetime is not possible since I want to have the data explicitly for each point in time, i.e. operation.  Luckily, there's a multiprocessing module in Python that creates a pool of workers that I can throw each object's [start, end] period, i.e. from malloc to free. This makes things faster, and according to my preliminary results it should take about six hours on an intel i7 quad-core to calculate the locking list for all objects. Workers save their output into one file per object, which I can later pick up and use for the locking benchmark.

fredag 6 september 2013

Making sense of histograms and fine-grained locking

The histograms are now properly generated. The main cause for histograms turning out incorrect was because matplotlib stores state in its convenience module pyplot.  Adding code to clear the drawing after each plot fixed the issue.  I can now visualize parts of histograms. This is a macro graph of coverage from 1%-100% (since 0% is a huge spike which skews the diagram):

I've shown this before.  Again, we clearly see a big divide between long-lived objects and short-lived objects. This is good! Zooming in on the even more short-lived objects between 0% and 0.1%:

Okay, there's a huge amount of extremely short-lived objects, pushing down all the others.  Just moving slightly past those, into 0.01% to 2%, we see a more nuanced view:

Still, most objects are clustered into <= ~20% or >= 80%, with not many being equally used, and a lot of them with essentially no lifetime at all. This makes it easier to reason about locked objects.

Leads us into next problem.  While I can get the lifetime of a handle and determine if it's locked/not locked, I'd prefer to get finer-grained locking information.  Specifically, what I need to do is to for each point in time store if a handle is in a locked or unlocked state.  This'll be a boolean value, in the form of:

locking[handle] = [bool]*op_count_during_handle_lifetime

The bool will most likely be a threshold from the activity measurement (own ops / other ops, most likely), but I also hope to see a pattern where activity is clustered in <<50% and >>50%.  While I might not come up with something better than what I have now (locked during its entire lifetime), the structure will still serve me when doing rmalloc benchmarking. (Just with an array filled with True).

Checking out after 6h58m,

Generating histograms of large data

Continuing from where I left off at yesterday's histogram generation.  It's now running on a large data set. Even though it's a lot quicker to run the application now (15 minutes vs probably 15 days on the Opera loading google.com dataset), it's still a bit of a chore to regenerate each time.

Especially since the data is going to be used as input for my measurements of rmalloc, specifically when to insert lock/unlock operations.  Lifetime data is now cached as a json blob in a file to be used by other tools. The histogram tool checks the contents of that file, and if it can be loaded, it is used instead of the ops file.

Checking out after 3h5m,

onsdag 4 september 2013

Generating histograms

Replaced the broken, but quick, handle lifetime generating code and got correct results from the ops file. Which is good, since I don't have to regenerate the ops file each time.

However, the histogram generation code is, as noted yesterday, naive and runs in O(n2), which turns out to take a lot of time processing 8 Gb data.  I modified the original code based on what I had now - working - and got code that runs in O(n) instead, still generating the same results for the small dataset of LibreOffice- opening an empty database file.

Checking out after 2h48m,

tisdag 3 september 2013

Three Week Sprint Starts Now

First out is cleaning up. Valgrind needs patching up to build & run on latest Ubuntu. Translators from customized memcheck format needs updating for the new format I switched to. The translators themselves need to be cleaned up and renamed.

It became a great mess because I had to do partial processing due to lack of available RAM and not being on a 64-bit system. And reinstalling my main computer's OS wasn't something I wanted to do. (Processing the 8Gb Opera allocation memtrace (output of instrumentation of access/malloc/free into memcheck) turned out to require loads of RAM. Luckily, I now have access to a 64-bit system with lots of RAM for such tasks.)

Remainder of today's goals is to fix histogram generation. There are two different versions: one that works but is slow, and one that is faster but does not yield the same results. It is possible I discard the optimized version now that I have a quicker system to run the emulations on. After that, I'm back to a state where I can automatically regenerate all plots and graphs from a memtrace run.

Work so far is pushed to the repository as https://github.com/mikaelj/rmalloc/commit/ab98b32ce9a2a67cf21e51b62b352fc7a08d2d7e

Checking out after 8h15m,

fredag 10 augusti 2012

Large-scale data processing

The output from running my modified Memcheck (of Valgrind) tool on Opera (12.x, latest as of 2012-08-08) gives 12Gb of output, which is about 500M lines in the following format:

('new', 16, 0x41c3028)
('store', 0x41c3028, 4)
('store', 0x41c3030, 4)
('store', 0x41c302c, 4)
('new', 16, 0x41c3068)
('store', 0x41c3068, 4)
('store', 0x41c3070, 4)
('store', 0x41c306c, 4)
('new', 16, 0x41c30a8)

Valid operations are new, free, store, load, modify.  Each new request is stored in a hash, which then subsequential memory access ops are mapped to.  The data collected is then used to generate a C program with a fixed number of handles and a long list of lock/unlock/free/new calls into the C allocator being benchmarked. This for comparison between cmalloc and other allocators.  The collector program (translate.py) evals each line and does processing on it, very simple.

Python too slow? Nope.

The problem is that Python is very slow and uses too much memory, which my 4GB Core i3 laptop can't handle - translate.py works for small-ish outputs. This because the list of handles is checked for each memory access, i.e. a 2'000 (approx) entries list for each memory access (~500M), quickly becomes unusable.   I tried various approaches, such as moving out the code to Cython (formerly known as Pyrex), which translates the Python code into C and builds it as a Python extension module (a regular .so), but only doing that did not speed things up.

Brute-force hashing!

Hashing on the start and end address was given as a suggestion by @bobf, where each access address would be decremented towards start and incremented towards end, and when both values hit a hash, and the hash were the same (i.e. the memory handle), it'd be done.  That was even slower than iterating through 2000 elements, because the hash has to be checked on average one lookup per allocated byte in the memory area.

Finally, I came up with a brute-force solution: hash all addresses within the requested memory area - from start to end, mapping each address to the corresponding memory handle.  This proved to be fast, but blew up with a MemoryError at about 2 GB data read (out of 12 GB in total), and was ready to investigate a key-value pair store (i.e. hash) like Redis, but it's in-memory only.

Yup. Or maybe. Still need more RAM though.

My server with 8GB RAM has swap enabled, but by default Ubuntu 10.04 LTS doesn't over-commit memory. Setting /proc/sys/vm/overcommit_memory to 1 effectively enables swap for application memory allocation, if I've understood it correctly. I've just restarted the application, chugging away at 2,3G data read at 336M physical RAM free and 0 bytes swap space used.

Or maybe, just maybe, 64-bit OS.

So, what I've realized is that the problem is, of course, that using a 32-bit system to allocate data larger than 4GB doesn't work very well.  Installed a 64-bit Ubuntu LiveCD on a USB stick and did post-processing from that end.

However, it's not good enough. Calculating the handle mappings can be done in one pass, but also including all ops (mapped to handles, instead of pointers) will not fit in memory. Therefore, my nice and handy post-processing script that does everything in one pass does not cut the mustard.   Splitting it up into more parts, where each one does one specific thing:
  • map addresses to handles and write out ops (mapped to handle) to a file
  • read ops file, pruning duplicate ops (e.g. two or more successive L/M/S to the same handle) and write out malloc C source
  • read ops file, calculate handle lifetime for histogram
That's what it does for now.  

More on lifetime

The lifetime might be more elaborate, for now the calculation is fairly naive in that it only checks for really long-lived areas, but it could also be setup to scan for "sub-lifetimes", i.e. module-global.  My guess is that it would look like the histogram data above (spikes), but located in the middle.  Calculating that would mean that start and end points for calculating lifetime would be sliding, such that end is fixed and start moves towards end, and the other way around, where start is fixed and end moves towards start.  Storing each value takes up lots of memory and analyzing the end-result by hand takes a very long time since one'd have to look at each histogram.

Current histogram is plotted for lifetime which is already calculated. A plot showing ops per handle over time (3D graph: ops, handle, time) could possibly give useful information about the clustering of ops and handles, in turn being used for calculating new lifetimes.  If time allows for it, otherwise left in future work, since I'm not quite sure on what to plot to give the most useful information, and how much it would affect real-life lock/unlock patterns.

Lifetime calculations too slow, alas

Lifetime is defined as number of ops on own handle divided by ops for all other handles, for each handle's lifetime.  Each handle is mapped to a tuple (own, others), and for each operation either own or others is incremented, until the handle is freed, at which point it's moved to the set of inactive handles. This means going through all handles for each operation, which for smaller datasets would be OK. Again, we don't have that luck, and for the Opera data set it's about 8GB data. Even removing duplicates (two successive ops on the same handle) this quadratic O(m*n) (m = ops, n = live handles) takes too long time.

Instead, keep a counter of ops so far (ops_counter) and for each handle, store the tuple (own, <value of ops_counter at New>), and only increase the "own" value for ops mapping to a handle. Then, at death (free), calculate the "others" value by calculating ops_counter - own - cached_ops_counter. Example, with ops counter, set of alive, set of dead:

20 | {(a 5 0) (b 2 5) (c 10 7) (d 3 17)} | {}, (death b) =>
20 | {(a 5 0) (c 10 7) (d 3 17)} | {(b 2 20-5-2=13)}, (death a) =>
20 | {(c 10 7) (d 3 17)} | {(b 2 13) {a 5 20-5-0=15}, (death d) =>
20 | {(c 10 7) (d 3 17)} | {(b 2 13) (a 5 15) (d 3 20-17-3=0)}, (new e) =>
25 | {(c 10 7) (d 3 17) (e 5 20)} | {(b 2 13) (a 5 15) (d 3 0)}, (new f) =>
28 | {(c 10 7) (d 3 17) (e 5 20) (f 3 25)} | {(b 2 13) (a 5 15) (d 3 0)}, (death e) =>
28 | {(c 10 7) (d 3 17) (e 5 20) (f 3 25)} | {(b 2 13) (a 5 15) (d 3 0) (e 5 28-20-5=3}

At end, any remaining live handles (due to missing frees) are moved to the dead set. This algorithm is O(m) + m*n.

Histogram of macro lifetime, 100'000 = full program