from __future__ import *

Erlang Binary Performance

September 21, 2006 at 02:28 PM | categories: python, c, erlang | View Comments

I was benchmarking egeoip today, which is my from-scratch Erlang geolocation library. It uses the MaxMind GeoLite City database, which has implementations in a bunch of other languages so it's great to compare with. The results were rather surprising to me, because I hadn't previously done any benchmarking of Erlang performance.

The test environment is a MacBook Pro 2ghz, Mac OS X 10.4.7, Erlang R11B-1 w/ HiPE enabled, Python 2.4.3 (using their GeoIP Python API, which is written in C). I do have other processes running (namely iTunes), but the benchmark is fair because the background load is consistent throughout the tests.

Erlang, BEAM:
~13k geolocations/sec
Python/C:
~18k geolocations/sec
Erlang, HiPE:
~44k geolocations/sec

As you can see, Erlang holds it own against Python w/ C extensions, and it can mop the floor with it when using the HiPE compiler. Erlang clearly kicks some serious ass at working with binaries, both in syntax and performance. The only work I had to do to make it faster was c(egeoip, [native]).

Note that I've only been using Erlang for a few weeks and have not done any profiling or performance tuning at all beyond what I assumed would be the fastest way given the documentation I had read.

Update

After looking at Shark results across the two implementations, it seems that the GeoIP API default settings are pessimistic for benchmarking purposes and that most of the time was spent in syscalls (Erlang looked like its time was spent in GC). A fair comparison would be using the memory cache option, which gets even better performance.

Python/C (Memory Cache):
~117k geolocations/sec

This is a lot more in line with what I expected, but I'm still impressed that Erlang w/ HiPE can get nearly 40% of the speed of C when scanning through a 25MB array of bytes. I'm pretty sure I can make some algorithmic improvements to the code (which the C implementation may or may not do), so we'll see how close I can get.

Update

After spending a while with eprof doing some profile driven optimizations, I was able to considerably speed up the Erlang code. The biggest BEAM optimization was moving the giant tuples out of function bodies, apparently BEAM is rather naive about that and decides to actually create and garbage collect them on every call in certain cases. Some other optimizations were done to the way it looks for null terminators and a hyper-optimized fast-path for IPv4 string to long conversion.

Given the API I could cheat and parse out some of the data when the user asks for it, rather than at record fetch time. This would make the benchmark incredibly fast, but it would be an unfair comparison with the Python/C version. I'll probably end up doing that anyway, since I'm typically looking for just the country of an IP address.

I still haven't really done any algorithmic optimizations to the lookup, but here's the numbers:

Erlang, BEAM:
~44k geolocations/sec
Erlang, HiPE:
~64k geolocations/sec

This brings the BEAM performance up to about 38% of C/Python and HiPE up to 55%. Not bad!

Read and Post Comments

Universal Binaries with gcc and Xcode

June 11, 2005 at 11:16 AM | categories: universal binaries, c, macosx | View Comments

Versions of Mac OS X prior to 10.4 did support universal binaries, but the shipping gcc did not, so it is possible to compile applications that will work on both 10.3 ppc (Panther) and 10.4 i386 (Tiger). To build Panther-compatible universal binaries, you'll need to use the following Custom Build Settings in Xcode 2.1 or later...

Per-architecture SDK Support for Universal Binaries in Xcode. This is according to documentation, but I couldn't get it to work:

SDKROOT_ppc = /Developer/SDKs/MacOSX10.3.9.sdk
SDKROOT_ppc64 = /Developer/SDKs/MacOSX10.4u.sdk
SDKROOT_i386 = /Developer/SDKs/MacOSX10.4u.sdk

I'm not sure how to translate per-architecture SDKs to gcc settings, but to build single-SDK Universal Binaries with GCC, there are some undocumented (not in the compiler man page, anyhow) flags that you can use. These flags are only available in GCC 4.0 and later, and probably only with Apple's toolchain.

Universal Binaries with GCC (in Makefile syntax):

SDK=/Developer/SDKs/MacOSX10.4u.sdk
CFLAGS= -isysroot ${SDK} -arch ppc -arch ppc64 -arch i386
LDFLAGS= -isysroot ${SDK} -syslibroot,${SDK}

Using lipo(1) to do per-architecture SDK builds with gcc should be possible, but painful. Hopefully it's not the only way! However, that's exactly what Xcode 2.1 does.

See Also:

Read and Post Comments

Python on Mac OS X (Intel)...

June 06, 2005 at 05:49 PM | categories: c, python, macosx, py2app, universal binaries, PyObjC | View Comments

Python on Mac OS X for Intel is not going to be a seamless transition. Unlike Mathematica, there is going to be a lot more than 2 hours of effort involved. Why?

  • Python uses autoconf. Autoconf is not Xcode. It does not have a checkbox for universal binary compatibility. Autoconf is a PITA.
  • PyObjC depends on libffi. libffi doesn't know what the Mach-O calling convention for x86 is, so it doesn't work. libffi is very deep, magical, scary code. This has been mostly solved by Ronald over the past few days in the universal binaries lab.
  • mach_inject (which PyObjC's objc.inject uses) depends on injecting code into other processes. On an OS that can be running code for two different architectures, how the heck does that work? How do you know which pid is running which architecture? Anyway, injecting from x86-x86 isn't going to work because mach_inject doesn't know x86 yet.
  • py2app's macholib doesn't really support fat binaries yet. It understands them well enough to do the right thing with the PPC portion of the Mach-O header, but it ignores other architectures. It will need a semi-major refactoring in order to support this cleanly.
  • Bgen stuff breaks. Specifically, all of your Carbon code that deals with four character codes is going to be broken due to endianness issues. Yet another reason to stay the heck away from this stuff.

So, thanks Apple, for giving us weeks worth of very hard and unintesting work to do. The least you could've done is put out a nice new laptop that I could buy to do this work on :)

I'm also not terribly interested in renting an Intel development machine from Apple so that I can do work that helps them more than me. If they sold it, or stated that we would get a grand worth of credit towards a real Intel machine when they're available, then I wouldn't complain. But no, we get to rent a machine from them, that we can't really talk about, publish benchmarks of, move to a location other than the shipped address (?!), etc. They also say that they're under no obligation to fix it if it breaks, and there are no refunds. Sweet deal!

Read and Post Comments

Mac OS X 10.4 (Tiger)'s copyfile()

May 03, 2005 at 08:40 AM | categories: c, macosx | View Comments

All of the snazzy ACL-and-metadata-copying features in Darwin 8 / Tiger live in one function, copyfile(). Unfortunately, Apple forgot to include the header for it in the version of Xcode that ships with Mac OS X 10.4.

It is defined in libSystem, so you can use it if you knew its prototype, and you can pick up the copyfile.h (may not be newest!) from the Libc package from Darwin releases.

Since this isn't a static library, and probably depends on New Stuff, using it in a backwards compatible way (i.e. to get resource-fork copying in Panther) probably isn't going to work. Oh well.

Read and Post Comments

SWIG hate

March 11, 2005 at 04:28 PM | categories: python, c | View Comments

SWIG should definitely win the prize for worst implementation of a good idea, ever. Automatic wrapping of code sounds great, right? Well, it could be, but it's most certainly not. To the point where I can only recommend SWIG to people that I really don't like.

Why must they break things in EVERY SINGLE CONSECUTIVE RELEASE?! Do they hate us? Are they secretly fighting to keep C as the first and foremost language in open source development?

I'm thankful that I have always had the common sense to avoid using SWIG for my own projects, but I've had more than my own share of headaches maintaining software that depends on the beast, like PyOpenGL. As you can probably guess, that's what I'm dealing with right now. Although gouging my eyes out with a toothpick would be more pleasurable. I hope the devleopers of SWIG don't put it on their resume, cause it sure wouldn't get them hired anywhere I work.

Read and Post Comments

Altivec Kicks Ass

February 24, 2005 at 05:07 AM | categories: python, c, macosx, pygame | View Comments

I started screwing around with enhancing Ryan Gordon's AltiVec patch for SDL last weekend. I ended up spending more time on it than I should've, but I think I've optimized all of the 32bit-32bit blits. Even though my AltiVec is probably pretty naive, I'm seeing a ~3-4x speed increase over the scalar code. And the scalar code is probably about as optimized as it's going to get, with duffs, etc.

Another thing I noticed is that my G5 really, really, really beats the snot out of my G4 powerbook for this stuff. For some of the blits, the 2ghz G5 runs about 5.65x faster than the 1ghz G4. That's nearly three times as much work per cycle! This is probably due to the bus speed and memory bandwidth more than anything else, but it's impressive nonetheless.

Now I really want a G5 powerbook. Even if it were clocked at 1ghz like my current G4, it seems like there would still be a large difference in performance.

Once I clean up the patch and do some more testing (read: hopefully this weekend), I'll do a release of pygame that will include this enhanced version of SDL. pygame is overdue for an update, anyway.

Read and Post Comments

Advanced Debugging Techniques: ktrace

February 04, 2005 at 12:29 AM | categories: debugging, c, python, macosx, py2app, wxPython | View Comments

I had spent the past few days on and off trying to help a py2app user with a very hairy problem: when wxPython 2.4.2.4 was bundled, the main menu didn't work. Running the application as an alias bundle or with pythonw worked just fine, but when built as standalone or semi-standalone, the menu no longer showed up.

Right off the bat I (correctly) assumed it was a problem with wxPython or his sample code, because py2app doesn't do anything that would cause this sort of behavior. It does link to Cocoa, but it doesn't call into any AppKit functionality unless it needs to display an error message. Somewhat trivially converting his source to work with wxPython 2.5 solved this issue, so I was rather stumped.

I had a hunch that perhaps there was something that wxPython 2.4.2.4 wants that didn't end up in the bundle, so I broke out ktrace. Using ktrace is rather simple:

% ktrace ./dist/test.app/Contents/MacOS/test

This will create a file in the current directory, ktrace.out, with a log of just about everything that happened. For efficiency, this file is in a binary format that you must process with kdump. The output of kdump is quite lengthy, but it looks like this:

15121 ktrace   CALL  execve(0xbffffd1b,0xbffffc84,0xbffffc8c)
15121 ktrace   NAMI  "dist/test.app/Contents/MacOS/test"
15121 ktrace   NAMI  "/usr/lib/dyld"
15121 test     RET   execve 0

CALL is the actual system call, NAMI is a name to inode translation that the system call used, and RET is the value returned to the application. If there was an error during the system call, kdump will gladly tell you everything you wanted to know:

15121 test     CALL  open(0xbfffe7f0,0,0x1b6)
15121 test     NAMI  "/Library/Preferences/org.pythonmac.unspecified.test.plist"
15121 test     RET   open -1 errno 2 No such file or directory

Since I suspected that wxPython was missing a file, I wanted to narrow down the output, so I naturally used grep on the kdump output to find errors:

% kdump | grep -B 2 errno | grep wx
15121 test     NAMI  "/Users/bob/Desktop/simple/dist/test.app/Contents/Resources/wxPython"
15121 test     NAMI  "/Users/bob/Desktop/simple/dist/test.app/Contents/Resources/wxPython.so"
15121 test     NAMI  "/Users/bob/Desktop/simple/dist/test.app/Contents/Resources/wxPythonmodule.so"
15121 test     NAMI  "/Users/bob/Desktop/simple/dist/test.app/Contents/Resources/wxPython.py"
....

Unfortunately, what we're looking at here (and for the next 150 or so lines) is primarily just Python doing its module import search. Since I knew that all of sys.path pointed to locations under Resources, I can just filter that out. It is extremely unlikely that wxPython wants anything from there:

% kdump | grep -B 2 errno | grep wx | grep -v Resources
15121 test     NAMI  "/Users/bob/Desktop/simple/dist/test.app/Contents/MacOS/../Frameworks/libwx_mac-2.4.0.rsrc"
15121 test     NAMI  "/libwx_mac-2.4.0.rsrc"

There it is! So now I just need to make the setup.py look like the following:

from distutils.core import setup
import py2app
setup(
    app = ['test.py'],
    data_files = [
        ('../Frameworks', ['/usr/local/lib/libwx_mac-2.4.0.rsrc']),
    ],
)

Now the application works as expected. py2app 0.1.9 will include a recipe for wxPython to make sure that this file ends up in the bundle automagically, among other new features.

Read and Post Comments

Disabling a CPU with the CHUD framework

January 30, 2005 at 02:14 PM | categories: debugging, c, macosx | View Comments

Xcode Tools has an optional component, CHUD Tools (Computer Hardware Understanding Development Tools), that consists of some useful performance tools and low-level hardware facilities. My Dual 2ghz G5 has been having some serious stability problems lately, with what I believe is a dying CPU or logic board. When I saw errant CPU messages in the system log after experiencing unexplicable kernel panics and crashes I decided to see what would happen if I toggled the second CPU off with the Hardware preference pane that ships with CHUD. It worked! My G5 is now usable (though I will of course still get it repaired, but it's not convenient to do so at this time).

Unfortunately, when I reboot the machine, this setting is lost and all bets are off as to whether I'll be able to disable the second CPU before the machine crashes, so I decided to look into what I could do. I opened up the Hardware preference pane nib with Interface Builder to see what message was sent to change the CPU count (unsurprisingly, setCPUCount:), then I used class-dump to find the implementation address of that message. I then did an otool disassembly of the Hardware preference pane (otool -tVv ...) so that I could see what the code looked like at that address. It called an unconspicuously named function chudSetNumProcessors from the CHUDCore.framework subframework of the umbrella CHUD.framework, which happens to ship with documented headers!

At first, I tried writing a simple C program that naively called right into chudSetNumProcessors, which returned an error code that I didn't expect (from the documentation): something about the kext not being loaded. I knew the kext was indeed loaded, because the Hardware preference pane works and I've used Shark recently, so I looked at the headers for initialization functions. Unsurprisingly, I needed to call chudInitialize before trying to talk to the CHUD kext, so I ended up with the following program:

/*
% cc -o setNumProcessors setNumProcessors.c -framework CHUD
*/

#include <unistd.h>
#include <CHUD/CHUD.h>

int main(int argc, char **argv) {
    int rval = 0;
    int status = chudInitialize();
    if (status != chudSuccess) {
        fprintf(stderr, "FATAL: Could not initialize chud, error %dn", chudInitialize());
        return -1;
    }
    if (argc == 2) {
        int cpuCount;
        int curCPUCount = chudProcessorCount();
        int physCPUCount = chudPhysicalProcessorCount();
        sscanf(argv[1], "%d", &cpuCount);
        if (cpuCount < 1 || cpuCount > physCPUCount) {
            fprintf(stderr, "CPU count of %d not acceptable, expecting between 1 and %dn", cpuCount, physCPUCount);
            rval = -1;
        } else {
            int res;
            res = chudSetNumProcessors(cpuCount);
            if (res != chudSuccess) {
                fprintf(stderr, "Could not change CPU count to %d, error %dn", cpuCount, res);
                rval = -1;
            }
        }
    } else if (argc > 2) {
        fprintf(stderr, "Must take zero or one argumentsn");
        rval = -1;
    }
    printf("CPU Count: %d of %dn", chudProcessorCount(), chudPhysicalProcessorCount());
    chudCleanup();
    return -1;
}

Now I can call this setNumProcessors application early on in the boot process and increase my odds of being able to use my computer on reboot!

UPDATE: rentzsch commented with a better solution. It's also possible to disable multiprocessing even earlier by twiddling a setting in Open Firmware (QA1141).

Read and Post Comments