MacPython Logo from __future__ import *

buy music albums Silver Apples buy mp3 albums Tarrus Riley buy tracks mp3 Kravits buy Reaper albums mp3 buy Kravits albums music buy music Evita CD online albums mp3 Silver Apples download Madonna CD music buy tracks music Kravits download music albums Silver Apples

2004-08-24

Forget Spreadsheet::ParseExcel!

Filed under: java, perl, python — bob @ 10:32 pm

I've been working on some automation scripts to take data out of excel and do useful things with it, and I hit a big stumbling block with Spreadsheet::ParseExcel. Unicode SUCKS in Perl, and Spreadsheet::ParseExcel does nothing at all to help you with that. Each cell gets its own encoding ('ucs2', '_native_' which I haven't seen, or it's simply undef.. which seems to be latin-1). Anyway, it's completely bogus, so I started shopping around for another implementation.

Andy Khan's JExcelApi does the trick and is light-years more correct and faster than the alternatives I have tried (other than the time it takes a JVM to start). Not only that, but by default the jar does exactly what I want it to do. It gets the unicode right, and everything worked perfectly the first time. My dealings with Excel files have been reduced to the following:

java -jar -Djxl.encoding=latin1 jxl.jar -xml EXCELFILE.xls

And the Python code to parse the workbook xml document from jxl looks roughly like this:

from xml.dom import minidom

def parseDocRows(doc):
    for row in doc.getElementsByTagName(u'row'):
        rowdata = [
            u''.join([x.nodeValue for x in col.childNodes])
            for col in row.getElementsByTagName(u'col')]
        if rowdata:
            yield rowdata

if __name__ == '__main__':
    import sys
    for row in parseDocRows(minidom.parse(file(sys.argv[1]))):
        print row

Thanks!

No Comments »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a comment

I'm WP-Hashcash. I eat spam.

Powered by WordPress