MacPython Logo from __future__ import *

Kailash and Friends Kailash Kher Kaipa

online mp3 Anoice albums buy Amund Maarud albums online Asia online CD Andy M. Stewart buy tracks Axis online Astral Rising A Beautiful Machine download CD Aereda buy tracks Aksent online tracks Absidia Atrium Carceri A Beautiful Machine Absolum buy CD Aryan Wind and Brumalis and Valhalla Saints online music Atomsmasher download albums AK1200 download music Angelzoom online CD Arturo Mantovani and his Orchestra buy music 16 buy tracks Ashtorath online CD Aimee Mann buy music Anael And Bradfield buy mp3 Autumnblaze download mp3 Aggrolites download CD Arj Snoek buy albums Ada buy CD Aalto Andy With Rama West A Beautiful Machine Absolum online tracks Asura albums online Albert Lee 4 Non Blondes A Beautiful Machine Absolum download albums Andrew Lloyd Webber and Ar Rahman online music African Head Charge download mp3 Amber Asylum online music Analena online music ANTIX feat ROB SALMON A.R. Rahman A Beautiful Machine Absolum online tracks African Blackwood buy mp3 Axis buy mp3 Alan Menken buy music Amoebic Dysentery buy Alph Secakuku A Beautiful Machine albums download Albita online Amparo Ochoa A Beautiful Machine download tracks Andy Partridge and Harold Budd download tracks Anubian Lights Alient Project A Beautiful Machine Absolum buy albums Antonio Forcione download CD Ali G Indahouse online mp3 Art and Jazz Messengers Blakey download Arab Strap A Beautiful Machine online albums Adema buy Agua de Annique A Beautiful Machine buy CD Avalanches download tracks Acroma Andi Deris A Beautiful Machine Absolum download tracks American Steel download albums Amanda Perez online 999 A Beautiful Machine download mp3 Arild Andersen download CD American Steel buy tracks Absolute Beginner download tracks Anubi online albums Ancient Wisdom online A Verse Unsung A Beautiful Machine buy music Aghast Andromeda Island A Beautiful Machine Absolum download Arlo Guthrie A Beautiful Machine online mp3 Aavepyora online albums Achillea buy Andrew Bird A Beautiful Machine buy music Alexey Aigui and Ensemble 4'33'' albums buy Abbey Lincoln and Archie Shepp download albums Archive download CD A Guy Called Gerald feat. D.S. download music Al Di Meola online music Abigail download music Angel Witch online music Adelaide

2005-04-04

PyObjC and unicode

Filed under: PyObjC, macosx, python — bob @ 8:53 pm

There's a constant battle in PyObjC about what to do about regular str instances, since Foundation doesn't have a data type that's unencoded bytes (NSData) and text (NSString) at the same time. Up to now, str instances were be converted to NSString using Python's default encoding (sys.getdefaultencoding()), which is basically always ascii and will raise an exception, which is really never what you want. I committed a change this week that will hopefully be a least worst of both worlds solution:

  • Added OC_PythonUnicode and OC_PythonString classes that preserve the identity of str and unicode objects across the bridge. Additionally, bridge for str now uses the default encoding of NSString, rather than sys.getdefaultencoding() from Python. For Mac OS X, this is typically MacRoman. The reason for this is that not all Python str instances could cross the bridge at all previously. objc.setStrBridgeEnabled(False) will still trigger warnings, if you are attempting to track down an encoding bug. However, the symptoms of the bug will be incorrectly encoded text, not an exception.

This lets NSString decide what encoding to use (generally, MacRoman), so you still get garbage in garbage out, but:

  • It's lossless (garbage in, same garbage out)
  • 7-bit ASCII safe (usually when str is used for text, it is just ascii)
  • Doesn't raise exceptions in strange places
  • If it was data (not text) and you were just putting it in a container or something, it will still be data when you get it back

NeXT/Apple definitely did strings in a very nice way. Rather than deciding there should be one and only one way to do them, they made NSString a class cluster where you can have any concrete implementation you want. The only methods your concrete classes have to implement are:

-(unsigned)length:
Return the number of characters in the string
-(unichar)characterAtIndex:(unsigned)index:
Return the character at that index (or raise an exception)

There are, of course, additional methods that a concrete subclass of NSString can implement for efficiency.

Doing strings in this way has some nice advantages:

  • If they wanted to trade up unichar to 32 bits, it would require minimal changes to the source code
  • You can use whatever backing store you need to use, with whatever properties it needs to have (i.e. mmap backed store, a constant in the code that never gets freed, a length-prefixed string, whatever!)
  • You can use whatever encoding you want to use, and conversion doesn't take place until you do something with it

In Python's case, OC_PythonUnicode is actually a zero-copy concrete subclass of NSString (if Python's unicode characters are the same size as unichar, anyway). Python can't do ANYTHING like this:

  • The backing store of a Python unicode object must be controlled by Python (i.e. you can't point to a constant string, you can't point to a slice of another unicode object, etc.). This throws zero-copy strategies out the window.
  • The encoding of a Python unicode object must be UCS-2 (or UCS-4, depending on configure time options)

So, that sucks... but at least Python does have good unicode support, unlike some other languages with similar heritage.

6 Comments »

  1. So, is it necessary or recommended to use only unicode strings in PyObjC? Should every string that deals with Cocoa classes be unicode?

    Comment by Florian Munz — 2005-04-05 @ 9:59 am

  2. Yes.

    Comment by Michael Hudson — 2005-04-05 @ 11:57 am

  3. Excellent. This sounds like a good solution to what has been a thorny problem for PyObjC.

    Comment by Donovan Preston — 2005-04-05 @ 12:04 pm

  4. Python’s main design goal was to be very efficient, not to be maximally flexible.

    I’m not a big fan of strategies that let you point into a slice of another string, since this can easily keep alive a large string just because a small substring is still used (extreme example: parsing a single word out of a file).

    Also, Python’s buffer API has some of the desirable properties (at least it lets you point at a fixed string) though not others (it’s a byte array in memory, although the ownership of the memory is flexible).

    Comment by Guido van Rossum — 2005-04-05 @ 11:47 pm

  5. Well, other times, pointing into another string is exactly what you want. Objective-C only does this (as far as I know) in its implementation of NSConstantString, which uses the executable’s data section as the backing store (which is mmaped in, of course).

    If you were writing some kind of high performance application using str-as-data, not str-as-text, it might be a big deal for mmap.mmap’s __getitem__ to give you a slice directly into that memory, instead of allocating new memory for the whole slice plus the str object, memcpy’ing it all in, and then freeing it sometime later..

    I’m not sure that the pointer indirection that you save in Python’s implementation is really that much of a win.. by making that (small constant time) optimization, a whole class of other optimizations that might be much more important are thrown out.

    I’d much rather see unicode become Python’s one and only one “text” data type.. but it will probably be a hard sell when the only way to back them is to use UCS-2 or UCS-4 (depending on the flavor of Python) :)

    Comment by Bob Ippolito — 2005-04-07 @ 8:28 pm

  6. So lets turn this into a wish list for Python 3000… This would be the ideal situation (to me at least):

    1. str’s usage gets degraded to a “symbol” type (that is, unless Python 3000 wants to support unicode variable and attribute names, in which case str should be deprecated altogether)
    2. unicode shall be the only text type
    3. unicode should have a flexible backing store
    4. there should be a str-like binary data type (’bytes’). Again, with a flexible backing store
    5. there could be mutable versions of unicode and bytes

    Comment by Just van Rossum — 2005-04-08 @ 3:03 am

RSS feed for comments on this post.

Leave a comment

Protected by WP-Hashcash.

Powered by WordPress