PyObjC and unicode
There's a constant battle in PyObjC about what to do about regular str instances, since Foundation doesn't have a data type that's unencoded bytes (NSData) and text (NSString) at the same time. Up to now, str instances were be converted to NSString using Python's default encoding (sys.getdefaultencoding()), which is basically always ascii and will raise an exception, which is really never what you want. I committed a change this week that will hopefully be a least worst of both worlds solution:
- Added OC_PythonUnicode and OC_PythonString classes that preserve the identity of str and unicode objects across the bridge. Additionally, bridge for str now uses the default encoding of NSString, rather than sys.getdefaultencoding() from Python. For Mac OS X, this is typically MacRoman. The reason for this is that not all Python str instances could cross the bridge at all previously. objc.setStrBridgeEnabled(False) will still trigger warnings, if you are attempting to track down an encoding bug. However, the symptoms of the bug will be incorrectly encoded text, not an exception.
This lets NSString decide what encoding to use (generally, MacRoman), so you still get garbage in garbage out, but:
- It's lossless (garbage in, same garbage out)
- 7-bit ASCII safe (usually when str is used for text, it is just ascii)
- Doesn't raise exceptions in strange places
- If it was data (not text) and you were just putting it in a container or something, it will still be data when you get it back
NeXT/Apple definitely did strings in a very nice way. Rather than deciding there should be one and only one way to do them, they made NSString a class cluster where you can have any concrete implementation you want. The only methods your concrete classes have to implement are:
- -(unsigned)length:
- Return the number of characters in the string
- -(unichar)characterAtIndex:(unsigned)index:
- Return the character at that index (or raise an exception)
There are, of course, additional methods that a concrete subclass of NSString can implement for efficiency.
Doing strings in this way has some nice advantages:
- If they wanted to trade up unichar to 32 bits, it would require minimal changes to the source code
- You can use whatever backing store you need to use, with whatever properties it needs to have (i.e. mmap backed store, a constant in the code that never gets freed, a length-prefixed string, whatever!)
- You can use whatever encoding you want to use, and conversion doesn't take place until you do something with it
In Python's case, OC_PythonUnicode is actually a zero-copy concrete subclass of NSString (if Python's unicode characters are the same size as unichar, anyway). Python can't do ANYTHING like this:
- The backing store of a Python unicode object must be controlled by Python (i.e. you can't point to a constant string, you can't point to a slice of another unicode object, etc.). This throws zero-copy strategies out the window.
- The encoding of a Python unicode object must be UCS-2 (or UCS-4, depending on configure time options)
So, that sucks... but at least Python does have good unicode support, unlike some other languages with similar heritage.
So, is it necessary or recommended to use only unicode strings in PyObjC? Should every string that deals with Cocoa classes be unicode?
Comment by Florian Munz — 2005-04-05 @ 9:59 am
Yes.
Comment by Michael Hudson — 2005-04-05 @ 11:57 am
Excellent. This sounds like a good solution to what has been a thorny problem for PyObjC.
Comment by Donovan Preston — 2005-04-05 @ 12:04 pm
Python’s main design goal was to be very efficient, not to be maximally flexible.
I’m not a big fan of strategies that let you point into a slice of another string, since this can easily keep alive a large string just because a small substring is still used (extreme example: parsing a single word out of a file).
Also, Python’s buffer API has some of the desirable properties (at least it lets you point at a fixed string) though not others (it’s a byte array in memory, although the ownership of the memory is flexible).
Comment by Guido van Rossum — 2005-04-05 @ 11:47 pm
Well, other times, pointing into another string is exactly what you want. Objective-C only does this (as far as I know) in its implementation of NSConstantString, which uses the executable’s data section as the backing store (which is mmaped in, of course).
If you were writing some kind of high performance application using str-as-data, not str-as-text, it might be a big deal for mmap.mmap’s __getitem__ to give you a slice directly into that memory, instead of allocating new memory for the whole slice plus the str object, memcpy’ing it all in, and then freeing it sometime later..
I’m not sure that the pointer indirection that you save in Python’s implementation is really that much of a win.. by making that (small constant time) optimization, a whole class of other optimizations that might be much more important are thrown out.
I’d much rather see unicode become Python’s one and only one “text” data type.. but it will probably be a hard sell when the only way to back them is to use UCS-2 or UCS-4 (depending on the flavor of Python) :)
Comment by Bob Ippolito — 2005-04-07 @ 8:28 pm
So lets turn this into a wish list for Python 3000… This would be the ideal situation (to me at least):
1. str’s usage gets degraded to a “symbol” type (that is, unless Python 3000 wants to support unicode variable and attribute names, in which case str should be deprecated altogether)
2. unicode shall be the only text type
3. unicode should have a flexible backing store
4. there should be a str-like binary data type (’bytes’). Again, with a flexible backing store
5. there could be mutable versions of unicode and bytes
Comment by Just van Rossum — 2005-04-08 @ 3:03 am