-----BEGIN PGP SIGNED MESSAGE-----
Moin Jeremy,
Post by Jeremy WhitePost by Jeremy WhitePerhaps a stupid question - can 5.8.x be built without unicode
support to get some of that speed back?
Thank god, no. ASCII is so 1995...
:)
However, if you can, please post the code, so that we can run
benchmarks ourselves, and figure out what the difference is exactly
and whether one can do something about it.
Unfortunately the code is large and complicated - although I'm prepared
to find and create test cases and post those.
5.8.x fixes quite a few bugs of 5.6, and in somecases these fixes make
the code slower (due to more checks/conditions).
Indeed - the reason why I'm upgrading to 5.8.x is because of bug fixes
- was just a shock to see such a drop in performance.
I did have the idea last week to look at how Perl processes unicode data
and see if I can streamline that. However, good benchmarks/profiling must
come first.
Just some background:
In ASCII, all you have is one byte per character. With unicode, there are
Post by Jeremy White64K characters, so you need more than one byte. This boils basically
down to two choices:
* utf8 - each character is 1,2,3 or 4 bytes, depending.
* utf32 - each char is 4 bytes
utf8 has the nice property to save space, because ASCII characters use 1
byte, and everything else 2,3 or 4 bytes, depending. utf32 has the
advantage that each char is exactly the same size, but it wastes 4 times
the space as ASCII.
In Perl, strings are now internally stored as utf8. Whenever possible they
are stored as ASCII, to speed things up. So, whether string operations
are the sole cause of the slowdown you see must be determined, of
course :)
One potential slowdown is stringlength and substrings. For instance, if
you want to do substr($string, $x, $y), then Perl needs to do:
* for ASCII: calculate str_start + $x, str_start + $x + $y: O(1)
* for utf8: start at the beginning, advance $x characters (that can be a
variable number of bytes!), then advance another $y characters. This is
unfortunately O(N) - e.g. it depends on $x and $y.
For large strings, repeated operations etc, the time difference can grow
very large.
There are basically two things one can do:
* try to make utf8 operations faster,
* cache as much info as possible
* use utf32 instead of utf8 (maybe a compile time option?)
The first two are medium work, and probably already done to quite some
extend. The last one is actually a very big task. I am pretty sure that
there are many places where the code assumes utf8 encoding - but it would
be worth a try after we found reliable benchmarks.
Best wishes,
Tels
- --
Signed on Fri Oct 14 20:20:03 2005 with key 0x93B84C15.
Visit my photo gallery at http://bloodgate.com/photos/
PGP key on http://bloodgate.com/tels.asc or per email.
"Yeah. Whatever."