Discussion:
T_WCHAR final
(too old to reply)
Thomas Busch
2006-11-29 18:55:46 UTC
Permalink
Hi,

for those who are interested, this typemap for wchar_t
seems to work.

Thomas.

============================================================

INPUT
T_WCHAR
{
// Alloc memory for wide char string. This could be a bit more
// then necessary.
Newz(0, $var, SvLEN($arg), wchar_t);

char* src = SvPV_nolen($arg);
wchar_t* dst = (wchar_t*) $var;

if (SvUTF8($arg)) {
// UTF8 to wide char mapping
STRLEN len;
while (*src) {
*dst++ = utf8_to_uvuni((U8*) src, &len);
src += len;
}
} else {
// char to wide char mapping
while (*src) {
*dst++ = (wchar_t) *src++;
}
}
*dst = 0;
SAVEFREEPV($var);
}

OUTPUT
T_WCHAR
{
wchar_t* src = (wchar_t*) $var;
U8* dst;
U8* d;

// Alloc memory for wide char string. This is clearly wider
// then necessary in most cases but no choice.
Newz(0, dst, 3 * wcslen(src), U8);

d = dst;
while (*src) {
d = uvuni_to_utf8(d, *src++);
}
*d = 0;

sv_setpv((SV*)$arg, (char*) dst);
sv_utf8_decode($arg);

Safefree(dst);
}
Marvin Humphrey
2006-11-30 08:21:11 UTC
Permalink
Post by Thomas Busch
// Alloc memory for wide char string. This is clearly wider
// then necessary in most cases but no choice.
Newz(0, dst, 3 * wcslen(src), U8);
I think you need to bump that allocation to 4 * wcslen(src) + 1,
otherwise you run the risk of a buffer overflow in the event that
your data has too many code points above the BMP. Alternately, you
can scan the input first and determine how much space you need to
allocate.
Post by Thomas Busch
while (*src) {
d = uvuni_to_utf8(d, *src++);
}
*d = 0;
I assume that uvuni_to_utf8 handles invalid input safely.

The crucial thing here is not to open a security hole. If a user can
supply input, assume that pathologically munged input is on its way.
Since this is typemap code, many functions are potentially affected.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
Thomas Busch
2006-11-30 12:42:37 UTC
Permalink
Hi Marvin,

what do you think is faster ? Scanning the input or
allocing more memory ? For European languages
the length should be between 1 and 2 wcslen(src).

Also where does the +1 come from ?

Last but not least I just wanted to mention there is
a bug in the INPUT part as I use char instead of U8.

Here is the correct code:

INPUT
T_WCHAR
{
// Alloc memory for wide char string. This could be a bit more
// then necessary.
Newz(0, $var, SvLEN($arg), wchar_t);

U8* src = (U8*) SvPV_nolen($arg);
wchar_t* dst = (wchar_t*) $var;

if (SvUTF8($arg)) {
// UTF8 to wide char mapping
STRLEN len;
while (*src) {
*dst++ = utf8_to_uvuni(src, &len);
src += len;
}
} else {
// char to wide char mapping
while (*src) {
*dst++ = (wchar_t) *src++;
}
}
*dst = 0;
SAVEFREEPV($var);
}


Thomas.
Post by Marvin Humphrey
Post by Thomas Busch
// Alloc memory for wide char string. This is clearly wider
// then necessary in most cases but no choice.
Newz(0, dst, 3 * wcslen(src), U8);
I think you need to bump that allocation to 4 * wcslen(src) + 1,
otherwise you run the risk of a buffer overflow in the event that
your data has too many code points above the BMP. Alternately, you
can scan the input first and determine how much space you need to
allocate.
Post by Thomas Busch
while (*src) {
d = uvuni_to_utf8(d, *src++);
}
*d = 0;
I assume that uvuni_to_utf8 handles invalid input safely.
The crucial thing here is not to open a security hole. If a user can
supply input, assume that pathologically munged input is on its way.
Since this is typemap code, many functions are potentially affected.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
Marvin Humphrey
2006-11-30 16:26:55 UTC
Permalink
Post by Thomas Busch
what do you think is faster ? Scanning the input or
allocing more memory ?
Benchmarking is the only way to know for sure. And overshooting on
allocation is a design tradeoff.

If you knew the sting's length already, naive allocation would
probably be faster. Until you hit swap. ;)

But wcslen is doing a scan already, right? So replace that with your
own custom scan and see what happens.
Post by Thomas Busch
For European languages
the length should be between 1 and 2 wcslen(src).
Yes. This is a classic problem. It's the reason my big patch
changing Java Lucene to use legal UTF-8 and a bytecount-based String
header causes a 20% performance hit. (<https://issues.apache.org/
jira/browse/LUCENE-510>) Java's internal routines for precisely this
task -- negotiating how much memory is required when converting
between two variable-length Unicode encodings -- are to blame.

You're working on this because you want to manipulate CLucene string
data from perl-space, correct? You're starting down a long, well-
traveled road. ;)
Post by Thomas Busch
Also where does the +1 come from ?
Null termination. It should be there even though a Perl scalar knows
its own length and may contain null bytes.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

Loading...