Erland Sommarskog
2008-05-01 17:53:17 UTC
I have in my XS module a routine that converts the charcaters of an SV
from one code page to another. The conversion is done "in-place". That is,
I retrieve the text pointer, and then I rewrite the area pointed to. For
reference, the full code for the routine is included at the end of this
post.
The module is an API for communicating with MS SQL Server from Perl. One
feature is that the caller can request that data sent to/from the server
should be converted between different code pages. When sending data to the
server, my module needs to copy the caller's data, or else the caller
would find his variables to have changed.
The caller may pass data in a hash, where the hash keys corresponds to
columns. Metadata is also subject to conversion, thus also the hash keys
needs to be converted.
The actual code is a bit contrived, but here is a distilled script that
shows the essence of the problem:
use strict;
use Win32::SqlServer;
my $X = new Win32::SqlServer;
my $datarows = [{k => 1, 'ÅÄÖ' => 'åäö'},
{k => 2, 'ÅÄÖ' => 'çéü'},
{k => 3, 'ÅÄÖ' => '$£'},
{k => 4, 'ÅÄÖ' => '§§§'}];
my %copy;
foreach my $key (keys %{$$datarows[0]}) {
my $val = $$datarows[0]{$key};
$X->codepage_convert($val, '1252', '850');
$X->codepage_convert($key, '1252', '850');
$copy{$key} = $val;
}
# This should print the original array.
foreach my $row (@$datarows) {
foreach my $key (keys %$row) {
print "Key: $key, Value '$$row{$key}' !!! ";
}
print "\n";
}
This naïve implementation does not work on neither Perl 5.8 or 5.10.
The output is:
Key: k, Value '1' !!! Key: , Value 'åäö' !!!
Key: k, Value '2' !!! Key: , Value 'çéü' !!!
Key: k, Value '3' !!! Key: , Value '$£' !!!
Key: k, Value '4' !!! Key: , Value '§§§' !!!
All four keys are changed, not only the one for $$datarows[0]. Apparently
there is an optimisation in how hash keys are stored. But if I change the
loop to:
my $val = $$datarows[0]{$key};
my $keycopy = $key;
$X->codepage_convert($val, '1252', '850');
$X->codepage_convert($keycopy, '1252', '850');
$copy{$keycopy} = $val;
$$datarows survive the ordeal in Perl 5.8. But not so in 5.10! Apparently
there is one more optimisation, so that the copy share the area with the
original.
Now is my question, what is the best practice to deal with this? A simple
workaround is this change:
my $keycopy = $key . '';
But maybe it's my XS code that is bad? That is, changing the area directly
from C++ is bad thing that I should not do.
But in such case, what should I do? I cannot just allocate a new area and
forget the old, because then I have a memory leak. And if deallocate the
area, I will apparently pull the rug for other SVs that uses the same
area. Safefree? Does Safefree keep a reference count? The Perl
documentation is very terse, I essentially only find this: "The
XSUB-writer's interface to the C free function." And as far as I know,
free does not do reference counting.
Or should I decrease the refcount on this SV and then return a new one?
Certainly, changing the area in place is the most efficient, although
efficiency on this level does not matter that much in a DB API.
Below is the code for codepage_convert.
void codepage_convert(SV * olle_ptr,
SV * sv,
UINT from_cp,
UINT to_cp)
{ int widelen;
int ret;
DWORD err;
BSTR bstr;
STRLEN sv_len;
char * sv_text = (char *) SvPV(sv, sv_len);
STRLEN outlen;
if (sv_len > 0) {
// If the input string is UTF_8, we should ignore from_cp.
if (SvUTF8(sv)) {
from_cp = CP_UTF8;
}
// First find out how long the Unicode string will be, by calling
// MultiByteToWideChar without a buffer. Not that we always set flags to
// 0 here, since it works with all code pages.
widelen = MultiByteToWideChar(from_cp, 0, sv_text, sv_len, NULL, 0);
if (widelen > 0) {
// Allocate Unicode string and convert to Unicode.
bstr = SysAllocStringLen(NULL, widelen);
ret = MultiByteToWideChar(from_cp, 0, sv_text, sv_len, bstr, widelen);
}
else {
ret = 0;
}
// Check for errors.
if (ret == 0) {
err = GetLastError();
if (err == ERROR_INVALID_PARAMETER) {
olle_croak(olle_ptr,
"Conversion from codepage %d to Unicode failed. Maybe you are using an non-existing code-page?",
from_cp);
}
else {
olle_croak(olle_ptr,
"Conversion from codepage %d to Unicode failed with error %d",
from_cp, err);
}
}
// Now determine the length for the string in the receiving code page.
outlen = WideCharToMultiByte(to_cp, 0, bstr, widelen, NULL, 0, NULL, NULL);
if (outlen > 0) {
// Note that with some code pages the new string could be shorter or
// longer.
if (outlen > sv_len) {
sv_text = SvGROW(sv, outlen);
}
SvCUR_set(sv, outlen);
sv_text = (char *) SvPV(sv, sv_len);
// Convert to target.
ret = WideCharToMultiByte(to_cp, 0, bstr, widelen, sv_text, outlen, NULL, NULL);
}
else {
ret = 0;
}
if (ret == 0) {
err = GetLastError();
if (err == ERROR_INVALID_PARAMETER) {
olle_croak(olle_ptr,
"Conversion to codepage %d from Unicode failed. Maybe you are using an non-existing code-page?",
to_cp);
}
else {
olle_croak(olle_ptr,
"Conversion to codepage %d from Unicode failed with error %d",
to_cp, err);
}
}
// Get rid of the bstr.
SysFreeString(bstr);
// Set or unset the UTF8 flag depending on target charset.
if (to_cp == CP_UTF8) {
SvUTF8_on(sv);
}
else {
SvUTF8_off(sv);
}
}
}
from one code page to another. The conversion is done "in-place". That is,
I retrieve the text pointer, and then I rewrite the area pointed to. For
reference, the full code for the routine is included at the end of this
post.
The module is an API for communicating with MS SQL Server from Perl. One
feature is that the caller can request that data sent to/from the server
should be converted between different code pages. When sending data to the
server, my module needs to copy the caller's data, or else the caller
would find his variables to have changed.
The caller may pass data in a hash, where the hash keys corresponds to
columns. Metadata is also subject to conversion, thus also the hash keys
needs to be converted.
The actual code is a bit contrived, but here is a distilled script that
shows the essence of the problem:
use strict;
use Win32::SqlServer;
my $X = new Win32::SqlServer;
my $datarows = [{k => 1, 'ÅÄÖ' => 'åäö'},
{k => 2, 'ÅÄÖ' => 'çéü'},
{k => 3, 'ÅÄÖ' => '$£'},
{k => 4, 'ÅÄÖ' => '§§§'}];
my %copy;
foreach my $key (keys %{$$datarows[0]}) {
my $val = $$datarows[0]{$key};
$X->codepage_convert($val, '1252', '850');
$X->codepage_convert($key, '1252', '850');
$copy{$key} = $val;
}
# This should print the original array.
foreach my $row (@$datarows) {
foreach my $key (keys %$row) {
print "Key: $key, Value '$$row{$key}' !!! ";
}
print "\n";
}
This naïve implementation does not work on neither Perl 5.8 or 5.10.
The output is:
Key: k, Value '1' !!! Key: , Value 'åäö' !!!
Key: k, Value '2' !!! Key: , Value 'çéü' !!!
Key: k, Value '3' !!! Key: , Value '$£' !!!
Key: k, Value '4' !!! Key: , Value '§§§' !!!
All four keys are changed, not only the one for $$datarows[0]. Apparently
there is an optimisation in how hash keys are stored. But if I change the
loop to:
my $val = $$datarows[0]{$key};
my $keycopy = $key;
$X->codepage_convert($val, '1252', '850');
$X->codepage_convert($keycopy, '1252', '850');
$copy{$keycopy} = $val;
$$datarows survive the ordeal in Perl 5.8. But not so in 5.10! Apparently
there is one more optimisation, so that the copy share the area with the
original.
Now is my question, what is the best practice to deal with this? A simple
workaround is this change:
my $keycopy = $key . '';
But maybe it's my XS code that is bad? That is, changing the area directly
from C++ is bad thing that I should not do.
But in such case, what should I do? I cannot just allocate a new area and
forget the old, because then I have a memory leak. And if deallocate the
area, I will apparently pull the rug for other SVs that uses the same
area. Safefree? Does Safefree keep a reference count? The Perl
documentation is very terse, I essentially only find this: "The
XSUB-writer's interface to the C free function." And as far as I know,
free does not do reference counting.
Or should I decrease the refcount on this SV and then return a new one?
Certainly, changing the area in place is the most efficient, although
efficiency on this level does not matter that much in a DB API.
Below is the code for codepage_convert.
void codepage_convert(SV * olle_ptr,
SV * sv,
UINT from_cp,
UINT to_cp)
{ int widelen;
int ret;
DWORD err;
BSTR bstr;
STRLEN sv_len;
char * sv_text = (char *) SvPV(sv, sv_len);
STRLEN outlen;
if (sv_len > 0) {
// If the input string is UTF_8, we should ignore from_cp.
if (SvUTF8(sv)) {
from_cp = CP_UTF8;
}
// First find out how long the Unicode string will be, by calling
// MultiByteToWideChar without a buffer. Not that we always set flags to
// 0 here, since it works with all code pages.
widelen = MultiByteToWideChar(from_cp, 0, sv_text, sv_len, NULL, 0);
if (widelen > 0) {
// Allocate Unicode string and convert to Unicode.
bstr = SysAllocStringLen(NULL, widelen);
ret = MultiByteToWideChar(from_cp, 0, sv_text, sv_len, bstr, widelen);
}
else {
ret = 0;
}
// Check for errors.
if (ret == 0) {
err = GetLastError();
if (err == ERROR_INVALID_PARAMETER) {
olle_croak(olle_ptr,
"Conversion from codepage %d to Unicode failed. Maybe you are using an non-existing code-page?",
from_cp);
}
else {
olle_croak(olle_ptr,
"Conversion from codepage %d to Unicode failed with error %d",
from_cp, err);
}
}
// Now determine the length for the string in the receiving code page.
outlen = WideCharToMultiByte(to_cp, 0, bstr, widelen, NULL, 0, NULL, NULL);
if (outlen > 0) {
// Note that with some code pages the new string could be shorter or
// longer.
if (outlen > sv_len) {
sv_text = SvGROW(sv, outlen);
}
SvCUR_set(sv, outlen);
sv_text = (char *) SvPV(sv, sv_len);
// Convert to target.
ret = WideCharToMultiByte(to_cp, 0, bstr, widelen, sv_text, outlen, NULL, NULL);
}
else {
ret = 0;
}
if (ret == 0) {
err = GetLastError();
if (err == ERROR_INVALID_PARAMETER) {
olle_croak(olle_ptr,
"Conversion to codepage %d from Unicode failed. Maybe you are using an non-existing code-page?",
to_cp);
}
else {
olle_croak(olle_ptr,
"Conversion to codepage %d from Unicode failed with error %d",
to_cp, err);
}
}
// Get rid of the bstr.
SysFreeString(bstr);
// Set or unset the UTF8 flag depending on target charset.
if (to_cp == CP_UTF8) {
SvUTF8_on(sv);
}
else {
SvUTF8_off(sv);
}
}
}
--
Erland Sommarskog, Stockholm, ***@sommarskog.se
Erland Sommarskog, Stockholm, ***@sommarskog.se