Best practice for updating a string?

Discussion:

(too old to reply)

Erland Sommarskog

2008-05-01 17:53:17 UTC

I have in my XS module a routine that converts the charcaters of an SV
from one code page to another. The conversion is done "in-place". That is,
I retrieve the text pointer, and then I rewrite the area pointed to. For
reference, the full code for the routine is included at the end of this
post.

The module is an API for communicating with MS SQL Server from Perl. One
feature is that the caller can request that data sent to/from the server
should be converted between different code pages. When sending data to the
server, my module needs to copy the caller's data, or else the caller
would find his variables to have changed.

The caller may pass data in a hash, where the hash keys corresponds to
columns. Metadata is also subject to conversion, thus also the hash keys
needs to be converted.

The actual code is a bit contrived, but here is a distilled script that
shows the essence of the problem:

use strict;
use Win32::SqlServer;

my $X = new Win32::SqlServer;
my $datarows = [{k => 1, 'ÅÄÖ' => 'åäö'},
{k => 2, 'ÅÄÖ' => 'çéü'},
{k => 3, 'ÅÄÖ' => '$£'},
{k => 4, 'ÅÄÖ' => '§§§'}];

my %copy;
foreach my $key (keys %{$$datarows[0]}) {
my $val = $$datarows[0]{$key};
$X->codepage_convert($val, '1252', '850');
$X->codepage_convert($key, '1252', '850');
$copy{$key} = $val;
}

# This should print the original array.
foreach my $row (@$datarows) {
foreach my $key (keys %$row) {
print "Key: $key, Value '$$row{$key}' !!! ";
}
print "\n";
}

This naïve implementation does not work on neither Perl 5.8 or 5.10.
The output is:

Key: k, Value '1' !!! Key: , Value 'åäö' !!!
Key: k, Value '2' !!! Key: , Value 'çéü' !!!
Key: k, Value '3' !!! Key: , Value '$£' !!!
Key: k, Value '4' !!! Key: , Value '§§§' !!!

All four keys are changed, not only the one for $$datarows[0]. Apparently
there is an optimisation in how hash keys are stored. But if I change the
loop to:

my $val = $$datarows[0]{$key};
my $keycopy = $key;
$X->codepage_convert($val, '1252', '850');
$X->codepage_convert($keycopy, '1252', '850');
$copy{$keycopy} = $val;

$$datarows survive the ordeal in Perl 5.8. But not so in 5.10! Apparently
there is one more optimisation, so that the copy share the area with the
original.

Now is my question, what is the best practice to deal with this? A simple
workaround is this change:

my $keycopy = $key . '';

But maybe it's my XS code that is bad? That is, changing the area directly
from C++ is bad thing that I should not do.

But in such case, what should I do? I cannot just allocate a new area and
forget the old, because then I have a memory leak. And if deallocate the
area, I will apparently pull the rug for other SVs that uses the same
area. Safefree? Does Safefree keep a reference count? The Perl
documentation is very terse, I essentially only find this: "The
XSUB-writer's interface to the C free function." And as far as I know,
free does not do reference counting.

Or should I decrease the refcount on this SV and then return a new one?

Certainly, changing the area in place is the most efficient, although
efficiency on this level does not matter that much in a DB API.

Below is the code for codepage_convert.

void codepage_convert(SV * olle_ptr,
SV * sv,
UINT from_cp,
UINT to_cp)

{ int widelen;
int ret;
DWORD err;
BSTR bstr;
STRLEN sv_len;
char * sv_text = (char *) SvPV(sv, sv_len);
STRLEN outlen;

if (sv_len > 0) {
// If the input string is UTF_8, we should ignore from_cp.
if (SvUTF8(sv)) {
from_cp = CP_UTF8;
}

// First find out how long the Unicode string will be, by calling
// MultiByteToWideChar without a buffer. Not that we always set flags to
// 0 here, since it works with all code pages.
widelen = MultiByteToWideChar(from_cp, 0, sv_text, sv_len, NULL, 0);

if (widelen > 0) {
// Allocate Unicode string and convert to Unicode.
bstr = SysAllocStringLen(NULL, widelen);
ret = MultiByteToWideChar(from_cp, 0, sv_text, sv_len, bstr, widelen);
}
else {
ret = 0;
}

// Check for errors.
if (ret == 0) {
err = GetLastError();
if (err == ERROR_INVALID_PARAMETER) {
olle_croak(olle_ptr,
"Conversion from codepage %d to Unicode failed. Maybe you are using an non-existing code-page?",
from_cp);
}
else {
olle_croak(olle_ptr,
"Conversion from codepage %d to Unicode failed with error %d",
from_cp, err);
}
}

// Now determine the length for the string in the receiving code page.
outlen = WideCharToMultiByte(to_cp, 0, bstr, widelen, NULL, 0, NULL, NULL);

if (outlen > 0) {
// Note that with some code pages the new string could be shorter or
// longer.
if (outlen > sv_len) {
sv_text = SvGROW(sv, outlen);
}
SvCUR_set(sv, outlen);
sv_text = (char *) SvPV(sv, sv_len);

// Convert to target.
ret = WideCharToMultiByte(to_cp, 0, bstr, widelen, sv_text, outlen, NULL, NULL);
}
else {
ret = 0;
}

if (ret == 0) {
err = GetLastError();
if (err == ERROR_INVALID_PARAMETER) {
olle_croak(olle_ptr,
"Conversion to codepage %d from Unicode failed. Maybe you are using an non-existing code-page?",
to_cp);
}
else {
olle_croak(olle_ptr,
"Conversion to codepage %d from Unicode failed with error %d",
to_cp, err);
}
}

// Get rid of the bstr.
SysFreeString(bstr);

// Set or unset the UTF8 flag depending on target charset.
if (to_cp == CP_UTF8) {
SvUTF8_on(sv);
}
else {
SvUTF8_off(sv);
}
}
}

--
Erland Sommarskog, Stockholm, ***@sommarskog.se

Jan Dubois

2008-05-01 18:29:37 UTC

Permalink

Post by Erland Sommarskog
I have in my XS module a routine that converts the charcaters of an SV
from one code page to another. The conversion is done "in-place". That is,
I retrieve the text pointer, and then I rewrite the area pointed to. For
reference, the full code for the routine is included at the end of this
post.

In general, you should never change SVs that are marked as read-only.

if (SvREADONLY(sv))
croak("Cannot modify READONLY variable");

I'm not quite sure if it is worthwhile to also check for SvLEN()
being 0 before attempting to modify the SV, but I would think all
SVs that are SvPOK() and have SvLEN() being 0 should also have
the SvREADONLY() bit set. (SvLEN being zero means that the PV
is managed externally and not (necessarily) even allocated by
the Perl memory manager).

In particular, you cannot update hash keys in-place at all: besides
them sharing memory in a shared hash key table, you would also end
up with a different hash for the new key, so you would be breaking
hash lookup. If you want to change the hash key, you need to remove
the entry using the old key and add it back using the new key.

Cheers,
-Jan

Jan Dubois

2008-05-01 19:10:59 UTC

Permalink

Post by Jan Dubois

In general, you should never change SVs that are marked as read-only.
if (SvREADONLY(sv))
croak("Cannot modify READONLY variable");

I forgot to add:

Generally, you should not try to modify SVs that were passed in to your
XS code at all; always try to return results in new SVs.

Here is an example that most XS code doesn't support (assignment to magical
SVs):

sub foo { $_[0] = "foobar" }
$a = "abc123xyz";
foo( substr($a,3,3) );
# print "abcfoobarxyz"
print $a;

Try implementing foo() in XS and see if it works...

And trust me, if this is an external API, then people will eventually
pass things like lvalue substr() expressions, $1 regexp variables,
tied hash elements etc. to your code. Treating everything correctly
at the XS level is quite a bit of work, especially if you want to
support older Perl versions too.

Cheers,
-Jan

Erland Sommarskog

2008-05-01 21:52:32 UTC

Permalink

Post by Jan Dubois
In particular, you cannot update hash keys in-place at all: besides
them sharing memory in a shared hash key table, you would also end
up with a different hash for the new key, so you would be breaking
hash lookup. If you want to change the hash key, you need to remove
the entry using the old key and add it back using the new key.

Yes, that is clear to me. And in this I need to create a entirely new
hash, as the caller's hash should be unmolested.

Keep in mind that the first example I had was just a starting point.

Post by Jan Dubois
And trust me, if this is an external API, then people will eventually
pass things like lvalue substr() expressions, $1 regexp variables,
tied hash elements etc. to your code. Treating everything correctly
at the XS level is quite a bit of work, especially if you want to
support older Perl versions too.

My module requires 5.8.3, so far so good.

Good point about substr and the like. Nevertheless, I settled on a
solution, where I write the text in the new codepage into a new
buffer, and then I copy that text into the sv with sv_setpvn. I found
that in this case it works even if I pass $key directly.

No, it does not work with substr, nor does it work if I pass a constant -
in the latter case it's bye-bye. I will have to document this.

Thanks for your comments, Jan!

--
Erland Sommarskog, Stockholm, ***@sommarskog.se