Encoding errors with Romanian characters "ș" and "ț"

Report SqlBase bugs and possible workarounds.
User avatar
Peter.Hugk
Germany
Posts: 362
Joined: 06 Mar 2017, 07:48
Location: Germany

Encoding errors with Romanian characters "ș" and "ț"

Post by Peter.Hugk » 22 Sep 2011, 14:22

Since TD 5.2 SP3 or SP4 there are two Romanian characters which will get destroyed upon conversion from codepage 1250 (Central Eastern) to ENC_ANSI.
A repro case is attached. We do need an urgent patch for this.

Background: We are dealing with multiple languages in an environment with users on Windows with different codepage settings and SQLBase 11.6 where most of the content must be saved in non-unicode columns. To do so, every input in an language specific data field is converted to ENC_ANSI before being saved in the DB. After reading the data of a language specific database column the autoconversion of TD needs to be dealt with. This is done by reverting the autoconversion with ENC_ANSI followed by conversion with the appropriate codepage. The principle is shown in the repro case. It is working fine with all ANSI character sets (e. g. Greek) but not the two Romanian characters "ș" and "ț".
You do not have the required permissions to view the files attached to this post.

User avatar
Peter.Hugk
Germany
Posts: 362
Joined: 06 Mar 2017, 07:48
Location: Germany

Encoding errors with Romanian characters "ș" and "ț"

Post by Peter.Hugk » 04 Oct 2011, 09:23

Could someone from Unify please have a look at this? I need a bug number or even better a solution for this problem.

Jeff Luther

Encoding errors with Romanian characters "ș" and "ț"

Post by Jeff Luther » 05 Oct 2011, 00:57

OK, I got a bug number for you: TD-16446

Bigger issue is what to do in the meantime (aka 'workaround' if any) and I have a question about this out to the tech help group here. If I hear of anything, I'll update this thread.

Jeff Luther

Encoding errors with Romanian characters "ș" and "ț"

Post by Jeff Luther » 05 Oct 2011, 01:23

Peter - you may have already considered and thought about these issues, but here's what one developer wrote back to my query about your problem, FYI:

"ENC_ANSI should use the client machine setting for local code page so he needs to be sure Romanian is CP 1250 and that the machine language is set to Romanian for that to work ok.

There may also be some conversion at the database end depending on the column type he is inserting into and I believe the database code page."

User avatar
Peter.Hugk
Germany
Posts: 362
Joined: 06 Mar 2017, 07:48
Location: Germany

Encoding errors with Romanian characters "ș" and "ț"

Post by Peter.Hugk » 05 Oct 2011, 15:55

Thank you for the bug number.

Did the developer even have a look at the source code of the repro case? When I convert a string with the following four steps:

Code: Select all

Call SalStrToMultiByte( s1, s2, 1250 )
Call SalStrToWideChar( s2, s2, ENC_ANSI )
Call SalStrToMultiByte( s2, s2, ENC_ANSI )
Call SalStrToWideChar( s2, s2, 1250 )
s2 should be equal to the original string s1, right? This had been right until SP3 (or 4) of TD 5.2. Since then this is defect for the two mentioned characters.

Best regards,
Peter

Jeff Luther

Encoding errors with Romanian characters "ș" and "ț"

Post by Jeff Luther » 05 Oct 2011, 16:45

I don't know what the developers did, Peter, just what they report back. Speaking of which, I received 2 further comments from 2 of developers. They are collectively recommending a Unicode DB column:
Dev. 1:
To handle multiple code page they need to use Unicode data.

This code converts Unicode string to CP1250, and then Unicode. Why need to convert? Original string is already Unicode.
Dev. 2:
So what [Peter] really needs to do is use a Unicode column in the database and stop the conversions all together.
Trouble with being in the middle is that I don't have all your answers, like "Why isn't he using a Unicode column in the first place?" etc. I am passing along these developers' comments so you can eval. what you are doing in light of what they suggest.

And you wrote: "Since then this is defect for the two mentioned characters." -- yes, that's why I opened TD-16446.

Jeff Luther

Encoding errors with Romanian characters "ș" and "ț"

Post by Jeff Luther » 05 Oct 2011, 17:28

P.S. some more info. from one of the developers:

Code: Select all

When converting non-Unicode string to Unicode you need to specify code page non-Unicode string is using. If you specify wrong code page converted Unicode characters glyph does not match with original (See following table). If they don’t want to store Unicode or UTF-8 string into DB, they need to store original string with code page information, then convert to Unicode base on this code page information when getting string from DB.

Binary                     0xBA  0xFE
CP 1250 (Central European)   ş   ţ
CP 1251 (Cyrillic)           є   ю
CP 1252 (Western European)   º   þ

User avatar
Peter.Hugk
Germany
Posts: 362
Joined: 06 Mar 2017, 07:48
Location: Germany

Encoding errors with Romanian characters "ș" and "ț"

Post by Peter.Hugk » 06 Oct 2011, 08:06

When converting non-Unicode string to Unicode you need to specify code page non-Unicode string is using. If you specify wrong code page converted Unicode characters glyph does not match with original (See following table). If they don’t want to store Unicode or UTF-8 string into DB, they need to store original string with code page information, then convert to Unicode base on this code page information when getting string from DB.

Binary 0xBA 0xFE
CP 1250 (Central European) ş ţ
CP 1252 (Western European) º þ
Yes, this is basically what we are doing since over 12 years now. Just that we do not store the code page information with the string but instead store the string in a column which is programmatically linked with that code page.
The defect is not in the database or router. The defect is in these two lines (IMHO the first one):

Code: Select all

Call SalStrToMultiByte( s1, s2, 1250 )
Call SalStrToWideChar( s2, s2, ENC_ANSI )
This shows the problem:
Romania.png
On a system with code page 1252 being used by windows we should get "º þ" Instead of "? ?".
So what [Peter] really needs to do is use a Unicode column in the database and stop the conversions all together.
But this is real life. I cannot convince 400 customers to buy a new database to fix a broken program part. And by the way it were Unify developers who broke it.

Nevertheless thank you for your support, Jeff.
You do not have the required permissions to view the files attached to this post.

Jeff Luther

Encoding errors with Romanian characters "ș" and "ț"

Post by Jeff Luther » 24 Apr 2012, 00:15

Peter: Development did look TD-16446 and I got an update on the status, plus a comment + image from the developer. Not good news for you, I'm afraid:

TD-16446 was closed by the developer as "Won't fix" with the following comment/reason, indicating that dev. could not do anything about fixing this:
[TD's SAL function] SalStrToMultiByte just calls Win32 WideCharToMultiByte function for code conversion.

If you convert these letters encoded by cp1250 to Unicode you can get U+015F and U+0163. But this repro case [romania.apt that you provided] uses U+0219 and U+021B so that converting to cp1250 does not work.
He also attached this image:
screenshot-1.jpg
You do not have the required permissions to view the files attached to this post.

Return to “Bug Reports”

Who is online

Users browsing this forum: [Ccbot] and 0 guests