Encoding errors with Romanian characters "ș" and "ț"
Encoding errors with Romanian characters "ș" and "ț"
Since TD 5.2 SP3 or SP4 there are two Romanian characters which will get destroyed upon conversion from codepage 1250 (Central Eastern) to ENC_ANSI.
A repro case is attached. We do need an urgent patch for this.
Background: We are dealing with multiple languages in an environment with users on Windows with different codepage settings and SQLBase 11.6 where most of the content must be saved in non-unicode columns. To do so, every input in an language specific data field is converted to ENC_ANSI before being saved in the DB. After reading the data of a language specific database column the autoconversion of TD needs to be dealt with. This is done by reverting the autoconversion with ENC_ANSI followed by conversion with the appropriate codepage. The principle is shown in the repro case. It is working fine with all ANSI character sets (e. g. Greek) but not the two Romanian characters "ș" and "ț".
A repro case is attached. We do need an urgent patch for this.
Background: We are dealing with multiple languages in an environment with users on Windows with different codepage settings and SQLBase 11.6 where most of the content must be saved in non-unicode columns. To do so, every input in an language specific data field is converted to ENC_ANSI before being saved in the DB. After reading the data of a language specific database column the autoconversion of TD needs to be dealt with. This is done by reverting the autoconversion with ENC_ANSI followed by conversion with the appropriate codepage. The principle is shown in the repro case. It is working fine with all ANSI character sets (e. g. Greek) but not the two Romanian characters "ș" and "ț".
You do not have the required permissions to view the files attached to this post.
Encoding errors with Romanian characters "ș" and "ț"
Could someone from Unify please have a look at this? I need a bug number or even better a solution for this problem.
Encoding errors with Romanian characters "ș" and "ț"
OK, I got a bug number for you: TD-16446
Bigger issue is what to do in the meantime (aka 'workaround' if any) and I have a question about this out to the tech help group here. If I hear of anything, I'll update this thread.
Bigger issue is what to do in the meantime (aka 'workaround' if any) and I have a question about this out to the tech help group here. If I hear of anything, I'll update this thread.
Encoding errors with Romanian characters "ș" and "ț"
Peter - you may have already considered and thought about these issues, but here's what one developer wrote back to my query about your problem, FYI:
"ENC_ANSI should use the client machine setting for local code page so he needs to be sure Romanian is CP 1250 and that the machine language is set to Romanian for that to work ok.
There may also be some conversion at the database end depending on the column type he is inserting into and I believe the database code page."
"ENC_ANSI should use the client machine setting for local code page so he needs to be sure Romanian is CP 1250 and that the machine language is set to Romanian for that to work ok.
There may also be some conversion at the database end depending on the column type he is inserting into and I believe the database code page."
Encoding errors with Romanian characters "ș" and "ț"
Thank you for the bug number.
Did the developer even have a look at the source code of the repro case? When I convert a string with the following four steps:
s2 should be equal to the original string s1, right? This had been right until SP3 (or 4) of TD 5.2. Since then this is defect for the two mentioned characters.
Best regards,
Peter
Did the developer even have a look at the source code of the repro case? When I convert a string with the following four steps:
Code: Select all
Call SalStrToMultiByte( s1, s2, 1250 )
Call SalStrToWideChar( s2, s2, ENC_ANSI )
Call SalStrToMultiByte( s2, s2, ENC_ANSI )
Call SalStrToWideChar( s2, s2, 1250 )
Best regards,
Peter
Encoding errors with Romanian characters "ș" and "ț"
I don't know what the developers did, Peter, just what they report back. Speaking of which, I received 2 further comments from 2 of developers. They are collectively recommending a Unicode DB column:
Dev. 1:
And you wrote: "Since then this is defect for the two mentioned characters." -- yes, that's why I opened TD-16446.
Dev. 1:
Dev. 2:To handle multiple code page they need to use Unicode data.
This code converts Unicode string to CP1250, and then Unicode. Why need to convert? Original string is already Unicode.
Trouble with being in the middle is that I don't have all your answers, like "Why isn't he using a Unicode column in the first place?" etc. I am passing along these developers' comments so you can eval. what you are doing in light of what they suggest.So what [Peter] really needs to do is use a Unicode column in the database and stop the conversions all together.
And you wrote: "Since then this is defect for the two mentioned characters." -- yes, that's why I opened TD-16446.
Encoding errors with Romanian characters "ș" and "ț"
P.S. some more info. from one of the developers:
Code: Select all
When converting non-Unicode string to Unicode you need to specify code page non-Unicode string is using. If you specify wrong code page converted Unicode characters glyph does not match with original (See following table). If they don’t want to store Unicode or UTF-8 string into DB, they need to store original string with code page information, then convert to Unicode base on this code page information when getting string from DB.
Binary 0xBA 0xFE
CP 1250 (Central European) ş ţ
CP 1251 (Cyrillic) є ю
CP 1252 (Western European) º þ
Encoding errors with Romanian characters "ș" and "ț"
Yes, this is basically what we are doing since over 12 years now. Just that we do not store the code page information with the string but instead store the string in a column which is programmatically linked with that code page.When converting non-Unicode string to Unicode you need to specify code page non-Unicode string is using. If you specify wrong code page converted Unicode characters glyph does not match with original (See following table). If they don’t want to store Unicode or UTF-8 string into DB, they need to store original string with code page information, then convert to Unicode base on this code page information when getting string from DB.
Binary 0xBA 0xFE
CP 1250 (Central European) ş ţ
CP 1252 (Western European) º þ
The defect is not in the database or router. The defect is in these two lines (IMHO the first one):
Code: Select all
Call SalStrToMultiByte( s1, s2, 1250 )
Call SalStrToWideChar( s2, s2, ENC_ANSI )
But this is real life. I cannot convince 400 customers to buy a new database to fix a broken program part. And by the way it were Unify developers who broke it.So what [Peter] really needs to do is use a Unicode column in the database and stop the conversions all together.
Nevertheless thank you for your support, Jeff.
You do not have the required permissions to view the files attached to this post.
Encoding errors with Romanian characters "ș" and "ț"
Peter: Development did look TD-16446 and I got an update on the status, plus a comment + image from the developer. Not good news for you, I'm afraid:
TD-16446 was closed by the developer as "Won't fix" with the following comment/reason, indicating that dev. could not do anything about fixing this:
TD-16446 was closed by the developer as "Won't fix" with the following comment/reason, indicating that dev. could not do anything about fixing this:
He also attached this image:[TD's SAL function] SalStrToMultiByte just calls Win32 WideCharToMultiByte function for code conversion.
If you convert these letters encoded by cp1250 to Unicode you can get U+015F and U+0163. But this repro case [romania.apt that you provided] uses U+0219 and U+021B so that converting to cp1250 does not work.
You do not have the required permissions to view the files attached to this post.
Who is online
Users browsing this forum: [Ccbot] and 0 guests