Solved TD-4709: TD Source Code File Format --> UTF-8

Missing or incomplete OpenText Gupta product features? Discuss it here!
ggd

TD-4709: TD Source Code File Format --> UTF-8

Post by ggd » 12 Apr 2012, 16:06

Hi!

It would be great, if it is Team Developer save the text source files to UTF-8 coding format.

Issue: TD-4709

Regards

Jeff Luther

Re: TD Source Code File Format --> UTF-8

Post by Jeff Luther » 24 Apr 2012, 22:29

Well, I think that won't happen. Since you referenced: TD-4709
I see that old defect was closed for v5.1 by development as "Won't Fix."

I see you have posted threads 2x before about this, as you listed. Sounds like you have a very specific issue with your source mgmt. program, so I might suggest either get that to work OK for you as is or perhaps have a utility to convert to UTF-8 before save? My samples page: http://www.jeffluther.net/unify/#Code_Samples
Look in the Unicode section for the Save sample.

I didn't put together a read sample, though. So another utility you'd write that will check out source and write to UTF-16 format (TD cannot read UTIF-8)?

Dave Rabelink
Founder/Site Admin
Founder/Site Admin
Netherlands
Posts: 3384
Joined: 24 Feb 2017, 09:12
Location: Gouda, The Netherlands

Re: TD Source Code File Format --> UTF-8

Post by Dave Rabelink » 25 Apr 2012, 06:11

Well, if the version mngmt system is the problem, just update IMHO.

We are using TortoiseSVN without problems using the TD Unicode files.
All tools work (like diff and merge).
TortoiseSVN.png
You do not have the required permissions to view the files attached to this post.
Regards,
Dave Rabelink

Image
Articles and information on Team Developer Tips & Tricks Wiki
Download samples, documents and resources from TD Sample Vault
Videos on TDWiki YouTube Channel

ggd

Re: TD Source Code File Format --> UTF-8

Post by ggd » 25 Apr 2012, 11:24

I wrote everything down once already here: viewtopic.php?f=41&t=70196

Subversion issue
„UTF-16 is commonly used to encode files whose semantic content is textual in nature, but the encoding itself makes heavy use of bytes which are outside the typical ASCII character byte range. As such, Subversion will tend to classify such files as binary files, much to the chagrin of users who desire line-based differencing and merging, keyword substitution, and other behaviors for those files.”
http://svnbook.red-bean.com/en/1.7/svn-book.html

Therefore
I can't use some important svn properties
svn:eol-style=native
svn:keywords=Date Rev Author URL Id (Dont change this)
I can't use TortoiseSVN Blame...
I can't use svnlook diff
etc...

Source code size issue
UTF-8 coding yields a smaller file size by 50%, than UTF-16.

Other opinion
„Why is it Important?
UTF-8 is an important encoding because of the following reasons:
• ASCII compatible
• easily supported
• compact and efficient for most scripts
• easily processed, unlike other multibyte encodings”
http://developers.sun.com/dev/gadc/tech ... /utf8.html

Solution
This problem simply solves a Team Developer save option (check box: UTF-8). And correct utf-8 readings.
Who uses Subversion Version Management System it more serious, it would be grateful for this.

best regards

ggd

Re: TD Source Code File Format --> UTF-8

Post by ggd » 28 Apr 2012, 20:52

How to solved this the Visual Studio
save_file_as_project.jpg
save_file_as_source.jpg
advanced_save_options.jpg
You do not have the required permissions to view the files attached to this post.

fakie

Re: TD Source Code File Format --> UTF-8

Post by fakie » 06 Jul 2012, 08:28

Our team is also running into problems with managing TD-Sources in our version control system (which is GIT).
An option to save the sources as UTF-8 instead of UTF-16 would be desirable!

Regards,
Patrick

ggd

Re: TD Source Code File Format --> UTF-8

Post by ggd » 09 Jul 2012, 17:00

Unfortunately, fewer ones are interested in this problem on this non-public forum, than on the public 5.x one.
However, if somebody uses good/bad version control system only for archiving, the source code size, which is smaller by 50 percent, still should denote something.

The total size of our source code is ca. 480MB (ASCII), ~50 main and ~50 custom modules.
Although the small prices of HDD-s are often mentioned, it is not a negligible fact, whether it approaches the size of 1 GB. In other words, double code size generates a double amount of data traffic. And once again, the substances handled binarily increase the size of the repository drastically.

Unfortunately, any kind of support does not seem to be expected from Unify (at least, the silence is big), they seem not to be able to understand even the problem itself. I guess it, because they keep sending conversion routines, although these conversion routines themselves are not the problem. The question is how to build the converting (in both directions) into the process of the application life cycle management?

juhosalo
Finland
Posts: 24
Joined: 27 Nov 2017, 16:06
Location: Finland

Re: TD Source Code File Format --> UTF-8

Post by juhosalo » 29 Oct 2012, 15:06

The only problems we have with using Tortoise SVN are these:
- Sometimes Tortoise SVN classifies the files as mime type binary - this is easily fixed by manually forcing the mime type to text file
- Because GUI settings of the IDE are saved in source code files these cause a lot of conflicts and I have instructed people to remove these changes at the pre-commit check phase

We use Beyond Compare 3 to do diff and 3-way diff for merge and have built a language dentition for Gupta in Beyond Compare 3 so we get syntax highlighting (well up to a point anyways).

ggd

Re: TD Source Code File Format --> UTF-8

Post by ggd » 29 Oct 2012, 17:26

Subversion recognizes as binary the utf-16 files. TD writes the source files with this coding.

This our problem:
viewtopic.php?f=41&t=70196

juhosalo
Finland
Posts: 24
Joined: 27 Nov 2017, 16:06
Location: Finland

Re: TD Source Code File Format --> UTF-8

Post by juhosalo » 30 Oct 2012, 16:37

When this happens for us it actually sets the svn property on the file to indicate that the file is binary. If we then manually remove this property it handles the file as text again.
This is not how it goes for you?

I might actually have misunderstood your issue because I just checked one of our source code files that SVN has marked as binary and I could still show diff and use Blame for the file. The problem only arises when merging changes between branches, it does not allow to edit conflict for example.

I currently have TorsoiseSVN 1.7.9 Build 23248 running on Windows 7 64bit.

ggd

Re: TD Source Code File Format --> UTF-8

Post by ggd » 30 Oct 2012, 21:56

Please read carefully. The source is svnbook.red-bean, link below, highlighting from me:

„UTF-16 is commonly used to encode files whose semantic content is textual in nature, but the encoding itself makes heavy use of bytes which are outside the typical ASCII character byte range. As such, Subversion will tend to classify such files as binary files, much to the chagrin of users who desire line-based differencing and merging, keyword substitution, and other behaviors for those files.”
http://svnbook.red-bean.com/en/1.7/svn-book.html

The above ones then true, if you delete the mime type. You can’t force this manually.

I use the Subversion config file:
*.app = svn:mime-type=text/plain;svn:eol-style=native;svn:keywords=Date Rev Author URL Id
*.apl = svn:mime-type=text/plain;svn:eol-style=native;svn:keywords=Date Rev Author URL Id

Eol-style cannot be used (this is not miracle). The mime type plain text (but I deleted it, same situation). Keywords do not substitute.

Once more:
I can't use some important svn properties
svn:eol-style=native
svn:keywords=Date Rev Author URL Id (Dont change this)
I can't use TortoiseSVN Blame...
I can't use svnlook diff
etc...

[Windows 7 Enterprise x64 SP1; Subversion 1.7 (command line); TortoiseSVN 1.7.10; Beyond Compare 3]

juhosalo
Finland
Posts: 24
Joined: 27 Nov 2017, 16:06
Location: Finland

Re: TD Source Code File Format --> UTF-8

Post by juhosalo » 31 Oct 2012, 16:02

OK, sorry I presumed that the problem was only about the effects caused by SVN changing the mime type. But I understand now that there are some other features that do not work for Gupta files even when the mime type is set to text explicitly.
I also did not realize you mentioned svnlook diff which is a server side tool. I have never tried SVN Blame but now that I did I can see that in fact it does not work for UTF-16 sources.
I also understood that the highlighted chapter specifically refers to the changing of the mime type which for us is irritating but not a big problem, but I was wrong and in fact there are also other issues where SVN functions do not work because it thinks the files are binary.

You are correct. I agree it would be nice if the source code would convert to UTF-8 when saving to disk. Otherwise keep everything as UTF-16 but just save to disk as UTF-8. It would also make it easier to merge changes between our Gupta 4.2 using version and Gupta 6.0 using version of our software.

No one should ever use any non ASCII characters in source code string literals anyways. Localization resources should be externalized always. And of course all class names, variables, functions and so on should be in English. I really really hate non-English code (and I'm Finnish).

So...
+1

ggd

Re: TD Source Code File Format --> UTF-8

Post by ggd » 31 Oct 2012, 17:19

Thank You! :D

The unicode no problem, indeed. The source code is needed (must) sometimes contain not ASCII characters (I'm Hungarian, so the GUI is written in Hungarian also, this is the localization base). We only talk about the serialize a stream of characters. UTF8 is the new ASCII. The world uses this generally.

Jeff Luther

Re: TD Source Code File Format --> UTF-8

Post by Jeff Luther » 01 Nov 2012, 00:47

This forum thread has gotten large and a number of you have replied to it. It seems to be an important issue, so I am asking dev. group internally about the idea for an Enhancement for future TD to support Read/Write source with UTF8 encoding as an option. When I have more info. I'll let you know and if development - Product - Management consider it I will add a TD enhancement request for this.

In the meantime I wondered what it would take to write a TD example that could:
* read in TD text source with Unicode encoding --> write out as UTF8 encoding
and the reverse:
* read in UTF8 encoded text file --> write out as Unicode encoded

So, I put a basic example together to do just this. The ZIP file contains:
sample-v60_UTF16.app -- simple v6 app with a form
sample-v60_UTF8.app -- same text file content as above that I'd read in with Notepad and saved out Encoded as UTF8
example_ReadWrite_UTF8-ReadWriteOK.app -- example code in TD v6.0 format that reads in each of those "sample-v60" and saves each out in other encoding with different output name. (-UTF8 or -Unicode is appended to root file name)

The example should run as is if it can locate the 2 sample files. Example also has 2 IsIn functions as well to check to be sure input file is in correct format.

How to tell what encoding a text file has:
** You can always use Notepad to read in a text file, then click File/Save As... That dlg: has an Encoding: <> field that tells you what the file is encoded as.
** If you want to see the file content in hex -- that is, the hex chars heading each file that let the OS and appl. what the encoding is -- a good Utility for this that I use is HexEdit: http://www.hexedit.com/

If you needed to get some text file encoded to UTF8 (and later converted back to Unicode) this code should work.

NOTE -- as for other examples I've posted here and on my Samples page: http://www.jeffluther.net/unify/
and the bottom TERMS OF USE: http://www.jeffluther.net/unify/#mailin ... nformation

This example is provided on an AS-IS basis and I urge you to be sure to back up everything before testing with real code!! Especially if you enhance the UI for this example so it supports multiple files, etc.
You do not have the required permissions to view the files attached to this post.

juhosalo
Finland
Posts: 24
Joined: 27 Nov 2017, 16:06
Location: Finland

Re: TD Source Code File Format --> UTF-8

Post by juhosalo » 01 Nov 2012, 12:16

Thank for the sample Jeff. But the real issue is not about being able to convert files between encodings. And I would not write such a tool with Gupta anyways. I would use .NET 4 and Parallel Linq for super, super easy multithreading to take advantage of our workstations' SSD drives. The performance is just that much better.

I guess we could have a tool that could be run for a folder and it would check all file encodings for Gupta source code files and if they are UTF-16 then would convert them. But I would not use this without official support in Team Developer for reading UTF-8 source code (even if it seems to work now). The idea we have here would be to save the desired encoding in Gupta source code metadata (but in clear text, none of that hex crap please). Then the IDE would make sure that when saving the file it would first convert it to the desired encoding.

But still... This is not a critical issue for us. What we need is to increase the stability of the platform, that is the biggest concern for us and of course unfortunately the hardest to fix because providing applications for repro is often impossible.


And when we talk about how to find out what encoding a file is in... If there are no BOM markers in the file or any other metadata about the encoding, then it is always a guess and not a clearly defined thing how different applications would figure out the encoding. UTF-16 you can make a guess because of the large numbers of 00 bytes in the file. UTF-8 you can try to find a 2 byte sequence. But other legacy encodings are impossible to tell which one is used just by looking at the data.
What I found I need to tell our developers when they are testing encoding related issues is to always, always look at the raw data in hex-view to see the bytes.

Return to “Enhancement suggestions”

Who is online

Users browsing this forum: [Ccbot] and 0 guests