Phonetic search for German language

forum.sourcecode (2000-2005) & forum.td.sourcecode (2005-2010)
Ingo Pohl

Phonetic search for German language

Post by Ingo Pohl » 06 Jul 2006, 15:30

 Posted by:  Ingo Pohl 

Hello,

Within SALExtension is the little Programm "SalStrSoundex" which gives you a
Code for any given Word. With this code you are able to look for
Dublettes...

Unfortunatly the Soundex Algorithm was made for the englisch speking people.
So i searched the Internet and found the following:

http://www.strohhalm.org/archiv/progr_server/Algorithmus_fuer_phonetische_Suche__215284.html

At the End of this Article you can download a c-code which i include with
this entry...
http://www.heise.de/ct/ftp/99/25/252/

I would like to use this algorithm but we have no c-compiler here.

Would someone please so kind to make a dll of it?
(The Code is under the LGPL so its legal to distribute it for free)

Thanks in advance

Ingo Pohl
--
Using TD 3.1 PTF4 WinXP SP2 (German)

You do not have the required permissions to view the files attached to this post.

Thomas Lauzi

Re: Phonetic search for German language

Post by Thomas Lauzi » 06 Jul 2006, 18:29

 Posted by:  Thomas Lauzi 

Hi Ingo,
IMHO Soundex also works satisfying for german language.
I looked on the link and they are right ( for most cases).
But the "H" could also be in german mostly be eliminated exept ph=f, but
htis could be solved by normalizing/synonyms.
The algorithm could be modified for the german language, that it works
better.

(We have included in our software two external different doublet finder and
an simple internal, so that the customer could choose)

1. First normalize the strings:

Code: Select all

ß=ss
ck=k
k=g
d=t
ao=au
tz=z
sch=ch
ay=ei
ai=ei
ey=ei
ä=ae
ü=ue
ph=f
strasse=str
weg=str
...
...
..
Then using the (modifed) soundex gives valuable results
e.g.

Code: Select all

orig.     normalized    soundex
Lautzi ->   Lauzi  -> L520
Lanzi  ->   Lanzi  -> L520
Lauzi  ->   Lauzi  -> L520
Laozi  ->   Lauzi  -> L520

Müller ->   Mueller-> M546
Mueller->   Mueller-> M546
Müler  ->   Mueler -> M546
The soundex could be modified very easily in code length, char groups,
......

Regards,
Thomas L.

Thomas Lauzi

Re: Phonetic search for German language

Post by Thomas Lauzi » 10 Jul 2006, 10:25

 Posted by:  Thomas Lauzi 

Hi Ingo,
normalizing is only replacing chars by equivalent chars or synonyms.
"! In der ersten Schleife Ä und so weiter umwandeln" is your normalizing...
In your third loop you replaced chars by their phonetic replacement.
You hardcoded it, but it, would be more elegant and better to do it by an
textfile (more flexible) which is initiallly read into an array.
I don´t know a library for normalizing (synonyms), but it is only the list
of words, which are synonyms, which is interesting. If you find one let me
know...
Our world list of synonyms is very special and small, because our customer
have ony addresses of a special business.

Regards,
Thomas L.

e.g. for improvement

Code: Select all

synonyms.txt
Ä=A
Ö=O
.=
strasse = str
weg = str
Gesellschaft mbH = GmbH
GmbH & Co= GmbH
GmbH & Co KG= GmbH
Krankenhaus=Hospital
Ambulanz=Hospital
Sozietät=Kanzlei
...

phonetic.txt
QU=KF
CH=
PH=F
...

Ingo Pohl

Re: Phonetic search for German language

Post by Ingo Pohl » 10 Jul 2006, 14:56

 Posted by:  Ingo Pohl 

Hi there,

now here is a small app with :
SalStrSoundex (minor changes for German Language)
SalStrMetaphone (Hardcoded German)
SalStrMetaphoneINI (phonetics and their equivalents can be stored in an ini
file
SalStrLevenshtein (calculates phonetic distances between 2 given Words)

Maybe this would be something for SALExtension?

Please is there someone out there with a c-compiler???

thanks Ingo


You do not have the required permissions to view the files attached to this post.

Ingo Pohl

Re: Phonetic search for German language

Post by Ingo Pohl » 28 Jul 2006, 14:37

 Posted by:  Ingo Pohl 

Well, here i programmed a Metaphone function for german language.

Could still someone please be so kind to compile the dll for me?

thanks Ingo

@Thomas Lauzi
Is there somewhere a library to normalize?
I guess there must be around hundreds of characters, syllables and words
that have to be normalized...

You do not have the required permissions to view the files attached to this post.

Return to “td.sourcecode”

Who is online

Users browsing this forum: [Ccbot] and 0 guests