2011年8月21日星期日

6. How to Support more languages


6. How to Support more languages

As mentioned before, we use Enchant to support more languages. So we have five backend to support more language. Take ISpell and mySpell for example.

In the folder “abiword\msvc2008\Debug\” there are the folder for hyphenation: Spell and mySpell. And there is two folder for their dictionary

User interface to manage hyphenation


User interface to manage hyphenation

Doing now, user can enable or disable hyphenation function in user interface (GUI).
Ø  I have finished GUI in Windows, Linux, and Cocoa.
Ø  Most languages have been translated for the globalization.
Take Windows GUI for example, user can check the checkbox for enable or disable hyphenation function.
Linux and Cocoa need more tests.
C:\Documents and Settings\Administrator\桌面\Simple Checkbox for Hyphenation GUI.bmp

4 Code Re-factor and debug


4 Code Re-factor and debug

I have finish the code re-factor both in Enchant and Abiword. Code Re-factor works:
1 deal with some ugly code
2 deal with the exception

3 Simple Implementation of Chinese Spell-Check in Enchant


3 Simple Implementation of Chinese Spell-Check in Enchant

After GSoc2011, I would like to add Chinese Spell-Check in Enchant. Chinese Spell-Check is also a very important issue in Word-Processor. I found some lib to support; I just build a simple framework since time is limit.
The main function:

2 Call the Hyphenation function in Abiword.


2 Call the Hyphenation function in Abiword.

Ø  Split run to split word and keep the format
Ø  Find split info
Ø  Deal with user's operation(select, delete, cut, paste)

Main Goal: call hyphenation module of enchant to display the hyphenation result in abiword. After user's operation, refresh the hyphenation-result accordingly include user adding new word, delete word, copy word, cut word

The main code is adding in the format function in LineBreaker.h(cpp)
// find the split point
while (pRunToBump && pLine->getNumRunsInLine() && (pLine->getLastRun() != m_pLastRunToKeep))
{
UT_ASSERT(pRunToBump->getLine() == pLine);
if(!pLine->removeRun(pRunToBump))
{
pRunToBump->setLine(NULL);
}
UT_ASSERT(pLine->getLastRun()->getType() != FPRUN_ENDOFPARAGRAPH);
if(pLine->getLastRun()->getType() == FPRUN_ENDOFPARAGRAPH)
{
fp_Run * pNuke = pLine->getLastRun();
pLine->removeRun(pNuke);
}
pRunToBump->printText();  //trace out debug message & run two time
pNextLine->insertRun(pRunToBump);  //called when create new line
// to get the split word
if (!(pRunToBump->getPrevRun() && pLine->getNumRunsInLine() && (pLine->getLastRun() != m_pLastRunToKeep)))
{
pRunToSplit=pRunToBump;
PD_StruxIterator text(pRunToBump->getBlock()->getStruxDocHandle(),
pRunToBump->getBlockOffset() + fl_BLOCK_STRUX_OFFSET);

text.setUpperLimit(text.getPosition() + pRunToBump->getLength() - 1);
UT_ASSERT_HARMLESS( text.getStatus() == UTIter_OK );
UT_UTF8String sTmp;
while(text.getStatus() == UTIter_OK)
{
UT_UCS4Char c = text.getChar();
UT_DEBUGMSG(("| %d |",c));
if(c >= ' ' && c <128)
sTmp +=  static_cast<char>(c);
++text;
}
UT_DEBUGMSG(("The Split Text |%s| \n",sTmp.utf8_str()));
if(sTmp.utf8_str()!=0)
{
                    pWordToSplit=sTmp;
UT_DEBUGMSG(("wordToSplit |%s| \n",pWordToSplit.utf8_str()));
}
}
pRunToBump = pRunToBump->getPrevRun();
UT_DEBUGMSG(("Next runToBump %x \n",pRunToBump));
}
}
//modify src/text/fmt/xp/fb_LineBreaker.cpp to place hypernation points
//spit the word
if(pWordToSplit.length()!=NULL)
{
pWordHyphenationResult=pBlock->_hyphenateWord(pWordToSplit.ucs4_str().ucs4_str(),0,0);
int tickLeft=pLine->getAvailableWidth();
if (pWordHyphenationResult && *pWordHyphenationResult){
gchar *c = g_ucs4_to_utf8(pWordHyphenationResult, -1, NULL, NULL, NULL);
for(int index=g_utf8_strlen(c,NULL);index>=0;--index)
{
if(pWordHyphenationResult[index]=='-'&&index<tickLeft)
{
pBreakPoint=index;
fp_TextRun* textout=static_cast<fp_TextRun*>(pRunToSplit);
textout->split(pBreakPoint);
}
}
}
}

1.8 Test in Linux

I have test the Enchant module in RedHat.  It works fine for me.

1.7 Deploy of enchant in Abiword


I just copy the buliding result of enchant to the right place in Abiword:
enchant\bin\Debug\libenchant_myspell.dll ---->abiword\msvc2008\Debug\lib\enchant\libenchant_myspell.dll
enchant\bin\Debug\libenchant_ispell.dll ---->abiword\msvc2008\Debug\lib\enchant\libenchant_ispell.dll
enchant\bin\Debug\libenchant.dll---->
abiword\msvc2008\Debug\bin\ibenchant.dll

1.2 Add five backends to support hyphenation


including ispell, myspell, zemberek, voikko, uspell
Ø  Hunspell: using seperated dictionary: such as hyph_en_us.dic.  we can download dic from internet
Ø  Libhyphenaiton: the dictionary is provided by author, sometimes limited
Ø  Zemberek: for Turkis
Ø  Voikko: for Finnish

the changes:
1 deleted the unneed connection, such as HSpell
2 add hunspell(myspell) hyphenation code
3 implement hyphenation using hunspell
4 implement hyphenation using Zemberek

======1 deleted the unneed connection, such as HSpell===========
Hebrew don’t need any hyphenation
Yiddish don’t need any hyphenation
=======2 Implement hyphenation using hunspell
In order to use libhyphenation. We need to add files:
hyphen/hnjalloc.h
hyphen/hnjalloc.c
hyphen/hyph_en_US.dic
hyphen/hyphen.c
hyphen/hyphen.gyp
hyphen/hyphen.h
hyphen/hyphen.patch
hyphen/hyphen.tex

========3 Implement hyphenation using Zemberek
 just using dbus_g_proxy_call the same as Spell-Check in Zemberek:
the hyphenation is as following
 char* Zemberek::hyphenate(const char* word)
{
       char* result;
       GError *Error = NULL;
       if (!dbus_g_proxy_call (proxy, "hecele", &Error,
               G_TYPE_STRING,word,G_TYPE_INVALID,
               G_TYPE_STRV, &result,G_TYPE_INVALID)) {
                       g_error_free (Error);
                       return NULL;
       }
       char*result=0;
       return result;
}

1.3 ISpell

I used Libhyphenation in ISpell. The simple code is just like this:
static char *
ispell_dict_hyphenate (EnchantDict * me, const char *const word)
{
ISpellChecker * checker;

checker = (ISpellChecker *) me->user_data;
if(me->tag!="")
  return checker->hyphenate (word,me->tag);
    return checker->hyphenate (word,"en_us");
}
The concrete code in ISpellChecker is :
char *
ISpellChecker::hyphenate(const char * const utf8Word, const char *const tag)
{  //we must choose the right language tag
char* param_value = enchant_broker_get_param (m_broker, "enchant.ispell.hyphenation.dictionary.path");
if(languageMap[tag]!="")
{
string result=Hyphenator(RFC_3066::Language(languageMap[tag]),param_value).hyphenate(utf8Word).c_str();

char* temp=new char[result.length()];
strcpy(temp,result.c_str());
return temp;
}
return NULL;
}

1.4 MySpell

I used Libhyphenate in ISpell. The simple code is just like this:
char*
MySpellChecker::hyphenate (const char* const word, size_t len,char* tag)
{
if(len==-1) len=strlen(word);
if (len > MAXWORDLEN
|| !g_iconv_is_valid(m_translate_in)
|| !g_iconv_is_valid(m_translate_out))
return 0;
char* result=0;
myspell->hyphenate(word,result,tag);
return result;
}
The concrete code in MySpellChecker is :
void Hunspell::hyphenate( const char* const word, char* result, char* tag )
{
HyphenDict *dict;
char buf[BUFSIZE + 1];
char *hyphens=new char[BUFSIZE + 1];
char ** rep;
int * pos;
int * cut;
/* load the hyphenation dictionary */ 
string filePath="hyph_";
filePath+=tag;
filePath+=".dic";
if ((dict = hnj_hyphen_load(filePath.c_str())) == NULL) {
fprintf(stderr, "Couldn't find file %s\n",tag);
fflush(stderr);
exit(1);
}
     int len=strlen(word);
     if (hnj_hyphen_hyphenate2(dict, word, len-1, hyphens, NULL, &rep, &pos, &cut)) {
free(hyphens);
fprintf(stderr, "hyphenation error\n");
exit(1);
}

hnj_hyphen_free(dict);
result=hyphens;
}

1.5 zemberek

The way in Zemberek is same with the two above:
static char*
zemberek_dict_hyphenate (EnchantDict * me, const char *const word)
{
Zemberek *checker;
checker = (Zemberek *) me->user_data;
return checker->hyphenate (word);
}
But the way for the concrete implementation is different from the two. We use zemberek_service
char* Zemberek::hyphenate(const char* word)
{
char* result;
GError *Error = NULL;
if (!dbus_g_proxy_call (proxy, "hecele", &Error,
G_TYPE_STRING,word,G_TYPE_INVALID,
G_TYPE_STRV, &result,G_TYPE_INVALID)) {
g_error_free (Error);
return NULL;
}

char*result=0;
return result;
}

1.6 voikko

The hyphenation implementation in Voikko is easy since Voikko has hyphenaiton’s API.
static char **
voikko_dict_suggest (EnchantDict * me, const char *const word,
     size_t len, size_t * out_n_suggs)
{
char **sugg_arr;
int voikko_handle;

voikko_handle = (long) me->user_data;
sugg_arr = voikko_suggest_cstr(voikko_handle, word);
if (sugg_arr == NULL)
return NULL;
for (*out_n_suggs = 0; sugg_arr[*out_n_suggs] != NULL; (*out_n_suggs)++);
return sugg_arr;
}

1. Hyphenation module in Enchant


1.1 Add hyphenation function in Enchant

Firstly, I add hyphenation method in Enchant:
================the code===========
I think we can combine the hyphenation with spell-checking together, So that we can make the code more flexible. In my opinion, the hyphenation function defines as following:
EnchantDict* enchant_broker_request_dict (EnchantBroker* broker, const
char *const lang); //same as spell-checking
char *enchant_dict_hyphenate(EnchantDict *dict, const char *const word,size_t len);

In order to achieve the function and implement in abstract layer, we need to add hyphenation function in EnchantDict. something like, just as a function pointer:
char* (*hyphenate) (struct str_enchant_dict * me,
                          const char *const word, size_t len,
                          size_t * out_n_suggs);

and the function is implement by the backend. Take “ispell” as example:
static char * ispell_dict_hyphenate (EnchantDict * me, const char *const word,
                    size_t len, size_t * out_n_suggs)
{
       ISpellChecker * checker;
       checker = (ISpellChecker *) me->user_data;
       return checker->hyphenate (word, len, out_n_suggs);
}

Finally, we set the connetion
 dict->hyphenate = ispell_dict_hyphenate;
 dict->suggest = hspell_dict_hyphenate;
dict->suggest = zemberek_dict_hyphenate;