Open
Description
Indic may be troubled by the length of the compressed codes used.
@theraysmith Can you explain a little more about this?
Indic may be troubled by the length of the compressed codes used.
@theraysmith Can you explain a little more about this?
Activity
Shreeshrii commentedon Jan 12, 2017
Devanagari script has a large set of ligature forms forms for consonant conjuncts. These are combinations of Consonant + Viraama + Consonant (CVC) or CVCVC or even rarer CVCVCVC.
Currently the generated unicharset uses the combination of the conjunct ligatures followed by vowel maatraas as well as vowel modifiers as a recognition unit, leading to unicharset of 5000+ lines.
You may want to consider recognizing the conjunct cluster as a unit and vowel maatras and vowel modifiers separately. A special case can be the i maatraa that comes before (to the left of) the consonant for Devanagari.
For a listing of orthographic syllables by frequency for Sanskrit, please see
http://www.sanskritweb.net/itrans/ortho2003.pdf
For a list of ligature sets for Hindi, please see
http://tdil-dc.in/tdildcMain/articles/82170Devanagari%20Script%20Behaviour%20for%20Hindi%20%20ver%201.4.10.pdf
Shreeshrii commentedon Jan 20, 2017
Font Comparison Samples
http://sanskritlibrary.org/Sanskrit/pub/chars.pdf
http://www.sanskritweb.net/itrans/itmanual2003.pdf Pages 43-75
Attested Hindi Ligatures
theraysmith commentedon Jan 20, 2017
The LSTM recognizer is currently trained to recognize the sequence of unicodes for Indic languages. This reduces the size of the output softmax of the network from the 5000+ elements in the unicharset to ~140. (There is an analogous process for Chinese, Japanese, and Korean, that doesn't use the unicode encoding, but it is a similar idea, and the codes are strictly limited in length.)
The unicharset is used as a filter in the beam search to allow only sensible grapheme/syllable combinations of unicodes, so it doesn't output complete garbage text.
The consequence of this recoding is that it runs a lot faster, but it has to learn to output a long sequence for each grapheme/syllable.
The recoding system that maps from unicharset elements to the sequence of unicodes currently only allows a maximum of 9 unicodes per grapheme/syllable, including any viramas.
I'm running a new training experiment this weekend to try a new coding scheme, in which
<virama><consonant>
pairs are mapped to a single code, allowing a long CVCVCVC string to be encoded using just CCCC, cutting down from 7 codes to 4. This will probably increase the size of the output softmax to ~170, but reduce the length of the average code sequence by about 1/3, which might be easier for it to learn, without slowing it down much.It will take a couple of weeks to tell if it works, but if it does I will check in the code, and upload new traineddatas, and close this issue. If it doesn't work, I will have to think again...
Shreeshrii commentedon Jan 21, 2017
theraysmith commentedon Jan 23, 2017
Shreeshrii commentedon Jan 24, 2017
Ray,
Thank you for the info on corpus building.
I have added links for resources for bih and snd in the langdata repo just now. Please see
Bihari training text not representative langdata#39 (Bihari)
Sindhi Language resources for corpus (Arabic script) langdata#42 (Sindhi in Arabic script)
I also added a link to this discussion at #622 for support regarding Khmer.
I will forward your post in the tesseract-ocr group for reach other community members too.
Shreeshrii commentedon Jan 25, 2017
@theraysmith
I tried creating training data for khmer and was able to create box/tiff pairs with khmer text. It is possible that the fonts directory you used did not have khmer fonts or for some reason 'latin' fonts were used instead of khmer fonts. I will post the files separately under an issue in langdata.
I used --find_fonts function of text2image to find the fonts that covered 70℅ of the khmer training text.
It maybe useful in the training process to check the given font list for coverage and give an error or warning if it falls below a certain threshold, before going ahead with building the box tiff pairs.
edit: --oem 0 works with the khm.traineddata, --oem 1 recognizes it incorrectly.
Shreeshrii commentedon Jan 25, 2017
Commands similar to above can be used for getting a fontlist that can be plugged into language-specific.sh to ensure that it calls fonts that are available on the system and have adequate coverage. Here is the output file from the above on my system.
25 remaining items