Skip to content

Q&A: Indic - length of the compressed codes #654

Open
@Shreeshrii

Description

@Shreeshrii
Collaborator

#648 (comment)

Indic may be troubled by the length of the compressed codes used.

@theraysmith Can you explain a little more about this?

Activity

Shreeshrii

Shreeshrii commented on Jan 12, 2017

@Shreeshrii
CollaboratorAuthor

Devanagari script has a large set of ligature forms forms for consonant conjuncts. These are combinations of Consonant + Viraama + Consonant (CVC) or CVCVC or even rarer CVCVCVC.

Currently the generated unicharset uses the combination of the conjunct ligatures followed by vowel maatraas as well as vowel modifiers as a recognition unit, leading to unicharset of 5000+ lines.

You may want to consider recognizing the conjunct cluster as a unit and vowel maatras and vowel modifiers separately. A special case can be the i maatraa that comes before (to the left of) the consonant for Devanagari.

For a listing of orthographic syllables by frequency for Sanskrit, please see
http://www.sanskritweb.net/itrans/ortho2003.pdf

For a list of ligature sets for Hindi, please see
http://tdil-dc.in/tdildcMain/articles/82170Devanagari%20Script%20Behaviour%20for%20Hindi%20%20ver%201.4.10.pdf

Shreeshrii

Shreeshrii commented on Jan 20, 2017

@Shreeshrii
CollaboratorAuthor
theraysmith

theraysmith commented on Jan 20, 2017

@theraysmith
Contributor

The LSTM recognizer is currently trained to recognize the sequence of unicodes for Indic languages. This reduces the size of the output softmax of the network from the 5000+ elements in the unicharset to ~140. (There is an analogous process for Chinese, Japanese, and Korean, that doesn't use the unicode encoding, but it is a similar idea, and the codes are strictly limited in length.)
The unicharset is used as a filter in the beam search to allow only sensible grapheme/syllable combinations of unicodes, so it doesn't output complete garbage text.

The consequence of this recoding is that it runs a lot faster, but it has to learn to output a long sequence for each grapheme/syllable.
The recoding system that maps from unicharset elements to the sequence of unicodes currently only allows a maximum of 9 unicodes per grapheme/syllable, including any viramas.

I'm running a new training experiment this weekend to try a new coding scheme, in which <virama><consonant> pairs are mapped to a single code, allowing a long CVCVCVC string to be encoded using just CCCC, cutting down from 7 codes to 4. This will probably increase the size of the output softmax to ~170, but reduce the length of the average code sequence by about 1/3, which might be easier for it to learn, without slowing it down much.

It will take a couple of weeks to tell if it works, but if it does I will check in the code, and upload new traineddatas, and close this issue. If it doesn't work, I will have to think again...

Shreeshrii

Shreeshrii commented on Jan 21, 2017

@Shreeshrii
CollaboratorAuthor
theraysmith

theraysmith commented on Jan 23, 2017

@theraysmith
Contributor
Shreeshrii

Shreeshrii commented on Jan 24, 2017

@Shreeshrii
CollaboratorAuthor

Ray,

Thank you for the info on corpus building.

I have added links for resources for bih and snd in the langdata repo just now. Please see

I also added a link to this discussion at #622 for support regarding Khmer.

I will forward your post in the tesseract-ocr group for reach other community members too.

Shreeshrii

Shreeshrii commented on Jan 25, 2017

@Shreeshrii
CollaboratorAuthor

I recently stopped training chr, iku, khm, mya after discovering that I
have no rendered textlines that contain anything other than digits and
punctuation.

@theraysmith

I tried creating training data for khmer and was able to create box/tiff pairs with khmer text. It is possible that the fonts directory you used did not have khmer fonts or for some reason 'latin' fonts were used instead of khmer fonts. I will post the files separately under an issue in langdata.

I used --find_fonts function of text2image to find the fonts that covered 70℅ of the khmer training text.

It maybe useful in the training process to check the given font list for coverage and give an error or warning if it falls below a certain threshold, before going ahead with building the box tiff pairs.

edit: --oem 0 works with the khm.traineddata, --oem 1 recognizes it incorrectly.

Shreeshrii

Shreeshrii commented on Jan 25, 2017

@Shreeshrii
CollaboratorAuthor

text2image --find_fonts \
--fonts_dir /usr/share/fonts \
--text ./langdata/ara/ara.training_text \
--min_coverage .8  \
--outputbase ./langdata/ara/ara \
|& grep raw | sed -e 's/ :.*/" \\/g'  | sed -e 's/^/  "/' >./langdata/ara/fontslist.txt

Commands similar to above can be used for getting a fontlist that can be plugged into language-specific.sh to ensure that it calls fonts that are available on the system and have adequate coverage. Here is the output file from the above on my system.

  "Arial" \
  "Arial Bold" \
  "Courier New" \
  "Courier New Bold" \
  "DejaVu Sans" \
  "DejaVu Sans Bold" \
  "DejaVu Sans Mono" \
  "DejaVu Sans Mono Bold" \
  "FreeMono" \
  "FreeMono Bold" \
  "FreeSerif" \
  "FreeSerif Bold" \
  "Times New Roman," \
  "Times New Roman, Bold" \

25 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @almas@kaomoneus@Shreeshrii@theraysmith@a1012

        Issue actions

          Q&A: Indic - length of the compressed codes · Issue #654 · tesseract-ocr/tesseract