Skip to content

Noise characters recognized with bbox as the entire page #1192

Open
@TerryZH

Description

@TerryZH

Environment

  • Tesseract Version: v4.00.00dev-692-gad5ee18 with Leptonica
  • Commit Number: ad5ee18
  • Platform: MAC OS 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 15 17:36:27 PDT 2017; root:xnu-3789.70.16~2/RELEASE_X86_64 x86_64

Current Behavior:

Line 1, unexpected '__' recognized between 1941 and Ritter, with bbox as the entire page.

sample
Corresponding HOCR line:
GS 1 2,261,002 Oct. 28,1941 __ Ritter 760 $FO

Expected Behavior:

'__' is not supposed to be recognized in the first place. If the false positive recognition is inevitable, the bbox information should be accurate.

Suggested Fix:

n/a

Activity

changed the title [-]Noise character recognized with bbox as the entire page[/-] [+]Noise characters recognized with bbox as the entire page[/+] on Oct 29, 2017
napasa

napasa commented on Mar 26, 2018

@napasa

When does this will be fixed?

Shreeshrii

Shreeshrii commented on Mar 28, 2018

@Shreeshrii
Collaborator

Comparing best and fast OCRed text

While best does not add __ for the vertical bar in table to the OCRed txt
fast did not recognize , as part of the number string.

Best

&% 1 2261002 Oct.28,1941 Ritter 260 SFO
(2 2 2211378 Jan. 27,1942 Searle ZB 2%

Fast

GS 1 2,261,002 Oct. 28,1941 __ Ritter 760 $FO
CG (2 2,271,378 Jan. 27,1942 Searle Ke \ 22

@zdenop Please label
accuracy

devendrasr

devendrasr commented on Jun 27, 2018

@devendrasr

@Shreeshrii How can we overcome this issue?

willaaam

willaaam commented on Sep 3, 2018

@willaaam

This is much more than an accuracy error - even for completely accurate words the bounding boxes make no sense at all, whether using TSV or HOCR output. This is pretty big for me.

The weird part is - somewhere internally it seems to know the coordinates as the word_ids are in order LTR and follow up.

Does anyone have any suggestion on where to start looking? I'm happy to hunt this down with some rusty C skills but I really need a pointer, completely unfamiliar with the tesseract codebase.

It happens with the LSTM engine, haven't been able to test with the legacy engine as tesseract won't recognize the retro traineddata file in the tessdata folder. I'll update this post when I get tesseract 4 --oem 0 working.

Still occurring on the very latest master build (e4b9cff)

Example of the error:

   <div class='ocr_carea' id='block_1_15' title="bbox 192 481 2422 1117">
    <p class='ocr_par' id='par_1_16' lang='ita' title="bbox 192 481 2422 1117">
     <span class='ocr_line' id='line_1_19' title="bbox 674 481 1494 532; baseline 0.002 -14; x_size 52; x_descenders 13; x_ascenders 11">
      <span class='ocrx_word' id='word_1_46' title='bbox 674 483 861 532; x_wconf 91'>Sottoposto</span>
      <span class='ocrx_word' id='word_1_47' title='bbox 0 0 2485 3508; x_wconf 96'>a</span>
      <span class='ocrx_word' id='word_1_48' title='bbox 863 481 1494 521; x_wconf 95'>condizione</span>
      <span class='ocrx_word' id='word_1_49' title='bbox 0 0 2485 3508; x_wconf 96'>risolutiva</span>

Also goes wrong when printing the character level info:

c 2485 0 2485 0 0
a 2485 0 2485 0 0
p 2485 0 2485 0 0
i 2485 0 2485 0 0
t 736 1948 742 1954 0
a 2485 0 2485 0 0
l 2485 0 2485 0 0
e 789 1916 795 1956 0
s 2485 0 2485 0 0
o 2485 0 2485 0 0
c 2485 0 2485 0 0
i 928 1950 932 1956 0
a 932 1950 934 1956 0
l 967 1917 969 1957 0
e 969 1917 973 1957 0
Sintun

Sintun commented on Sep 3, 2018

@Sintun
Contributor

@willaaam
I am not sure if it's a problem of the tesseract console program or at the api level. If it's independet from the console program it's probably a variation of #1712. I also started looking at this problem. A pointer in order to track it down:
After an OCR run the result information can be extracted through a result Iterator
unique_ptr<tesseract::ResultIterator> ri( tess->GetIterator() );
The bounding box of different detail level (for example tesseract::RIL_PARA, tesseract::RIL_TEXTLINE, tesseract::RIL_WORD, tesseract::RIL_SYMBOL also known as paragraph, text line, word, character) can be obtained through

if( ri )
{
  do
  {
    int x1, y1, x2, y2;
    ri->BoundingBox( level, &x1, &y1, &x2, &y2 );
    if( ri->IsAtFinalElement( higher_level, level ) )
      break;
  }
  while( ri->Next( level ) )
}

Now the bounding boxes on all levels are consistent, in some cases consistently false. So I would start with a minimal failing image by tracking down where the information from BoundingBox originates and where it goes wrong.
https://tesseract-ocr.github.io/4.0.0/a02399.html#aae57ed588b6bffae18c15bc02fbe4f68

Doing that is also on my ToDo list, but unfortunately i havn't found the time yet. And our Codebase "found" a temporary solution that lead to beautiful function names like

void tesseractBugFixingCharSizePlausibilityCheck();

willaaam

willaaam commented on Sep 6, 2018

@willaaam

Thanks, I appreciate the pointer, let's see if I can make some time to track this down, hopefully with a friend of mine. This bug breaks all analytics applications that come after tesseract.

And I agree, also crossposted in #1712 so we can nip this one in the bud.

willaaam

willaaam commented on Sep 7, 2018

@willaaam

Quick update - we spent some time on this last night and the bug is definitely at the API level unfortunately.

Using the code below we notice that already using BoundingBoxInternal (code below was snapshotted an hour earlier and still using BoundingBox - but same results) we get whole-page coordinates for the boxes.

Inside the BoundingBoxInternal structure, at least for our sample code, cblob_it is always null, so thats where we are going to resume the hunt and check out the BlobBox.

     case RIL_SYMBOL:
       if (cblob_it_ == NULL)
         box = it_->word()->box_word->BlobBox(blob_index_);
       else
         box = cblob_it_->data()->bounding_box();

Sample API test code below:

#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>

int main()
{
    char *outText;

    tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
    // Initialize tesseract-ocr with English, without specifying tessdata path
    if (api->Init(NULL, "ita", tesseract::OcrEngineMode::OEM_LSTM_ONLY)) {
        fprintf(stderr, "Could not initialize tesseract.\n");
        exit(1);
    }

    // Open input image with leptonica library
    Pix *image = pixRead("/home/user/000001_nonconf.page.png");
    api->SetPageSegMode(tesseract::PSM_AUTO);
    api->SetImage(image);
    // Get OCR result
    outText = api->GetUTF8Text();
    printf("OCR output:\n%s", outText);

    //hai
  tesseract::ResultIterator* ri = api->GetIterator();
  tesseract::PageIteratorLevel level = tesseract::RIL_WORD;
  if (ri != 0) {
    do {
      const char* word = ri->GetUTF8Text(level);
      float conf = ri->Confidence(level);
      int x1, y1, x2, y2;
      ri->BoundingBox(level, &x1, &y1, &x2, &y2);
      printf("word: '%s';  \tconf: %.2f; BoundingBox: %d,%d,%d,%d;\n",
               word, conf, x1, y1, x2, y2);
      delete[] word;
    } while (ri->Next(level));
  }

    // Destroy used object and release memory
    api->End();
    delete [] outText;
    pixDestroy(&image);

    return 0;
}
Sintun

Sintun commented on Sep 8, 2018

@Sintun
Contributor

Hey there,
I also followed the path and got down the following lane:
BoundingBox -> BoundingBoxInternal -> restricted_bounding_box -> true_bounding_box -> TBOX WERD::bounding_box
And tracked every usage of WERD::bounding_box,WERD::restricted_bounding_box
I saw, that the the bounding boxes are fine until the code reaches
Tesseract::RetryWithLanguage
https://tesseract-ocr.github.io/4.0.0/a02479.html#a8952ab340e0f5e61992109e85cb1619c

Within this function the recognizer
(this->*recognizer)(word_data, in_word, &new_words);
(which uses LSTMRecognizeWord)
https://tesseract-ocr.github.io/4.0.0/a01743.html#ac50ad7dad904ed14e81cd29a3bfdb82d
https://tesseract-ocr.github.io/4.0.0/a02479.html#a0478ee100b826566b0b9ea048eee636e
is applied.
Before it's application the word and character positions are good. (They are initialized before the lstm runs, so that sane regions can be fed into the lstm)
After this the resulting new word rectangles can have negative values & probably loose / gain characters at the word start / end.

Now there seem to be two possibilites

  1. The LSTM itself produces crappy output based on the provided neuronal net.
  2. There exists a post processing issue in or near the LSTM code.

Considering, that

  1. word_data.lang_words[ max_index + 1 ]->word->bounding_box();
    results into a bounding box with Points containing +/- 32767
    Which usage would result in a boundingBox containing the whole image, after it got cropped down to the image borders,
  2. single character boxes moved one to the left or to the right.

I would hope, that this is a +/- 1 error on some pointer in the LSTM bounding box / blob index post processing.

Next time i will continue tracking the issue within LSTMRecognizeWord .

Update: I'm still closing in on this, reached ExtractBestPathAsWords .

FrkBo

FrkBo commented on Sep 12, 2018

@FrkBo

Don't know if the following is of any help... If you would comment out the following lines in ccstruct/pageres.cpp on lines 1311-1313
if (blob_it.at_first())
blob_it.set_to_list(next_word_blobs);
blob_end = (blob_box.right() + blob_it.data()->bounding_box().left()) / 2;

You would end up with a lot more lines where the bounding box is equal to the entire page. Maybe the previous if-statement >> if (!blob_it.at_first() || next_word_blobs != nullptr) << does not cover all applicable cases?

Update
Disabling the following line in pageres.cpp (line 1375) seems to 'solve' the issue or at least give better output for the incorrect bounding boxes, but previous 'correct' bounding boxes are changed (and not for the better...)
// Delete the fake blobs on the current word.
word_w->word->cblob_list()->clear();

Sintun

Sintun commented on Sep 16, 2018

@Sintun
Contributor

Hey there,
I'm still working on this and traced the issue down to the character positions computed from the LSTM output. Unfortunately it seems to be more than an "one off" error.
For now i reached the function
RecodeBeamSearch::ExtractBestPaths in recodebeam.cpp
Debug output from
RecodeBeamSearch::ExtractPathAsUnicharIds
shows, that best_nodes[i]->duplicate and best_nodes[i]->unichar_id are wrong, and off by more than one.
Using the image
test_3
the letters u and s are attributed to the position of u and one of the L s gets the position of both L s.
Next time i will test the ExtractBestPaths function and follow the source of the wrong values, it shouldn't be far away. I hope that i find the bug source before reaching the LSTM computations.

58 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @baskerville@vidiecan@TerryZH@zdenop@devendrasr

        Issue actions

          Noise characters recognized with bbox as the entire page · Issue #1192 · tesseract-ocr/tesseract