tesseract - OpenCV for OCR: How to compute thresholding levels for gray image OCR -

i'm trying prepare images ocr, , far here i've done using info extracting text opencv

from resulting image use contours have been filtered make mask follow:

//this mask of text mat maskf = mat::zeros(rgb.rows, rgb.cols, cv_8uc1); // cv_filled fills connected components found - cv_filled fill drawcontours(maskf, letters, -1, scalar(255), cv_filled); cv::imwrite("noise2-mask.png", maskf);

the resulting img promising:

considering original img:

unfortunately running tesseract on yields issues, think levels of gray see between letters on words confuses tesseract - so, you're thinking yeah, lets binary transform, misses second half of page, tried applying otsu threshold text becomes pixelated , characters lose shape.

i tried calcblockmeanvariance opencv adaptive threshold ocr not compile (and i'm not understand tbh) compile chokes on

res=1.0-res; res=img+res;

anyhow, if has suggestions i'll appreciate it! note fractions recognized tesseract i'm writing new training set improve reco rate)

enhancing dynamic range , normalizing illumination

the point normalize background seamless color first. there many methods this. here have tried image:

create paper/ink cell table image (in same manner in linked answer). select grid cell size big enough distinct character features background. image choose 8x8 pixels. divide image squares , compute avg color , abs difference of color each of them. mark saturated ones (small abs difference) , set them paper or ink cells according avg color in comparison whole image avg color.

now process lines of image , each pixel obtain left , right paper cells. , linearly interpolate between values. should lead actual background color of pixel substract image.

my c++ implementation looks this:

color picture::normalize(int sz,bool _recolor,bool _sbstract)     {     struct _cell { color col; int a[4],da,_paper; _cell(){}; _cell(_cell& x){ *this=x; }; ~_cell(){}; _cell* operator = (const _cell *x) { *this=*x; return this; }; /*_cell* operator = (const _cell &x) { ...copy... return this; };*/ };     int i,x,y,tx,ty,txs,tys,a0[4],a1[4],n,dmax;     int x0,x1,y0,y1,q[4][4][2],qx[4],qy[4];     color c;     _cell **tab;     // allocate grid table     txs=xs/sz; tys=ys/sz; n=sz*sz; c.dd=0;     if ((txs<2)||(tys<2)) return c;     tab=new _cell*[tys]; (ty=0;ty<tys;ty++) tab[ty]=new _cell[txs];     // compute grid table     (y0=0,y1=sz,ty=0;ty<tys;ty++,y0=y1,y1+=sz)      (x0=0,x1=sz,tx=0;tx<txs;tx++,x0=x1,x1+=sz)         {         (i=0;i<4;i++) a0[i]=0;         (y=y0;y<y1;y++)          (x=x0;x<x1;x++)             {             dec_color(a1,p[y][x],pf);             (i=0;i<4;i++) a0[i]+=a1[i];             }         (i=0;i<4;i++) tab[ty][tx].a[i]=a0[i]/n;         enc_color(tab[ty][tx].a,tab[ty][tx].col,pf);          tab[ty][tx].da=0;         (i=0;i<4;i++) a0[i]=tab[ty][tx].a[i];         (y=y0;y<y1;y++)          (x=x0;x<x1;x++)             {             dec_color(a1,p[y][x],pf);             (i=0;i<4;i++) tab[ty][tx].da+=abs(a1[i]-a0[i]);             }         tab[ty][tx].da/=n;         }     // compute max safe delta dmax = avg(delta)     (dmax=0,ty=0;ty<tys;ty++)      (tx=0;tx<txs;tx++)       dmax+=tab[ty][tx].da;        dmax/=(txs*tys);      // select paper cells , compute avg paper color     (i=0;i<4;i++) a0[i]=0; x0=0;     (ty=0;ty<tys;ty++)      (tx=0;tx<txs;tx++)       if (tab[ty][tx].da<=dmax)         {         tab[ty][tx]._paper=1;         (i=0;i<4;i++) a0[i]+=tab[ty][tx].a[i]; x0++;         }       else tab[ty][tx]._paper=0;     if (x0) (i=0;i<4;i++) a0[i]/=x0;     enc_color(a0,c,pf);     // remove saturated ink cells paper (small .da wrong .a[])     (ty=1;ty<tys-1;ty++)      (tx=1;tx<txs-1;tx++)       if (tab[ty][tx]._paper==1)        if ((tab[ty][tx-1]._paper==0)          ||(tab[ty][tx+1]._paper==0)          ||(tab[ty-1][tx]._paper==0)          ||(tab[ty+1][tx]._paper==0))         {         x=0; (i=0;i<4;i++) x+=abs(tab[ty][tx].a[i]-a0[i]);         if (x>dmax) tab[ty][tx]._paper=2;         }     (ty=0;ty<tys;ty++)      (tx=0;tx<txs;tx++)       if (tab[ty][tx]._paper==2)        tab[ty][tx]._paper=0;      // piecewise linear interpolation h-lines     int ty0,ty1,tx0,tx1,d;     if (_sbstract) (i=0;i<4;i++) a0[i]=0;     (y=0;y<ys;y++)         {         ty=y/sz; if (ty>=tys) ty=tys-1;         // first paper cell         (tx=0;(tx<txs)&&(!tab[ty][tx]._paper);tx++); tx1=tx;         if (tx>=txs) continue; // no paper cell found         (;tx<txs;)             {             // fnext paper cell             (tx++;(tx<txs)&&(!tab[ty][tx]._paper);tx++);             if (tx<txs)                 {                 tx0=tx1; x0=tx0*sz;                 tx1=tx;  x1=tx1*sz;                 d=x1-x0;                 }             else x1=xs;              // interpolate             (x=x0;x<x1;x++)                 {                 dec_color(a1,p[y][x],pf);                 (i=0;i<4;i++) a1[i]-=tab[ty][tx0].a[i]+(((tab[ty][tx1].a[i]-tab[ty][tx0].a[i])*(x-x0))/d)-a0[i];                 if (pf==_pf_s   ) (i=0;i<1;i++) clamp_s32(a1[i]);                 if (pf==_pf_u   ) (i=0;i<1;i++) clamp_u32(a1[i]);                 if (pf==_pf_ss  ) (i=0;i<2;i++) clamp_s16(a1[i]);                 if (pf==_pf_uu  ) (i=0;i<2;i++) clamp_u16(a1[i]);                 if (pf==_pf_rgba) (i=0;i<4;i++) clamp_u8 (a1[i]);                 enc_color(a1,p[y][x],pf);                 }             }         }      // recolor paper cells avg color (remove noise)     if (_recolor)      (y0=0,y1=sz,ty=0;ty<tys;ty++,y0=y1,y1+=sz)       (x0=0,x1=sz,tx=0;tx<txs;tx++,x0=x1,x1+=sz)        if (tab[ty][tx]._paper)         (y=y0;y<y1;y++)          (x=x0;x<x1;x++)           p[y][x]=c;      // free grid table     (ty=0;ty<tys;ty++) delete[] tab[ty]; delete[] tab;     return c;     }

see linked answer more details. here result input image after switching gray-scale <0,765> , using pic1.normalize(8,false,true);

binarize

i tried naive simple range tresholding first if color channel values (r,g,b) in range <min,max> recolored c1 else c0:
```
void picture::treshold_and(int min,int max,int c0,int c1) // channels tresholding: c1 <min,max>, c0 (-inf,min)+(max,+inf)     {     int x,y,i,a[4],e;     (y=0;y<ys;y++)      (x=0;x<xs;x++)         {         dec_color(a,p[y][x],pf);         (e=1,i=0;i<3;i++) if ((a[i]<min)||(a[i]>max)){ e=0; break; }         if (e) (i=0;i<4;i++) a[i]=c1;          else  (i=0;i<4;i++) a[i]=c0;         enc_color(a,p[y][x],pf);         }     } 
```
after applying pic1.treshold_and(0,127,765,0); , converting rgba got result:

the gray noise due jpeg compression (png big). can see result more or less acceptable.

in case not enough can divide image segments. compute histogram each segment (it should bimodal) find color between 2 maximums treshold value. problem background covers more area ink peak relatively small , hard spot in linear scales see full image histogram:

when each segment better (as there less background/text color bleedings around tresholds) gap more visible. not forget ignore small gaps (missing vertical lines in histogram) related quantization/encoding/rounding (not gray shades present in image) should filter out gaps smaller few intensities replacing them avg of last , next valid histogram entry.