IFN/ENIT-database – DATABASE OF HANDWRITTEN ARABIC WORDS – http://www.ifnenit.com/ Database Specification Overview • • • • • • • • Over 2200 form-pages of (937) Tunisian town/village names All data are digitized using a resolution of 300 dpi (b/w) Ground truth information available o e.g. sequence of Arabic character shapes o baseline/reference line position o topline position for set_a A wide variety of writing styles; 411 different writers About 26000 Arabic handwritten Tunisian town/village names Approximately 212 000 Arabic characters and ligatures Divided into 4 sets (a-d) Images and ground truth documentation included Set specification The whole database is divided into 4 disjoint sets for training and testing Arabic OCR systems. We recommend using the specified sets for comparability with results of other groups. SET a b c d SUM Number of words 6537 6710 6477 6735 26459 Number of writer (OK+bad) 88+14=102 89+13=102 88+15=103 90+14=104 355+56=411 Data formats Image format The form-pages are stored in uncompressed TIFF-file format (file-extension .tif). All cropped Tunisian town/village names coming as uncompressed TIFF-images and as BMP-images (file-extension .bmp). In the TIFF header section “Image description” label information is stored. Ground Truth format For each cropped Tunisian town/village name you find a truth file (file-extension .tru). The truth-file is an ASCII .txt file including all available ground truth information. One example is shown below. 01: 02: 03: 04: 05: 06: 07: 08: 09: 10: 11: COM: COM: COM: COM: X_Y: BDR: LBL: CHA: BLN: TLN: EDR: IFN/ENIT-database truth (label) file http://www.ifnenit.com IfN, TU-BS di45_019.tif coming from pb377_6.tif 498 87 begin data record ZIP:3032;AW1:;ﻣﺮآﺰدروﻳﺶAW2:maB|raE|keB|zaE|daA|raA|waA|yaB|shE|;QUA:YB1;ADD:P6 9 56,42 23,19 end of data record Arabic database -- IFN/ENIT-database for developing and testing recognition systems for handwritten Arabic words (Arabic OCR) (version 1.0p2) 1-5 IFN/ENIT-database – DATABASE OF HANDWRITTEN ARABIC WORDS – http://www.ifnenit.com/ lines 01-04: line 05: line 06: line 07: ZIP: AW1: AW2: comments image size in pixel (x,y) begin data record label Tunisian post code / ZIP code Tunisian town/village name in Arabic windows encoding Arabic character shape sequence of Tunisian town/village name in Latin code (refer Appendix for the lookup table). QUA: baseline quality tag (B1=:OK;B2=:bad) ADD: number of pieces of Arabic words (PAW’s) line 08: number of characters line 09: baseline/reference line information Y1,Y2 line 10: topline information Y1,Y2 (in data set_a only!!!) line 11: end of data record Database Organisation The IFN/ENIT-database has the following directory structure. IFN/ENIT - database doc data forms doc set_a tru tif ... bmp set_d ... doc In this directory documentation in pdf and/or txt format is available. data In this directory all available data are stored. Each of the four sets has it’s own directory (set_?). set_? In these directories all data are available in tif and bmp file format. The ground truth information you will find in the directory tru. The subdirectory doc under data/set_? includes a pdf-file for each writer with all words and baselines. File name convention The following file name convention is used. SWww_NNN.EXT • S:=set (a,b,c,d) • Www:=writerID; W=(e,f,i,j,m,q) ,w=(0..9) • NNN:=word_number; N=(0..9) Ordering Information IFN/ENIT-database is made available for non-commercial use. The data is supplied with no guarantee of accuracy or usability. We can’t guarantee to maintain the IFN/ENIT-database, but would be interested in hearing of any comments or results that you have. Arabic database -- IFN/ENIT-database for developing and testing recognition systems for handwritten Arabic words (Arabic OCR) (version 1.0p2) 2-5 IFN/ENIT-database – DATABASE OF HANDWRITTEN ARABIC WORDS – http://www.ifnenit.com/ Upon request we make the data available on the Internet for free download. If the database has to be shipped as a CD-Rom production and shipping costs will be charged. In both cases please contact us. BUGS When you are working with the IFN/ENIT-database perhaps you will discover bugs concerning the label or the extraction of the data. Please report the bugs you find back to us. So we can improve the quality of the data over the time. Reporting results We kindly invite you to publish reached results with the IFN/ENIT-database. To keep recognition results comparable, we suggest reporting results as shown in the following example: Database version: IFN/ENIT-database v1.0p2 Test Training set(s) Test set Recognition result(*) 1 a,b,c D 23.7% 2 c,b,d A 25,4% … (*) Percentage of correctly recognised words in the specified test set. We recommend to use the ground truth / label category “ZIP:” as reference for the recognition result. Please use for each test the whole number of 937 different Tunisian town/village names as lexicon, in the case a lexicon is needed. Contact Please feel free to contact contact@ifnenit.com. References Mario Pechwitz, Samia Snoussi Maddouri, Volker Märgner, Noureddine Ellouze, Hamid Amiri; IFN/ENITdatabase of handwritten Arabic words, In Proceedings of CIFED’02, Hammamet, Tunisia, 21.23.10.2002, p. 129-136 Appendix Label legend / lookup table – Arabic label to Latin label & statistic of occurrence in the database Arabic label 0_A 1_A 2_A 6_A 7_A 8_A 9_A _ءA _ﺁA _أA _إA _أE_لB _إE_لB _ئM Latin label 0A 1A 2A 6A 7A 8A 9A hhA amA aeA ahA aeElaB ahElaB alM Quantity 342 279 384 311 354 284 341 520 544 1660 631 360 122 355 _اA _اE _اE_لB _اE_لM _بA _بB _بE _بM _ةA _تA _تB _ةE _تE _تM aaA aaE aaElaB aaElaM baA baB baE baM teA taA taB teE taE taM 20251 13308 799 1076 331 5636 344 3407 2182 356 2324 7259 357 1045 Arabic database -- IFN/ENIT-database for developing and testing recognition systems for handwritten Arabic words (Arabic OCR) (version 1.0p2) 3-5 IFN/ENIT-database – DATABASE OF HANDWRITTEN ARABIC WORDS – http://www.ifnenit.com _ثA _ثB _ثM _جA _جB _جE _جM _جM_لB _حA _حB _حE _حM _حM_لB _حM_مM_لB _حM_نB _خA _خB _خE _خM _خM_لB _دA _دE _ذA _ذE _رA _رE _زA _زE _سA _سB _سE _سM _شA _شB _شE _شM _صA _صB _صE _صM _ضA _ضB _ضE _ضM thA thB thM jaA jaB jaE jaM jaMlaB haA haB haE haM haMlaB haMmaMlaB haMnaB khA khB khE khM khMlaB daA daE dhA dhE raA raE zaA zaE seA seB seE seM shA shB shE shM saA saB saE saM deA deB deE deM 338 _طA _طB _طE _طM _ظB _ظM _عA _عB _عE _عM _غB _غM _فA _فB _فE _فM _قA _قB _قE _قM _كB _كE _كM _لA _لB _لE _لM _مA _مB _مE _مM _مM_لB _نA _نB _نE _نM _ﻩA _ﻩB _ﻩE _ﻩM _وA _وE _ىA _يA 353 327 504 981 346 1218 539 314 2483 295 1804 365 64 100 341 863 321 425 310 2718 4883 353 703 6252 9369 1764 2718 873 4218 811 1110 475 1617 351 1277 355 956 357 1096 351 730 343 328 4-5 toA toB toE toM zaB zaM ayA ayB ayE ayM ghB ghM faA faB faE faM kaA kaB kaE kaM keB keE keM laA laB laE laM maA maB maE maM maMlaB naA naB naE naM heA heB heE heM waA waE eeA yaA 343 359 350 1258 339 690 915 1650 347 1990 326 600 319 898 316 1647 397 2608 348 1307 1221 335 980 1485 14340 1056 2594 890 3886 536 4626 458 1267 3723 2119 2912 696 1924 351 347 3511 6529 322 2932 IFN/ENIT-database – DATABASE OF HANDWRITTEN ARABIC WORDS – http://www.ifnenit.com _يB _ىE _يE _يM yaB eeE yaE yaM 4383 350 2167 7759 Note: “llL” is often added and means ligature “chadda” • The character shape indicators (A,B,M,E) are sometimes supplemented with a “1” or a “2”, like ”baA1”. In this case there is a point error detected. Take care: this feature is not consistent over the whole database /. We recommend ignoring these kinds of label supplements. • 5-5