Specification of ETL-9

ETL-9G

Format

  • Fixed Record Length without Control Words
  • 8 bits per byte, 8199 bytes per record
  • Big endian
Bytes # bytes type contents
1-2 2 Integer Serial Sheet Number
3-4 2 Binary JIS Character Code (JIS X 0208)
5-12 8 ASCII JIS Typical Reading (ex. ‘AI.MEDER’)
13-16 4 Integer Serial Data Number
17 1 Integer Quality Evaluation of Individual Character Image
18 1 Integer Quality Evaluation of Character Group
19 1 Integer Gender of Writer (1:male, 2:female) (JIS X 0303)
20 1 Integer Age of Writer
21-22 2 Integer Industry Classification Code (JIS X 0403)
23-24 2 Integer Occupation Classification Code (JIS X 0404)
25-26 2 Integer Date of Collection (19YYMM)
27-28 2 Integer Date of Scan (19YYMM)
29 1 Integer X-Coordinate of Sample on Sheet
30 1 Integer Y-Coordinate of Sample on Sheet
31-64 34 (undefined)
65-8192 8128 Packed 16-graylevel (4-bit) image of 128 x 127 = 16256 pixels
8193-8199 7 (uncertain)

Files

  • One data set contains 3036 characters written by a writer, hence 12144 = 4 * 3036
  • 20 sheets per writer: like 1-20: first writer, 21-40: second writer etc.
filename # records # categories # data sets data set indices # sheets
ETL9G_01 12144 3036 4 1-4 80
ETL9G_02 12144 3036 4 5-8 80
ETL9G_50 12144 3036 4 197-200 80

Samples

filename record metadata and JIS code in hex image
ETL9G_01 1 (1, 12321, ‘A.TSUGU ‘, 1, 0, 0, 0, 0, 0, 0, 8212, 8310, 0, 0) 0x3021 ETL9G_1_3021
ETL9G_11 101 (1, 12580, ‘IN.HIBI ‘, 101, 0, 0, 0, 0, 0, 0, 8212, 8311, 4, 6) 0x3124 ETL9G_1_3124
ETL9G_21 201 (2, 12839, ‘OU.OKI ‘, 49, 0, 0, 0, 0, 0, 0, 8212, 8406, 0, 3) 0x3227  ETL9G_2_3227
ETL9G_31 301 (2, 13101, ‘KAI.BAI ‘, 301, 0, 0, 0, 0, 0, 0, 8212, 8405, 4, 9) 0x332d  ETL9G_2_332d
ETL9G_41 401 (3, 13360, ‘KAN.MA ‘, 401, 0, 0, 0, 0, 0, 0, 8212, 8403, 0, 6) 0x3430 ETL9G_3_3430

Sample Python code for retrieving the first record in ETL9G_01 (tested with Python 2.7.5).

ETL-9B

ETL-9B is generated from ETL-9G by binalization. The threshold is determined by T=λh + (1-λ)∙μ, where h is Otsu’s threshold [4] and μ is the average of all intensity levels in ETL-9G [5]. For ETL-9B, λ=0.4 [1][2].

Format

  • Fixed Record Length without Control Words
  • 8 bits per byte, 576 bytes per record
  • Big endian
bytes # bytes type contents
1-2 2 Integer Serial Sheet Number
3-4 2 Binary JIS Kanji Code (JIS X 0208)
5-8 4 ASCII JIS Typical Reading ( ex. ‘AI.M’)
9-512 504 Packed Binary image of 64 x 63 = 4032 pixels
513-576 64 (uncertain)

Files

  • One data set contains 3036 characters written by a writer, hence 121440 = 40 * 3036
  • 20 sheets per writer: 1-20: first writer, 21-40: second writer etc.
  • The first record of each file is dummy filled by zeros
  • The last data set of 3036 records of ETL9B_5 is the model presented to examinees
filename # records # data sets data set index # sheets
ETL9B_1 121440 40 1-40 800
ETL9B_2 121440 40 41-80 800
ETL9B_3 121440 40 81-120 800
ETL9B_4 121440 40 121-160 800
ETL9B_5 121440+3036 40+1 161-200 800+20

Samples

filename record index (dummy record as 0) metadata and JIS code in hex image
ETL9B_1 1 (1, 9250, ‘A.HI’) 0x2422 1_9250
ETL9B_2 100 (801, 12349, ‘AYA.’) 0x303d 801_12349
ETL9B_3 200 (1601, 12611, ‘EI.A’) 0x3143 1601_12611
ETL9B_4 300 (2402, 12873, ‘KA.Y’) 0x3249 2402_12873
ETL9B_5 400 (3203, 13135, ‘KAKU’) 0x334f 3203_13135

Sample Python code for retrieving the first record skipping the dummy record in ETL9B_1 (tested with Python 2.7.5).

References

  1. 斉藤泰一、山田博三、山本和彦: “JIS第1水準手書漢字データベースETL9とその解析”, 「信学論(D) 画像処理特集号」, Vol.J68-D, No.4, pp.757–764 (1985-04).
  2. 斉藤泰一、山田博三、山本和彦: “手書文字データベースの解析(VIII) -方向パターン・マッチング法によるJIS第1水準手書漢字データベースETL9の評価-”, 「電総研彙報」, Vol.49, No.7, pp.487–525 (1985-07).
  3. 斉藤泰一、山本和彦、山田博三: “手書文字データベースの解析(IX) -データベースETL9とその見本文字について-”, 「電総研彙報」, Vol.50, No.4, pp.259–263 (1986-04).
  4. 大津展之: “判別および最小2乗規準に基づく自動しきい値選定法”, 「信学論(D)」, Vol.63-D, No.4, pp.349–356 (1980-04).
  5. 斉藤泰一、山田博三: “判別しきい値選定法の一改良”, 「情報処理学会論文誌(情処学論)」, Vol.22, No.6, pp.596–599 (1981-11).