Specification of ETL-2

Data Collection

OCR Sheets

  • Paper: B4, 90 kg per 1000 sheets
  • Samples: e2shta01e2shtb01e2shtc01e2shtd01

Characters

  • 2184 characters in CO-59 codeset
  • Hiragana, katakana, roman alphabet, symbol, Kanji
  • 8 point fonts of metal type printing for newspapers
  • 9 point fonts of offset printing for publication of patent application

Scanner

  • ITV Camera Scanner 240×240
  • Sampling interval: 54μm x 54μm
  • Spot size: 54μm
  • Intensity levels: 64=6bits
  • Number of pixels: 60 x 60 = 3600

Compile

  • Source of Collection: Dai Nippon Printing Co., Ltd., The Mainichi Newspapers Co., Ltd
  • Total samples: 52796
  • Scanning: Toshiba
  • Computer: TOSBAC-40C TOSPICS
  • Date of Collection: October 1973
  • Date of Scanning: October 1973

Format

  • Fixed record length without control words
  • Big endian
  • 6 bits per byte
  • 3660 bytes = 2745 octets per record
bytes # bytes type contents
1-6 6 Integer Serial Index
7 1 T56Code Source (‘A’: Mincho Newspaper, ‘B’: Gothic Newspaper, ‘C’: Mincho Patent, ‘D’: Gothic Patent)
8-12 5 T56Code Spaces
13-18 6 T56Code Class (‘KANJI’: kanji, ‘EIJI’: roman alphabets, ‘HRKANA’: hiragana, ‘KTKANA’: katakana, ‘KIGO’: special characters, ‘SUUJI’: numerals)
19-24 6 T56Code Font (‘MINCHO’, ‘GOTHIC’)
25-28 4 Zeros
29-30 2 Integer CO-59 Code
31-60 30 (undefined)
61-3660 3600 Packed 6-bit-depth image of 60 x 60 = 3600 pixels

Files

filename # records serial numbers # categories # sheets source font original files
ETL2-1 9056 1-11520 1136 24 A MINCHO KPSM1-KPSM4
ETL2-2 10480 11521-23040 1048 24 A MINCHO KPSM5-KPSM8
ETL2-3 11360 28801-40320 1136 24 C MINCHO KPTM1-KPTM4
ETL2-4 10480 40321-51840 1048 24 C MINCHO KPTM5-KPTM8
ETL2-5 11420 23041-28800 51841-57600 571 24 B D GOTHIC GOTHIC KPSG1-KPSG2 KPTG1-KPTG2

Samples

filename record metadata image
ETL2_1 1 1 A KANJI MINCHO 上 1
ETL2_2 101 11621 A KANJI MINCHO 浴 note: the stored character code is wrong, which is next to the true one, but shown as it is here  11621
ETL2_3 201 29001 C KANJI MINCHO 内  29001
ETL2_4 301 40621 C KANJI MINCHO 淡  40621
ETL2_5 401 23441 B KANJI GOTHIC 切  23441

Python code:

This code assumes that the file ‘co59-utf8.txt‘ is in the same directory.