Specification of ETL-2

Data Collection

OCR Sheets

  • Paper: B4, 90 kg per 1000 sheets
  • Samples: e2shta01e2shtb01e2shtc01e2shtd01

Characters

  • 2184 characters in CO-59 codeset
  • Hiragana, katakana, roman alphabet, symbol, Kanji
  • 8 point fonts of metal type printing for newspapers
  • 9 point fonts of offset printing for publication of patent application

Scanner

  • ITV Camera Scanner 240×240
  • Sampling interval: 54μm x 54μm
  • Spot size: 54μm
  • Intensity levels: 64=6bits
  • Number of pixels: 60 x 60 = 3600

Compile

  • Source of Collection: Dai Nippon Printing Co., Ltd., The Mainichi Newspapers Co., Ltd
  • Total samples: 52796
  • Scanning: Toshiba
  • Computer: TOSBAC-40C TOSPICS
  • Date of Collection: October 1973
  • Date of Scanning: October 1973

Format

  • Fixed record length without control words
  • Big endian
  • 6 bits per byte
  • 3660 bytes = 2745 octets per record
bytes # bytes type contents
1-6 6 Integer Serial Index
7 1 T56Code Source (‘A’: Mincho Newspaper, ‘B’: Gothic Newspaper, ‘C’: Mincho Patent, ‘D’: Gothic Patent)
8-12 5 T56Code Spaces
13-18 6 T56Code Class (‘KANJI’: kanji, ‘EIJI’: roman alphabets, ‘HRKANA’: hiragana, ‘KTKANA’: katakana, ‘KIGO’: special characters, ‘SUUJI’: numerals)
19-24 6 T56Code Font (‘MINCHO’, ‘GOTHIC’)
25-28 4 Zeros
29-30 2 Integer CO-59 Code
31-60 30 (undefined)
61-3660 3600 Packed 6-bit-depth image of 60 x 60 = 3600 pixels

Files

filename # records serial numbers # categories # sheets source font original files
ETL2-1 9056 1-11520 1136 24 A MINCHO KPSM1-KPSM4
ETL2-2 10480 11521-23040 1048 24 A MINCHO KPSM5-KPSM8
ETL2-3 11360 28801-40320 1136 24 C MINCHO KPTM1-KPTM4
ETL2-4 10480 40321-51840 1048 24 C MINCHO KPTM5-KPTM8
ETL2-5 11420 23041-28800 51841-57600 571 24 B D GOTHIC GOTHIC KPSG1-KPSG2 KPTG1-KPTG2

Samples

filename record metadata image
ETL2_1 1 1 A KANJI MINCHO 上 1
ETL2_2 101 11621 A KANJI MINCHO 浴 note: the stored character code is wrong, which is next to the true one, but shown as it is here  11621
ETL2_3 201 29001 C KANJI MINCHO 内  29001
ETL2_4 301 40621 C KANJI MINCHO 淡  40621
ETL2_5 401 23441 B KANJI GOTHIC 切  23441

Python code:

import codecs
import bitstring
from PIL import Image, ImageEnhance

t56s = '0123456789[#@:>? ABCDEFGHI&.](<  JKLMNOPQR-$*);\'|/STUVWXYZ ,%="!'
def T56(c):
    return t56s[c]

with codecs.open('co59-utf8.txt', 'r', 'utf-8') as co59f:
    co59t = co59f.read()
co59l = co59t.split()
CO59 = {}
for c in co59l:
    ch = c.split(':')
    co = ch[1].split(',')
    CO59[(int(co[0]),int(co[1]))] = ch[0]

filename = 'ETL2/ETL2_1'
skip = 0

f = bitstring.ConstBitStream(filename=filename)
f.pos = skip * 6 * 3660
r = f.readlist('int:36,uint:6,pad:30,6*uint:6,6*uint:6,pad:24,2*uint:6,pad:180,bytes:2700') 
print r[0], T50(r[1]), "".join(map(T56, r[2:8])), "".join(map(T56, r[8:14])), CO59[tuple(r[14:16])]
iF = Image.frombytes('F', (60,60), r[16], 'bit', 6)
iP = iF.convert('P')
fn = '{:d}.png'.format(r[0])
#iP.save(fn, 'PNG', bits=6)
enhancer = ImageEnhance.Brightness(iP)
iE = enhancer.enhance(4)
iE.save(fn, 'PNG')

This code assumes that the file ‘co59-utf8.txt‘ is in the same directory.