Specifications of ETL4

Summary

ETL4 is a dataset of handwritten character images of 51 Hiragana characters made from OCR sheets collected at Nagoya University which were scanned at Electrotechnical Laboratory with the TOSBAC-3400 scanning system in 1974 (S49) FY.


Data Collection

OCR Sheet (Same as the ETL1)

  • Sheet: B5, 90kg per 1000 sheets
  • Dropout color: No.26 Violet 50% Screen(DNP)
  • Frame size (width x height): 5mm x 7mm
  • Frame pitch (width x height): 7.62mm x 12.7mm
  • Number of frames: 10 x 12 = 120

Characters

  • Hiragana: 51(あいうえおかきくけこさしすせそたちつてとなにぬねのはひふへほまみむめもやいゆえよらりるれろわゐうゑをん)

Data Collection

Scanning System

  • Scanner: Flying Spot Scanner (FSS) with a Flying Spot Cathode Ray Tube 5CNP16 and a Photomultiplier Tube 7696
  • Interval: 0.133mm x 0.133mm
  • Spot diameter: 0.1333mm
  • Intensity levels: 16 (4bit)
  • Number of pixels: 72 x 76 = 5,472 pixels

Compilation

  • Location: Electrotechnical Laboratory (ETL)
  • Computer : TOSBAC-3400/41
  • Software: FSSTOMT
  • Date of Compilation: Dec. 1974
  • Date of Scanning: Dec. 1974

Format

  • C-Type Data Format (ETL3, ETL4, ETL5)
  • Fixed Record Length without Control Words
  • Logical record length is 3936 bytes (6 bits / byte) or 2952 octets (1 octet = 8 bits)
  • Big endian
byte range # bytes type contents
1 – 6 6 Integer Serial Data Number
7 – 12 6 Integer Serial Sheet Number
13-18 6 Binary JIS Code (Effective bits  = Left 8 bits) (JIS X 0201)
19 – 24 6 Binary EBCDIC Code (Effective bits = Left 8 bits)
25 – 28 4 T56Code 4 Character Code (ex. “N  0”, “A  A”, “S  +”, “K KA” )
29-30 2 T56Code Spaces
31 – 36 6 Integer Evaluation of Individual Character Image (0=clean, 1, 2, 3)
37 – 42 6 Integer Evaluation of Character Group (0=clean, 1, 2)
43 – 48 6 Integer Sample Position Y on Sheet
49 – 54 6 Integer Sample Position X on Sheet
55 – 60 6 Integer Male-Female (Gender) Code ( 1=male, 2=female ) (JIS X 0303)
61 – 72 6 Integer Industry Classification Code (JIS X 0403)
73 – 78 6 Integer Occupation Classification Code (JIS X 0404)
79 – 84 6 Integer Sheet Gatherring Date
85 – 90 6 Integer Scanning Date
91 – 96 6 Integer Number of X-Axis Sampling Points (image width)
97 – 102 6 Integer Number of Y-Axis Sampling Points (image height)
103 – 108 6 Integer Number of Levels of Pixel
109 – 114 6 Integer Magnification of Scanning Lenz
115 – 120 6 Integer Serial Data Number (old)
121 – 288 168 unused
289 – 3936 3648 Packed  image data of 72 x 76 (width x height) = 5472 pixels with 16 gray levels (4bits / pixel)

Sample

Code for unpacking ETL4 in Python 2.

#ETL-4
import bitstring

t56s = '0123456789[#@:>? ABCDEFGHI&.](<  JKLMNOPQR-$*);\'|/STUVWXYZ ,%="!'

def read_record_ETL4(f, pos=0):
   f = bitstring.ConstBitStream(filename=f)
   f.bytepos = pos * 2952
   r = f.readlist('2*uint:36,uint:8,pad:28,uint:8,pad:28,4*uint:6,pad:12,15*uint:36,pad:1008,bytes:21888')
   print 'Serial Data Number:', r[0]
   print 'Serial Sheet Number:', r[1]
   print 'JIS Code:', r[2]
   print 'EBCDIC Code:', r[3]
   print '4 Character Code:', ''.join([t56s[c] for c in r[4:8]])
   print 'Evaluation of Individual Character Image:', r[8]
   print 'Evaluation of Character Group:', r[9]
   print 'Sample Position Y on Sheet:', r[10]
   print 'Sample Position X on Sheet:', r[11]
   print 'Male-Female Code:', r[12]
   print 'Age of Writer:', r[13]
   print 'Industry Classification Code:', r[14]
   print 'Occupation Classifiaction Code:', r[15]
   print 'Sheet Gatherring Date:', r[16]
   print 'Scanning Date:', r[17]
   print 'Number of X-Axis Sampling Points:', r[18]
   print 'Number of Y-Axis Sampling Points:', r[19]
   print 'Number of Levels of Pixel:', r[20]
   print 'Magnification of Scanning Lens:', r[21]
   print 'Serial Data Number (old):', r[22]
   return r

from PIL import Image, ImageEnhance
from PIL import ImageOps, ImageMath
from matplotlib import pyplot as plt

filename = 'ETL4/ETL4C' # specify the ETL4 filename here
r = read_record_ETL4(filename)

iF = Image.frombytes('F', (r[18], r[19]), r[-1], 'bit', 4)
iP = iF.convert('P')
enhancer = ImageEnhance.Brightness(iP)
iE = enhancer.enhance(r[20])
plt.imshow(iE)

sample output

 record metadata image
0 Serial Data Number: 500100
Serial Sheet Number: 5001
JIS Code: 0xb1
EBCDIC Code: 0x81
4 Character Code: H A
Evaluation of Individual Character Image: 0
Evaluation of Character Group: 0
Sample Position Y on Sheet: 1
Sample Position X on Sheet: 0
Male-Female Code: 1
Age of Writer: 23
Industry Classification Code: 9144
Occupation Classifiaction Code: 11
Sheet Gatherring Date: 741202
Scanning Date: 741216
Number of X-Axis Sampling Points: 72
Number of Y-Axis Sampling Points: 76
Number of Levels of Pixel: 16
Magnification of Scanning Lens: 133
Serial Data Number (old): 0