Data Collection
OCR Sheets
- Paper: B4, 90 kg per 1000 sheets
- Samples:
Characters
- 2184 characters in CO-59 codeset
- CO-59 (六社協定新聞社用) code is a Japanese character code standardized by 6 major Japanese newspaper companies in 1959. The table is called 漢テレファックス符号及び文字配列表. The conversion function to utf-8 is included in the sample script.
- Hiragana, katakana, roman alphabet, symbol, Kanji
- 8 point fonts of metal type printing for newspapers
- 9 point fonts of offset printing for publication of patent application
Scanner
- ITV Camera Scanner 240×240
- Sampling interval: 54μm x 54μm
- Spot size: 54μm
- Intensity levels: 64=6bits
- Number of pixels: 60 x 60 = 3600
Compile
- Source of Collection: Dai Nippon Printing Co., Ltd., The Mainichi Newspapers Co., Ltd
- Total samples: 52796
- Scanning: Toshiba
- Computer: TOSBAC-40C TOSPICS
- Date of Collection: October 1973
- Date of Scanning: October 1973
Format
- Fixed record length without control words
- Big endian
- 6 bits per byte
- 3660 bytes = 2745 octets per record
- File Formats and Sample Script
Files
filename | # records | serial numbers | # categories | # sheets | source | font | original files |
ETL2-1 | 9056 | 1-11520 | 1136 | 24 | A | MINCHO | KPSM1-KPSM4 |
ETL2-2 | 10480 | 11521-23040 | 1048 | 24 | A | MINCHO | KPSM5-KPSM8 |
ETL2-3 | 11360 | 28801-40320 | 1136 | 24 | C | MINCHO | KPTM1-KPTM4 |
ETL2-4 | 10480 | 40321-51840 | 1048 | 24 | C | MINCHO | KPTM5-KPTM8 |
ETL2-5 | 11420 | 23041-28800 51841-57600 | 571 | 24 | B D | GOTHIC GOTHIC | KPSG1-KPSG2 KPTG1-KPTG2 |