Anyone who wants to obtain copies of Database must agree the Terms of Use.
Terms of Use
- [Definition] The ETL Character Database (hereinafter referred to as “Database”) is composed of 9 datasets of scanned images of handwritten and printed characters supplied formerly by the Electrotechnical Laboratory (ETL) and currently by its reorganized successor the National Institute of Advanced Industrial Science and Technology (AIST).
- [Copyright] AIST retain the copyright of Database.
- [Usage] Use of Database is allowed for free.
- [Reference] Publication referring to and including any part of Database should be indicated by the term ETL Character Database (or ETLn Character Database for specific dataset where n stands for the relevant dataset number) with the reference: Electrotechnical Laboratory, Japanese Technical Committee for Optical Character Recognition, ETL Character Database, 1973-1984.
- [Distribution] Distribution of Database should only be made through this web site. Unauthorized distribution and publication of the data itself beyond the scope of quotation and/or direct URL links to data files are prohibited.
- [Privacy Policy] The information of users is utilized for notification about Database and analysis of the usage of Database. A request for disclosure of information gathered by this site may be made in accordance with Japanese laws and regulations. In the event of such a request, no personal information will be disclosed to third party unless the individual concerned has agreed to such disclosure or other special circumstances apply.
- [Responsibility] AIST accepts no responsibility for any adverse effects as a result from downloading files of Database.
- [Revision Date] The terms of use for Database were revised on 2025-03-28.
File Formats
Each dataset is segmented into multiple data files and packed into a zip file including an additional plain text file (ETLnINFO) describing the number of records contained in each data file.
All data is stored in binary format. The format is derived from the magnetic tape on which the data was originally recorded. All records in a file have the same fixed length without control sequence. The storage unit is 8 bits per byte except for ETL2-5 whose storage unit is 6 bits per byte. Bit order is the big endian. There are 7 different formats with several character encodings depending on the dataset.
- M-type (ETL1, ETL6, ETL7), encodings: JIS X 0201, extended EBCDIC
- K-type (ETL2), encodings: CO-59, T56
- C-type (ETL3, ETL4, ETL5), encodings: JIS X 0201, extended EBCDIC, T56
- B-type (ETL8B), encodings: JIS X 0208
- G-type (ETL8G), encodings: JIS X 0208
- B-type (ETL9B), encodings: JIS X 0208
- G-type (ETL9G), encodings: JIS X 0208
Unpacking the Database
For unpacking the contents, download and unzip
unpack_etlcdb.zip (51 downloads )
. Install the required packages using
pip install -r requirements.txt
For example for unpacking the file ETL1/ETL1_01
, run
python unpack.py ETL1/ETL1_01
which will generate a folder named ETL1/ETL1_01_unpack
containing image files and a metadata CSV file. The image filename is numbered (e.g. 00000.png
), which is equal to the line index of the CSV file. The input file path needs to be adjusted as needed for your environment.
You can check the usage instructions with
python unpack.py --help
Use --fields
to specify the fields to be written in the CSV file.
This script should work with Python 3.9 or later.
There is a user project called etlcdb-image-extractor.