The following paper is being made available by the authors for comment.

	Any comments reader wish to make can be sent to the authors at the address
	below, or can be directed by email to Jack Halpern c/- Jeffrey Friedl
	(jfriedl@omron.co.jp)

	Jim Breen
	Monash University
	February 1994

----------------------------------------------------------------------------

            BUILDING OF COMPREHENSIVE DATABASE FOR THE COMPILATION OF
                  INTEGRATED  KANJI DICTIONARIES AND TOOLS 
           
           ------------------------------------------------------
           |Summary of paper to be presented at Euralex '94, an  |
           |international congress for lexicographers to be held |
           |at the Free  University Amsterdam from August 30     |
           |to September 3, 1994, plus appendixes for reference. |
           -------------------------------------------------------

            Showa Women's University, Institute of Modern Culture
                     KANJI  DICTIONARY  PUBLISHING SOCIETY  
                          漢 英 字 典 刊 行 会
                         1-3-502 3-Chome Niiza
                       Niiza-shi, Saitama 352 JAPAN
                    	  FAX: +81-48-479-1323

                              JACK HALPERN
                   Research Fellow at Institute of Modern Culture
                 Editor in Chief of Kanji Integrated Tools Project
             Editor in Chief of New Japanese-English Character Dictionary

                            MASAAKI NOMURA
                         Professor of Japanese
                 Center for Japanese Language, Waseda University

                           ATSUSHI FUKADA  
                 Assistant Professor of Applied Linguistics
           Center for Linguistic and Cultural Research, Nagoya University
           
           
                            A B S T R A C T

 The New Japanese-English Character Dictionary was designed to provide an in-
depth understanding of how kanji are used in contemporary Japanese. One aim of
this project is to use NJECD to build a comprehensive database with detailed
information on how Chinese characters are used in Chinese, Japanese and Korean,
including printed/calligraphic forms, in-depth semantics, phonemics, encoding
methods, indexing schemes, synonyms, homophones, and voluminous reference data.
A second aim is to use this database to compile about forty applications and
spinoff products for pedagogical and research purposes, including learner's
dictionaries, reference manuals, and CALL software by integrating lexical
semantics and combinatorics with computational lexicography.

1. BACKGROUND 

 Although Japanese has been the subject of various linguistic studies, little
attention has been given to the systematic analysis of its writing system. 
Kanji (Chinese characters as used in Japanese) are combined with each other to
generate countless compound words, and function as a network of interrelated
parts. Though this is vaguely recognized by educators, it has been largely
disregarded in the compilation of character dictionaries. The demand for
effective tools for mastering the Japanese script has been growing at an
unprecedented pace. Learners are in urgent need for dictionaries that
systematically address the special problems of non-Japanese students.

 The New Japanese-English Character Dictionary (NJECD) (Halpern 1990, 1993) was
compiled with the aim of creating a lookup tool that provides an in-depth
understanding of the meanings and functions of high-frequency characters in
contemporary Japanese. The dictionary departs from traditional kanji
lexicography in several ways: (1) the *core meaning* defines the dominant
character sense; (2) detailed meanings show how single-character morphemes
generate numerous compounds; (3) psychologistic ordering reveals the
logical/hierarchical interrelatedness between senses; (4) the System of Kanji
Indexing by Patterns (SKIP), a new method for rapid retrieval of entries; and
(5) precise distinctions between synonyms, homophones, and orthographic
variants (for further details, see Halpern 1990, EURALEX '90 Proceedings).

2. PROJECT AIMS

 This project aims to contribute to Sino-Japanese studies in general, and to
Japanese language studies in particular, in the following four areas:

1. To use NJECD as a basis for creating a comprehensive kanji information 
   database system, which will be referred to as DESK (Database System for 
   Kanji). This database contains detailed information on the use of Chinese 
   characters in Chinese, Japanese and Korean (CJK languages).
2. To use DESK as a basis for compiling about forty applications and spinoff 
   products for pedagogical and research purposes.
3. To provide a comprehensive source of reference data on Chinese characters 
   for pedagogical, linguistic and lexicological research. Some of these data 
   will be made available on the Internet, with certain restrictions to avoid 
   copyright violations.
4. To promote basic research on computational lexicography by establishing 
   methodology for building integrated dictionary databases, especially 
   multilingual databases for storing lexicographic data in a CJK environment.

3. PROJECT OUTLINE 

 To achieve these aims, the Kanji Dictionary Publishing Society was established
in late 1993 as a part of the Institute of Modern Culture at Showa Women's
University. The Society is directed by the Editorial Committee, which consists
of renowned experts in Japanese linguistics, and is financed by the University
and various foundations (1994 budget about US$250,000).

 The DESK database is being used for compiling about forty computer-edited
applications and spinoff products, including teaching and learning aids such as
learner's dictionaries and reference manuals, foreign languages editions such
as a German edition of NJECD, software packages such as CAI/CAL courseware,
electronic books and learning machines, and so on. This series of products will
be referred to as KIT, which stands for Kanji Integrated Tools.

During the initial phase of the project, which will be completed in mid-1994,
the framework and principal components of DESK will be created, and the
electronic book (EB) edition of NJECD will be published. Concurrently, the
building a pilot system for a pocket edition of NJECD is in progress, which
will also be completed in mid-1994.

The following KIT applications will be either published or finalized for
publication over a period of two to three years:

 1. New Japanese-English Character Dictionary: Electronic Book Edition
 2. New Kanji-English Pocket Dictionary 
 3. New Kanji-English Learner's Dictionary
 4. Kanji Input System Based on System of Kanji Indexing by Patterns
 5. Comparative Study of Sino-Japanese Lexical Items
 6. Kanji Cards 
 7. Japanese-English Dictionary of Kanji Synonyms
 8. Japanese-English Dictionary of Kanji Usage

 The EB edition of NJECD is scheduled for publication in the summer of 1994 in
time for presentation at Euralex '94. It incorporates all the features of
NJECD, including core meanings, independent words, homophone/synonym
discrimination, compounds, radicals, a kanji thesaurus, and much more. A
hierarchical menu system enables the user to easily retrieve information by
specifying single or multiple keywords in normal or wordend searches, such as
readings, radicals, core meanings, SKIP patterns and stroke-count. This,
combined with a comprehensive cross-reference network, provides the user with
multiple search paths to access information with maximum speed and facility.

4. LEXICAL SEMANTICS AND COMBINATORICS

 The principal semantic component of DESK was compiled by submitting single-
character morphemes to an exhaustive semantic analysis. The meanings were
analyzed by such techniques as componential analysis and an in-depth
examination of the differences and similarities between near-synonyms, which
served as a powerful technique for establishing precise character meanings.

 Each meaning was analyzed into its single senses, and its relationships to
other members of the same synonym group were examined and compared. That is,
the denotation, connotation, and range of application of each sense were
carefully studied in contrast with those of their near-synonym counterparts,
with emphasis on how the single senses of wordforming elements are influenced
not only by normal syntagmatic relations, but also by often subtle
semantic/functional distinctions dependent on the morphophonemic context. For
example, whereas the Chinese-drived (*on*) bound morpheme 謡 yoo means 'popular
song' in such compounds as 民謡 minyoo 'folk song', the native Japanese (*kun*)
form 謡 utai refers to the chanting of a noh text.

5. COMPUTATIONAL LEXICOGRAPHY
 
 Although every phase of the compilation and editing of NJECD was computerized,
we faced great difficulties in the initial stages. MS-DOS and database
management systems were not yet in widespread use, and the level of PC
technology was hardly up to the task. Nevertheless, the lack of funds and
technical expertise led us to select Fujitsu's FACOM-9450 series, the most
advanced PC on the market at the time, rather than mini-computers.
 
 To compile, process, and proofread the data for NJECD, we wrote about 700
programs in BASIC and used spreadsheets and other software packages from the
mid-eighties, and had to resort to a series of ingenious tricks to force the
hardware and software to perform tasks they were not designed for. An
inevitable consequence of this was data files of complex structure, quite
unlike the logically organized relational database files of today.

 To produce KIT applications in a short period with maximum efficiency, it was
essential to integrate state-of-the-art computer technology with such
disciplines as computational lexicography and lexical semantics to restructure
the data into a rationally-organized database system (DESK), and to write
software for developing applications drawing data from the database. The work
of building the database and application development is outlined below.

6. DATA AND CODE CONVERSION

 The character set of the computers used to compile NJECD, Fujitsu's now
obsolete FACOM-9450 series, supported only Level 1 characters of JIS C 6226-
1978. Since hundreds of characters were missing from the latter, we were forced
to customize it by creating hundreds of user-defined characters and remapping
hundreds of JIS Level 2 characters to JIS Level 1 codes. This resulted in a
character set basically incompatible with current character set standards,
national or corporate.

 To ensure easy portability to a wide range of hardware and software platforms,
we converted the data to the Shift-JIS code system and updated it to JIS X
0208-1990. In addition, we restored the remapped codes and either recreated or
remapped user-defined characters not present in JIS X 0208-1990, if necessary
by mapping into the supplemental character set JIS X 0212-1990, or the ISO
10646/Unicode character set, in that order. This approach, although complex,
yielded excellent results by keeping user-defined characters to a bare minimum
and ensuring maximum portability. It was suggested by Ken Lunde, an expert on
Japanese encoding methods, who has written a definitive work on the subject
(Lunde 1993).

7. SYSTEM ANALYSIS AND DATABASE DESIGN

 Each entry character is associated with numerous attributes, such as a core
meaning, various readings, multiple senses for each reading, and stylistic
labels, and is also a member of various cross-reference networks. For example, 
暖 and 温 share the *kun* reading *atatakai* but have slightly different
 connotations when used as free morphemes. On the other hand, 煖 and 暖 share 
the same meanings and *on* reading *dan* as word elements, e.g. as a verb 'to
warm', but the free form 煖かい *atatakai* 'warm' is not normally used.

 The entry characters and their attributes thus form an inherently complex
network of semantic, orthographic and phonologic relations and subrelations
often interrelated in highly complex hierarchical structures that do not easily
lend themselves to representation by traditional one-to-many and many- to-many
relations. Ideally, to express such intricate interrelations in a manner
conducive to their effective extraction and analysis approaches the limit of
relational databases, and requires a network database design. To do so within
the limits of RDB systems requires a thorough analysis aimed to discover the
most effective constructs that will, on the one hand, capture and represent the
relations between entry characters, compounds, and their respective attributes,
and, on the other, allow easy manipulation of the data with a view to
efficiently generating a wide range of applications.

 In spite of these limitations, we have chosen to adopt dBASE IV, a relational
database management system, for a number of reasons, especially its universal
availability, ease of manipulating data and developing applications using the
Xbase language, and easy portability to other systems. We are also using PERL,
a powerful language for text processing and string manipulation.

8. DEVELOPMENT OF DATABASE SYSTEM 

 The DESK database contains (or will contain) detailed information on every
important aspect of Chinese characters as used in CJK languages and the
principal Chinese dialects. This includes printed and calligraphic forms, in-
depth semantics, phonemics, encoding methods, indexing schemes, synonyms and
homophones, character etymology (based on Halpern 1987) and a wealth of other
reference data.

The development of software for building the DESK database and the feeding of
data to the system is being implemented in six stages.

1. Developing software for restructuring the old format of NJECD's data to a 
   rationally-structured relational database system on a dBASE platform.

2. Defining structures and developing software for building a system that is
   (a) sufficiently flexible to integrate the NJECD database into the broader
   framework of a comprehensive CJK database system (DESK) and (b) sufficiently
   open-ended to accommodate large-scale expansion.

3. Developing software and a menu-driven user interface for querying,
   searching, sorting, and otherwise manipulating the database system.

4. Thorough testing, revision, and maintenance of the system.

5. Building a pilot system for generating data for the New Kanji-English Pocket
   Dictionary in order to verify that the system is sufficiently robust to cope
   with dictionary compilation under field conditions.

6. Feeding large volumes of data to the database from various sources,
   including NJECD and its German edition, character meanings, compounds and
   their equivalents, frequency statistics, CJK character readings, character
   codes, calligraphic styles, etymology, stroke-order diagrams, etc. The
   system will grow organically through the addition of data from new sources,
   the compilation of new dictionaries, and the expansion of existing ones.

9. DEVELOPMENT OF KIT APPLICATIONS

 The development and compilation of KIT applications and products is being
carried out in three stages: (1) designing the system for each application by
(a) performing an in-depth analysis of its special features, such as the range
of coverage, ordering scheme, entry layout, appendixes and indexes, and by (b)
drawing up software specifications for each application; (2) building a system
for each application by developing application-specific software; and (3)
thorough testing, revision, and maintenance of software.

 The production of KIT printed products is being carried out in four stages:
(1) adding new data (such as German core meanings), (2) editing the data
generated by each application-specific system, and repeatedly checking the data
until it is error-free, (3) developing software to process the data prior to
computerized photocomposition; and (4) preparing camera-ready mechanicals by
DTP and/or computerized photocomposition, to be followed by printing and
binding.

 * * * * *

 Lexicography is not yet a recognized discipline in Japan. By building a
comprehensive CJK database and using it for compiling numerous lexicographic
works, this project will make a significant contribution to the advancement and
eventual establishment of lexicography as a branch of learning in Japan, and to
the promotion of the study and research of CJK languages.

REFERENCES

HALPERN, Jack (1987): 漢字の再発見 (Kanji no Saihakken) 'Rediscovering Chinese 
 Characters'. Tokyo: Shodensha
HALPERN, Jack (1990): New Japanese-English Character Dictionary. Tokyo: 
 Kenkyusha
HALPERN, Jack (1990): New Japanese-English Character Dictionary: A Semantic 
 Approach to Kanji Lexicography. EURALEX '90 Proceedings: Actas del IV 
 Congreso Internacional, 157-166. Benalmadena (Malaga): Bibliograph
HALPERN, Jack (1993): NTC's New Japanese-English Character Dictionary. 
 Chicago: National Textbook Company 
LUNDE, Ken (1993): Understanding Japanese Information Processing. Sebastopol, 
 CA: O'Reilly & Associates

----------------------------------------------------------------------------

 APPENDIX A: LIST OF KIT APPLICATIONS

Below is a list of the principal dictionaries, reference works and learning 
tools (DISK applications) that could be compiled on the basis of the DESK 
database. (The asterisk indicates that more detailed information is available 
for that item.)

1. GENERAL CHARACTER DICTIONARIES 一般漢英字典 
 
 *1. NTC's New Japanese-English Character Dictionary (NTC, 1993) 
 *2. New Kanji-English Pocket Dictionary 新漢英小字典
 *3. New Kanji-English Learner's Dictionary 新漢英学習字典
 *4. Japanese-English Dictionary of Kanji Synonyms 類義漢字和英辞典
  5. Pocket Kanji Thesaurus 類義漢字和英小辞典
 *6. Japanese-English Dictionary of Kanji Usage 同訓使い分け和英辞典
  7. Japanese-English Kanji Compounds Dictionary 実用漢英熟語字典・一般編
 *8. New Japanese-German Character Dictionary 新漢独字典
  9. New Japanese-Spanish Character Dictionary 新漢西字典
 10. New Japanese-French Character Dictionary 新漢仏字典

2. SPECIAL-PURPOSE DICTIONARIES/REFERENCE WORKS 特殊漢字字典・参考書 

   1. Introduction to Kanji 漢字入門
 * 2. Kanji-English Dictionary for Business and Economics 
	 実用漢英熟語字典・経済編
   3. Kanji-English Dictionary for the Arts and Humanities 
	 実用漢英熟語字典・文化編 
   4. Kanji-English Dictionary for Science and Technology 
	 実用漢英熟語字典・科学技術編
   5. Introduction to Kanji Compound Formation 漢字熟語成立ち入門
   6. Japanese-English Dictionary of Prefixes and Suffixes 漢字接辞和英辞典 
   7. Japanese-English Dictionary for Counters and Units 単位・助数詞和英辞典
   8. Kanji Reference Handbook 漢英参考情報便覧 
   9. Japanese-English Dictionary of Character Etymology 漢英字源字典
  10. Introduction to the Radical System 漢字部首入門 
  11. Introduction to Written Japanese 日本語書き方入門 
 *12. Comparative Study of Sino-Japanese Lexical Items 漢語語彙比較研究 

3. ELECTRONIC DICTIONARIES, OTHERS 電子字典・その他

   1. Kanji Learner's Electronic Dictionary 電子漢字学習機
   2. Kanji Learner's Courseware 漢字学習コースウェア
 * 3. Kanji Input System Based on System of Kanji Indexing by Patterns
 	字型検字法による漢字入力方式
   4. Kanji Games Software Kit 漢字学習ゲームソフト
   5. JIS Kanji Index Based on System of Kanji Indexing by Patterns
	 字型検字法によるＪＩＳ漢字索引
 * 6. New Japanese-English Character Dictionary: Electronic Book Edition
	 新漢英字典電子ブック版
   7. New Japanese-English Character Dictionary: CD-ROM Edition
	 新漢英字典ＣＤ－ＲＯＭ版
   8. Kanji Learner's Wall Chart 漢字学習貼紙表
 * 9. Kanji Cards 漢字学習カード 
  10. Introduction to Kanji: Video Edition 漢字学習ビデオ 
  11. Train and Subway Kanji Guide 電車・列車漢字案内
  12. Restaurant Kanji Guide レストラン漢字案内 

4. DICTIONARIES AND AIDS FOR JAPANESE USERS 日本人対象の字典・教材

  1. Dictionary of Kanji Synonyms 類義漢字辞典
  2. Pocket Kanji Thesaurus 類義漢字小辞典
  3. Dictionary of Kanji Usage 同訓使い分け辞典
  4. Kanji Learner's Dictionary for Elementary Schoolchildren 
	  小学生用漢字学習字典
  5. Dictionary of Kanji Compound Formation 漢字熟語構成辞典
  6. Kanji Learner's Courseware 漢字学習コースウェア
  7. Kanji Learner's Dictionary: Electronic Book Edition
	  漢字学習字典電子ブック版
  8. Introduction to Kanji Compound Formation 漢字熟語成立ち入門 
  9. Kanji Learner's Graded Wall Chart 学年別漢字学習貼紙表

----------------------------------------------------------------------------

APPENDIX B: EDITORIAL COMMITTEE OF KANJI DICTIONARY PUBLISHING SOCIETY

KUSUO HITOMI President of Showa Women's University
 Director General and President of KDPS
 Chairman of KDPS Editorial Committee

OKI HAYASHI President of the Society for Teaching Japanese as a 
 Foreign Language 
 formerly President of the National Language Research 
 Institute
 Consultant to KDPS Editorial Committee

OSAMU MIZUTANI Director General of the National Language Research Institute
 Councilor of the Society for Teaching Japanese as a Foreign 
 Language
 Consultant to KDPS Editorial Committee

SHIGEHIKO TOYAMA Professor at the Graduate School of Literature, Showa 
 Women's University
 Member of KDPS Editorial Committee

TAKASHI TAKAMIZAWA Professor/Director of the Course of Japanese Literature, 
 Showa Women's University
 Member of KDPS Editorial Committee

CHIKASADA HARADA Professor of Japanese Literature, Showa Women's University
 Member of KDPS Editorial Committee

TOMOKO KANEKO Professor of English and American Literature, Showa 
 Women's University
 Member of KDPS Editorial Committee

KEN LUNDE Project Manager of Japanese Font Production at Adobe 
 Systems, Inc.
 Technical Consultant to KDPS 

YOSHIAKI TAKEBE formerly Professor at Waseda University
 Member of KDPS Editorial Committee

MASAAKI NOMURA Professor of Japanese at Center for Japanese Language, 
 Waseda University
 Member of KDPS Editorial Committee

ATSUSHI FUKADA Assistant Professor of Applied Linguistics at Center for 
 Linguistic and Cultural Research, Nagoya University
 Member of KDPS Editorial Committee

YOICHIRO YAMAMURA President of Brain Brigade Systems, Ltd.
 Production and Marketing Consultant to KDPS

JACK HALPERN Research Fellow at Institute of Modern Culture, Showa 
 Women's University
 Editor in Chief of New Japanese-English Character 
 Dictionary
 Editor in Chief of Kanji Integrated Tools Project

----------------------------------------------------------------------------

APPENDIX C: OVERVIEW OF PRINCIPAL FEATURES 

Listed below are the principal features of DESK-KIT applications and products.
The presence or absence of a specific feature depends on the item in question 
. For more information, see the individual descriptions for each project
(available on request), and *Features of This Dictionary* on page 61 of NJECD).

>> *Core meaning* -- a concise keyword that defines the most dominant sense
    of each character to provide an instant grasp of its fundamental concept.

>> *Psychologistic ordering* of character meanings, clustered around the core
    meaning in a manner that allows them to be conceived as a
    logically-structured, integrated unit.

>> *Complete and accurate character meanings* clearly show how a few thousand
    building blocks are combined to generate countless compound words.

>> Numerous *high-frequency compounds* provide maximally useful examples 
   of each character sense and clearly show how these contribute to the meaning
   of each compound.

>> *Compound formation articles* describe the etymology of compounds and
    explain how their constituent characters contribute to their meanings.

>> *Synonym articles* provide full guidance on the differences and 
    similarities between closely related characters.

>> *Detailed usage notes* help you understand the fine distinctions
    between *kun* homophones.

>> System of Kanji Indexing by Patterns -- a totally new method for looking up
   characters as quickly as in alphabetical dictionaries

>> Six lookup methods and three indexes allow even a complete beginner to
   locate entries with great speed and little effort.

>> A *system of labels* provides useful information on the temporal status,
   etymology, orthography, style, function, level of formality, etc., of
   character senses.

>> The *degree of importance* of each character sense is indicated by various
   typographical differences and status labels for four levels of study.

>> Quick access to a valuable source of *supplementary reference data,* such as
   the principles of stroke order, frequency lists, historical tables, rules
   for okurigana, kana charts, a list of kanji synonyms.

>> A user-friendly format ensures a visually attractive layout and maximum 
   ease of use.