The following paper is being made available by the authors for comment. Any comments reader wish to make can be sent to the authors at the address below, or can be directed by email to Jack Halpern c/- Jeffrey Friedl (jfriedl@omron.co.jp) Jim Breen Monash University February 1994 ---------------------------------------------------------------------------- BUILDING OF COMPREHENSIVE DATABASE FOR THE COMPILATION OF INTEGRATED KANJI DICTIONARIES AND TOOLS ------------------------------------------------------ |Summary of paper to be presented at Euralex '94, an | |international congress for lexicographers to be held | |at the Free University Amsterdam from August 30 | |to September 3, 1994, plus appendixes for reference. | ------------------------------------------------------- Showa Women's University, Institute of Modern Culture KANJI DICTIONARY PUBLISHING SOCIETY 漢 英 字 典 刊 行 会 1-3-502 3-Chome Niiza Niiza-shi, Saitama 352 JAPAN FAX: +81-48-479-1323 JACK HALPERN Research Fellow at Institute of Modern Culture Editor in Chief of Kanji Integrated Tools Project Editor in Chief of New Japanese-English Character Dictionary MASAAKI NOMURA Professor of Japanese Center for Japanese Language, Waseda University ATSUSHI FUKADA Assistant Professor of Applied Linguistics Center for Linguistic and Cultural Research, Nagoya University A B S T R A C T The New Japanese-English Character Dictionary was designed to provide an in- depth understanding of how kanji are used in contemporary Japanese. One aim of this project is to use NJECD to build a comprehensive database with detailed information on how Chinese characters are used in Chinese, Japanese and Korean, including printed/calligraphic forms, in-depth semantics, phonemics, encoding methods, indexing schemes, synonyms, homophones, and voluminous reference data. A second aim is to use this database to compile about forty applications and spinoff products for pedagogical and research purposes, including learner's dictionaries, reference manuals, and CALL software by integrating lexical semantics and combinatorics with computational lexicography. 1. BACKGROUND Although Japanese has been the subject of various linguistic studies, little attention has been given to the systematic analysis of its writing system. Kanji (Chinese characters as used in Japanese) are combined with each other to generate countless compound words, and function as a network of interrelated parts. Though this is vaguely recognized by educators, it has been largely disregarded in the compilation of character dictionaries. The demand for effective tools for mastering the Japanese script has been growing at an unprecedented pace. Learners are in urgent need for dictionaries that systematically address the special problems of non-Japanese students. The New Japanese-English Character Dictionary (NJECD) (Halpern 1990, 1993) was compiled with the aim of creating a lookup tool that provides an in-depth understanding of the meanings and functions of high-frequency characters in contemporary Japanese. The dictionary departs from traditional kanji lexicography in several ways: (1) the *core meaning* defines the dominant character sense; (2) detailed meanings show how single-character morphemes generate numerous compounds; (3) psychologistic ordering reveals the logical/hierarchical interrelatedness between senses; (4) the System of Kanji Indexing by Patterns (SKIP), a new method for rapid retrieval of entries; and (5) precise distinctions between synonyms, homophones, and orthographic variants (for further details, see Halpern 1990, EURALEX '90 Proceedings). 2. PROJECT AIMS This project aims to contribute to Sino-Japanese studies in general, and to Japanese language studies in particular, in the following four areas: 1. To use NJECD as a basis for creating a comprehensive kanji information database system, which will be referred to as DESK (Database System for Kanji). This database contains detailed information on the use of Chinese characters in Chinese, Japanese and Korean (CJK languages). 2. To use DESK as a basis for compiling about forty applications and spinoff products for pedagogical and research purposes. 3. To provide a comprehensive source of reference data on Chinese characters for pedagogical, linguistic and lexicological research. Some of these data will be made available on the Internet, with certain restrictions to avoid copyright violations. 4. To promote basic research on computational lexicography by establishing methodology for building integrated dictionary databases, especially multilingual databases for storing lexicographic data in a CJK environment. 3. PROJECT OUTLINE To achieve these aims, the Kanji Dictionary Publishing Society was established in late 1993 as a part of the Institute of Modern Culture at Showa Women's University. The Society is directed by the Editorial Committee, which consists of renowned experts in Japanese linguistics, and is financed by the University and various foundations (1994 budget about US$250,000). The DESK database is being used for compiling about forty computer-edited applications and spinoff products, including teaching and learning aids such as learner's dictionaries and reference manuals, foreign languages editions such as a German edition of NJECD, software packages such as CAI/CAL courseware, electronic books and learning machines, and so on. This series of products will be referred to as KIT, which stands for Kanji Integrated Tools. During the initial phase of the project, which will be completed in mid-1994, the framework and principal components of DESK will be created, and the electronic book (EB) edition of NJECD will be published. Concurrently, the building a pilot system for a pocket edition of NJECD is in progress, which will also be completed in mid-1994. The following KIT applications will be either published or finalized for publication over a period of two to three years: 1. New Japanese-English Character Dictionary: Electronic Book Edition 2. New Kanji-English Pocket Dictionary 3. New Kanji-English Learner's Dictionary 4. Kanji Input System Based on System of Kanji Indexing by Patterns 5. Comparative Study of Sino-Japanese Lexical Items 6. Kanji Cards 7. Japanese-English Dictionary of Kanji Synonyms 8. Japanese-English Dictionary of Kanji Usage The EB edition of NJECD is scheduled for publication in the summer of 1994 in time for presentation at Euralex '94. It incorporates all the features of NJECD, including core meanings, independent words, homophone/synonym discrimination, compounds, radicals, a kanji thesaurus, and much more. A hierarchical menu system enables the user to easily retrieve information by specifying single or multiple keywords in normal or wordend searches, such as readings, radicals, core meanings, SKIP patterns and stroke-count. This, combined with a comprehensive cross-reference network, provides the user with multiple search paths to access information with maximum speed and facility. 4. LEXICAL SEMANTICS AND COMBINATORICS The principal semantic component of DESK was compiled by submitting single- character morphemes to an exhaustive semantic analysis. The meanings were analyzed by such techniques as componential analysis and an in-depth examination of the differences and similarities between near-synonyms, which served as a powerful technique for establishing precise character meanings. Each meaning was analyzed into its single senses, and its relationships to other members of the same synonym group were examined and compared. That is, the denotation, connotation, and range of application of each sense were carefully studied in contrast with those of their near-synonym counterparts, with emphasis on how the single senses of wordforming elements are influenced not only by normal syntagmatic relations, but also by often subtle semantic/functional distinctions dependent on the morphophonemic context. For example, whereas the Chinese-drived (*on*) bound morpheme 謡 yoo means 'popular song' in such compounds as 民謡 minyoo 'folk song', the native Japanese (*kun*) form 謡 utai refers to the chanting of a noh text. 5. COMPUTATIONAL LEXICOGRAPHY Although every phase of the compilation and editing of NJECD was computerized, we faced great difficulties in the initial stages. MS-DOS and database management systems were not yet in widespread use, and the level of PC technology was hardly up to the task. Nevertheless, the lack of funds and technical expertise led us to select Fujitsu's FACOM-9450 series, the most advanced PC on the market at the time, rather than mini-computers. To compile, process, and proofread the data for NJECD, we wrote about 700 programs in BASIC and used spreadsheets and other software packages from the mid-eighties, and had to resort to a series of ingenious tricks to force the hardware and software to perform tasks they were not designed for. An inevitable consequence of this was data files of complex structure, quite unlike the logically organized relational database files of today. To produce KIT applications in a short period with maximum efficiency, it was essential to integrate state-of-the-art computer technology with such disciplines as computational lexicography and lexical semantics to restructure the data into a rationally-organized database system (DESK), and to write software for developing applications drawing data from the database. The work of building the database and application development is outlined below. 6. DATA AND CODE CONVERSION The character set of the computers used to compile NJECD, Fujitsu's now obsolete FACOM-9450 series, supported only Level 1 characters of JIS C 6226- 1978. Since hundreds of characters were missing from the latter, we were forced to customize it by creating hundreds of user-defined characters and remapping hundreds of JIS Level 2 characters to JIS Level 1 codes. This resulted in a character set basically incompatible with current character set standards, national or corporate. To ensure easy portability to a wide range of hardware and software platforms, we converted the data to the Shift-JIS code system and updated it to JIS X 0208-1990. In addition, we restored the remapped codes and either recreated or remapped user-defined characters not present in JIS X 0208-1990, if necessary by mapping into the supplemental character set JIS X 0212-1990, or the ISO 10646/Unicode character set, in that order. This approach, although complex, yielded excellent results by keeping user-defined characters to a bare minimum and ensuring maximum portability. It was suggested by Ken Lunde, an expert on Japanese encoding methods, who has written a definitive work on the subject (Lunde 1993). 7. SYSTEM ANALYSIS AND DATABASE DESIGN Each entry character is associated with numerous attributes, such as a core meaning, various readings, multiple senses for each reading, and stylistic labels, and is also a member of various cross-reference networks. For example, 暖 and 温 share the *kun* reading *atatakai* but have slightly different connotations when used as free morphemes. On the other hand, 煖 and 暖 share the same meanings and *on* reading *dan* as word elements, e.g. as a verb 'to warm', but the free form 煖かい *atatakai* 'warm' is not normally used. The entry characters and their attributes thus form an inherently complex network of semantic, orthographic and phonologic relations and subrelations often interrelated in highly complex hierarchical structures that do not easily lend themselves to representation by traditional one-to-many and many- to-many relations. Ideally, to express such intricate interrelations in a manner conducive to their effective extraction and analysis approaches the limit of relational databases, and requires a network database design. To do so within the limits of RDB systems requires a thorough analysis aimed to discover the most effective constructs that will, on the one hand, capture and represent the relations between entry characters, compounds, and their respective attributes, and, on the other, allow easy manipulation of the data with a view to efficiently generating a wide range of applications. In spite of these limitations, we have chosen to adopt dBASE IV, a relational database management system, for a number of reasons, especially its universal availability, ease of manipulating data and developing applications using the Xbase language, and easy portability to other systems. We are also using PERL, a powerful language for text processing and string manipulation. 8. DEVELOPMENT OF DATABASE SYSTEM The DESK database contains (or will contain) detailed information on every important aspect of Chinese characters as used in CJK languages and the principal Chinese dialects. This includes printed and calligraphic forms, in- depth semantics, phonemics, encoding methods, indexing schemes, synonyms and homophones, character etymology (based on Halpern 1987) and a wealth of other reference data. The development of software for building the DESK database and the feeding of data to the system is being implemented in six stages. 1. Developing software for restructuring the old format of NJECD's data to a rationally-structured relational database system on a dBASE platform. 2. Defining structures and developing software for building a system that is (a) sufficiently flexible to integrate the NJECD database into the broader framework of a comprehensive CJK database system (DESK) and (b) sufficiently open-ended to accommodate large-scale expansion. 3. Developing software and a menu-driven user interface for querying, searching, sorting, and otherwise manipulating the database system. 4. Thorough testing, revision, and maintenance of the system. 5. Building a pilot system for generating data for the New Kanji-English Pocket Dictionary in order to verify that the system is sufficiently robust to cope with dictionary compilation under field conditions. 6. Feeding large volumes of data to the database from various sources, including NJECD and its German edition, character meanings, compounds and their equivalents, frequency statistics, CJK character readings, character codes, calligraphic styles, etymology, stroke-order diagrams, etc. The system will grow organically through the addition of data from new sources, the compilation of new dictionaries, and the expansion of existing ones. 9. DEVELOPMENT OF KIT APPLICATIONS The development and compilation of KIT applications and products is being carried out in three stages: (1) designing the system for each application by (a) performing an in-depth analysis of its special features, such as the range of coverage, ordering scheme, entry layout, appendixes and indexes, and by (b) drawing up software specifications for each application; (2) building a system for each application by developing application-specific software; and (3) thorough testing, revision, and maintenance of software. The production of KIT printed products is being carried out in four stages: (1) adding new data (such as German core meanings), (2) editing the data generated by each application-specific system, and repeatedly checking the data until it is error-free, (3) developing software to process the data prior to computerized photocomposition; and (4) preparing camera-ready mechanicals by DTP and/or computerized photocomposition, to be followed by printing and binding. * * * * * Lexicography is not yet a recognized discipline in Japan. By building a comprehensive CJK database and using it for compiling numerous lexicographic works, this project will make a significant contribution to the advancement and eventual establishment of lexicography as a branch of learning in Japan, and to the promotion of the study and research of CJK languages. REFERENCES HALPERN, Jack (1987): 漢字の再発見 (Kanji no Saihakken) 'Rediscovering Chinese Characters'. Tokyo: Shodensha HALPERN, Jack (1990): New Japanese-English Character Dictionary. Tokyo: Kenkyusha HALPERN, Jack (1990): New Japanese-English Character Dictionary: A Semantic Approach to Kanji Lexicography. EURALEX '90 Proceedings: Actas del IV Congreso Internacional, 157-166. Benalmadena (Malaga): Bibliograph HALPERN, Jack (1993): NTC's New Japanese-English Character Dictionary. Chicago: National Textbook Company LUNDE, Ken (1993): Understanding Japanese Information Processing. Sebastopol, CA: O'Reilly & Associates ---------------------------------------------------------------------------- APPENDIX A: LIST OF KIT APPLICATIONS Below is a list of the principal dictionaries, reference works and learning tools (DISK applications) that could be compiled on the basis of the DESK database. (The asterisk indicates that more detailed information is available for that item.) 1. GENERAL CHARACTER DICTIONARIES 一般漢英字典 *1. NTC's New Japanese-English Character Dictionary (NTC, 1993) *2. New Kanji-English Pocket Dictionary 新漢英小字典 *3. New Kanji-English Learner's Dictionary 新漢英学習字典 *4. Japanese-English Dictionary of Kanji Synonyms 類義漢字和英辞典 5. Pocket Kanji Thesaurus 類義漢字和英小辞典 *6. Japanese-English Dictionary of Kanji Usage 同訓使い分け和英辞典 7. Japanese-English Kanji Compounds Dictionary 実用漢英熟語字典・一般編 *8. New Japanese-German Character Dictionary 新漢独字典 9. New Japanese-Spanish Character Dictionary 新漢西字典 10. New Japanese-French Character Dictionary 新漢仏字典 2. SPECIAL-PURPOSE DICTIONARIES/REFERENCE WORKS 特殊漢字字典・参考書 1. Introduction to Kanji 漢字入門 * 2. Kanji-English Dictionary for Business and Economics 実用漢英熟語字典・経済編 3. Kanji-English Dictionary for the Arts and Humanities 実用漢英熟語字典・文化編 4. Kanji-English Dictionary for Science and Technology 実用漢英熟語字典・科学技術編 5. Introduction to Kanji Compound Formation 漢字熟語成立ち入門 6. Japanese-English Dictionary of Prefixes and Suffixes 漢字接辞和英辞典 7. Japanese-English Dictionary for Counters and Units 単位・助数詞和英辞典 8. Kanji Reference Handbook 漢英参考情報便覧 9. Japanese-English Dictionary of Character Etymology 漢英字源字典 10. Introduction to the Radical System 漢字部首入門 11. Introduction to Written Japanese 日本語書き方入門 *12. Comparative Study of Sino-Japanese Lexical Items 漢語語彙比較研究 3. ELECTRONIC DICTIONARIES, OTHERS 電子字典・その他 1. Kanji Learner's Electronic Dictionary 電子漢字学習機 2. Kanji Learner's Courseware 漢字学習コースウェア * 3. Kanji Input System Based on System of Kanji Indexing by Patterns 字型検字法による漢字入力方式 4. Kanji Games Software Kit 漢字学習ゲームソフト 5. JIS Kanji Index Based on System of Kanji Indexing by Patterns 字型検字法によるJIS漢字索引 * 6. New Japanese-English Character Dictionary: Electronic Book Edition 新漢英字典電子ブック版 7. New Japanese-English Character Dictionary: CD-ROM Edition 新漢英字典CD−ROM版 8. Kanji Learner's Wall Chart 漢字学習貼紙表 * 9. Kanji Cards 漢字学習カード 10. Introduction to Kanji: Video Edition 漢字学習ビデオ 11. Train and Subway Kanji Guide 電車・列車漢字案内 12. Restaurant Kanji Guide レストラン漢字案内 4. DICTIONARIES AND AIDS FOR JAPANESE USERS 日本人対象の字典・教材 1. Dictionary of Kanji Synonyms 類義漢字辞典 2. Pocket Kanji Thesaurus 類義漢字小辞典 3. Dictionary of Kanji Usage 同訓使い分け辞典 4. Kanji Learner's Dictionary for Elementary Schoolchildren 小学生用漢字学習字典 5. Dictionary of Kanji Compound Formation 漢字熟語構成辞典 6. Kanji Learner's Courseware 漢字学習コースウェア 7. Kanji Learner's Dictionary: Electronic Book Edition 漢字学習字典電子ブック版 8. Introduction to Kanji Compound Formation 漢字熟語成立ち入門 9. Kanji Learner's Graded Wall Chart 学年別漢字学習貼紙表 ---------------------------------------------------------------------------- APPENDIX B: EDITORIAL COMMITTEE OF KANJI DICTIONARY PUBLISHING SOCIETY KUSUO HITOMI President of Showa Women's University Director General and President of KDPS Chairman of KDPS Editorial Committee OKI HAYASHI President of the Society for Teaching Japanese as a Foreign Language formerly President of the National Language Research Institute Consultant to KDPS Editorial Committee OSAMU MIZUTANI Director General of the National Language Research Institute Councilor of the Society for Teaching Japanese as a Foreign Language Consultant to KDPS Editorial Committee SHIGEHIKO TOYAMA Professor at the Graduate School of Literature, Showa Women's University Member of KDPS Editorial Committee TAKASHI TAKAMIZAWA Professor/Director of the Course of Japanese Literature, Showa Women's University Member of KDPS Editorial Committee CHIKASADA HARADA Professor of Japanese Literature, Showa Women's University Member of KDPS Editorial Committee TOMOKO KANEKO Professor of English and American Literature, Showa Women's University Member of KDPS Editorial Committee KEN LUNDE Project Manager of Japanese Font Production at Adobe Systems, Inc. Technical Consultant to KDPS YOSHIAKI TAKEBE formerly Professor at Waseda University Member of KDPS Editorial Committee MASAAKI NOMURA Professor of Japanese at Center for Japanese Language, Waseda University Member of KDPS Editorial Committee ATSUSHI FUKADA Assistant Professor of Applied Linguistics at Center for Linguistic and Cultural Research, Nagoya University Member of KDPS Editorial Committee YOICHIRO YAMAMURA President of Brain Brigade Systems, Ltd. Production and Marketing Consultant to KDPS JACK HALPERN Research Fellow at Institute of Modern Culture, Showa Women's University Editor in Chief of New Japanese-English Character Dictionary Editor in Chief of Kanji Integrated Tools Project ---------------------------------------------------------------------------- APPENDIX C: OVERVIEW OF PRINCIPAL FEATURES Listed below are the principal features of DESK-KIT applications and products. The presence or absence of a specific feature depends on the item in question . For more information, see the individual descriptions for each project (available on request), and *Features of This Dictionary* on page 61 of NJECD). >> *Core meaning* -- a concise keyword that defines the most dominant sense of each character to provide an instant grasp of its fundamental concept. >> *Psychologistic ordering* of character meanings, clustered around the core meaning in a manner that allows them to be conceived as a logically-structured, integrated unit. >> *Complete and accurate character meanings* clearly show how a few thousand building blocks are combined to generate countless compound words. >> Numerous *high-frequency compounds* provide maximally useful examples of each character sense and clearly show how these contribute to the meaning of each compound. >> *Compound formation articles* describe the etymology of compounds and explain how their constituent characters contribute to their meanings. >> *Synonym articles* provide full guidance on the differences and similarities between closely related characters. >> *Detailed usage notes* help you understand the fine distinctions between *kun* homophones. >> System of Kanji Indexing by Patterns -- a totally new method for looking up characters as quickly as in alphabetical dictionaries >> Six lookup methods and three indexes allow even a complete beginner to locate entries with great speed and little effort. >> A *system of labels* provides useful information on the temporal status, etymology, orthography, style, function, level of formality, etc., of character senses. >> The *degree of importance* of each character sense is indicated by various typographical differences and status labels for four levels of study. >> Quick access to a valuable source of *supplementary reference data,* such as the principles of stroke order, frequency lists, historical tables, rules for okurigana, kana charts, a list of kanji synonyms. >> A user-friendly format ensures a visually attractive layout and maximum ease of use.