Utilities

This chapter deals with the JWPce Utilities.

  • Introduction
  • JINDEX -- Dictionary Index Utility
  • KINFO -- Character Information Utility
  • RINFO -- Radical Lookup Database Utility
  • UINFO -- Unicode Conversion Utility
  • WINFO -- Kana->Kanji Conversion Utility
  • Configuration File
  • INDEX: Contents
  • PREV: Localization
  • NEXT: Support

  • Introduction

    The Utilities are a number of small programs to handle specific tasks related to JWPce. Most people will never need to use these, but some people may be interested in them. All utilities are run from the command line. Currently there are 5 utilities:

    UtilityFunction
    JINDEXGenerates index files for EUC and UTF-8 dictionaries.
    KINFOGenerated KANJINFO.DAT from Jim Breen’s KANJIDIC.
    RINFOManipulates the radical lookup databases.
    UINFOGenerates UNICODE conversion tables.
    WINFOGenerates kana->kanji conversion database.


    JINDEX -- Dictionary Index Utility

    The JINDEX utility generates index files for EUC and UTF-8 dictionaries (mixed mode dictionaries are not currently supported). The format used for the command arguments are:

    JINDEX dictionary_file [ flags ]

    The index is written to the same location as the dictionary_file, but will have the extension .JDX.

    The flags parameter indicates any number of the following flags:

    FlagFunction
    ALLKANAIncludes every kana in the index.
    ANYKANAIncludes every kana string of 2 or more characters in the index, regardless of the location of the string (i.e. not just at the beginning of the word).
    SHORTKANAIncludes short kana words (1 kana). Note, these must be full words of a single character.
    ALLASCIIIncludes every ASCII character in the index.
    ANYASCIIIncludes every ASCII string of 3 or more characters in the index, regardless of the location of the string (i.e. not just at the beginning of the word).
    SHORTASCIIInclude short ASCII words (2 or 1 characters). Note, these must be full words or 2 or 1 ASCII characters.
    ASCII2Changes the normal ASCII acceptance string from 3 characters to 2 characters.
    SKIPNOTESDoes not generate index entries for characters located in parenthese. This will exclude dictionary ID keys, and parenthetical nodes.
    UTFChanges the dictionary encoding from EUC to UTF-8.
    TESTScan the actual file, but don’t create the index. This can be used to determine the size of an index file without actually generating the index, which is much faster.
    NOWARNSuppresses warning messages.
    1250Changes the code page from (1252, American/Western European) to Eastern European.
    1251Changes the code page from (1252, American/Western European) to Cyrillic.
    1253Changes the code page from (1252, American/Western European) to Greek.
    1254Changes the code page from (1252, American/Western European) to Turkish.
    1255Changes the code page from (1252, American/Western European) to Hebrew.
    1256Changes the code page from (1252, American/Western European) to Arabic.
    1257Changes the code page from (1252, American/Western European) to Baltic.
    1258Changes the code page from (1252, American/Western European) to Vietnamese.

    Dictionary files can be encoded in EUC or UTF-8. The index file for EUC dictionaries does not depend on the code page. The index for UTF-8 dictionaries does (at least at the current time). By default the code page is 1252 (American/Western European), but if you intend to use the index on some other system you must indicate the code page so the correct UNICODE conversion table can be used.

    Indexing every character in the dictionary will generate an exceptionally large index file. In order to reduce the size of the index file some limitations are normally made on what sequences are normally indexed. The following table shows the default index conditions:

    KanjiEvery kanji in the file will be indexed.
    SymbolsMost every symbol in the file will be indexed. There are not that many of these, so this does not increase the size of the index file much.
    KanaKana sequences of 2 or more kana occurring at the beginning of a word are indexed.
    ASCIIASCII sequences of 3 or more characters occurring at the beginning of a word are indexed.
    NumbersA numerical sequences occurring at the beginning of a word are indexed. The number of these is small, so the size increase in the index is small.

    Many of these indexing conditions can be changed using the flags. All of the indexing flags, except for skipnotes, will increase the size of the index file.

    It is important to understand the ALL, ANY, and SMALL flags. The easiest way to see what these do is to consider how they index some kana words. Consider indexing the words and , with various flags:


    WARNING! This utility must sort the index into order. For a large index, this can take some time.


    KINFO -- Character Information Utility

    It is not convenient for JWPce to use Jim Breen’s KANJIDIC file directly. This is a basic text file, and is relatively large, as well as difficult to search through without loading all the information directly into memory. Instead JWPce uses KANJINFO.DAT file, which contains the same information in a more compact format. Further, the ability to quickly search through the data has been added.

    This utility converts Jim Breen’s KANJIDIC into a binary format used by JWPce (KANJINFO.DAT). The format of this command is:

    KINFO [EUC] [UTF8] [STATS] [IN=filename]

    If the file name is not specified, KINFO will assume KANJIDIC. This utility normally assumes the dictionary is in EUC, but will also support UTF-8. The STATS flag will cause information about the ranges and number of kanji including different indexes. I use this information to make modifications to KANJINFO.DAT.

    This utility will write a number of files:

    KANJINFO.DATLarge form of KANJINFO.DAT. This file contains all the information in KANJIDIC.
    KANJINFO.MEDMedium form of KANJINFO.DAT. This file does not contain nanori, pinyin, or Korean entires.
    KANJINFO.SMLSmall form of KANJINFO.DAT. Reduced file that contains only the fixed size data (bushu, strokes, grade, skip, Halpern, nelson, and Haig), meanings, on-yomi, and kun-yomi.
    JWP_UNIC.DATContains UNICODE information for the kanji. This file was never used. JWPce actually uses the UNICODE conversions tables from the UNICODE Consortium (see UINFO below).
    KANJISRK.DATContains stroke information for the kanji. This files Is used by the RINFO utility to generate radical lookup data.
    KANJI_FREQ.EUCObsolete file no longer generated. Contains the kanji by frequency index using Jack Halpern’s frequency data listed in KANJIDIC.


    RINFO -- Radical Lookup Database Utility

    This utility processes files used for the radical lookup feature. The utility takes no parameters, but reads a number of files:

    kanjisrk.datKanji stroke data extracted from Jim Breen’s KNAJIDIC.
    radkanji.idxIndex file for radical data. This data was first compiled by Michael Raine and Derc Yamaski.
    radkanji.datRadical data file compiled by Michael Raine and Derc Yamaski.

    The files stroknji.idx and stroknji.dat can be read, but these stroke files compiled by Michael Raine and Derc Yamaski are no longer used.

    The utility will write the following files:

    stroke.eucEUC file containing the kanji by stroke count.
    radical.eucEUC file containing the kanji by radical.
    stroke.datStroke count database used by JWPce for radical lookup.
    radical.datRadical database used by JWPce for radical lookup.


    UINFO -- Unicode Conversion Utility

    This utility generates the UNICODE conversion tables used by JWPce. These tables are stored as C code that is actually compiled into JWPce. This utility takes no parameters and reads the file JIS0208.TXT. This file is produced by the UNICODE Consortium. The utility writes the following files:

    jwp_ukan.datConversion table for JIS kanji.
    jwp_umis.datConversion table for symbols.
    jwp_cp1250.datConversion table for Eastern Europe extended ASCII.
    jwp_cp1251.datConversion table for Cyrillic extended ASCII.
    jwp_cp1252.datConversion table for USA, West Europe extended ASCII.
    jwp_cp1253.datConversion table for Greek extended ASCII.
    jwp_cp1254.datConversion table for Turkish extended ASCII.
    jwp_cp1255.datConversion table for Hebrew extended ASCII.
    jwp_cp1256.datConversion table for Arabic extended ASCII.
    jwp_cp1257.datConversion table for Baltic extended ASCII.
    jwp_cp1258.datConversion table for Vietnamese extended ASCII.


    WINFO -- Kana->Kanji Conversion Utility

    This utility builds the kana->kanji conversion database used by JWPce. A number of different sources can go into the construction of this table. The syntax for calling the utility is:

    WINFO filename [ alloc ]

    The alloc parameter determines the maximum number of conversions allocated. This must be more than the number of conversions you expect, because ether are usually duplicates that have to be removed. By default this parameter is set at 500,000.

    If you compile this utility make sure the stack space is set quite high. The utility uses a quicksort algorithm to order the list. This can use a substantial amount of stack space. MS VC++ allocates a 1 MB stack by default. This is not enough to run the standard configuration. I normally allocate 20 MB, just to be safe. If the utility runs out of stack space you will get a system crash!

    The filename parameter must specify a configuration file to read. The standard configuration file is called STANDARD.EUC.

    The utility will write the files WNN.DAT and WNN.DIX. These are kana->kanji conversion database and index file. It is also possible for the utility to write the older format conversion database that was used by JWP. This has been disabled since these files are no longer used. For debugging purposes a number of other files will be written:

    test1.eucRaw data read from all sources
    test2.eucSorted data read from all sources
    test3.eucFiltered data read from all sources. Duplicates and unwanted entries are removed.
    Test4.eucMerged final data.


    Configuration File

    The configuration file is an EUC file containing a number of different commands. Each line should contain a single command. Blank lines are allowed, and any line beginning with a # is treated as a comment line. The following commands are supported:

    DICExtract kana->kanji conversions from a dictionary in EDICT format. EUC, UTF-8, and Mixed dictionaries are supported.
    ENDThis command end the file. This command must be in the file.
    WNNExtract data from a WNN file. These files are normally produced by the WNN consortium. Older versions of these files (as are distributed with JWPce) are freely distributed. Newer versions are not.
    WLINEContains a single kana->kanji conversion in the WNN format, but entered on a single line. I used to use these to make additions to the conversion database, but I have moved all of them into ROSENTHAL.U.

    DIC Entry

    Entry specifies a dictionary in EDICT format. All or some valid kana->kanji conversions will be extracted from the file. Conversions with priority marking are assigned value 1. Conversions without priority markings are given value 0.

    The format of the line is:

    DIC ( ALL | PRIORITY ) filename

    The ALL options indicates extract all entries form the dictionary. The PRIORITY option indicates extract only priority entries. Such entries must end with a /(P)/.

    Entries that do not contain kanji will automatically be skipped. As well as certain entries that mix character formats.

    END Entry

    Each configuration file must terminate in an END command. The format of the command is:

    END

    WNN Entry

    Extracts the kana->kanji conversions from a WNN formatted file. The format of the command is

    WNN  filename

    Most of these data files were compiled by WNN consortium and are under copyright of Kyoto University Research Institute for Mathematical Sciences, although I have also created some.

    The basic format of these files is as follows:

    \total 
    _blank_
    entries

    Each of the entries has the form:

    kana  kanji  part_of_speech  value

    You can examine the files or check in the WINFO code to determine the details. The basics of each field are:

    kanaKana for the conversion. Verb and adjective endings are not included.
    kanjiKanji for the conversion. Verb and adjective endings are not included.
    part_of_speechIndicates the part of speech. This is used to determine verb endings. The important parts of speech are:
    valueIndicates the priority of the conversion. Higher priorities are listed earlier in the list.


    Next Chapter: Support