A few words of explanation of the "wordfreq" file.

In 1998 I was contacted by Alexandre Girardi who was at that time
a graduate student at NAIST (Shikano-lab). He had been using my EDICT
file and also had access to online copies of the previous 4 years of the
Mainichi Shimbun. He had processed (all?, part?) of the newspaper material 
through a morphological analyzed and produced a word-list with Part-of-
Speech and frequency attached. Initially he was interested in getting
relative frequency data into the EDICT file.

Alexandre sent me a copy of the file with about 300,000 words marked. He
asked me just to use it myself until he could get clearance to release it.
He didn't get back to me.

Some years later I tracked him down (he was in Belgium working) and asked
him if it was OK to release the file. He replied: "Yes, you can release it.
In fact you can use the whole data, because it is already in the public
domain."

So here it is, exactly as Alexandre produced it. 

The format appears to be:

	word+pos[TAB]frequency

The pos number relates to the POS category in the "wordtype" file.

Everything is in EUC coding.

I have also added a file called "word_freq_k" which has the kanji-only
words which appear 16 or more times. It has ~38,000 lines. In this the fields
are tab-delimited.

The words in the file show the expected bias in newspapers, i.e.
words relating to politics, etc. occur quite often. Also they reflect
the time period of the newspaper articles, with words like "Clinton"
occurring more often than is the case now.

Jim Breen
May 2003