Derived by Lane McDonald, February 2010. Encoding="UTF-8" Methodology + Rant: I needed a frequency sorted dictionary for a Japanese project of mine and came across Jim Breen's EDICT project. EDICT was exactly what I was looking for except that it wasn't ordered by frequency. So I set about using wget to basically recursively copy the Yomiuri and Mainichi newspaper websites. By using somewhat primitive html mining tools, I was able to extract ~550MB of text from news, opinion, editorial, forum, review and other types of text articles which resulted in upward of 79 million terms and 60 million kanji (other file). I started by sorting EDICT alphabetically so I could group like terms because there are several instances of terms that share spelling and I had no meaningful way of telling which was intended in text (this trimmed the list from around 175,000 terms to around 167,000. Then I sorted by term length to search for longer patterns before shorter ones and, upon finding pattern instances, they were overwritten with asterisks to avoid cases where a word like "schoolbus" would count as +1 for "school" and "bus" as well. Then it was a simple matter of counting occurrences and ranking terms. There is a caveat. There may be newspaper specific or headline related terms that appear to occur more than they actually do as a result of the type of text extraction I used. I used the same methodology for Jim Breen's KanjiDic project to rank the 6355 kanji therein by frequency. These did not need to be sorted as they're all 3 bytes wide (after encoding to utf-8) The formatting is as follows -- a Python shell will make quick work of this! [cumulative occurrences]\t[instance occurrences]\t[instance percentage of total]\t[cumulative percentage of total]\t[terms and definitions delimited by pipes] 4378450 4378450 0.055417739157 0.055417739157 の|(prt) indicates possessive among other uses (for full details and examples see the main entry (linked))| 6887209 2508759 0.0317531893409 0.0871709284979 を|(prt) (1) indicates direct object of action|(2) indicates subject of causative expression|(3) indicates an area traversed|(4) indicates time (period) over which action takes place|(5) indicates point of departure or separation of action|(6) indicates object of desire, like, hate, etc.|(P)| 9309086 2421877 0.0306535298693 0.117824458367 が|(prt) (1) indicates sentence subject (occasionally object)|(2) indicates possessive (esp. in literary expressions)|(prt,conj) (3) but|however|still|and|(P)|