日本語のREADMEはこちらです: README.ja.md
A Japanese morphological analyzer implemented as a pure JavaScript ES module. This is a modern port of the original Kuromoji, designed for simplicity and compatibility with current web standards.
import.Import kuromoji-es and use the async createTokenizer function to get a tokenizer instance. The dictionary is loaded from a remote CDN by default.
import { kuromoji } from "https://code4fukui.github.io/kuromoji-es/kuromoji.js";
// Asynchronously load the dictionary and create a tokenizer
const tokenizer = await kuromoji.createTokenizer();
// Tokenize a sentence
const tokens = tokenizer.tokenize("すもももももももものうち");
console.log(tokens);
kuromoji.createTokenizer()Asynchronously loads the dictionary files and returns a Promise that resolves with a tokenizer instance.
tokenizer.tokenize(text)Takes a string of Japanese text and returns an array of token objects, each containing detailed morphological information.
The tokenize() method returns an array of objects with the following structure:
[
{
"word_id": 509800,
"word_type": "KNOWN",
"word_position": 1,
"surface_form": "黒文字",
"pos": "名詞",
"pos_detail_1": "一般",
"pos_detail_2": "*",
"pos_detail_3": "*",
"conjugated_type": "*",
"conjugated_form": "*",
"basic_form": "黒文字",
"reading": "クロモジ",
"pronunciation": "クロモジ"
}
]
Fields:
word_id: ID of the word in the dictionary.word_type: Type of word (KNOWN if in dictionary, UNKNOWN otherwise).word_position: 1-based starting position of the word in the input text.surface_form: The word as it appears in the text.pos: Part of speech (e.g., 名詞 for noun).pos_detail_1, pos_detail_2, pos_detail_3: Sub-classifications for the part of speech.conjugated_type: Conjugation type (e.g., 五段・ラ行 for Godan verb).conjugated_form: Conjugation form (e.g., 基本形 for basic form).basic_form: The base form (lemma) of the word.reading: The reading of the word in Katakana.pronunciation: The pronunciation of the word in Katakana.For more details on the dictionary fields, please refer to the original kuromoji.js JSDoc page.
This project is a port of kuromoji.js by Takuya Asano, which is a JavaScript port of the original Kuromoji project by Atilika Inc.
This library is licensed under the Apache License, Version 2.0.