LLM-jp LLM-jp

News

Release of llm-jp-modernbert

We are pleased to announce the release of llm-jp-modernbert, a BERT model trained on a large-scale Japanese corpus with a maximum token length of 8192.
This model is based on ModernBERT, which incorporates techniques used in recent LLMs such as RoPE and FlashAttention. It uses llm-jp-tokenizer v3 as its tokenizer. The training data consists of the Japanese subset (approximately 0.69T tokens) of llm-jp-corpus-v4, developed by LLMC. Note that llm-jp-corpus-v4 will be released in the future.
Although the model did not outperform existing ones on JGLUE subtasks, it showed strong performance on cloze tests, likely due to the inclusion of recent content in llm-jp-corpus-v4. These results suggest that the model captures up-to-date knowledge and common sense effectively.
We also evaluated the effect of context length expansion using pseudo-perplexity, following the method in the NeoBERT paper, and analyzed sentence embedding behaviors during training using intermediate checkpoints.
For details on training methods and evaluation analysis, please refer to our technical paper on arXiv.

Resources:

References

  • Warner et al., Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference, https://arxiv.org/abs/2412.13663
  • Breton et al., NeoBERT: A Next-Generation BERT, https://arxiv.org/abs/2502.19587