Chat2Find Releases 255M+ Token Sri Lankan Trilingual AI Corpus via Hugging Face and Lanka Data

COLOMBO, Sri Lanka LankaData has announced the public release of the Chat2Find Corpus, a major trilingual conversational dataset, marking a key milestone in Sri Lanka’s growing AI ecosystem.

The Chat2Find Corpus consists of over 255 Million tokens (279,248 records) in Sinhala, Tamil, and English, including code-mixed language such as Singlish and Tanglish. Designed for training and fine-tuning Large Language Models (LLMs), the dataset supports multilingual AI development in low-resource environments.

Developed by Chat2Find and released as open-source, the dataset enables researchers, developers, and institutions to access high-quality, locally relevant data—an area that has long limited AI innovation in Sri Lanka. It is released under the MIT License and is suitable for continual pre-training (CPT) and supervised fine-tuning (SFT).

The corpus captures authentic language use and cultural context, making it especially valuable for modern natural language processing tasks. Alongside the dataset, Chat2Find has also announced upcoming AI models, including a base model and fine-tuned models optimized for trilingual understanding and reasoning.

Access the dataset:

Hugging Face: Link
LankaData: Link

This release positions LankaData as a key hub for open AI resources in Sri Lanka, supporting the next wave of locally grounded AI development.

Post Views: 19

Chat2Find Releases 255M+ Token Sri Lankan Trilingual AI Corpus via Hugging Face and Lanka Data

Chat2Find Releases 255M+ Token Sri Lankan Trilingual AI Corpus via Hugging Face and Lanka Data

The NDB Fraud: Hard Questions

NDB Fraud at LKR 13.2 Billion – Direct Hit on Capital, Profits and Trust

NDB Bank Fraud: Under the Lens – Questions No Longer Waiting

Russia Calls – Will Sri Lanka Answer? A New Power Game Emerges

The Book That Reopens Easter: Truth… or Trigger?

Related Posts