Chat2Find Releases 255M+ Token Sri Lankan Trilingual AI Corpus via Hugging Face and Lanka Data

COLOMBO, Sri Lanka  LankaData has announced the public release of the Chat2Find Corpus, a major trilingual conversational dataset, marking a key milestone in Sri Lanka’s growing AI ecosystem.

The Chat2Find Corpus consists of over 255 Million tokens (279,248 records) in Sinhala, Tamil, and English, including code-mixed language such as Singlish and Tanglish. Designed for training and fine-tuning Large Language Models (LLMs), the dataset supports multilingual AI development in low-resource environments.

Developed by Chat2Find and released as open-source, the dataset enables researchers, developers, and institutions to access high-quality, locally relevant data—an area that has long limited AI innovation in Sri Lanka. It is released under the MIT License and is suitable for continual pre-training (CPT) and supervised fine-tuning (SFT).

The corpus captures authentic language use and cultural context, making it especially valuable for modern natural language processing tasks. Alongside the dataset, Chat2Find has also announced upcoming AI models, including a base model and fine-tuned models optimized for trilingual understanding and reasoning.

Access the dataset:

This release positions LankaData as a key hub for open AI resources in Sri Lanka, supporting the next wave of locally grounded AI development.