Chat2Find Unveils “Base” AI Model Built on Sri Lanka’s Largest Trilingual Dataset

Colombo: The emerging AI platform Chat2Find has taken a major step forward in multilingual artificial intelligence with the release of its foundational Chat2Find Base model, now available via Hugging Face.

The Base model forms the backbone of Chat2Find’s upcoming open-weight model suite and is designed to support Sinhala, Tamil, and English, including naturally code-mixed variations such as Singlish and Tanglish. The release marks a significant milestone for locally grounded AI development, particularly in low-resource language ecosystems.

Built on a 255M+ Token Corpus

The Chat2Find Base model is trained through continual pre-training (CPT) on the recently released Chat2Find Corpus a large-scale conversational dataset containing over 255 million tokens across nearly 280,000 records.

Unlike many global datasets, the corpus is derived from real user interactions, capturing authentic linguistic patterns, cultural nuances, and regional knowledge specific to South Asia. This gives the Base model a strong advantage in understanding local context, informal speech, and multilingual switching areas where traditional models often struggle.

Foundation for Advanced AI Models

Industry observers note that “base models” represent the pretrained core of AI systems, which can later be fine-tuned for tasks such as instruction-following or reasoning.

Chat2Find has confirmed that the Base model will be followed by specialized variants, including:

Chat2Find Instruct – optimized for task execution and prompts
Chat2Find Reasoning – focused on complex problem-solving

These models are expected to expand the capabilities of AI tools in education, business, and public services across Sri Lanka and the wider region.

Access Base Model: Hugging Face and Lanka Data Net (Local Repository)

Boost for Sri Lanka’s AI Ecosystem

The release is being seen as a breakthrough for Sri Lanka’s AI landscape, where access to high-quality, locally relevant training data has historically been limited. By open-sourcing both the dataset and model components under permissive licensing, Chat2Find is enabling researchers, startups, and institutions to build next-generation applications tailored to regional needs.

Analysts say the initiative could position Sri Lanka as a regional hub for multilingual AI innovation, particularly in South Asian language technologies.

Looking Ahead

With the Base model now accessible to developers worldwide, attention is shifting to real-world deployments and fine-tuned applications. As global AI development increasingly moves toward open ecosystems, Chat2Find’s approach highlights the growing importance of localized data and inclusive language representation in shaping the future of artificial intelligence.

Post Views: 358

Chat2Find Unveils “Base” AI Model Built on Sri Lanka’s Largest Trilingual Dataset

Built on a 255M+ Token Corpus

Foundation for Advanced AI Models

Boost for Sri Lanka’s AI Ecosystem

Looking Ahead

CSE Suspends Richard Pieris Securities Over Regulatory Non-Compliance

13.2 | TURNING POINT The NDB Bank ‘FRAUD’ DOCUMENTARY

The Duty to ACT

Opinion Divided Over Equivalent Sentences

What Happens Next?

The Cost of Inaction Former Defence Secretary Hemasiri Fernando and EX-IGP Pujith Jayasundera Sentenced to Death

Built on a 255M+ Token Corpus

Foundation for Advanced AI Models

Boost for Sri Lanka’s AI Ecosystem

Looking Ahead

Related Posts