India's AI Playbook: Blending Global Models with Local Innovation

Livemint

The recent withdrawal of Microsoft’s cloud services to Russia-backed Indian refiner Nayara Energy underscored a critical vulnerability: the risks associated with over-reliance on foreign technology infrastructure. This incident has amplified India’s strategic push to develop its own foundational artificial intelligence capabilities, an endeavor that could serve as a blueprint for other nations in the Global South.

India faces a unique challenge in AI development due to its profound linguistic diversity, encompassing 22 official languages and hundreds of spoken dialects. Building AI systems capable of navigating this multilingual landscape is a monumental task. Yet, a pragmatic dual strategy is emerging, where Indian startups are simultaneously fine-tuning global open-source models for immediate applications while painstakingly building indigenous foundational models from the ground up.

At Google’s I/O Connect event in Bengaluru, this layered approach was evident. Startups like Sarvam AI showcased Sarvam-Translate, a multilingual model refined using Google’s open-source large language model (LLM), Gemma. Similarly, CoRover demonstrated BharatGPT, a chatbot providing public services, including for the Indian Railway Catering and Tourism Corporation (IRCTC), also built on a fine-tuned model. These efforts, backed by Google, might seem paradoxical given that Sarvam, Soket AI, and Gnani are also among the four startups tasked with developing India’s sovereign LLMs under the ₹10,300 crore IndiaAI Mission.

The rationale for this dual approach is rooted in necessity. Developing competitive AI models from scratch is resource-intensive, demanding vast datasets, advanced compute infrastructure, and extensive research. India, with its evolving tech ecosystem and urgent market demands, cannot afford to build in isolation. Instead, fine-tuning existing large language models—specializing them with focused, local data—offers a pragmatic path to address real-world problems today. This allows startups to bootstrap initial deployments, gather user feedback, and develop domain-specific expertise while concurrently building the data pipelines and infrastructure needed for truly independent models.

Project EKA, an open-source initiative led by Soket AI in partnership with leading Indian institutes like IIT Gandhinagar and IISc Bangalore, exemplifies the sovereign ambition. Designed from the ground up with entirely India-sourced code, infrastructure, and data pipelines, EKA aims to deliver a 7 billion-parameter model within months, with a larger 120 billion-parameter model planned. This initiative focuses on critical domains such as agriculture, law, education, and defence, ensuring training happens on India’s GPU cloud and the resulting models are open-sourced. Yet, Soket AI co-founder Abhishek Upperwal clarifies that using Gemma for initial deployments is a temporary measure, a way to “bootstrap and switch to sovereign stacks when ready,” rather than a long-term dependency. CoRover’s BharatGPT follows a similar trajectory, leveraging fine-tuned models for current government applications while also developing its own foundational LLM with Indian datasets, treating current deployments as avenues for both service delivery and dataset creation.

For India, developing its own AI capabilities transcends national pride; it is about solving problems that foreign models often cannot adequately address. Imagine a migrant worker in rural Maharashtra, understanding only Hindi, trying to comprehend a doctor’s AI-assisted explanation of an X-ray in English, based on Western medical assumptions. Such scenarios highlight a fundamental mismatch in cultural, physiological, and contextual grounding. India requires AI tools that understand local medical terms in Maithili, provide crop advisories aligned with state-specific irrigation schedules, and process citizen queries across 15 languages with regional variations. These are high-impact, everyday use cases where errors can directly affect livelihoods, public services, and health outcomes. Fine-tuning open models provides an immediate solution to these urgent needs, simultaneously laying the groundwork for a truly sovereign AI stack.

The IndiaAI Mission is a strategic response to a burgeoning geopolitical concern. As AI systems become integral to governance, education, agriculture, and defence, reliance on foreign platforms poses risks of data exposure and loss of control, as demonstrated by the Nayara Energy incident. Furthermore, most global AI models are trained on English-dominant, Western datasets, rendering them ill-equipped to handle India’s linguistic diversity or the intricacies of its legal judgments and agricultural practices.

While complete self-sufficiency in AI is unfeasible for any nation, including global powers, India’s approach is about maximizing choice and reducing dependencies. Amlan Mohanty, a technology policy expert, emphasizes that sovereignty lies in controlling infrastructure and setting terms. He notes that the Indian government’s pragmatic, technology-agnostic stance is shaped by constraints like the scarcity of high-quality Indic datasets, compute capacity, and readily available open-source alternatives tailored for India.

Indeed, the lack of high-quality training data, particularly in Indian languages, remains a significant hurdle. Google DeepMind India’s Manish Gupta points out that 72 Indian languages with over 100,000 speakers have virtually no digital presence. Initiatives like Google’s Project Vaani, in collaboration with the Indian Institute of Science (IISc), aim to bridge this gap by collecting vast amounts of voice samples across hundreds of Indian districts, even for languages previously lacking digital datasets. This data, coupled with Google’s cross-lingual transfer capabilities, helps improve performance in lower-resource languages and is incorporated into models like Gemma, which Indian startups utilize.

India’s layered strategy offers a compelling roadmap for other nations in the Global South grappling with similar constraints. It provides a blueprint for building AI systems that reflect local languages, contexts, and values without requiring immense compute budgets or mature data ecosystems from the outset. By 2026, as India’s sovereign LLMs like EKA are expected to be production-ready, this dual track is projected to converge, with homegrown systems gradually replacing bootstrapped models.

However, even as Indian startups build on open tools from global tech giants, the question of long-term dependency persists. Control over architecture, training techniques, and infrastructure support still largely rests with Big Tech. While Google has open-sourced datasets and partnered with IndiaAI Mission startups, the terms of such openness may not always be symmetrical. India’s sovereign ambitions ultimately hinge on its ability to outgrow these open models. The critical question for India and other Global South nations is whether they can convert this borrowed support into a complete, sovereign AI infrastructure before the terms of access shift or the window of opportunity closes.