India's Dual AI Path: Open Source Fine-Tuning & Indigenous Development
India is charting a distinctive course to establish itself as an artificial intelligence powerhouse, employing a pragmatic two-pronged strategy that could serve as a blueprint for other nations in the Global South. This approach balances immediate utility with long-term technological sovereignty, a necessity underscored by recent geopolitical events, such as Microsoft’s abrupt withdrawal of services to the Russia-backed Indian refiner Nayara Energy last month, which exposed the vulnerabilities of relying on foreign digital infrastructure.
The core of India’s AI ambition was prominently displayed at Google’s I/O Connect event in Bengaluru this July, where the emphasis was firmly on developing AI capabilities tailored to India’s profound linguistic diversity. With 22 official languages and hundreds of spoken dialects, creating AI systems that can effectively navigate this multilingual landscape presents a formidable challenge. Startups like Sarvam AI showcased Sarvam-Translate, a multilingual model fine-tuned on Google’s open-source large language model (LLM), Gemma, to address this. Similarly, CoRover demonstrated BharatGPT, a chatbot designed for public services, including the Indian Railway Catering and Tourism Corporation (IRCTC). Google also announced collaborations with Sarvam, Soket AI, and Gnani, all of whom are leveraging Gemma to build next-generation Indian AI models.
This reliance on a foreign-developed model like Gemma might seem paradoxical, especially since three of these startups are also designated to build India’s foundational large language models from scratch under the ₹10,300 crore IndiaAI Mission. This government initiative aims to foster homegrown models trained on Indian data, languages, and values. However, the decision to use existing open-source models is rooted in pragmatism. Developing competitive models from the ground up is resource-intensive and time-consuming. Given India’s evolving compute infrastructure, limited high-quality training datasets, and pressing market demands, a layered approach proves more viable. Startups are fine-tuning open-source models to solve immediate, real-world problems while concurrently building the data pipelines, user feedback loops, and domain-specific expertise required to cultivate truly indigenous and independent models over time. Fine-tuning involves adapting a pre-trained general LLM to specialize in specific, often local, datasets, thereby enhancing its performance in particular contexts.
This dual strategy is exemplified by initiatives such as Project EKA, an open-source community-driven effort led by Soket AI in collaboration with IIT Gandhinagar, IIT Roorkee, and IISc Bangalore. EKA is being built from scratch, with its code, infrastructure, and data pipelines entirely sourced within India. A 7 billion-parameter model is anticipated within four to five months, followed by a 120 billion-parameter model within ten months. Abhishek Upperwal, co-founder of Soket AI, noted that the project focuses on four critical domains: agriculture, law, education, and defense, each with a defined dataset strategy drawing from government advisories and public-sector use cases. A key feature of EKA is its complete independence from foreign infrastructure, with training conducted on India’s GPU cloud and the resulting models being open-sourced. Yet, in a pragmatic move, Soket has utilized Gemma for initial deployments, with Upperwal clarifying that the goal is to bootstrap and transition to sovereign stacks when ready.
CoRover’s BharatGPT mirrors this dual approach. It currently operates on a fine-tuned model, providing conversational AI services in multiple Indian languages to government clients like IRCTC and Life Insurance Corporation. Founder Ankush Sabharwal emphasized the need for a quickly fine-tunable base model for critical applications in public health, railways, and space, while also confirming the development of their own foundational LLM using Indian datasets. These deployments serve not only as service delivery mechanisms but also as crucial data creation avenues, improving accessibility today while building a bridge to future sovereign systems. Sabharwal explained that the process begins with an open-source model, which is then fine-tuned, enhanced for language understanding and domain relevance, and eventually replaced by a proprietary sovereign model.
Amlan Mohanty, a technology policy expert, describes India’s strategy as an “experiment in trade-offs”—leveraging models like Gemma for rapid deployment without abandoning the long-term objective of autonomy. This approach aims to reduce dependency on potentially adversarial nations, ensure cultural representation, and test the reliability of partnerships with allies.
The drive for indigenous AI in India extends beyond national pride; it is about addressing unique problems that foreign models often fail to comprehend. Consider a migrant in rural Maharashtra seeking medical advice. A foreign AI tool, trained on Western data, might provide explanations in English with a Cupertino accent, using medical assumptions that don’t align with Indian body types or local medical terminology. Such a mismatch highlights the critical need for AI that understands local languages, cultural nuances, and physiological contexts—whether for a health worker in Bihar needing an AI tool that understands Maithili medical terms, or a farmer in Maharashtra requiring crop advisories aligned with state-specific irrigation schedules. These are high-impact, everyday scenarios where errors can directly affect livelihoods, public services, and health outcomes. Fine-tuning open models provides a crucial immediate solution while simultaneously building the essential datasets, domain knowledge, and infrastructure for a truly sovereign AI stack.
This dual-track strategy is seen as one of the quickest paths forward, using open tools to organically build sovereign capacity. Abhishek Upperwal of Soket AI views these as parallel yet separate threads: one focused on immediate utility, the other on long-term independence, with an ultimate convergence in sight.
The IndiaAI Mission is a national response to a growing geopolitical concern. As AI systems become indispensable for education, agriculture, defense, and governance, over-reliance on foreign platforms increases the risks of data exposure and loss of control. The Nayara Energy incident, where Microsoft cut off services due to sanctions, served as a stark warning, illustrating how foreign tech providers can become geopolitical leverage points. Similarly, shifts in trade policies, like past tariff increases, underscore the intertwined nature of trade and technology.
Beyond reducing dependence, sovereign AI systems are vital for India’s critical sectors to accurately reflect local values, regulatory frameworks, and linguistic diversity. Most global AI models, predominantly trained on English and Western datasets, are ill-equipped to handle India’s multilingual population or the complexities of its localized systems, such as interpreting Indian legal judgments or accounting for specific crop cycles and farming practices. Mohanty emphasizes that AI sovereignty is not about isolation but about control over infrastructure and terms of access. He notes that complete “full-stack” independence, from chips to models, is unfeasible for any nation, including India, with even global powers balancing domestic development with strategic partnerships. India’s government, therefore, maintains a pragmatic, agnostic stance on foundational AI elements, driven by constraints like the lack of Indic data, compute capacity, and readily available open-source alternatives tailored for India.
Despite the momentum, a fundamental roadblock remains the scarcity of high-quality training data, particularly in Indian languages. While India boasts immense linguistic diversity, this has not translated into sufficient digital data for AI systems to learn from. Manish Gupta, director of engineering at Google DeepMind India, cited internal assessments revealing that 72 Indian languages with over 100,000 speakers had virtually no digital presence. To address this, Google launched Project Vaani in collaboration with the Indian Institute of Science (IISc), aiming to collect voice samples across hundreds of Indian districts. The first phase gathered over 14,000 hours of speech data from 80 districts, covering 59 languages, 15 of which previously lacked digital datasets. Subsequent phases are expanding this coverage across India. Gupta also highlighted the challenges of data cleaning and quality, and Google’s efforts to integrate these local language capabilities into its large models, leveraging cross-lingual transfer from widely spoken languages like English and Hindi to improve performance in lower-resource languages. Google’s Gemma LLM incorporates these Indian language capabilities, and its collaborations with IndiaAI Mission startups include technical guidance and making collected datasets publicly available, driven by both commercial and research imperatives. India is seen as a global testbed for multilingual and low-resource AI development, with solutions potentially scaling to other linguistically complex regions.
For India’s sovereign AI builders, the absence of readily available, high-quality Indic datasets means that model development and dataset creation must proceed in parallel. India’s layered strategy—using open models now while concurrently building sovereign ones—offers a valuable roadmap for other countries grappling with similar constraints, particularly in the Global South. It provides a blueprint for nations seeking to develop AI systems that reflect local languages, contexts, and values without the luxury of vast compute budgets or mature data ecosystems. For these countries, fine-tuned open models offer a bridge to capability, inclusion, and control.
As Soket AI’s Upperwal puts it, “Full-stack sovereignty in AI is a marathon, not a sprint. You don’t build a 120 billion model in a vacuum. You get there by deploying fast, learning fast and shifting when ready.” Countries like Singapore, Vietnam, and Thailand are already exploring similar methods, using Gemma to kickstart their local LLM efforts. By 2026, when India’s sovereign LLMs, including EKA, are expected to be production-ready, this dual track is projected to converge, with homegrown systems gradually replacing bootstrapped models.
However, a lingering question of dependency persists. Even with open-source models from global tech giants like Meta’s Llama or Google’s Gemma, control over architecture, training techniques, and infrastructure support still heavily rests with these major players. While Google has open-sourced speech datasets and partnered with Indian startups, the terms of such openness are not always symmetrical. India’s sovereign aspirations ultimately depend on outgrowing these open models. As Mohanty cautions, if a foreign government were to direct a tech giant to alter access or pricing, the impact on Indian initiatives could be significant, jeopardizing digital sovereignty. The years ahead will test whether India and other Global South nations can transform this borrowed support into complete, sovereign AI infrastructure before the terms of access shift or the window to act closes.