DataPelago's Spark Accelerator Boosts Performance on Modern Cloud

Apache Spark remains a dominant engine for large-scale data processing, yet its architecture, developed when cloud infrastructure primarily relied on CPUs, faces challenges in today’s diverse computing environments. Modern cloud platforms increasingly incorporate GPUs, FPGAs, and other specialized hardware. Many open-source data systems, however, have not evolved to leverage these advancements, leading organizations to incur higher compute costs without achieving expected performance gains.

Addressing this disparity, DataPelago has launched its new Spark Accelerator. This solution integrates native execution with CPU vectorization and GPU support, built upon the company’s Universal Data Processing Engine. DataPelago aims to enable organizations to run analytics, ETL (Extract, Transform, Load), and GenAI (Generative AI) workloads across modern compute infrastructures without the need to rewrite existing code or data pipelines.

The Spark Accelerator operates within existing Spark clusters, requiring no reconfiguration. It dynamically analyzes workloads during execution, intelligently selecting the optimal processor for each task component—be it a CPU, GPU, or FPGA. DataPelago states this approach can accelerate Spark jobs by up to 10 times while reducing compute costs by as much as 80 percent.

Rajan Goyal, Founder and CEO of DataPelago, elaborated on the Accelerator in an exclusive interview, describing it as a direct response to the growing gap between traditional data systems and contemporary infrastructure. “If you look at the servers in the public cloud today, they are not CPU-only servers. They are all CPU plus something,” Goyal explained. “But many of the data stacks written last decade were built for single software environments, usually Java-based or C+±based, and only using CPU.”

The DataPelago Accelerator for Spark connects to existing Spark clusters using standard configuration hooks and functions as a complementary component. Once activated, it analyzes query plans as they are generated, determining precisely where each part of the workload should execute—on a CPU, GPU, or other accelerators.

These decisions are made at runtime, based on the available hardware and the specific characteristics of the job. “We’re not replacing Spark. We extend it,” Goyal clarified. “Our system acts as a sidecar. It hooks into Spark clusters as a plugin and optimizes what happens under the hood without any change to how users write code.” Goyal emphasized that this runtime flexibility is crucial for delivering performance without introducing new complexities for users. “There is no one silver bullet,” he stated. “All of them have different performance points or performance per dollar points. In our workload, there are different characteristics that you need.” By adapting to the hardware present in each environment, the system can more effectively utilize modern infrastructure without forcing users to re-architect their pipelines.

This adaptability has already yielded significant benefits for early adopters. A Fortune 100 company managing petabyte-scale ETL pipelines reported a 3-4x improvement in job speed and a reduction in data processing costs by up to 70 percent. While results may vary by workload, Goyal affirmed the tangible nature of these savings. “Here is the cost reduction. That $100 will become either $60 or $40,” he noted, highlighting the direct financial advantage for enterprises.

Other early customers have observed similar gains. RevSure, a prominent e-commerce firm, deployed the Accelerator in just 48 hours and reported measurable enhancements across its ETL pipeline, which processes hundreds of terabytes of data. ShareChat, one of India’s largest social media platforms with over 350 million users, experienced a doubling of job speeds and a 50 percent decrease in infrastructure costs after implementing the Accelerator in production.

The Accelerator’s adaptive capabilities are also attracting broader industry attention. Orri Erling, co-founder of the Velox project, views DataPelago’s work as a natural progression of advancements made by open-source systems on CPUs. “Since its inception, Velox has been deeply focused on accelerating analytical workloads. To date, this acceleration has been oriented around CPUs, and we’ve seen the impact that lower latency and improved resource utilization have on businesses’ data management efforts,” Erling commented. “DataPelago’s Accelerator for Spark, leveraging Nucleus for GPU architectures, introduces the potential for even greater speed and efficiency gains for organizations’ most demanding data processing tasks.”

The new Spark Accelerator builds directly on the foundational technology DataPelago introduced when it emerged from stealth in late 2024 with its Universal Data Processing Engine. At that time, the company described a virtualization layer designed to route data workloads to the most suitable processor without requiring code modifications. This initial vision now underpins the performance improvements reported by customers using the Spark Accelerator.

The Accelerator is currently available on both Amazon Web Services (AWS) and Google Cloud Platform (GCP), and can also be accessed via the Google Cloud Marketplace. DataPelago states that deployment typically takes minutes, not weeks, eliminating the need to rewrite applications, swap data connectors, or adjust security policies. It seamlessly integrates with Spark’s existing authentication and encryption protocols and includes built-in observability tools for real-time performance monitoring. This combination of visibility and plug-and-play integration facilitates customer adoption without disrupting ongoing operations.

While initially focused on analytics and ETL, Goyal indicated a growing demand for the Accelerator within AI and GenAI pipelines. “The compute footprint for these models is only going up,” he observed. “Our goal is to help teams unlock that performance affordably without reinventing their infrastructure.”

In a move to support its next phase of growth, DataPelago recently appointed John “JG” Chirapurath, a former SAP and Microsoft executive, as its President. Chirapurath previously served as Executive Vice President and Chief Marketing & Solutions Officer at SAP, and as Vice President of Azure at Microsoft. His appointment signals DataPelago’s strategic push to scale adoption and deepen industry partnerships.

DataPelago's Spark Accelerator Boosts Performance on Modern Cloud

Related Articles

Open SWE: Langchain's Open-Source Asynchronous AI Coding Agent

Nvidia AI Chip Export Trial & Kill Switch Rejection

GPD Win 5: External Battery Powers Desktop-Level Handheld Gaming

Related Articles

▸
Open SWE: Langchain's Open-Source Asynchronous AI Coding Agent

▸
Nvidia AI Chip Export Trial & Kill Switch Rejection

▸
GPD Win 5: External Battery Powers Desktop-Level Handheld Gaming