If you’re building a Big Data application in 2022, there is a high chance you will weigh up the choice between traditional cloud architecture based on containers and Serverless. In this blog, K&C’s Big Data & AI Consulting and Development looks at why a Serverless is becoming so popular for apps and platforms that work with Big Data.
Both Serverless and containers offer huge cost advantages compared to what went before them. Before the era of cost-effective, scalable public cloud platforms (it’s easy to forget AWS only launched in 2006 and took several years to become ubiquitous), the insights afforded by Big Data analysis were practically limited to large enterprises because of the infrastructure overheads entailed. There was also a lot less data in general with the connected devices that now produce so much of it also a consequence of the cloud computing revolution.
Cloud development gave rise to containers, which allowed applications to be broken down into the smallest independent modules that made sense and improved speed, efficiency, reliability. In big data projects, the advantage of containers running in VMs over the containerless VMs previously relied on was also that they allowed Big Data applications to be hardware and GPU agnostic and improved GPU-sharing efficiency.
Serverless takes things one step further and provides Containers-as-a-Service, removing the job of setting up and managing the underlying VM infrastructure. The software development team only needs to interact with the container.
K&C’s experts believe that if a Serverless architecture is not yet always the right solution for Big Data-powered AI applications, the technology is moving in that direction.
In this article, we examine why Serverless is so well suited to Big Data processing. And why Serverless architecture is becoming such a popular approach to app development more generally.
Can We Help You With Your Next Software Development Project?
Flexible models to fit your needs!
In 2012, Forbes published a guest post on the rise of Big Data, written by John Bantleman, CEO of database software company Rainstor. The ‘age of Big Data’ was announced. Bantleman wrote:
“We’ve entered the age of Big Data where new business opportunities are discovered every day because innovative data management technologies now enable organizations to analyze all types of data”.
However, Bantleman quickly moved on to warn that the business opportunities Big Data was opening up would come with costs not yet appreciated. Collecting, storing, processing and using AI/machine learning algorithms to analyse the huge volumes of semi and unstructured data being generated requires a huge computing resource.
The infrastructure of any application built to process Big Data must answer the following questions and meet the challenges:
Before the rise of Cloud and Serverless offered storage and computing resource as a utility service, processing Big Data meant building and maintaining the server infrastructure to do so. That had to be large enough to accommodate peaks of data flow even if they were only occasional.
It is precisely the anomalies such as peaks and troughs in data flow that often offer the most valuable scientific or commercial insights. However, an application being able to handle occasional peaks meant paying for and maintaining expensive infrastructure that spent most of its time redundant.
First Hadoop and then Cloud computing changed that. By distributing Big Data sets across many cheap ‘commodity server’ nodes, which combine into a computational resource capable of storing and handling huge data sets, Hadoop significantly lowered the cost of scalable bare metal infrastructure.
That 2012 Bantleman Forbes article estimated a Hadoop cluster and distribution facility for Big Data cost around $1 million compared to the $10 million to $100s of millions for enterprise data warehouses. But of course, $1 million is still not pocket change and maintained a barrier to entry that kept most out of Big Data applications.
Next came Cloud computing and containers. Cloud providers such as AWS turned computing power into a service – removing the requirement for major upfront investment in hardware infrastructure. The pay-as-you-go and fluidly scalable model of public cloud platforms like AWS opened the door for the experimentation and innovation that led to a rich open source development ecosystem.
It has also allowed many young companies using Big Data to grow and flourish that would otherwise have had to contend with much tougher barriers to entry.
Cloud Computing meant no upfront investment and only paying for the processing power needed for irregular data flow peaks when they occurred.
Cloud computing’s democratisation of Big Data can be credited as the catalyst for a new technology revolution. One that spans digital technology and biotechnology. Revolutions gathering pace in medicine, pharmaceuticals, finance, commerce, agriculture, food technology and pretty much any other sector you may care to mention are happening because start-ups and SMEs can now afford to build and run Big Data applications.
The Machine Learning zeroing in on patterns previously undetectable is suddenly turbo charging new discoveries.
Within a few short decades the world we live in will be unrecognisable. Yes, technology has advanced quickly over the decades before. But what Cloud-powered Big Data and AI will achieve over the next several will be a paradigm shift.
We’ll be able to cure quite possible a majority of previously incurable diseases and conditions. The human genome and those of other forms of life will be mapped. Autonomous vehicles will reshape the economy and our lifestyles more than most imagine today. Ecommerce will be truly a truly personalised experience. The list goes on.
But as much as cloud computing has knocked down barriers to entry for Big Data and the AI that feeds on it, there is still a bottleneck. Cloud has hugely cut costs and container orchestrators such as Kubernetes have helped make apps more efficient and flexible.
But setting up and maintaining the cloud infrastructure that containerised Big Data applications run on is still very difficult. The main gains are in velocity and time required to maintain architecture. But containers require specialists with a very specific skill set that isn’t easy to acquire. Those specialists are expensive, either to hire ‘off-the-shelf’ or as an investment in further training, and in short supply.
The explosion of IoT across pretty much every sector imaginable means huge demand for the cloud architects and DevOps engineers able to set up container infrastructure. Everyone is fishing in the same shallow pool and the difficulty and expense of hiring the skills needed to manage container infrastructure has become a strategic problem.
The main advantage of Serverless computing over traditional cloud-based or server-centric infrastructure is that freeing developers from the need to deal with purchasing, provisioning and managing backend servers can lead to quicker time to release and less ongoing maintenance. That reduces development overheads and Serverless can also often, under the right circumstances, also lower cloud costs because charging is entirely based on resources actually used with no overheads for maintaining unused capacity.
But setting cloud costs aside, Serverless architectures allows for quicker app deployments and updates because code doesn’t need to be deployed to servers, or backends reconfigured, to release a working version. And because the application is collection of functions provisioned by the Serverless vendor, rather than a monolithic stack, developers can either release code all at once or one function at a time.
This makes it possible to quickly update, fix or add new features to a live application by making changes to or adding functions.
Cloud development introduced pay-per-use and only spinning up as many instances as required, then deleting them when a job was completed so they didn’t have to be paid for any more. But traditional cloud services still involve a user manually spinning up the virtual machines needed, necessitating the need to always have infrastructure specialists available to maintain instances.
The Functions-as-a-Service (FaaS) services of Serverless providers like AWS Lambda, Google Cloud Functions and Azure Functions, automate scaling by using events to trigger actions. They can also perform data processing much like traditional Hadoop without having to involve the Hadoop framework.
Other Serverless services are introduced to FaaS to build a data pipeline consisting of:
All of the main Serverless providers, AWS, Azure and GCP, have their own services for each stage of a Big Data process.
Some Big Data apps are fed data that has already been collected and/or formatted and structured. Others will integrate live data being generated in real time by things like IoT devices or data being generated by an application itself like logs or Change Data Capture (CDC).
Data collection services offered by the main Serverless providers include:
Amazon Web Services – AWS Glue, AWS IoT, AWS datapipeline
Microsoft Azure – Azure DataFactory, Azure IoT hub
Google Cloud Platform – Google Cloud DataFlow
The flow of streaming data into a Serverless application that allows for real-time processing and analysis. Serverless architecture needs to include real-time storage that can scale up and down as the volume of data being collected fluctuates. There are also services for real-time data processing.
Data streams and processing services provided by the main Serverless providers include:
Amazon Web Services – Amazon Kinesis, Amazon Managed Streaming for Apache Kakfa (Amazon MSK) and Amazon Kinesis Streams for real-time data processing.
Microsoft Azure – Event Hub and Azure Stream Analytics for real-time data processing.
Google Cloud Platform – PUB/SUB and Google Cloud DataFlow for real-time data processing.
At the storage layer, Serverless offers BaaS fully managed databases that automate the scaling of tables to adjust for capacity and maintain performance and with built in availability and fault tolerance. Serverless databases also allow for a decoupling of compute and storage nodes.
The main Serverless database services across the three main providers are:
Amazon Web Services – AWS DynamoDB, AWS Aurora, AWS Athena
Microsoft Azure – Azure Cosmos DB, Azure SQL Database Serverless
Google Cloud Platform – Google BigQuery, Google BigTable, Google Datastore.
An early adopter of the Serverless approach, K&C (Krusche & Company) has established a reputation as one of Munich, Germany and Europe’s most trusted IT services providers over more than 20 years. We operate tech talent centres in Krakow, Poland, Kyiv, Ukraine, and Minsk in Belarus, combining German management and company presence with cost-efficient access to nearshore IT talent.
If you are considering a Serverless approach for your next Big Data application and would like an expert opinion on the suitability of a Serverless architecture or a dedicated team to build it end-to-end please do get in touch!
K&C - Creating Beautiful Technology Solutions For 20+ Years . Can We Be Your Competitive Edge?
Drop us a line to discuss your needs or next project