A Blow-by-Blow Guide to the Technical Decisions and Challenges Involved + AWS Tools and Technologies Used to Build a Cutting Edge SSO System
Digital businesses with multiple anonymous and registered user entry points need to find an efficient way for different systems and apps to have access to a unified database of users.
Large digital business ecosystems most often have multiple conversion points where users enter the ‘funnel’, including landing pages, websites, newsletters, apps, ecommerce sites and more. Some of these users are registered, while others are anonymous. Adding additional complexity, some users register directly via their email address, while others register through social media or Google accounts. Anonymous users have simply agreed to cookies.
User identification across various channels like mobile apps, content sites, shops and more is one of the central challenges for companies trying to obtain a complete picture of their customers.
A unified database of all users, that different systems and applications can access with a Single Sign On (SSO) solution, is key to efficiently extracting the most potential value from each and every user.
This case study details how the team extension provided to Valiton by K&C achieved that, working alongside their inhouse peers, using a combination of AWS-native tools and technologies and compatible open source tools and technologies.
The Client
Valiton – IT Services Provider To Major Digital Media & eCommerce Holding
The Project
Single Sign On (SSO) Unified User Database Built For The AWS Cloud
“Using Harbourmaster enables you to target your customers with an experience as unique as they are”.
Valiton was creating a new SSO service, Harbourmaster, to allow different systems and subsidiaries to access a single huge database of users. As well as storing the user data, each user needed to have an identifier. Subsidiaries and their systems all needed access to the same user but for different reasons.
Harbourmaster provides touchpoints with clients/provides. The system itself is administrative. “Events”, such as subscribing to a newsletter or special offers encourage users to sign up. Harbourmaster’s role is to then store and make those user databases conveniently available to a wider ecosystem, allowing for GDPR-compliant filtering.
A media subsidiary might want to push news notifications via mobile messaging or email. Another may need to send physical marketing materials by post and so need to pull name and address data. A third subsidiary or system may send surveys or promotional emails to users.
The same system needed to offer management insight into when and why user data requests were made by which systems and subsidiaries. And for GDPR, users themselves have to be able to edit their profiles and adjust permission settings, or delete their profile, all in one place. Processed data needed to be fetchable by identifiers and through specific filters.
The majority of data requests are machine-to-machine, requiring 100% availability of the central database service. This meant we needed to create a variety of services in different domains, aggregating them through an API documentation merge, with access to all client systems provided through a single API Gateway.
The result of this merging is a single page detailing API documentation of services, that a specific user might need and a cost efficient SSO system.
Key Requirements of the SSO System
- automated deployment.
- automated scalability.
- automated rate limits for API calls.
- improved monitoring of data flows and optimised use of resources.
- fast deployment to another AWS account (on demand).
- extremely fast HTTP response requiring event-driven design
Key Features
- Easy to install on premise and in cloud
- High performance caching
- API-driven
- Cross-domain login
- Social Network logins
- Delivered by industry standard Docker Hub
- Thunder and Drupal compatible
- Compatible with e-commerce and paywalls
Why K&C & Why AWS?
Having worked with Valiton on previous projects, providing them with specialist team extensions, the company turned to us here due to our significant experience in working with the combination of AWS, Kubernetes and Terraform – all technologies that would be part of realising this new SSO database solution.
We confirmed our agreement that the new service’s infrastructure should be deployed on AWS. The decision was taken on the basis of a cost analysis between major public cloud providers, specific AWS functionalities and technologies that would be leveraged and the fact that Valiton already deployed a range of services on AWS infrastructure.
The Tech Stack
AWS Native
- S3
- SQS
- SNS
- RDS
- CloudFront
- EKS (Kubernetes)
- Lambda
Plus
- Kafka
- Terraform
- Terragrunt
- Helm
- Istio
The Result – Fully Automated Deployment
The K&C team extension, together with Valiton’s inhouse team, achieved fully automated deployment of the new service. Once automated deployment was set up, the developers were able to apply their full focus to building the service’s business logic. The infrastructure is mirrored across the developer’s local machine and production environment.
That means every change made to the service is tested in exactly the same environment as it will run on in production.
Whenever a feature is added or bug fixed, the developer simply pushes changes to the remote repository where the automated CI/CD pipeline creates the updated build of the service, tests it, and deploys it to a testing environment. There the updated features or bug fixes are tested again. If everything is good, they are then updated in the production environment where test features or bug fixes might be further tested.
Fully automated scaling of services was achieved through Kubernetes, by allowing for service instances to be scaled up during peak demand before being scaled back to normal usage levels.
Infrastructure-as-Code – Terraform – AWS
Infrastructure-as-Code (IaC) was decided upon as the approach to building Valiton’s Single Sign On (SSO) service in a way that would meet the key requirements and incorporate needed functionalities.
Hashicorp’s Terraform was selected as our IaC tool. Terraform’s strength is it allows for the creation of fully manageable module-based solutions. It is also cloud agnostic so perfectly compatible with our chosen AWS environment.
Terraform also comes with the additional tool kit of Terragrunt, which would allow us to avoid:
- large numbers of folders and inconvenient folder structure
- long files with variables and complex configurations
Terragrunt also allowed us to maintain:
- a single, consistent state per environment and for every developer.
- keeping that state locked while any changes are being introduced by individual developers.
Terragrunt meant we could create a few environment folders with different resources, available on demand.
Terraform/Terragrunt Allowed For Precise Control Of AWS Resources
We used Terraform/Terragrunt to control the following AWS resources:
- Creating SQS,
- S3 buckets & adding tags to S3 labels,
- Updating internal services and AWS resources eg. Removing old messages etc.
- SNS, ElastiCache, RDS etc.,
Buggy terraform_helm Provider Required An Alternative Kubernetes Solution
We ran into some issues creating and manipulating Kubernetes infrastructure with the Terraform/Terragrunt combo, due to a buggy terraform_helm provider. It also didn’t allow us to use the latest version of Helm, which comes with new features we wanted to exploit.
Our K8s, which contained all of our applications, services, cron jobs etc., required an alternative solution and we opted for Helm Charts.
Helm Charts
We created Helm Charts (templates that look like K8s resources, but can be changed at any time by Helm based on variables we send to it). The Helm tool allowed us to achieve the same result we had originally planned on using Terraform and Terragrunt in combination with Kubernetes for.
And the Helm solution ultimately resulted in better, faster performance.
Why Was AWS Selected Ahead Of Other Major Public Cloud Platform Vendors?
As one of the most mature public cloud solutions (if not the most mature), AWS is always under consideration for any architecture that involves a public cloud resource. But there were three qualities represented by AWS that closed the case in relation to this particular project:
- infrastructure and SDK
- pricing structure
- innovations in the AWS infrastructure
The most influential factor in the decision was the AWS infrastructure and SDK. Why?
How We Used AWS Services And Found Solutions To Complexities
- S3 – the same object storage through a web service interface Amazon.com uses to run its global e-commerce network, we used S3 for huge datasets for analytics reports, static resources for websites, and much more.
- SQS, AWS’s distributed message queuing service was used for delivery policies, rather than custom building a solution. AWS SQS allowed us to manage messages securely and, in cases where something goes wrong, automating the re-assignment of messages to another queue – the dead letter queue.
- SNS allowed us to publish messages and subscribe to them using different resources, like email, another SQS etc. SNS comes to the rescue in the following scenarios:
- a service crashes – send an alert to the developer(s) responsible/
- suspicious activity is recorded. For example, too many requests from a single IP address. Alerts can be sent via email or even messaging services such as Slack.
- unexpected surge in user activity. For example, we didn’t expect as many requests as came through over some holiday periods. But SNS sent an alert, so the surge could be managed by allocating more resources.
- RDS – we wanted to use PostgreSQL as a database for some sets of data. And we wanted to have full access to manage data, make dumps and track some events inside the DB. But we didn’t want to manage the database itself (patching, scaling, securing, etc.). All we wanted to have was “just use”, which Amazon RDS allowed us.
- CloudFront – we had a lot of websites and apps to share with our client, their development teams, and internally. Also, we had a few environments to test or run all the services within: feature, integration, staging, production.
Each service required a web-link in different environments, which meant a mess of links. Cloud formation allowed us to manage all the resources (including cache and path rules) using one tool.
Terraform modules and Terragrunt-style meant we could manage all the resources easily. For example, we had a service “link-manager”. When we wanted to have it in a feature environment, we had “link-manager.feature.domain.com”. In the case of staging, we had “link-manager.staging.domain.com”. Clean and simple.
- AWS EKS – the EKS Managed Kubernetes Service was used to manage our Kubernetes clusters. We spent a significant amount of time comparing the price models of Google Cloud, AWS, and a few other public cloud options but chose AWS on pricing, infrastructure, and services.
Because we were going to manage all the resources on a single cloud platform, rather than sharing data in a multi cloud architecture, EKS was an ideal Kubernetes management service.
- Lambda – to be used in cases we wanted to run something just once or twice a day. But the resource is limited. We wanted to create a number of long run processes and other actions to compute data and ran into difficulties with Lambda. Why?
- Lambda has an invocation limit – 15 mins. (https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html). Here we wanted to have some services, that make complicated SQL or NoSQL requests to databases. Some requests took more than 15 mins. So, sometime Lambda just halted our requests.
- Lambda limits the volume of data that can be be processed (memory). We wanted to fetch some data from different resources, aggregate it, process it and only after that stream it to S3 or to another application. Because of memory limitations, it wasn’t possible to use a single lambda (too much data) or even a cluster (Step Functions).
Also, in our case it wasn’t possible to divide requests to other services or databases for fetching data because as the data was highly dynamic and might change from one minute to the next. But we wanted to be able to analyse data snapshots and to do so we needed to send data from one place to another or store it elsewhere.
Solving The Lambda Problem
To solve this issue, we built a small application that then created and pushed a Docker image of the same service to our ECR repository on the same cloud. This used ECS with CloudWatch events (like a cron job) to run our container at the instructed time of day.
Leveraging Of AWS Services And Tools
- We wanted to have a new subdomain for every new service, which was possible using Route53. And Terraform allowed us to create a new subdomain every time a new service was deployed. This allowed for full automation, cutting out the need for any manual work.
- When we needed to use a service, that allowed us to deliver messages to other services several times in succession, it was possible to achieve this by using SQS.
- We decided upon own Kafka cluster to work with and found that AWS provide this too. The price of the service was also incredibly low in comparison to other providers. AWS additionally provided us with CloudWatch charts for Kafka that nicely integrated with other charts we were already using.
- CloudWatch Insights – a tool for showing charts on every service we wanted to monitor. The tool is similar to Grafana, but has a different approach to managing data:
- Grafana uses a pull-approach, which means if you want to show some data, a request must be made.
- CloudWatch Insights uses a push-approach, which means if you want to show some data, it is sent to CloudWatch.
- We had to integrate several external resources, where we stored and managed data, and, also, fetched data from. CloudWatch Insight’s push-approach didn’t allow us to track resources. But we wanted to track them all and additionally some dataflows inside of them. That was only possible with a pull-approach.
That left us with a major decision: either use Grafana or create a Lambda-like function for pulling data from the external resources to CloudWatch. We decided to use Grafana as it was automatically created by Istio.
Additional Advantages To Using AWS
In the past, we made standard use of EC2 instances for our services. But around six months ago AWS created a new service – Spot Instance. Spot Instance achieves what EC2 did but more cost effectively in scenarios where instances are used less frequently than anticipated.
Amazon MSK – a Kafka service recently added to the AWS took kit that offers the same services as alternatives by other vendors but with the additional advantages of:
- AWS integration – simple connection to other AWS services
- pricing- much cheaper than using other vendors
- automatic Zoo Keeper scaling
- metric diagrams (automated collection of required data)
Another strength of AWS as our cloud platform here is SQS DLQ (dead letter queue) – technique that is overlaid on a standard SQS by adding a few checkboxes to user interface. The technique makes it much easier to debug certain applications or isolate problems without the need for vanishing messages.
For instance, a production application gets stuck, messages come to SQS and then on to the jammed application. Eventually, the application crashes and the process has to start again, with the messages lost forever. SQS DLQ stores these messages for further analysis, which can often help in resolving the problem.
Why Kafka?
We wanted to have an event-driven architecture and made the decision to use Kafka to resolve the following complexities:
- SQS – can be slow when delivering messages, certain messages might be sent more times than required and others delayed.
- SNS – the main problem here is delivery. If one’s service is down, and a message sent, the message is lost. The logic here is ‘if no service is ‘listening’ out for the message, is unnecessary. Even if a new instance is run, the message will not be delivered.
- ElastiCache Redis has a Pub/Sub mechanism, that is similar to Kafka’s. But the main problem there was data schema. We wanted data arrays and nested objects. That was only possible in Redis by stringifying our data, which meant we would have to parse and stringify our data every time we wanted to make a change. That is time-consuming and we wanted an automated solution.
- Kinesis was also considered and rejected for its lack of flexibility and higher costs We expected to store groups of messages every millisecond, which would have been expensive. So, we looked for a more cost-effective solution
Kafka was that solution. And it was a nice surprise to discover that AWS provide their own solution for managing Kafka clusters.
Cost Modelling & Features Comparison Behind The Kafka Decision
We reached the conclusion that AWS’s Amazon MSK managed Kafka service was the best option for us. We would need to tweak some configurations to keep a wider retention period and default configurations are to 7 days. More information about the custom configuration can be found here:
Amazon MSK makes it easy to scale up disk size and to add new nodes to the cluster. The cluster described would be able to handle 4MB/s. Here is a small example to hint at our needs based on a Client user with 20m create/update changes per month. With a larger payload size of 1KB, a throughput of 8KB/s would be required. So, we have x512 more capacity.
Kafka management solutions features and price comparison
What Did We Learn? AWS Case Study Takeaways
In order to reach the CloudWatch/Grafana decision described above, we really had to dig deeply into researching both tools.
Another important takeaway from this project was the pricing structures of different public cloud providers for managing Kakfa. All the main providers offered similar features, but AWS’s pricing was far more attractive.