In this case study we’ll look at MongoDB sharding, “a database architecture that partitions data by key ranges and distributes the data among two or more database instances. Sharding enables horizontal scaling”.
More specifically, we’ll show step-by-step how our nearshored DevOps team configured an open source MongoDB sharding solution for a scalable, high availability distributed database as an alternative to an expensive enterprise license. Spoiler – the net result for our client was a minimum MongoDB enterprise license saving of €100k a year. Every year.
As well as walking you through the use case context (our client SolarLog is a leading PV energy management system) of our decision to use open source MongoDB code, we’ll detail the configuration process step-by-step.
MongoDB – In Pursuit of The Holy Grail of an Automated, Self-Healing & Scalable Database Solution
How effectively companies and organisations leverage data is, as much as anything else, the foundation of their sustainable success and competitiveness in the contemporary world. And that comes down to three things:
- Data capture
- Data storage and availability
- Data analysis
Database technologies such as MongoDB focus on number 2. Software engineers have grappled with the challenges around creating autonomous, self-recoverable and scalable database systems since the 1970s and the advent SQL relational databases (first developed by IBM as their ‘System R’ prototype and commercialised by Oracle in 1979).
By the 1990s, the Object Oriented databases that led to NoSQL implementations had appeared and it is this format that is now dominates bulk data storage on the web. But the original problem remains. The advent of cloud computing and proliferation of IoT hardware has meant database technologies and systems are more important than ever. Data as currency is the economics of the digital economy.
The main categories of NoSQL databases are:
- Key-Value Stores
- Wide-Column Stores
- Document Stores
- Graph Databases
- and Search Engines
Our focus here is Document Store databases, which hold data in the form of JSON documents. Document stores do not require a set schema and data structure can vary between documents. Document databases use the same document-model format used in their application code, which makes it easier for software developers to store and query the data in a database.
As explained by AWS:
“The flexible, semi-structured, and hierarchical nature of documents and document databases allows them to evolve with applications’ needs. The document model works well with use cases such as catalogs, user profiles, and content management systems where each document is unique and evolves over time. Document databases enable flexible indexing, powerful ad hoc queries, and analytics over collections of documents”.
MongoDB, an open source project also available as an enterprise solution, has become a leading Document Store database technology, with particularly common application in an IoT environment.
Open source alternatives to MongoDB include, but are not limited to, CouchDB, Apache Cassandra and PostgreSQL as well as cloud vendor-native choices like AWS’s DynamoDB and AmazonDocumentDB (which is compatible with and can support MongoDB in an AWS environment).
MongoDB Shards And Sharded Clusters
In MongDB, sharding splits large data sets into smaller subsets held across multiple instances. These multiple smaller data sets are termed Shards, which work together as one large cluster-based data set.
From MongoDB 3.6 on (the most recent release in MongoDB 4.4), shards are deployed as a replica set, which offers the redundancy and high availability crucial to contemporary database technologies. Shards are only connected to directly by users or applications for local maintenance and administration operations.
A query to a singly shard will only return the subset of data held on that particular shard. Cluster-level operations such as read or write operations involve the user or application connecting to the mongos. The technology does not guarantee any two contiguous chunks of data being able to reside on a single shard.
Sharding Broken Down
Let’s take a closer look at the components of MongDB sharding architecture:
Primary Shard
Each individual database in a sharded cluster centres around a primary shard. The primary shard holds all the cluster database’s un-sharded collections. A database’s primary shard is not related to the primary in a replica set.
Source: MongoDB
Mongos
When creating a new database, the primary shard is selected by the mongos as that holding the least amount of data in the cluster. The mongos identify the primary shard using the totalSize field returned by the listDatabase command.
The interface between the client applications and the sharded cluster is provided by the mongos instances, which route queries and write operations to the shards. As far as an application is concerned, a mongos instance behaves in the exact same way as any other MongoDB instance.
Changing The Primary Shard In A Database
It is possible to change the shard assigned as a database’s primary shard, though the migration process can take some time, during which collections associated to the database should not be accessed. To do so, the movePrimary command is executed.
Before migrating to a new primary shard, keep in mind larger data collections being transferred can impact on a cluster’s operations. Take this and network load into account.
When deploying a new sharded cluster containing shards previously employed as replica sets, existing databases continue to reside on their original replica sets. Databases created at a later date can sit on any of the cluster’s shards.
Shard Status
For an overview of a cluster, use the sh.status() method in the mongo shell. The report shows the database’s primary shard as well as how data chunks are distributed across the shards comprising a cluster.
Sharded Cluster Security
Intra-cluster security is maintained and unauthorised access from components blocked through the use of Internal/Membership Authentication. Internal authentication can only be enforced if each mongod in a cluster is initiated with the appropriate security settings.
Shard Local Users
Role-Based Access Control (RBAC) is supported on individual shards and restricts unauthorised access to data residing on it as well as its operations. To enforce RBAC, each mongod in the replica set should be initiated with the –auth option.
Each shard has its own shard-local users, which cannot be used on other shards or to connect to the cluster via a mongos.
Why Open Source MongoDB Using Sharding Was Our Choice Of Database Solution
Our client SolarLog is a global leader in the innovation of software and IoT devices for monitoring and managing photovoltaic plants. The company’s solutions are used by 327, 313 PV plants worldwide across 138 countries. It operates a huge number of IoT devices, which gather the data from the solar energy plants whose output its products and solutions are designed to optimise.
After partial processing, those large and growing data flows needed to be efficiently stored and easily and reliably accessible. Scalability was a requirement, as was self-healing qualities. The load had to be distributed between cluster nodes to ensure high availability and recoverability.
Our team elected to develop the database using open source MongoDB sharding. The easy alternative was a MongoDB enterprise license. However, that would have come at a cost of €100,000 annually before server costs.
Confident that a configuration that used MongoDB’s open source code and sharding for effective distribution across node clusters could be built to meet the project’s requirements in full, that is what we set about implementing.
Problem Solving – What We Learned In The Process Of Configuring Our MongoDB Sharding Solution
No project involving relatively complex custom architecture and development is successfully executed without any issue being encountered. This particular database solution was no exception. Initially, the wrong approach to creating backups was taken. In the final version of the solution, the scheme was adjusted to take snapshots of the file system, with database records blocked at the moment snapshots were captured.
MongoDB Sharding Clusters Configuration Step-by-Step Set-Up
In our configuration, we will use the following scheme:
The starting point is installing the mongodb daemons on the Config servers
config of the first CFG replica set server for mongo
systemLog: destination: file logAppend: true path: /var/log/mongodb/mongod.log storage: dbPath: /var/lib/mongo journal: enabled: true processManagement: fork: true pidFilePath: /var/run/mongodb/mongod.pid timeZoneInfo: /usr/share/zoneinfo net: port: 27017 bindIp: mongos-cfg1 replication: oplogSizeMB: 10240 replSetName: "replconfig01" sharding: clusterRole: configsvr
For the other two cfg servers, everything is similar, we only change bindIP
If you use short names of cluster members, specify them in the file /etc/hosts
We connect to our first server mongodb-cfg1 and make a replicaset
mongo --host mongos-cfg1 --port 27017
run command;
rs.initiate( { _id: "replconfig01", configsvr: true, members: [ { _id : 0, host : "mongos-cfg1:27017" }, { _id : 1, host : "mongos-cfg2:27017" }, { _id : 2, host : "mongos-cfg3:27017" } ] } )
In the output “ok”: 1 – means the successful execution of our previous command.
We check the status of all three servers, there should be 1 Primary and 2 Secondary, the commands for checking the status:
rs.isMaster() rs.status()
We configure our shard for MongoDB, the config of first shard:
systemLog: destination: file logAppend: true path: /var/log/mongodb/mongod.log
storage: dbPath: /opt/mongo journal: enabled: true
processManagement: fork: true # fork and run in background pidFilePath: /var/run/mongodb/mongod.pid # location of pidfile timeZoneInfo: /usr/share/zoneinfo
net: port: 27017 bindIp: 0.0.0.0
security: authorization: enabled security: transitionToAuth: false keyFile: /etc/mongo.keyfile replication: replSetName: "rs1" sharding: clusterRole: shardsvr
Generate keyfile:
openssl rand -base64 756 > /etc/mongo.keyfile chmod 400 /etc/mongo.keyfile
For the mongo data storage directory specified by dbPath, set the correct owner:
chown -R mongod:mongod /opt/mongo/
We start the services, check that the daemons we need are listening on port 27017.
We execute commands to initialize the first replica, the first shard:
rs.initiate( { _id: "rs1", members: [ { _id : 0, host : "ip_mongo_shard_1_1:27017" }, { _id : 1, host : " ip_mongo_shard_1_2:27017" }, { _id : 2, host : " ip_mongo_shard_1_3:27017" } ] } )
The last and most interesting thing is setting up mongos – a router for all requests to our future cluster.
After the mongod package is installed on the future router, we will create a service file so that mongos starts automatically
!!! The mongod service itself must be stopped and excluded from autorun !!!
nano /etc/systemd/system/mongos.service [Unit] Description=High-performance, schema-free document-oriented database After=network.target Documentation=https://docs.mongodb.org/manual [Service] Type=forking User=root Group=root ExecStart=/usr/bin/mongos -f /etc/mongos.conf PIDFile=/var/run/mongos.pid LimitFSIZE=infinity LimitCPU=infinity LimitAS=infinity LimitNOFILE=64000 LimitNPROC=64000 LimitMEMLOCK=infinity TasksMax=infinity TasksAccounting=false [Install] WantedBy=multi-user.target
Run the command to update the list of services
systemctl daemon-reload systemctl enable mongos
Our config file for mongos /etc/mongos.conf will be as follows:
:systemLog destination: file logAppend: true path: /var/log/mongodb/mongos.log processManagement: fork: true pidFilePath: /var/run/mongos.pid timeZoneInfo: /usr/share/zoneinfo net: port: 27017 bindIp: 0.0.0.0 maxIncomingConnections: 20000 ipv6: false sharding: configDB: "replconfig01/ mongos-cfg1:27017, mongos-cfg2:27017, mongos-cfg3:27017" security: keyFile: /etc/mongo.keyfile
On mongos server repeat steps for create keyfile.
Run the Mongos service
systemctl start mongos
Check:
netstat -ntupl | grep mongos tcp 0 0 0.0.0.0:27017 0.0.0.0:* LISTEN 3570/mongos
Also in the log /var/log/mongodb/mongos.log there should be lines with a successful connection to our config replicaset mongo
Successfully connected to mongos-cfg1:27017 Successfully connected to mongos-cfg2:27017 Successfully connected to mongos-cfg3:27017
The final touch, registering our mongo shard servers in the mongos router
On the mongos server, connect to our cluster:
mongo --host mongos --port 27017
and add our shard servers:
sh.addShard("rs1/ ip_mongo_shard_1_1:27017, ip_mongo_shard_1_2:27017, ip_mongo_shard_1_3:27017")
and add additional if you create it:
sh.addShard("rs2/ ip_mongo_shard_2_1:27017, ip_mongo_shard_2_2:27017, ip_mongo_shard_2_3:27017") sh.addShard("rs3/ ip_mongo_shard_3_1:27017, ip_mongo_shard_3_2:27017, ip_mongo_shard_3_3:27017") …
MongoDB Sharding Case Study Takeaways
Our solution for SolarLog is not groundbreaking in the sense we didn’t do anything with MongoDB open source code that hasn’t been done previously. However, our MongoDB sharding approach to clusters both fully answered to the requirements of the project and highlights how the often weighty expense of enterprise licenses can be avoided with a little thought and innovation.
Paying for enterprise solutions can be a convenient option if the cash is available and the business case supports the overhead. But it is also, especially in the case of mature and well-developed and resourced open source technologies, not necessary. An open source alternative, especially one that can lead to costs savings of as much as €1 million over the lifespan of a database architecture, is almost always worth at least considering and exploring.