I had to setup a MongoDB cluster at my job recently and i thought it would be useful to share the research i did for the same. As you might know MongoDB is a NoSQL document DB that scales well horizontally.
MongoDB achieves scaling through a technique called "sharding". Sharding is the process of writing data across different servers to distribute read & write loads and also manage data storage.
Sharding Topology
- Config Server Main responsibility is to determine which shard the data gets stored in. Best practices recommend using a minimum of 3 config servers to ensure redundancy and availability.
- Query Routers The clients connect to the query routers and they are responsible for communicating to the config servers to figure out where the requested data is stored. The router then accesses it and returns the data from the appropriate shard. Best practices recommend using a minimum of 1 Query Router.
- Shard Servers Shards are responsible for the actual data storage. Best practices recommend using a minimum of 2 shard servers.
Here is an diagram showing the MongoDB cluster architecture.
 
            Shard Key
Once you have determined the topology the next task is to determine the Shard Key. In the case of my employer we were storing global suppression list and the data was similar to this
{
	"_id" : ObjectId("553e79233738d570a1d9000001"),
	"email" : "test@example.com",
	"event_type" : "aa",
	"event_source" : "bb",
	"event_reason" : "abuse",
	"event_date" : ISODate("2015-04-27T00:00:00Z"),
	"processed" : true,
	"remarks" : "campaign_id=8450f5b87d;list_id=cbb7896aa1"
}
            It is recommended to use a shard key that's guarenteed to be evenly distributed. We were anticipating an initial load of 3 million global suppression emails and projected to grow by about 100,000 every year. We felt that email should be by and large distributed evently and decided to use the email as the shard key.
Cluster Setup
I am going to assume that mongodb has already been setup on the target machines. The target machines will be referred to as config 1 thru 3, Router 1 and 2 & Shard 1 and 2 respectively. Here are the instructions to setup the Cluster.
mkdir /apps/mongodb/logs
cat > /etc/mongodb.conf <<-EOF
port = 27017
fork = true
pidfilepath = /apps/mongodb/mongodb.pid
logpath = /apps/mongodb/logs/mongodb.log
dbpath =/apps/mongodb
journal = true
EOF
            Each Router runs the mongos command to specify the config databases. Once that is done connect to the server as admin as issue the following commands
mongos --configdb config1, config2, config3
mongo localhost/admin
            
db.runCommand({"addShard" : "HOSTNAME_FOR_SHARD1:27017"});
db.runCommand({"addShard" : "HOSTNAME_FOR_SHARD2:27017"});
// Now enable sharding
db.adminCommand({"enableSharding" : "DATABASE_NAME"});
// Set the shard key
db.adminCommand({"shardCollection" : "DATABASE_NAME.COLLECTION_NAME", key : {"email" : 1}})
            Create the configuration file for the shard server just like the config servers and you are all set.
mkdir /apps/mongodb/logs
cat > /etc/mongodb.conf <<-EOF
port = 27017
fork = true
pidfilepath = /apps/mongodb/mongodb.pid
logpath = /apps/mongodb/logs/mongodb.log
dbpath =/apps/mongodb
journal = true
EOF
            The command to start or stop the mongodb depends on the version of OS hosting the mongodb Cluster. On CentOS linux servers it would be as follows
sudo /etc/init.d/mongod start|stop