My Mongodb upgrade, or not -ish
2019-09-01
So I have been using Mongodb for years now, since I started on V3 of my (Unoffical) BnS API back in 2017 and it has served me well. it was my first foray into databases in general and for my needs at teh time it suited me perfectly – I could dump in json elements in and pull them out as I required. This also served me well for my Gw2 API parser and most things I have done up to now.
My hardware didnt really change in the two years, my host is Hetzner which has good pricing and power. I used a CX11 for my frontend and a CX21 for my backend/db Initially I ran all scripts on my frontend but as time progressed I added more storage to my backend and moved some of the heavy processing to it. It also hosts a mysql database (that wordpress and a few more things use)
That brings me up to about a month ago. By this point I had been running Mongodb as a standalone for a few years and it was pretty simple to work with I have been nearing the limits of my database server for a while now and started looking into upgrading. The new plan was to move to a sharded replica set.
-
CX11’s
- Frontend
- 3x Script processing servers
- 1x Mongos instance
- 3x for config servers
-
CX21
- 2x for the first shard
- 2x for the second shard
The process I was going to follow was:
- Upgrade my existing DB to v4.2
- Convert it into a replica set (rs0)
- Turn rs0 into my first shard
- Provision rs1 and start it as a shard
- Setup the mongos and config servers
- Tie everything in together
- Balance out the data between the shards
- Test everything
1 – I had no issues with, I was already using a config file and it makes it so much easier
2 – This is where my issues started, of course I forgot to do teh blood sacrifice and Murphy popped in for coffee.
Setting up the replica was no issue, teh problem was latency.
When the secondaries completed the initial sync they had to apply all teh changes that had happened while doing the sync, by the time it caught up, it had actually fallen outside of the oplog window, even when I rescaled them to being teh CX51’s it didnt make any difference.
This took quite a bit of testing and in the end I couldnt get it working. I am still not sure what was causing teh problems in the replication.
3 – Turning it (the lone primary) into a shard was actually pretty easy, literally just adding a line to the config and restarting mongod
4 – This was also an easy task, by now I had scripted the setup process and it went smoothly.
5 – Again this was pretty easy, mongos is in essence a mongod that uses other mongod’s as its storage
6 – This was also pretty easy, you know what is next right? I set up my two main collections to be sharded using a hashed ID to allow them to be balanced better as the id’s of the items are ever increasing.
7 – My unlucky number, Murphy brought friends.
Balancing was slow AF between two CX21’s so I resized them to CX51’s until they balanced
This took two days to balance ~30GB in ~70,000,000 documents.
By this point the upgrade had taken far longer than planned.
When done I moved the two database servers back down to the CX21 level.
8 – This is exactly the point where things got worse.
In my testing I found that using the two CX21 servers in a sharded setup gave worse preformance than the previous standalone CX21
This was tested by running a script and logging its time. This script previously ran every 15 min.
On the standalone it took roughly 6 min to complete
With the two shards it took 38 min.
sighs
Originally I had setup the servers in different datacenters for redundancy, so lets see if latency was causing any of teh issues. this took a few days to sort out: 30 min.
groans
Getting closer, what else could I do? Add another shard! This took another day to balance out: 20 min.
screams
Upgraded all 3 servers to CX51’s: 10 min.
breaks down into a gibbering mess..
What have I learnt? It is probably best to scale Mongodb vertically before horizontally. Quite a bit about server setup’s and scaling
That brings me to the current stage.
I am in the process of removing the shards and going back to a standalone DB but upgrading its resources, I may keep it at the CX51 level or go one lower to CX41 – either one will be able to hold a large chunk of the db in the ram.
When I get to that point I will be testing its speed again and hopefully its down to ~6 min. If its not then I will know that the previous speed came from that script running on teh same server as the db.
But that’s a while off, I started the shard removal over 24h ago and it is about a quarter done now.
What are teh consequences?
My hair changed from #C0C0C0 to #FFFFFFuuuccccckkkkkkk from teh stress of it all.
It subsequently changed to #F00bar due to repeated head_meets_desk.sh
My main Gw2 API has been in a half active state for over a week now, something that does not sit well with me
That good enough Greaka? Satisfied now?
Well I am off to bed now, sun is beginning to rise.
Downs remaining coffee, posts this post and collapses on keyboard…