We run MongoDB 2.6.0 on a production system configured with a replica set of 3 members:
- Primary
- Secondary
- Arbiter
About 3 weeks ago we had to move the data files from a data centre to another due to disk size problems. Database size had reached 15.6TB. Thus we copied only the data from the PRIMARY node to the new data centre and we started the SECONDARY node totally empty in order to recover via full replication. You can read more about the procedure in this blog post:
Due to the large size of the database we had to check the recovery replication process daily. During the last 3 days the status of the replica was as usual STARTUP2 and the logs showed that it builds its indexes:
|
|
In the meantime, Nagios monitoring showed an increase in the replication lag:
|
|
It made us quite skeptikal until today, when in the daily check, we stumbled upon an error! What a coincidence! Replication had stopped with the following error:
|
|
Also when we run on the PRIMARY the command rs.status() it gave us the following output:
|
|
After searching in the internet we found issue SERVER-14523 where a guy had the same problem as ours. Furthermore we read again the official documentation of how to Resync a Member of a Replica Set. The solutions to our problem were the following:
- Resize the oplog to a large value and restart initial sync replication by deleting all files in the
SECONDARYnode. - Copy all data files from the
PRIMARYto theSECONDARYnode and restart both of them.
We rejected resizing the oplog because the database grows every day +9GB so the oplog had to become at least 81GB in order to contain the changes of the aforementioned last 9 days.
That’s it! I hope it helped you!