MongoDB replica error RS102 too stale to catch up during full recovery

3 minute read

We run MongoDB 2.6.0 on a production system configured with a replica set of 3 members:

  • Primary
  • Secondary
  • Arbiter

About 3 weeks ago we had to move the data files from a data centre to another due to disk size problems. Database size had reached 15.6TB. Thus we copied only the data from the PRIMARY node to the new data centre and we started the SECONDARY node totally empty in order to recover via full replication. You can read more about the procedure in this blog post: .

Due to the large size of the database we had to check the recovery replication process daily. During the last 3 days the status of the replica was as usual STARTUP2 and the logs showed that it builds its indexes:

2017-11-20T08:53:55.491+0200 [rsSync] Index Build: 62017200/69819295 88%
2017-11-21T09:10:01.192+0200 [rsSync] Index Build: 66154400/69819295 94%
2017-11-22T08:33:52.992+0200 [rsSync] Index: (2/3) BTree Bottom Up Progress: 33284400/53472142 62%

In the meantime, Nagios monitoring showed an increase in the replication lag:

2017-11-20T08:53:55.491+0200 Replication lag: 586379 seconds (6.79 days)
2017-11-21T09:10:01.192+0200 Replication lag: 673901 seconds (7.8 days)
2017-11-22T08:33:52.992+0200 Nagios - Replication lag: 758020 seconds (8.77 days)

It made us quite skeptikal until today, when in the daily check, we stumbled upon an error! What a coincidence! Replication had stopped with the following error:

2017-11-23T08:54:25.937+0200 [rsBackgroundSync] replSet not trying to sync from 192.168.10.4:27017, it is vetoed for 8 more seconds
2017-11-23T08:54:25.937+0200 [rsBackgroundSync] replSet not trying to sync from 192.168.10.4:27017, it is vetoed for 8 more seconds
2017-11-23T08:54:30.000+0200 [rsBackgroundSync] replSet not trying to sync from 192.168.10.4:27017, it is vetoed for 3 more seconds
2017-11-23T08:54:30.000+0200 [rsBackgroundSync] replSet not trying to sync from 192.168.10.4:27017, it is vetoed for 3 more seconds
2017-11-23T08:54:34.062+0200 [rsBackgroundSync] replSet syncing to: 192.168.10.4:27017
2017-11-23T08:54:34.109+0200 [rsBackgroundSync] replSet error RS102 too stale to catch up, at least from 192.168.10.4:27017
2017-11-23T08:54:34.109+0200 [rsBackgroundSync] replSet our last optime : Nov 13 14:06:29 5a098ac5:6
2017-11-23T08:54:34.109+0200 [rsBackgroundSync] replSet oldest at 192.168.10.4:27017 : Nov 16 10:30:19 5a0d4c9b:6
2017-11-23T08:54:34.109+0200 [rsBackgroundSync] replSet See http://dochub.mongodb.org/core/resyncingaverystalereplicasetmember

Also when we run on the PRIMARY the command rs.status() it gave us the following output:

{
    "set" : "panamapapers-rs",
    "date" : ISODate("2017-11-23T07:12:11.000Z"),
    "myState" : 1,
    "members" : [ 
        {
            "_id" : 0,
            "name" : "192.168.10.4:27017",
            "health" : 1.0,
            "state" : 1,
            "stateStr" : "PRIMARY",
            "uptime" : 1380243,
            "optime" : Timestamp(1511421121, 7),
            "optimeDate" : ISODate("2017-11-23T07:12:01.000Z"),
            "electionTime" : Timestamp(1510041044, 1),
            "electionDate" : ISODate("2017-11-07T07:50:44.000Z"),
            "self" : true
        }, 
        {
            "_id" : 1,
            "name" : "192.168.10.5:27017",
            "health" : 1.0,
            "state" : 3,
            "stateStr" : "RECOVERING",
            "uptime" : 836950,
            "optime" : Timestamp(1510574789, 6),
            "optimeDate" : ISODate("2017-11-13T12:06:29.000Z"),
            "lastHeartbeat" : ISODate("2017-11-23T07:12:11.000Z"),
            "lastHeartbeatRecv" : ISODate("2017-11-23T07:12:10.000Z"),
            "pingMs" : 0,
            "lastHeartbeatMessage" : "error RS102 too stale to catch up"
        }, 
        {
            "_id" : 2,
            "name" : "192.168.10.9:27017",
            "health" : 1.0,
            "state" : 7,
            "stateStr" : "ARBITER",
            "uptime" : 1380095,
            "lastHeartbeat" : ISODate("2017-11-23T07:12:10.000Z"),
            "lastHeartbeatRecv" : ISODate("2017-11-23T07:12:10.000Z"),
            "pingMs" : 0
        }
    ],
    "ok" : 1.0
}

After searching in the internet we found issue SERVER-14523 where a guy had the same problem as ours. Furthermore we read again the official documentation of how to Resync a Member of a Replica Set. The solutions to our problem were the following:

  1. Resize the oplog to a large value and restart initial sync replication by deleting all files in the SECONDARY node.
  2. Copy all data files from the PRIMARY to the SECONDARY node and restart both of them.

We rejected resizing the oplog because the database grows every day +9GB so the oplog had to become at least 81GB in order to contain the changes of the aforementioned last 9 days.

That’s it! I hope it helped you!

Comments