Hello Guys. Another day, another issue.
During a Test Recovery, user had observed an error and test recovery was failing immediately with the error “A generic error occurred in the vSphere Replication Management Server. Exception details: ‘com.vmware.hms.replication.sync.DeltaAbortedException'”.
Well again, my initial thought was there must be some network connectivity issue between SRM and vSphere Replication. When checked, all required communications were working as per VMware KB article.
My next thought was hbr services might be not having some issue. But “systemctl status hbrsrv.service” shows the status as healthy and active and no errors reported.
Even /var/log/vmware/hbrsrv.log shows no errors other than the one mentioned above error.
Upon some digging and talking to internal folks, understood this could be due to the large number of incoming or outgoing event entries in the vSphere Replication database.
So, I checked the database and confirm that the values are very high using below commands.
/opt/vmware/vpostgres/current/bin/psql -U vrmsdb -c ‘select from incomingeventlogentity;’
/opt/vmware/vpostgres/current/bin/psql -U vrmsdb -c ‘select from outgoingeventlogentity;’
Since the case has been confirmed, next stop is to clear both the tables to fix the issue.
For that below are the steps.
- Stop hms service using systemctl stop hms.service at both sites (Primary and DR)
- Run below commands on both sites HMS Servers.
/opt/vmware/vpostgres/current/bin/psql -U vrmsdb -c ‘delete from incomingeventlogentity;’
/opt/vmware/vpostgres/current/bin/psql -U vrmsdb -c ‘delete from outgoingeventlogentity;’
3. Start hms service using systemctl start hms.service at both sites
Once services are started successfully, perform the Test Recovery and see the results.