This checklist can be used to record your progress throughout the process of administering a LogRhythm High Availability deployment.
System Monitoring and Status Verification
Regular Status Checks
-
[ ] Open LifeKeeper GUI as administrator
-
[ ] Verify all resources show "In Service" on active node
-
[ ] Verify all resources show "Standby" or "Mirroring" on passive node
-
[ ] Confirm no resources show "Failed" status on either node
-
[ ] Check DataKeeper mirror status for each replicated volume
-
[ ] Verify proper functioning of LogRhythm services
Physical Configuration Verification
-
[ ] Confirm public network connection (NIC1) is operational on both nodes
-
[ ] Verify private network connection (replication NIC) is operational between nodes
-
[ ] Check network latency between nodes is below 15ms for dual-site deployments
-
[ ] Ensure sufficient network bandwidth for replication requirements
Service Management
Protected Service Operations
-
[ ] To stop a protected service:
-
[ ] Select service in LifeKeeper GUI
-
[ ] Right-click and select "Out of Service"
-
[ ] Confirm at prompt
-
-
[ ] To start a protected service:
-
[ ] Select service in LifeKeeper GUI
-
[ ] Right-click and select "In Service"
-
[ ] Confirm at prompt
-
-
[ ] To activate services on other node:
-
[ ] Select Resource Hierarchy on Standby node
-
[ ] Right-click and select "In Service"
-
[ ] Confirm at prompt
-
Service Status Verification
-
[ ] Verify all protected services are set to Manual Startup type
-
[ ] Confirm only LifeKeeper GUI is used to manage services
-
[ ] Never use Windows Services MMC or other tools to manage protected services
Volume Management
Volume Status Verification
-
[ ] Check status of replicated volumes:
-
[ ] D: (Data volume)
-
[ ] L: (SQL Logs volume)
-
-
[ ] Verify replicated volumes are not directly accessible on Standby node
-
[ ] Confirm only replication engine writes to target volumes
SQL Management
SQL Component Verification
-
[ ] Verify SQL Server (MSSQLSERVER) service is protected and running
-
[ ] Confirm SQL Server Agent (MSSQLSERVER) service is protected and running
-
[ ] Check Distributed Transaction Coordinator is protected and running
SQL Privileged Account Management
-
[ ] Update SQL account credentials in LifeKeeper (when needed):
-
[ ] Select SQL_ResTag Resource Hierarchy in LifeKeeper GUI
-
[ ] Select Properties
-
[ ] Select Admin Actions
-
[ ] Select Manage User
-
[ ] Select Change Password or Change User and Password
-
[ ] Enter new credentials
-
[ ] Click Done
-
IP and Name Management
IP Address Configuration
-
[ ] Verify shared IP address is properly configured on active node
-
[ ] Confirm all LogRhythm services use shared IP address (not node-specific IP)
-
[ ] Check that shared IP address switches properly during failover
Name Resolution
-
[ ] Verify shared Windows Name resolution works correctly
-
[ ] Confirm DNS name resolution (if configured) functions properly
-
[ ] Ensure all services reference shared Windows Name
Planned Maintenance Procedures
Controlled Switchover Procedure
-
[ ] Notify all stakeholders before planned switchover
-
[ ] Select Resource Hierarchy on Standby node
-
[ ] Right-click and select "In Service"
-
[ ] Confirm at prompt
-
[ ] Wait for all services to start on new active node
-
[ ] Verify functionality after switchover
Software Updates
-
[ ] Apply OS updates to both nodes individually
-
[ ] Install SQL updates on both nodes
-
[ ] Apply LogRhythm updates to both nodes
-
[ ] Always test functionality after updates
Recovery Procedures
Handling Unplanned Failovers
-
[ ] After unplanned failover, verify services running on new active node
-
[ ] Reboot system that experienced failure to ensure proper cluster rejoining
-
[ ] Verify Resource Hierarchies show Standby or Mirroring state after reboot
-
[ ] Contact LogRhythm Support if any resources show Failed status after reboot
Backup Management
-
[ ] Perform regular backups of LogRhythm Archives
-
[ ] Backup LogRhythm EMDB regularly
-
[ ] Store backups securely off-site
-
[ ] Test backup restoration procedures periodically
System Monitor Configuration for HA
Configure Syslog for HA Operations
-
[ ] Launch LogRhythm Console
-
[ ] Navigate to Deployment Manager > System Monitor Agents tab
-
[ ] Double-click System Monitor Agent on HA system
-
[ ] Click Advanced in bottom-left corner
-
[ ] Verify SyslogServerNIC value is set to shared IP address
-
[ ] Update value if necessary and save changes
Configure NetFlow for HA Operations
-
[ ] Launch LogRhythm Console
-
[ ] Navigate to Deployment Manager > System Monitor Agents tab
-
[ ] Double-click System Monitor Agent on HA system
-
[ ] Click Advanced in bottom-left corner
-
[ ] Verify NetflowServerNIC value is set to shared IP address
-
[ ] Update value if necessary and save changes
Configure sFlow for HA Operations
-
[ ] Launch LogRhythm Console
-
[ ] Navigate to Deployment Manager > System Monitor Agents tab
-
[ ] Double-click System Monitor Agent on HA system
-
[ ] Click Advanced in bottom-left corner
-
[ ] Verify sFlowServerNIC value is set to shared IP address
-
[ ] Update value if necessary and save changes
Configure SNMP for HA Operations
-
[ ] Launch LogRhythm Console
-
[ ] Navigate to Deployment Manager > System Monitor Agents tab
-
[ ] Double-click System Monitor Agent on HA system
-
[ ] Select SNMP Trap Receiver tab
-
[ ] Verify "Enable SNMP Trap Receiver" checkbox is selected
-
[ ] Confirm Address is set to shared IP address
-
[ ] Update address if necessary and save changes
Troubleshooting Procedures
Common Issues Resolution
-
[ ] For service startup failures:
-
[ ] Check LifeKeeper GUI for resource dependencies
-
[ ] Verify all required services are running
-
[ ] Check Windows Event Log for errors
-
-
[ ] For replication issues:
-
[ ] Verify network connectivity between nodes
-
[ ] Check disk space on target volumes
-
[ ] Review DataKeeper logs
-
-
[ ] For SQL connectivity problems:
-
[ ] Verify SQL services are running
-
[ ] Check SQL error logs
-
[ ] Confirm service account permissions
-
When to Contact Support
-
[ ] Resource shows "Failed" status even after system reboot
-
[ ] Replication fails to synchronize after network interruption
-
[ ] Services fail to start on either node
-
[ ] Any resource dependencies appear broken in LifeKeeper GUI
-
[ ] SQL authentication or permission issues persist after credential updates