High Availability Administration Checklist
This checklist can be used to record your progress throughout the process of administering a LogRhythm High Availability deployment.
System Monitoring and Status Verification
Regular Status Checks
[ ] Open LifeKeeper GUI as administrator
[ ] Verify all resources show "In Service" on active node
[ ] Verify all resources show "Standby" or "Mirroring" on passive node
[ ] Confirm no resources show "Failed" status on either node
[ ] Check DataKeeper mirror status for each replicated volume
[ ] Verify proper functioning of LogRhythm services
Physical Configuration Verification
[ ] Confirm public network connection (NIC1) is operational on both nodes
[ ] Verify private network connection (replication NIC) is operational between nodes
[ ] Check network latency between nodes is below 15ms for dual-site deployments
[ ] Ensure sufficient network bandwidth for replication requirements
Service Management
Protected Service Operations
[ ] To stop a protected service:
[ ] Select service in LifeKeeper GUI
[ ] Right-click and select "Out of Service"
[ ] Confirm at prompt
[ ] To start a protected service:
[ ] Select service in LifeKeeper GUI
[ ] Right-click and select "In Service"
[ ] Confirm at prompt
[ ] To activate services on other node:
[ ] Select Resource Hierarchy on Standby node
[ ] Right-click and select "In Service"
[ ] Confirm at prompt
Service Status Verification
[ ] Verify all protected services are set to Manual Startup type
[ ] Confirm only LifeKeeper GUI is used to manage services
[ ] Never use Windows Services MMC or other tools to manage protected services
Volume Management
Volume Status Verification
[ ] Check status of replicated volumes:
[ ] D: (Data volume)
[ ] L: (SQL Logs volume)
[ ] Verify replicated volumes are not directly accessible on Standby node
[ ] Confirm only replication engine writes to target volumes
SQL Management
SQL Component Verification
[ ] Verify SQL Server (MSSQLSERVER) service is protected and running
[ ] Confirm SQL Server Agent (MSSQLSERVER) service is protected and running
[ ] Check Distributed Transaction Coordinator is protected and running
SQL Privileged Account Management
[ ] Update SQL account credentials in LifeKeeper (when needed):
[ ] Select SQL_ResTag Resource Hierarchy in LifeKeeper GUI
[ ] Select Properties
[ ] Select Admin Actions
[ ] Select Manage User
[ ] Select Change Password or Change User and Password
[ ] Enter new credentials
[ ] Click Done
IP and Name Management
IP Address Configuration
[ ] Verify shared IP address is properly configured on active node
[ ] Confirm all LogRhythm services use shared IP address (not node-specific IP)
[ ] Check that shared IP address switches properly during failover
Name Resolution
[ ] Verify shared Windows Name resolution works correctly
[ ] Confirm DNS name resolution (if configured) functions properly
[ ] Ensure all services reference shared Windows Name
Planned Maintenance Procedures
Controlled Switchover Procedure
[ ] Notify all stakeholders before planned switchover
[ ] Select Resource Hierarchy on Standby node
[ ] Right-click and select "In Service"
[ ] Confirm at prompt
[ ] Wait for all services to start on new active node
[ ] Verify functionality after switchover
Software Updates
[ ] Apply OS updates to both nodes individually
[ ] Install SQL updates on both nodes
[ ] Apply LogRhythm updates to both nodes
[ ] Always test functionality after updates
Recovery Procedures
Handling Unplanned Failovers
[ ] After unplanned failover, verify services running on new active node
[ ] Reboot system that experienced failure to ensure proper cluster rejoining
[ ] Verify Resource Hierarchies show Standby or Mirroring state after reboot
[ ] Contact LogRhythm Support if any resources show Failed status after reboot
Backup Management
[ ] Perform regular backups of LogRhythm Archives
[ ] Backup LogRhythm EMDB regularly
[ ] Store backups securely off-site
[ ] Test backup restoration procedures periodically
System Monitor Configuration for HA
Configure Syslog for HA Operations
[ ] Launch LogRhythm Console
[ ] Navigate to Deployment Manager > System Monitor Agents tab
[ ] Double-click System Monitor Agent on HA system
[ ] Click Advanced in bottom-left corner
[ ] Verify SyslogServerNIC value is set to shared IP address
[ ] Update value if necessary and save changes
Configure NetFlow for HA Operations
[ ] Launch LogRhythm Console
[ ] Navigate to Deployment Manager > System Monitor Agents tab
[ ] Double-click System Monitor Agent on HA system
[ ] Click Advanced in bottom-left corner
[ ] Verify NetflowServerNIC value is set to shared IP address
[ ] Update value if necessary and save changes
Configure sFlow for HA Operations
[ ] Launch LogRhythm Console
[ ] Navigate to Deployment Manager > System Monitor Agents tab
[ ] Double-click System Monitor Agent on HA system
[ ] Click Advanced in bottom-left corner
[ ] Verify sFlowServerNIC value is set to shared IP address
[ ] Update value if necessary and save changes
Configure SNMP for HA Operations
[ ] Launch LogRhythm Console
[ ] Navigate to Deployment Manager > System Monitor Agents tab
[ ] Double-click System Monitor Agent on HA system
[ ] Select SNMP Trap Receiver tab
[ ] Verify "Enable SNMP Trap Receiver" checkbox is selected
[ ] Confirm Address is set to shared IP address
[ ] Update address if necessary and save changes
Troubleshooting Procedures
Common Issues Resolution
[ ] For service startup failures:
[ ] Check LifeKeeper GUI for resource dependencies
[ ] Verify all required services are running
[ ] Check Windows Event Log for errors
[ ] For replication issues:
[ ] Verify network connectivity between nodes
[ ] Check disk space on target volumes
[ ] Review DataKeeper logs
[ ] For SQL connectivity problems:
[ ] Verify SQL services are running
[ ] Check SQL error logs
[ ] Confirm service account permissions
When to Contact Support
[ ] Resource shows "Failed" status even after system reboot
[ ] Replication fails to synchronize after network interruption
[ ] Services fail to start on either node
[ ] Any resource dependencies appear broken in LifeKeeper GUI
[ ] SQL authentication or permission issues persist after credential updates