High Availability Administration Checklist

This checklist can be used to record your progress throughout the process of administering a LogRhythm High Availability deployment.

System Monitoring and Status Verification

Regular Status Checks

[ ] Open LifeKeeper GUI as administrator
[ ] Verify all resources show "In Service" on active node
[ ] Verify all resources show "Standby" or "Mirroring" on passive node
[ ] Confirm no resources show "Failed" status on either node
[ ] Check DataKeeper mirror status for each replicated volume
[ ] Verify proper functioning of LogRhythm services

Physical Configuration Verification

[ ] Confirm public network connection (NIC1) is operational on both nodes
[ ] Verify private network connection (replication NIC) is operational between nodes
[ ] Check network latency between nodes is below 15ms for dual-site deployments
[ ] Ensure sufficient network bandwidth for replication requirements

Service Management

Protected Service Operations

[ ] To stop a protected service:
- [ ] Select service in LifeKeeper GUI
- [ ] Right-click and select "Out of Service"
- [ ] Confirm at prompt
[ ] To start a protected service:
- [ ] Select service in LifeKeeper GUI
- [ ] Right-click and select "In Service"
- [ ] Confirm at prompt
[ ] To activate services on other node:
- [ ] Select Resource Hierarchy on Standby node
- [ ] Right-click and select "In Service"
- [ ] Confirm at prompt

Service Status Verification

[ ] Verify all protected services are set to Manual Startup type
[ ] Confirm only LifeKeeper GUI is used to manage services
[ ] Never use Windows Services MMC or other tools to manage protected services

Volume Management

Volume Status Verification

[ ] Check status of replicated volumes:
- [ ] D: (Data volume)
- [ ] L: (SQL Logs volume)
[ ] Verify replicated volumes are not directly accessible on Standby node
[ ] Confirm only replication engine writes to target volumes

SQL Management

SQL Component Verification

[ ] Verify SQL Server (MSSQLSERVER) service is protected and running
[ ] Confirm SQL Server Agent (MSSQLSERVER) service is protected and running
[ ] Check Distributed Transaction Coordinator is protected and running

SQL Privileged Account Management

[ ] Update SQL account credentials in LifeKeeper (when needed):
1. [ ] Select SQL_ResTag Resource Hierarchy in LifeKeeper GUI
2. [ ] Select Properties
3. [ ] Select Admin Actions
4. [ ] Select Manage User
5. [ ] Select Change Password or Change User and Password
6. [ ] Enter new credentials
7. [ ] Click Done

IP and Name Management

IP Address Configuration

[ ] Verify shared IP address is properly configured on active node
[ ] Confirm all LogRhythm services use shared IP address (not node-specific IP)
[ ] Check that shared IP address switches properly during failover

Name Resolution

[ ] Verify shared Windows Name resolution works correctly
[ ] Confirm DNS name resolution (if configured) functions properly
[ ] Ensure all services reference shared Windows Name

Planned Maintenance Procedures

Controlled Switchover Procedure

[ ] Notify all stakeholders before planned switchover
[ ] Select Resource Hierarchy on Standby node
[ ] Right-click and select "In Service"
[ ] Confirm at prompt
[ ] Wait for all services to start on new active node
[ ] Verify functionality after switchover

Software Updates

[ ] Apply OS updates to both nodes individually
[ ] Install SQL updates on both nodes
[ ] Apply LogRhythm updates to both nodes
[ ] Always test functionality after updates

Recovery Procedures

Handling Unplanned Failovers

[ ] After unplanned failover, verify services running on new active node
[ ] Reboot system that experienced failure to ensure proper cluster rejoining
[ ] Verify Resource Hierarchies show Standby or Mirroring state after reboot
[ ] Contact LogRhythm Support if any resources show Failed status after reboot

Backup Management

[ ] Perform regular backups of LogRhythm Archives
[ ] Backup LogRhythm EMDB regularly
[ ] Store backups securely off-site
[ ] Test backup restoration procedures periodically

System Monitor Configuration for HA

Configure Syslog for HA Operations

[ ] Launch LogRhythm Console
[ ] Navigate to Deployment Manager > System Monitor Agents tab
[ ] Double-click System Monitor Agent on HA system
[ ] Click Advanced in bottom-left corner
[ ] Verify SyslogServerNIC value is set to shared IP address
[ ] Update value if necessary and save changes

Configure NetFlow for HA Operations

[ ] Launch LogRhythm Console
[ ] Navigate to Deployment Manager > System Monitor Agents tab
[ ] Double-click System Monitor Agent on HA system
[ ] Click Advanced in bottom-left corner
[ ] Verify NetflowServerNIC value is set to shared IP address
[ ] Update value if necessary and save changes

Configure sFlow for HA Operations

[ ] Launch LogRhythm Console
[ ] Navigate to Deployment Manager > System Monitor Agents tab
[ ] Double-click System Monitor Agent on HA system
[ ] Click Advanced in bottom-left corner
[ ] Verify sFlowServerNIC value is set to shared IP address
[ ] Update value if necessary and save changes

Configure SNMP for HA Operations

[ ] Launch LogRhythm Console
[ ] Navigate to Deployment Manager > System Monitor Agents tab
[ ] Double-click System Monitor Agent on HA system
[ ] Select SNMP Trap Receiver tab
[ ] Verify "Enable SNMP Trap Receiver" checkbox is selected
[ ] Confirm Address is set to shared IP address
[ ] Update address if necessary and save changes

Troubleshooting Procedures

Common Issues Resolution

[ ] For service startup failures:
- [ ] Check LifeKeeper GUI for resource dependencies
- [ ] Verify all required services are running
- [ ] Check Windows Event Log for errors
[ ] For replication issues:
- [ ] Verify network connectivity between nodes
- [ ] Check disk space on target volumes
- [ ] Review DataKeeper logs
[ ] For SQL connectivity problems:
- [ ] Verify SQL services are running
- [ ] Check SQL error logs
- [ ] Confirm service account permissions

When to Contact Support

[ ] Resource shows "Failed" status even after system reboot
[ ] Replication fails to synchronize after network interruption
[ ] Services fail to start on either node
[ ] Any resource dependencies appear broken in LifeKeeper GUI
[ ] SQL authentication or permission issues persist after credential updates

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.