Skip to main content
Skip table of contents

High Availability Administration Checklist

This checklist can be used to record your progress throughout the process of administering a LogRhythm High Availability deployment.

System Monitoring and Status Verification

Regular Status Checks

  • [ ] Open LifeKeeper GUI as administrator

  • [ ] Verify all resources show "In Service" on active node

  • [ ] Verify all resources show "Standby" or "Mirroring" on passive node

  • [ ] Confirm no resources show "Failed" status on either node

  • [ ] Check DataKeeper mirror status for each replicated volume

  • [ ] Verify proper functioning of LogRhythm services

Physical Configuration Verification

  • [ ] Confirm public network connection (NIC1) is operational on both nodes

  • [ ] Verify private network connection (replication NIC) is operational between nodes

  • [ ] Check network latency between nodes is below 15ms for dual-site deployments

  • [ ] Ensure sufficient network bandwidth for replication requirements

Service Management

Protected Service Operations

  • [ ] To stop a protected service:

    • [ ] Select service in LifeKeeper GUI

    • [ ] Right-click and select "Out of Service"

    • [ ] Confirm at prompt

  • [ ] To start a protected service:

    • [ ] Select service in LifeKeeper GUI

    • [ ] Right-click and select "In Service"

    • [ ] Confirm at prompt

  • [ ] To activate services on other node:

    • [ ] Select Resource Hierarchy on Standby node

    • [ ] Right-click and select "In Service"

    • [ ] Confirm at prompt

Service Status Verification

  • [ ] Verify all protected services are set to Manual Startup type

  • [ ] Confirm only LifeKeeper GUI is used to manage services

  • [ ] Never use Windows Services MMC or other tools to manage protected services

Volume Management

Volume Status Verification

  • [ ] Check status of replicated volumes:

    • [ ] D: (Data volume)

    • [ ] L: (SQL Logs volume)

  • [ ] Verify replicated volumes are not directly accessible on Standby node

  • [ ] Confirm only replication engine writes to target volumes

SQL Management

SQL Component Verification

  • [ ] Verify SQL Server (MSSQLSERVER) service is protected and running

  • [ ] Confirm SQL Server Agent (MSSQLSERVER) service is protected and running

  • [ ] Check Distributed Transaction Coordinator is protected and running

SQL Privileged Account Management

  • [ ] Update SQL account credentials in LifeKeeper (when needed):

    1. [ ] Select SQL_ResTag Resource Hierarchy in LifeKeeper GUI

    2. [ ] Select Properties

    3. [ ] Select Admin Actions

    4. [ ] Select Manage User

    5. [ ] Select Change Password or Change User and Password

    6. [ ] Enter new credentials

    7. [ ] Click Done

IP and Name Management

IP Address Configuration

  • [ ] Verify shared IP address is properly configured on active node

  • [ ] Confirm all LogRhythm services use shared IP address (not node-specific IP)

  • [ ] Check that shared IP address switches properly during failover

Name Resolution

  • [ ] Verify shared Windows Name resolution works correctly

  • [ ] Confirm DNS name resolution (if configured) functions properly

  • [ ] Ensure all services reference shared Windows Name

Planned Maintenance Procedures

Controlled Switchover Procedure

  • [ ] Notify all stakeholders before planned switchover

  • [ ] Select Resource Hierarchy on Standby node

  • [ ] Right-click and select "In Service"

  • [ ] Confirm at prompt

  • [ ] Wait for all services to start on new active node

  • [ ] Verify functionality after switchover

Software Updates

  • [ ] Apply OS updates to both nodes individually

  • [ ] Install SQL updates on both nodes

  • [ ] Apply LogRhythm updates to both nodes

  • [ ] Always test functionality after updates

Recovery Procedures

Handling Unplanned Failovers

  • [ ] After unplanned failover, verify services running on new active node

  • [ ] Reboot system that experienced failure to ensure proper cluster rejoining

  • [ ] Verify Resource Hierarchies show Standby or Mirroring state after reboot

  • [ ] Contact LogRhythm Support if any resources show Failed status after reboot

Backup Management

  • [ ] Perform regular backups of LogRhythm Archives

  • [ ] Backup LogRhythm EMDB regularly

  • [ ] Store backups securely off-site

  • [ ] Test backup restoration procedures periodically

System Monitor Configuration for HA

Configure Syslog for HA Operations

  • [ ] Launch LogRhythm Console

  • [ ] Navigate to Deployment Manager > System Monitor Agents tab

  • [ ] Double-click System Monitor Agent on HA system

  • [ ] Click Advanced in bottom-left corner

  • [ ] Verify SyslogServerNIC value is set to shared IP address

  • [ ] Update value if necessary and save changes

Configure NetFlow for HA Operations

  • [ ] Launch LogRhythm Console

  • [ ] Navigate to Deployment Manager > System Monitor Agents tab

  • [ ] Double-click System Monitor Agent on HA system

  • [ ] Click Advanced in bottom-left corner

  • [ ] Verify NetflowServerNIC value is set to shared IP address

  • [ ] Update value if necessary and save changes

Configure sFlow for HA Operations

  • [ ] Launch LogRhythm Console

  • [ ] Navigate to Deployment Manager > System Monitor Agents tab

  • [ ] Double-click System Monitor Agent on HA system

  • [ ] Click Advanced in bottom-left corner

  • [ ] Verify sFlowServerNIC value is set to shared IP address

  • [ ] Update value if necessary and save changes

Configure SNMP for HA Operations

  • [ ] Launch LogRhythm Console

  • [ ] Navigate to Deployment Manager > System Monitor Agents tab

  • [ ] Double-click System Monitor Agent on HA system

  • [ ] Select SNMP Trap Receiver tab

  • [ ] Verify "Enable SNMP Trap Receiver" checkbox is selected

  • [ ] Confirm Address is set to shared IP address

  • [ ] Update address if necessary and save changes

Troubleshooting Procedures

Common Issues Resolution

  • [ ] For service startup failures:

    • [ ] Check LifeKeeper GUI for resource dependencies

    • [ ] Verify all required services are running

    • [ ] Check Windows Event Log for errors

  • [ ] For replication issues:

    • [ ] Verify network connectivity between nodes

    • [ ] Check disk space on target volumes

    • [ ] Review DataKeeper logs

  • [ ] For SQL connectivity problems:

    • [ ] Verify SQL services are running

    • [ ] Check SQL error logs

    • [ ] Confirm service account permissions

When to Contact Support

  • [ ] Resource shows "Failed" status even after system reboot

  • [ ] Replication fails to synchronize after network interruption

  • [ ] Services fail to start on either node

  • [ ] Any resource dependencies appear broken in LifeKeeper GUI

  • [ ] SQL authentication or permission issues persist after credential updates

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.