During the deployment of your LogRhythm solution, LogRhythm provides configuration and tuning to ensure that your solution starts at an optimal configuration for your log processing needs. The health and maintenance of your LogRhythm solution is crucial for its optimal performance.
Maintaining an effective, efficient, healthy LogRhythm deployment requires regular system maintenance and monitoring. While this does not require a lot of work from the end user, it is important to understand how to monitor the health of your solution and maintain a healthy deployment.
Monitor LogRhythm Health
There are many ways to keep your system healthy, and various monitoring tools can give you a status of your deployment's health:
- The Deployment Monitor provides a quick snapshot of the health of your system, including details such as the status of your hosts, host performance, database utilization, Data Processor metrics, and log volume statistics. It can be accessed from the Client Console by a LogRhythm Administrator.
- The LogRhythm Performance Counters provide visibility into the performance of the various LogRhythm components.
- The Log Processing Reports provide input into system performance and the efficiency of rules being used in log processing.
- The LogRhythm Diagnostics Tool is a standalone application that collects log and data files from LogRhythm components, runs Platform Manager database queries, and performs health, capacity, and oversubscription analysis on a LogRhythm deployment. The data is consolidated into a local .zip file for subsequent evaluation, capacity analysis and planning, and troubleshooting.
- LogRhythm Echo is a standalone Windows application that simulates a LogRhythm System Monitor Agent and allows users to replay native raw logs and PCAPs into LogRhythm to quickly build, demo, validate, verify, and tear down security use cases. ECHO comes with more than 30 use cases ready for replay, and users can create, modify, and share use cases using the web interface or a text editor.
- Centralized Service Metrics captures all your performance counters from all your servers in the LogRhythm deployment and gathers them into one database that can be utilized to create dashboards showing the overall health of your deployment.
- LogRhythm diagnostic alarm rules provide alarms to notify you of errors or warnings related to the LogRhythm components. These alarm rules are imported with the Knowledge Base and are required for all deployments. The Alarm Rule Group is LogRhythm Diagnostics. These alarms should be enabled and configured to notify Global Administrators, who are the only users who can view these alarms. They are managed from the Alarm Rules tab within Deployment Manager.
For more detailed information, you can review the log files. All LogRhythm components have log files that provide specific details of the component's state and current processing. The log files are stored in the logs folder in the location where the component files were installed (for example, C:\Program Files\LogRhythm\LogRhythm Job Manager\logs) and are also available from the component's Local Configuration Manager. The components log at a level of detail as specified in the Log Level value within the properties of the component, which is configured within the Client Console. Log Level is set to ‘info’ by default, but can be updated as necessary to provide more or less detail.
Understand and Manage Oversubscription
LogRhythm Oversubscription means that the number of logs being processed in your system, given how your system is currently configured, has surpassed your deployment's size and processing capacity.
LogRhythm Appliances are sized to provide different levels of processing capability. They are also configured to support optimal processing. As you start processing more logs and depending on the configuration of your Mediator and System Agent advanced properties, your deployment may need to be reconfigured or expanded to support these needs.
The Mediator and System Monitor Agent have advanced property values such as max memory, queue size, max queue size, max logs cached, and flush batch that are configured at your initial deployment to best serve your platform. If these values change, they could impact the performance of your solution and cause oversubscription to occur.
To fully understand oversubscription, an understanding of the System Monitor Agent and Mediators state and suspense handling, as well as Database capacity is required. These are described in the next section.
Understand State Handling in System Monitor and Mediator
The System Monitor and the Mediator manage logs through queues to prevent log data loss. When these components shutdown they take whatever data they have in memory and write it to disk. Likewise, when the agent or the mediator starts up, the data on the disk is ready back into memory to continue processing.
- On service shutdown, the System Monitor Agent writes its Message Queue to the state/processedlogs directory.
- On service startup, the System Monitor Agent reads its Message Queue from the state/processedlogs directory.
- On service shutdown, the Mediator writes its queues (unprocessed, processed, archive, loginsert, eventinsert, ldsengine) to the appropriate state directory.
- On service shutdown, the Mediator writes final queue state/suspense counts to the scmedsvr.log (INFO level).
- On service startup, the Mediator reads its state data file back into the queues.
Understand Suspense Handling in System Monitor and Mediator
A suspense state is reached when one of the following conditions is met
- ArchiveQueue size > QueueSize
- Available state drive disk space < 10GB (that is, disk space for suspense and state spool files)
In a suspense state, the mediator disconnects all agents and does not allow connection until resources have stabilized. Each service will spool incoming logs to disk to maintain the processing level. The component systematically reads the spooled log data files back into the queues and processes them as soon as the log rate has decreased back to its specified operating rate. The spooled suspense files live in the component’s state directory.
When a System Monitor receives more data than its current configuration can handle, it begins removing data from memory and persisting it to disk. When the Agent is no longer under the unsustainable load, it gradually reads the disk persisted data back in to continue processing.
When a Mediator receives more data than its current configuration can handle, it begins removing data from memory and persisting it to disk. When the Mediator is no longer under the unsustainable load, it gradually reads the disk persisted data back to continue processing.
Understand Database Capacity
Five databases exist within the LogRhythm Solution:
- Platform Manager Database. LogRhythmEMDB
- Alarms Database. LogRhythm_Alarms
Case Management Database. LogRhythm_CMDB
- Events Database. LogRhythm_Events
- LogMart Database. LogRhythm_LogMart
The LogMart and Events databases have a time to live (TTL) setting within the Global Data Management Settings. This value determines the number of days the data for that database should be stored online before being removed by the maintenance process. Setting these values high takes up more capacity in the database.
Changing the default SQL Server Collation from “SQL_Latin1_General_CP1_CI_AS” is not supported and can leave your deployment in a non-working state.
Platform Manager (EMDB), Alarms, CMDB, Events, and LogMart Databases
The EMDB, Alarms, CMDB, Events, and LogMart databases are set for auto growth, so they continue to grow until the host system runs out of disk space.
If disk space is not available, the LogRhythm components issue errors in the log files to indicate that there is a problem. You may see some of the following logs:
- Mediator. ***WARNING*** Suspend condition detected: The minimum disk space available for log data spool files is less than 1 GB
- Agent. CanLoadSpoolFiles=False : Reason=Agent run state not RUNNING or MaxLogQueueMemory exceeded
- Agent. ***WARNING*** Received Data Processor unavailable message from mediator 1
Enable all of the Alarm Rules in the Alarm Rule Group LogRhythm Diagnostics and configure them to notify the appropriate personnel to ensure you receive alerts.
As with any production server, disk space must be monitored. This can be set up so that LogRhythm itself is monitoring with alarms or other third party monitoring tools, that might be in the environment, are monitoring.
LogRhythm AI Engine Data File Processing
Each AI Engine Communication Manager has local, persistent storage for the log data files it receives from Mediators. The AI Engine reads the log data files, processes them, and then deletes the data files from the file system based on the configurable parameter MaxLogDataSize, which is the maximum amount of log data to keep on disk.
If the size of the data files exceeds the configurable amount, the AI Communication Manager begins to delete the oldest data files while continuing to write the newest logs to new data files on the file system. The AI Communication Manager writes logs to the Windows Event Log indicating that the deletions are occurring.
The value is configured in the AIEEngine.ini configuration file on the AI Engine System:
# Maximum amount of log data from Data Processors to keep on disk (in MB).
# If this amount is exceeded, the AI Engine Communication Manager will begin to delete the
# oldest data files until the data file size is less than the specified maximum.
# Values: 100-1000000 (100 MB - 1000 GB)
# Default: 2000 (2 GB)
On the mediator side, the AI Engine data is stored in memory until the limit is reached. The data is then written out to spool files until it can be transmitted to AI Engine. When the storage limit is reached, the oldest spool file is deleted.
The parameters are configured by accessing the LogRhythm Data Processor Advanced Properties window and filtering the Name field for MaxDataQueueSize or MaxSpoolStorage. For more information on configuring these parameters, see Configure Data Processor Properties.
As with any production server, disk space must be monitored. This can be set up so that LogRhythm itself is monitoring with alarms or other third party monitoring tools that might be in the environment are monitoring.
LogRhythm has several maintenance jobs that age data from the databases and rebuild indexes to maintain efficient search functions. If the maintenance jobs do not run, it will have an impact on your system and could create suspense conditions and fill the databases to capacity.
The database maintenance jobs are implemented as SQL Server Agent jobs. Two jobs are in place on any LogRhythm database server:
- LogRhythm Weekday Maintenance. Runs Monday-Friday at a default time of 12:15 AM.
- Platform Manager Sunday Maintenance. Runs Sunday at a default time of 12:15 AM.
Two additional jobs appear on Platform Managers only:
- LogRhythm Backup. Called by the Platform Manager Sunday Maintenance Job.
Delete Old Database Backups. This job, disabled by default, will clean up old database backup files that are older than the value for Database Backup Days to Keep, set in [LogRhythmEMDB].[dbo].[SCMaint].
Each job comprises individual steps that perform a specific maintenance function.
It is critical that the SQL Server Agent service be running on all LogRhythm database servers to ensure that all maintenance is performed on schedule.
To ensure that the Startup Type is set to Automatic and that the Status is Started, open the Services application and find SQL Server Agent (MSSQLSERVER) on the list.
LogRhythm Health Checks and Tune-Up Services
LogRhythm is designed to run with a minimum of required maintenance. However, as with any complex system, a LogRhythm Deployment can benefit from periodic assessment, maintenance, and tuning.
LogRhythm’s Health Check and Tune-Up Services are designed to regularly assess a customer’s deployment and ensure it is fully operational and functioning.
You can read more about these services on the LogRhythm website. To learn more or to sign up for these services, contact your Customer Relationship Manager.