Wednesday, November 26, 2014

System Monitoring with WSO2 Bigdata Platfrom


Monitoring is a critical factor for any cloud deployment. There are many monitoring tools deployed in the cloud. Each of these monitoring tools expose important statistics such as CPU Load, memory usage, disk usage, network usage, service status, etc. These tools also provide separate dashboards for the DevOps administrators. In order to maximize the benefits gained by these monitoring tools, it is important to have an unified view of the whole system through the statistics. With such unification, it is easy to process the information in order to correlate events, detect outliers, find anomalies and generate useful information such as failure patterns, failure/resource exhaustion predictions, etc. In such an event administrators also need to be alerted via notification mechanisms such as email and sms.

Such a solution needs to be extensible, dynamic, easy to deploy and easy to configure. It is easy to build such a system using WSO2 BigData Platform. This article explains how we built such a platform. In our solution, we used WSO2 Enterprise Service Bus as an agent that handles data collection and data publication, WSO2 Complex Event Processor as a real time stream processing component, WSO2 Business Activity Monitor as a historical data analysis component and WSO2 User Engagement Server as the unified dashboard provider.

System Monitoring Solution

Figure 1 : Service clusters system health status dashboard

WSO2 Cloud Monitoring solution consists with a service clusters summary health status dashboard, service cluster health status dashboard, node disk space exhaustion prediction mechanism and real time system statistics analysis and alerting mechanism using email and SMS.

Figure 1 shows the dashboard of the WSO2 Cloud Monitor. It summarizes the health status of the service clusters of the WSO2 Cloud and Figure 2 illustrates the individual service cluster system health status with service tests summary. As in Figure 1, dashboard has some features like highlighting the service cluster graph if the service tests are failing and changing the color of the line into red when the values are exceeding 90% usage.

Figure 2 : Service cluster system health status with service tests

Solution Approach

Monitoring statistics are collected by an agent and published to the analytical components of the system periodically. In order to achieve this WSO2 ESB and the scheduled task feature is used. WSO2 CEP conducts real time stream processing to generate useful patterns in order to identify sudden failure patterns. WSO2 BAM conducts historical data processing to identify patterns that develop with time. The cloud administrators are notified [via email and sms] when a notifiable event pattern occurs on either of the analytical components. WSO2 UES is used as the dashboard provider for the unified system.

System Architecture

System overview - Monitoring WSO2 cloud using CEP and BAM(2).jpg
Figure 3 : System architecture overview [link]

Inside WSO2 ESB - Data Collector and Publisher

WSO2 ESB has been used as the system statistics collector and the event publisher.

As Figure 3 illustrates a sequence has been defined in the WSO2 ESB which is triggered by a scheduled task to collect data from monitoring tools. You can find more information about using the scheduled tasks inside WSO2 ESB at [1].

When scheduled task is triggered clone mediator runs four class mediators parallely to collect data from monitoring tools. These class mediators act as adapters specific to each cloud monitoring tool. The task of these adapters is to connect to the cloud monitoring tool and acquire the data they publish. There is no intelligent part in these adapters to process data but they are only responsible of collecting and sending data to BAM mediator. BAM mediator receives data from adapters in four separate streams and is responsible of publishing these data streams to BAM/CEP.

The components in the sequence serve following purposes.

Clone mediator
Allows the parallel execution of the class mediators to collect data.

Class mediators
These class mediators act as the main components that collect data which are developed specific to each monitoring tool.

For an example lets consider a popular monitoring tool such as Ganglia. Ganglia exposes its statistics as an XML dump over a TCP socket. In order to collect the statistics from Ganglia, a class mediator needs to be developed so it operates within the ESB. An example class mediator has been attached below. These class mediators do not conduct any intelligent data processing. They simply contact the monitoring tool, acquire all the data the tool is ready to provide, transform the data into a XML format and insert the XML into the sequence as a SOAP message [Figure 4].

It is also easy to parse needed configurations to the class mediator as parameters within the synapse configuration [Figure 5].

<?xml version="1.0" encoding="utf-8"?>
<soapenv:Envelope xmlns:soapenv="">
     <ns:WSO2_CLOUD_MON VERSION="1.0.0" xmlns:ns="">
       <ns:HOST NAME="">
       <ns:METRIC NAME="reported" VALUE="2014-06-09 23:36:21"/>
       <ns:METRIC NAME="host_address" VALUE=""/>
       <ns:METRIC NAME="disk_usage" VALUE="11"/>
       <ns:METRIC NAME="load_average" VALUE="OK - load average: 0.00, 0.01, 0.09"/>
       <ns:METRIC NAME="hostgroup_name" VALUE="debian-servers"/>
   - - - - - - -
Figure 4 : Generic soap format used in ESB for mediating cloud monitor messages

<class name="">
    <property name="port" value="8651"/>
    <property name="gmetadHost" value=""/>
Figure 5 : Synapse configuration of a class mediator with parameters

Due to the use of class mediators it is easy to extend the system when a new monitoring tool is added to the WSO2 cloud. It only requires to develop a new specific class mediator which handles data collection for the new monitoring tool. With a simple ESB configuration this class mediator can be introduced to the system and the collected data will be published to the analytical components through the sequence without any additional change. By using WSO2 ESB we give the user the advantage of altering and configuring the system as they require and it has the added benefit of built in extensibility.

<?xml version="1.0" encoding="UTF-8"?>
<definitions xmlns="">
  <registry provider="org.wso2.carbon.mediation.registry.WSO2Registry">
     <parameter name="cachableDuration">15000</parameter>
  <sequence name="fault">
     <log level="full">
        <property name="MESSAGE" value="Executing default 'fault' sequence"/>
        <property name="ERROR_CODE" expression="get-property('ERROR_CODE')"/>
        <property name="ERROR_MESSAGE" expression="get-property('ERROR_MESSAGE')"/>
  <sequence name="main">
        <clone id="cln">
                 <class name="">
                    <property name="port" value="161"/>
                    <property name="puppetDBPort" value="8080"/>
                    <property name="communityString" value="public"/>
                    <property name="puppetDBHost" value=""/>
                 <iterate xmlns:ns="http://org.apache.synapse/xsd"
                    <target sequence="conf:/snmp_publisher_seq"/>
                 <class name="">
                    <property name="port" value="8651"/>
                    <property name="gmetadHost" value=""/>
                 <log level="full">
                    <property name="MESSAGE" value="#####----GANGLIA----#####"/>
                 <iterate xmlns:ns="http://org.apache.synapse/xsd"
                    <target sequence="conf:/ganglia_publisher_seq"/>
     <description>The main sequence for the message mediation</description>
  <task name="cloudmon_task"
     <trigger interval="60"/>
     <property xmlns:task=""
     <property xmlns:task="" name="message">
        <ns:WSO2_CLOUD_MON xmlns:ns="" VERSION="1.0.0">

Figure 6: Example ESB synapse configuration with class mediators

BAM mediator
BAM Mediator is used as the data publishing component of the system. The collected data need to be published to the analytical tools in order to acquire meaningful information. This is done through a Thrift transport session which publishes event streams to WSO2 CEP and BAM.

A modified version of BAM mediator 4.2.2 has been used here to publish data. The mediator has been modified to publish a value of “-1” for not available/garbage value filtering. This is useful when handling the values in analytical components. We can easily filter -1 values during analysis.
For each and every monitoring tool there is a stream defined to it and there are two BAM profiles defined; one for WSO2 BAM and one for WSO2 CEP. Filters have been used since the namespace specific elements of the SOAP message cannot be extracted at the BAM mediator, therefore as the Figure 7 illustrates filter mediators have been used here to insert those values to properties and insert the BAM mediator.

<?xml version="1.0" encoding="UTF-8"?>
<sequence xmlns="">
 <filter regex="true" source="boolean(//nsp:METRIC[ @NAME = 'cpu_speed' ])" xmlns:ns="http://org.apache.synapse/xsd" xmlns:nsp="">
  <property expression="//nsp:METRIC/@VALUE[ ../@NAME = 'cpu_speed' ]" name="CPU_SPEED" scope="default" type="STRING"/>
   <property name="CPU_SPEED" scope="default" type="STRING" value="N/A"/>
 - - - - - -  
  <serverProfile name="cloudmon_bam_publisher_profile">
   <streamConfig name="ganglia.stats" version="1.0.0"/>
  <serverProfile name="cloudmon_cep_publisher_profile">
   <streamConfig name="ganglia.stats" version="1.0.0"/>
Figure 7 : Ganglia monitoring data collector publisher sequence.

You can configure the BAM profile in ESB by going to configuration -> bam server profiles. Figure 8 shows the configuration in XML and Figure 9 and Figure 10 shows how you should configure the bam profile to insert the extracted values to the event stream at the BAM mediator.

<serverProfile xmlns="">
 <connection authPort="7711" ip="" loadbalancer="false" receiverPort="7611" secure="true" urlSet=""/>
 - - - -
  <stream description="Cloud Monitor Ganglia Stats" name="ganglia.stats" nickName="Cloud Monitor Ganglia Stats" version="1.0.0">
    <property isExpression="true" name="host_name" type="STRING" value="get-property('HOST_NAME')"/>
    <property isExpression="true" name="disk_free" type="DOUBLE" value="get-property('DISK_FREE')"/>
   - - - - - 
  - - - - - -
Figure 8 : WSO2 Cloud Monitor BAM publisher profile - in xml

Screenshot from 2014-06-27 09:20:23_profile_2.png
Figure 9 : WSO2 Cloud Monitor BAM publisher profile - in WSO2 ESB UI

Screenshot from 2014-06-27 09:20:33_stream_2.png

Figure 10: WSO2 Cloud Monitor BAM publisher profile - stream definition from UI

Inside WSO2 BAM - Historical data analyzer

WSO2 BAM is used as the historical data analyzing component of the project. As an example disk space exhaustion prediction mechanism is explained below.

Logic behind disk space exhaustion prediction

This analysis uses Monte Carlo simulation[2] to generate the values for empirical - distribution of days left. In monte carlo simulation “daily used disk space” is sampled until the sum is above the free disk space. By repeatedly sampling the gathered results, we get a population of “number of days left till exhaustion”. Next it takes the empirical cumulative distribution[3] of the population. From that distribution we can get empirical cumulative probability estimate that we will reach the disk exhaustion before a given date.

Sometimes if the empirical distribution assumption (independent identically distribution of random variables) is violated, it gives deviation 0 exception, in that case we only consider the primary monte carlo simulation only for the probability calculations.

These calculations are done in BAM with hive scripts, hive UDFs(user defined functions). After analysis BAM will publish notifications as events to CEP in the deployment, as a stream and required parties will be notified by sending email/sms via CEP.
Summarize aggregated system statistics using BAM

System statistics collected from monitoring tools can be aggregated in BAM using hive scripts and summarized data can be pushed to a RDBMS. This RDBMS can be used later by the dashboard when displaying graphs. Figure 11 illustrates the hive script used to summarize data.

CREATE EXTERNAL TABLE IF NOT EXISTS SNMPAllDataStore ( messageID STRING, hostName STRING, reported BIGINT, loadAvgOne DOUBLE, loadAvgFive DOUBLE, loadAvgFifteen DOUBLE, loggedInUsers INT, memAvailableReal BIGINT, memTotalSwap BIGINT, memAvailableSwap BIGINT, memTotalFree BIGINT,
memCached BIGINT, processes INT, incomingTraffic BIGINT, outgoingTraffic BIGINT, incomingTraficEth0 BIGINT, outgoingTrafficEth0 BIGINT, cpuUsers INT, cpuSystem INT, cpuIdlePercentage DOUBLE, cpuRawIdle BIGINT)
    STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'
    "" = "" ,
    "cassandra.port" = "9160" ,
    "" = "EVENT_KS" ,
    "cassandra.ks.username" = "admin" ,
    "cassandra.ks.password" = "admin" ,
    "" = "snmp_stats" ,
    "cassandra.columns.mapping" =     ":key,payload_host_name, payload_timestamp, payload_load_average_one, payload_load_average_five, payload_load_average_fifteen, payload_logged_in_users, payload_mem_avail_real, payload_mem_total_swap, payload_mem_avail_swap,
payload_mem_total_free, payload_mem_cached, payload_hr_system_processes, payload_incoming_traffic,
payload_outgoing_traffic, payload_incoming_traffic_eth0, payload_outgoing_traffic_eth0,
payload_cpu_users, payload_cpu_system, payload_cpu_idle_percentage, payload_cpu_raw_idle" );

CREATE EXTERNAL TABLE IF NOT EXISTS snmp_hosts_data( id STRING, hostName STRING, reported BIGINT, timeString STRING, loadAvgOne DOUBLE,cpuIdlePercentage DOUBLE, processes INT, memTotalFree BIGINT, memTotalAvail BIGINT, memUsage BIGINT, incomingTraffic BIGINT, outgoingTraffic BIGINT,
incomingTraficEth0 BIGINT, outgoingTrafficEth0 BIGINT)STORED BY ''
    'hive.jdbc.update.on.duplicate' = 'true' ,
    'hive.jdbc.primary.key.fields' = 'id' ,
    'hive.jdbc.table.create.query' =
    'CREATE TABLE snmp_hosts_data (
    hostName VARCHAR(100), reported BIGINT,
    timeString VARCHAR(30), loadAvgOne DOUBLE,
    cpuIdlePercentage DOUBLE, processes INT,
    memTotalFree BIGINT, memTotalAvail BIGINT,
    memUsage BIGINT, incomingTraffic BIGINT,
    outgoingTraffic BIGINT, incomingTraficEth0 BIGINT,
    outgoingTrafficEth0 BIGINT)' );
@Incremental(name="mainFeaturesAnalysis", tables="SNMPDataStoreCasTable", bufferTime="20")
insert overwrite table snmp_hosts_data 
    SELECT messageID, hostName, reported, from_unixtime(cast(reported/1000 as BIGINT),'HH:mm:ss' ) as timeString, loadAvgOne, 
    cpuIdlePercentage, processes, memTotalFree, memAvailableReal, (memAvailableReal-memTotalFree) as memUsage , incomingTraffic, outgoingTraffic, incomingTraficEth0, outgoingTrafficEth0  
    FROM SNMPAllDataStore where loadAvgOne >= 0 and cpuIdlePercentage >= 0 and processes >= 0 and memTotalFree >= 0 and memAvailableReal >= 0;
Figure 11 : Hive script used to summarize statistics

Inside WSO2 CEP - Real time stream processor

WSO2 CEP is used as the real time analyzing component of the project. An example task, the CEP could be used to analyse is described here.
  • Alert on high load average on a host
In order to conduct the analyses, CEP utilizes event streams published to it by the WSO2 ESB.

Logic behind alerting on load average

Load average plays a large role in identifying the usage of hosts that reside in the cloud. Therefore it is important to be able to detect useful high load average patterns. To alert when the load average of a particular host is higher than a threshold value the following siddhi query can be used. 

1. Filter based on reported time and the published time (if the reported time is within 5 minutes of the published time consider that data, otherwise discard the data)
from ganglia_stream[timestamp - (reported*1000) < 300000]
select reported, host_name, load_one, load_five, load_fifteen, cpu_num
insert into ganglia_timefiltered;

It is important to make sure that the data we use for analysis are within a reasonable time  window, to ensure that we compare the reported timestamp of the data with the current timestamp.

2. Filter hosts with high load 15 average based on a threshold value (take number of cpu cores into consideration as well)

from ganglia_timefiltered[cpu_num != -1 and load_fifteen != -1 and load_fifteen > (1.7 * cpu_num)]
insert into load15_high_hosts;

from ganglia_timefiltered[cpu_num == -1 and load_fifteen != -1 and load_fifteen > 1.7]
insert into load15_high_hosts;

Assumption : If the number of cpu cores is not available/returned treat it as a single core cpu.

3. Alert about the hosts who report high load averages frequently (larger than a threshold count) within a given time period (within a threshold time window).

from load15_high_hosts#window.timeBatch(10 min)
select reported, host_name, avg(load_one) as load_one_average, avg(load_five) as load_five_average, avg(load_fifteen) as load_fifteen_average, cpu_num, count(*) as frequency
group by host_name having frequency > 5
insert into load15_high_hosts_toalert;

Note: Use of timeBatch window ensures that alerts are only generated within the time window instead of alerting every time the load average is high (So no notification spamming). Threshold values should be set as preferred. These threshold values need to be experimented to best suit the requirement.

When these analyses create events that need to be notified to the administrators, the output event adapters of the WSO2 CEP are used to send email and sms notifications.

In addition to above analyses, CEP also listens for the events from the analyses done at WSO2 BAM [the results of this analysis are published to CEP as events] and notifies administrators once an event arrives.

Notification mechanism

WSO2 CEP is used as the notification mechanism for the alerts generated by the analyses. As mentioned above it alerts the WSO2 Cloud Administrators on 3 events.
Email notification
In order to send email notifications the Output Email Event Adapter[8] of the WSO2 CEP is used.

Example email notification is as follows.

Subject: Warning : Logged in user count is high!

  Logged in user count on {{host_name}} is {{logged_in_users}}.
  Stats on logged in user count for the last 10 minutes:
  Host name : {{host_name}}
  Maximum user count : {{max_logged_in_users}}
  Minimum user count : {{min_logged_in_users}}
  Current user count : {{logged_in_users}}
  Please take actions accordingly.

-WSO2 Cloud Monitoring bot.

SMS notification
In order to send sms notifications the Output HTTP Event adapter [9] of the WSO2 CEP is used to send HTTP requests to an HTTP API of a suitable sms service [10]. If the sms service exposes other APIs such as an SMTP API, Output Email event adapter can be configured to send sms messages as well. This is upon preference of the user.

System dashboard
System dashboard is used to show collected data graphically to make it easier for a user to analyse data. For an example aggregated network monitoring (packets in, packets out, etc) per host, and aggregated system statistics monitoring (cpu_load, memory and etc) per host can be shown in the dashboard using WSO2 UES as shown in Figure 12.

Screenshot from 2014-06-23 14:27:10.png
Figure 12: WSO2 cloud monitor dashboard

UES uses data in RDBMS which is pushed by BAM to draw the charts. When drawing charts using a database it is necessary to configure the datasource in master-datasources.xml where you can find inside <UES_HOME>/repository/conf/datasources/ directory. A sample configuration is shown below in Figure 13.

   <description>The datasource used for analyzer data</description>
   <definition type="RDBMS">
    <validationQuery>SELECT 1</validationQuery>
Figure 13: Customize the graph and preview

After adding the data source to the master-datasources.xml file start the UES server and login to UES portal. Then you can create an empty dashboard from the options given in UES portal. More details on how to create an empty dashboard can be found here [11]. In order to add a chart which uses a data source, select “Create new Gadget” icon in one of the gadgets and you will get a wizard as shown in Figure 14. Figure 14-18 show the step by step guide to create a gadget using RDBMS database.

Figure 14 : Insert RDBMS data source

Figure 15 : configure data source and validate connection

Figure 16 : Insert SQL statement and test the query

Figure 17 : Select gadget template

Figure 18 : Customize the graph and preview

In the example scenario two gadgets are used to display system data and one gadget is used to select the node which is a customized gadget. This customized gadget access the RDBMS database every minute to delete the old records in the database and update the hosts list in the drop down menu. Since data is shown on demand per node it is necessary to enable inter communication between gadgets using publisher subscriber model. To find more information about inter gadgets communication you can refer [12].

WSO2 UES provides the facility to set the refresh time of the graphs and this feature can be used to refresh the graphs in the system dashboard with data in BAM. The customized gadget should also be configured to refresh along with the graphs which can be done by setting a timeout in javascript.


As we have mentioned above WSO2 ESB can be combined with CEP, BAM and UES to successfully develop an intelligent analyzing tool. The results that can be obtained from this setup is only limited to the requirements of the user and statistics published by the monitoring tools. With the extensibility and configurability of the ESB, high analytical power of CEP and BAM, and the UES dashboards it is possible to generate the ultimate monitoring tool.