Monitoring Large Scale Environments

Requirement:

A relatively “large scale” environment (500-1000 servers) needs to be monitored, with specific KPIs tracked and trended at regular intervals. This needs to be done without adding extra packages to the current production servers, which may have only basic unix tools/commands available.

There can be no impact…zero…none to the performance of the server as some of them already have a high workload.

The collected data needs to be relatively easy to ingest, so that time and effort is maximized. Ideally, data collection should be automated and require zero human interaction.

Setup

I plan to use Elasticsearch for this task. I already know it well and am envisioning this working at a high level in these stages (note: I am not using any beats for this task. Even though they are extremely light weight, I would need a soak period proven before deploying and metricbeat in particular tends to create large indices, which for me is excessive as I can boil down the operational needs to no more than two dozen KPIs per server):

  1. cronjob on a jumpbox kicks off a script (at regular intervals) which executes a script over ssh to collect the required KPIs, which are output to specific log files on the jumpbox sorted by something like hostnamedate-.log. Neither the script nor the output remain on the production server.
  2. filebeat will run on the jumpbox, collecting the output as soon as it’s available and sending it to logstash for ingest.
  3. Ingest on the logstash node will be relatiely straight forward as the collected logs will always be in the same format (pipe deliminated). Any values missing will need to be skipped and recorded as a null.

The two areas requiring the most effort will be the KPI collection script itself and the grok pattern to parse the logs (as it needs to be resilient to any nulls).

 

Prep Script

This is where you need to decide how you want to collect the data. Personally, I prefer something like counters/values in a predictable format that I can synthesize later. I’ve done this kind of collection using internal DBs (pdb) and SNMP, as well as grep’in and awk’in through logs. For availability and simplicity, I’ll focus on SNMP based collection locally. I’m doing this locally as I won’t allow (via firewalld) external SNMP access for polling for security reasons.

If not already done, I’d need net-snmp running, which the below accomplishes. I am approaching this under the assumption it already is:

yum -y install net-snmp net-snmp-utils; systemctl enable snmpd; systemctl start snmpd

Next, we need to expose the relevant OIDs for data collection. Considering I have a high number (hundreds) of these to set up, I am scripting the snmpd.conf update to pull from a list of IPs/Hostnames:

The script / series of commands in this example are simply adding the required MIBs to snmpd.conf and reloading snmpd.service.

#!/bin/bash

HOSTNAME=`hostname`

echo ""
echo "##########################################"
echo "backing up snmpd.conf and updating OIDs..."
cp /etc/snmp/snmpd.conf /etc/snmp/snmpd.conf.`date +"%Y_%m_%d-%H:%M:%S"`.bk


echo "" >> /etc/snmp/snmpd.conf
echo "########################" >> /etc/snmp/snmpd.conf
echo "# added by Operations #" >> /etc/snmp/snmpd.conf
echo "########################" >> /etc/snmp/snmpd.conf
echo "" >> /etc/snmp/snmpd.conf
echo "#IF-MIB::interfaces" >> /etc/snmp/snmpd.conf
echo view systemview included .1.3.6.1.2.1.2 >> /etc/snmp/snmpd.conf
echo "#IP-MIB::ip" >> /etc/snmp/snmpd.conf
echo view systemview included .1.3.6.1.2.1.4 >> /etc/snmp/snmpd.conf
echo "#TCP-MIB::tcp" >> /etc/snmp/snmpd.conf
echo view systemview included .1.3.6.1.2.1.6 >> /etc/snmp/snmpd.conf
echo "#UDP-MIB::udp" >> /etc/snmp/snmpd.conf
echo view systemview included .1.3.6.1.2.1.7 >> /etc/snmp/snmpd.conf
echo "#HOST-RESOURCES-MIB::" >> /etc/snmp/snmpd.conf
echo view systemview included .1.3.6.1.2.1.25 >> /etc/snmp/snmpd.conf
echo "#UCD-SNMP-MIB::" >> /etc/snmp/snmpd.conf
echo view systemview included .1.3.6.1.4.1.2021 >> /etc/snmp/snmpd.conf

echo ""
echo "...reloading snmpd and checking snmpd status..."
systemctl reload snmpd
echo ""
systemctl status snmpd |grep Active |awk '{print $2}' 
echo ""
echo "SNMP setup is complete for $HOSTNAME!"
echo "##########################################"
echo ""

To launch this script I’m using the below which reads from /var/tmp/service-list.txt, which is a long list of server IPs. The awk will skip any blank or commented lines. ssh is replying on ssh keys which are already setup for this user.

#!/bin/bash

i=1
cat /var/tmp/server-list.txt |awk '!/^ *#/ && NF' |while read serverList
do
array[ $i ]="$serverList"
(( i++ ))
ssh user@$serverList "bash -s" < /var/tmp/snmp-setup.sh >>/var/tmp/SNMP-Setup_`date +"%Y-%m-%d_%H:%M"`.log
done

echo "Finished checking every server in the list!"

Now for a test run:

[user@jumpbox tmp]# ./snmp-kickoff.sh 
Finished checking every server in the list!

[user@jumpboxtmp]# cat SNMP-Setup_2019-10-15_16:11.log

##########################################
backing up snmpd.conf and updating OIDs...

...reloading snmpd and checking snmpd status...

active

SNMP setup is complete for server1.example.net!
##########################################


##########################################
backing up snmpd.conf and updating OIDs...

...reloading snmpd and checking snmpd status...

active

SNMP setup is complete for server2.example.net!
##########################################


##########################################
backing up snmpd.conf and updating OIDs...

...reloading snmpd and checking snmpd status...

active

SNMP setup is complete for server3.example.net!
##########################################

 

Data Collection

The data collection script will use the same server list as the snmp-kickoff script, which preps each server with the correct snmpd configuration.

The logic is that the script only needs to execute a minimal number of commands (4 snmpwalks plus a cat of /proc/meminfo as it’s the most accurate). The value of the data is then grep’d through for the KPIs I want. I can add more KPIs if need be, without adding any additional command execution.

#!/bin/bash

#dberry was here

###Commands reference###
HOSTNAME=`hostname`

###Reference for CPU load averages###
CPULOADAVG=`snmpwalk -cpublic -v2c localhost UCD-SNMP-MIB::laTable`
#1 min load average
CPU1MINAVG=`echo “$CPULOADAVG” |grep laLoad.1 |awk ‘{print $4}’`
#5 min load average
CPU5MINAVG=`echo “$CPULOADAVG” |grep laLoad.2 |awk ‘{print $4}’`
#15 min load average
CPU15MINAVG=`echo “$CPULOADAVG” |grep laLoad.3 |awk ‘{print $4}’`

###Reference for CPU system stats table###
CPUSYSTEMSTATS=`snmpwalk -cpublic -v2c localhost UCD-SNMP-MIB::systemStats`
#CPU user and system utilization percent
CPUUSERPERCENT=`echo “$CPUSYSTEMSTATS” |grep ssCpuUser.0 |awk ‘{print $4}’`
CPUSYSTEMPERCENT=`echo “$CPUSYSTEMSTATS” |grep ssCpuSystem.0 |awk ‘{print $4}’`
#CPU idle percent
CPUIDLEPERCENT=`echo “$CPUSYSTEMSTATS” |grep ssCpuIdle.0 |awk ‘{print $4}’`

###Reference for Memory Stats###
MEMORYSTATS=`cat /proc/meminfo`
#MemTotal and Available
MEMTOTAL=`echo “$MEMORYSTATS” |grep MemTotal |awk ‘{print $2}’`
MEMAVAILABLE=`echo “$MEMORYSTATS” |grep MemAvailable |awk ‘{print $2}’`
MEMACTIVE=`echo “$MEMORYSTATS” |grep Active |egrep -v ‘(anon|file)’ |awk ‘{print $2}’`
#SwapTotal and Available
SWAPTOTAL=`echo “$MEMORYSTATS” |grep SwapTotal |awk ‘{print $2}’`
SWAPFREE=`echo “$MEMORYSTATS” |grep SwapFree |awk ‘{print $2}’`

#Filesystem and Disk
#HOST-RESOURCES-MIB::hrStorageAllocationUnits.31 = INTEGER: 4096 Bytes <- This is important
ROOTPARTITIONSIZE=`snmpget -cpublic -v2c localhost hrStorageSize.31 |awk ‘{print $4}’`
ROOTPARTITIONUSED=`snmpget -cpublic -v2c localhost hrStorageUsed.31 |awk ‘{print $4}’`

###Reference for disk IO stats###
DISKIOSTATS=`snmpwalk -cpublic -v2c localhost UCD-DISKIO-MIB::diskIOTable`
#1,5 and 15 mins disk load averages
ROOT1MINLOADAVG=`echo “$DISKIOSTATS” |grep diskIOLA1.1 |awk ‘{print $4}’`
ROOT5MINLOADAVG=`echo “$DISKIOSTATS” |grep diskIOLA5.1 |awk ‘{print $4}’`
ROOT15MINLOADAVG=`echo “$DISKIOSTATS” |grep diskIOLA15.1 |awk ‘{print $4}’`

###Reference for Interface stats from IF-MIB::ifTable###
INTERFACESTATS=`snmpwalk -cpublic -v2c localhost IF-MIB::ifTable`
ETH0IFINOCTETS=`echo “$INTERFACESTATS” |grep ifInOctets.2 |awk ‘{print $4}’`
ETH0IFOUTOCTETS=`echo “$INTERFACESTATS” |grep ifOutOctets.2 |awk ‘{print $4}’`
ETH0INDISCARDS=`echo “$INTERFACESTATS” |grep ifInDiscards.2 |awk ‘{print $4}’`
ETH0OUTDISCARDS=`echo “$INTERFACESTATS” |grep ifOutDiscards.2 |awk ‘{print $4}’`
ETH0INERRORS=`echo “$INTERFACESTATS” |grep ifInErrors.2 |awk ‘{print $4}’`
ETH0OUTERRORS=`echo “$INTERFACESTATS” |grep ifOutErrors.2 |awk ‘{print $4}’`

echo “HOSTNAME:$HOSTNAME|CPU1MINAVG:$CPU1MINAVG|CPU5MINAVG:$CPU5MINAVG|CPU15MINAVG:$CPU15MINAVG|CPUUSERPERCENT:$CPUUSERPERCENT|CPUSYSTEMPERCENT:$CPUSYSTEMPERCENT|CPUIDLEPERCENT:$CPUIDLEPERCENT|MEMTOTAL:$MEMTOTAL|MEMAVAILABLE:$MEMAVAILABLE|MEMACTIVE:$MEMACTIVE|SWAPTOTAL:$SWAPTOTAL|SWAPFREE:$SWAPFREE|ROOTPARTITIONSIZE:$ROOTPARTITIONSIZE|ROOTPARTITIONUSED:$ROOTPARTITIONUSED|ROOT1MINLOADAVG:$ROOT1MINLOADAVG|ROOT5MINLOADAVG:$ROOT5MINLOADAVG|ROOT15MINLOADAVG:$ROOT15MINLOADAVG|ETH0IFINOCTETS:$ETH0IFINOCTETS|ETH0IFOUTOCTETS:$ETH0IFOUTOCTETS|ETH0INDISCARDS:$ETH0INDISCARDS|ETH0OUTDISCARDS:$ETH0OUTDISCARDS|ETH0INERRORS:$ETH0INERRORS|ETH0OUTERRORS:$ETH0OUTERRORS”

The result is then echo’d out in a pipe deliminated format, which I plan to parse in Logstash for use in Elastic.

Example:

[user@dhcp tmp]# ./kpiCollection.sh 
HOSTNAME:dhcp.example.net|CPU1MINAVG:0.00|CPU5MINAVG:0.01|CPU15MINAVG:0.05|CPUUSERPERCENT:0|CPUSYSTEMPERCENT:0|CPUIDLEPERCENT:98|MEMTOTAL:500180|MEMAVAILABLE:323708|MEMACTIVE:176056|SWAPTOTAL:421884|SWAPFREE:420056|ROOTPARTITIONSIZE:677376|ROOTPARTITIONUSED:403109|ROOT1MINLOADAVG:0|ROOT5MINLOADAVG:0|ROOT15MINLOADAVG:0|ETH0IFINOCTETS:3043965952|ETH0IFOUTOCTETS:9373054|ETH0INDISCARDS:1|ETH0OUTDISCARDS:0|ETH0INERRORS:0|ETH0OUTERRORS:0
[user@dhcp tmp]#

Now, let’s work this into a kickoff script that will read through the server list so I can launch the script from one point (jumpbox or centralized log ingest node) and collect everything over ssh.

#!/bin/bash

# if you need debugging, uncomment the below
#set -x

i=1
cat /var/tmp/server-list.txt |awk ‘!/^ *#/ && NF’ |while read serverList
do
array[ $i ]=”$serverList”
(( i++ ))
ssh root@$serverList “bash -s” < /var/tmp/kpiCollection.sh >>/var/tmp/KPI_`date +”%Y-%m-%d_%H:%M”`.log
done

echo “Finished KPI collection for every server in the list!”

Example:

[user@puppetmaster tmp]# ./kpi-kickoff.sh 
Finished KPI collection for every server in the list!
[user@puppetmaster tmp]#
[user@puppetmaster tmp]# ls -l KPI*
-rw-r--r--. 1 user user 1304 Oct 17 15:09 KPI_2019-10-17_15:09.log
[user@puppetmaster tmp]#
[user@puppetmaster tmp]# tail KPI_2019-10-17_15:09.log
HOSTNAME:ns1.example.net|CPU1MINAVG:0.49|CPU5MINAVG:0.47|CPU15MINAVG:0.27|CPUUSERPERCENT:0|CPUSYSTEMPERCENT:0|CPUIDLEPERCENT:99|MEMTOTAL:500180|MEMAVAILABLE:322484|MEMACTIVE:196420|SWAPTOTAL:421884|SWAPFREE:410332|ROOTPARTITIONSIZE:677376|ROOTPARTITIONUSED:438912|ROOT1MINLOADAVG:0|ROOT5MINLOADAVG:0|ROOT15MINLOADAVG:0|ETH0IFINOCTETS:3315204668|ETH0IFOUTOCTETS:600135194|ETH0INDISCARDS:1|ETH0OUTDISCARDS:0|ETH0INERRORS:0|ETH0OUTERRORS:0
HOSTNAME:dhcp.example.net|CPU1MINAVG:0.00|CPU5MINAVG:0.01|CPU15MINAVG:0.05|CPUUSERPERCENT:0|CPUSYSTEMPERCENT:0|CPUIDLEPERCENT:99|MEMTOTAL:500180|MEMAVAILABLE:321700|MEMACTIVE:177664|SWAPTOTAL:421884|SWAPFREE:420056|ROOTPARTITIONSIZE:677376|ROOTPARTITIONUSED:402735|ROOT1MINLOADAVG:0|ROOT5MINLOADAVG:0|ROOT15MINLOADAVG:0|ETH0IFINOCTETS:3044963814|ETH0IFOUTOCTETS:9379091|ETH0INDISCARDS:1|ETH0OUTDISCARDS:0|ETH0INERRORS:0|ETH0OUTERRORS:0
HOSTNAME:lamp.example.net|CPU1MINAVG:0.00|CPU5MINAVG:0.01|CPU15MINAVG:0.05|CPUUSERPERCENT:0|CPUSYSTEMPERCENT:0|CPUIDLEPERCENT:99|MEMTOTAL:500180|MEMAVAILABLE:203228|MEMACTIVE:111816|SWAPTOTAL:946172|SWAPFREE:909404|ROOTPARTITIONSIZE:1857024|ROOTPARTITIONUSED:1230849|ROOT1MINLOADAVG:0|ROOT5MINLOADAVG:0|ROOT15MINLOADAVG:0|ETH0IFINOCTETS:667351072|ETH0IFOUTOCTETS:5467449|ETH0INDISCARDS:0|ETH0OUTDISCARDS:0|ETH0INERRORS:0|ETH0OUTERRORS:0
[user@puppetmaster tmp]#

Scheduled in crontab. Hourly will work for now.

0 * * * * /var/tmp/kpi-kickoff.sh

Now that we have regular data to ingest, let’s parse this in Logstash using grok and regex:

Parsing pipe deliminated lines

For this task you can either use https://grokdebug.herokuapp.com/ or the Grok Debugger under the Dev Tools in Kibana.

One thing to keep in mind while building up a grok filter, is how it will behave if a value is missing/null.

First, place a single line in the sample data. I’m using the one below:

HOSTNAME:lamp.example.net|CPU1MINAVG:0.00|CPU5MINAVG:0.01|CPU15MINAVG:0.05|CPUUSERPERCENT:0|CPUSYSTEMPERCENT:0|CPUIDLEPERCENT:99|MEMTOTAL:500180|MEMAVAILABLE:203228|MEMACTIVE:111816|SWAPTOTAL:946172|SWAPFREE:909404|ROOTPARTITIONSIZE:1857024|ROOTPARTITIONUSED:1230849|ROOT1MINLOADAVG:0|ROOT5MINLOADAVG:0|ROOT15MINLOADAVG:0|ETH0IFINOCTETS:667351072|ETH0IFOUTOCTETS:5467449|ETH0INDISCARDS:0|ETH0OUTDISCARDS:0|ETH0INERRORS:0|ETH0OUTERRORS:0

Then build up your pattern.

grok-pattern-building

If you hit a snag and things don’t parse, use greedydata to see what is coming next (this can be very helpful if you are working with an output with missing / null fields.

grok-greedydata

For the load averages, I’ll match those to a digit plus one or more decimal places:

(?<value name>[\d\.]+)

or:

(?<value name>%{BASE10NUM})

Any parenthesis enclosed pattern can be skipped if null by placing a ? after it. This keeps your whole line from becoming a grok parse failure if there are nulls.

(?<value name>[\d\.]+)?

grok-and-regex

Here is one possible grok pattern that matches the example output (I switched the CPU load averages to the grok pattern of BASE10NUM as they would never end up a number such as 10.12.233 which the regex [\d\.]+ would be good for.

(%{HOSTNAME:Hostname})?\|CPU1MINAVG:(?<CPU1MINAVG>%{BASE10NUM})?\|CPU5MINAVG:(?<CPU5MINAVG>%{BASE10NUM})?\|CPU15MINAVG:(?<CPU15MINAVG>%{BASE10NUM})?\|CPUUSERPERCENT:(?<CPUUSERPERCENT>%{NUMBER})?\|CPUSYSTEMPERCENT:(?<CPUSYSTEMPERCENT>%{NUMBER})?\|CPUIDLEPERCENT:(?<CPUIDLEPERCENT>%{NUMBER})?\|MEMTOTAL:(?<MEMTOTAL>%{NUMBER})?\|MEMAVAILABLE:(?<MEMAVAILABLE>%{NUMBER})?\|MEMACTIVE:(?<MEMACTIVE>%{NUMBER})?\|SWAPTOTAL:(?<SWAPTOTAL>%{NUMBER})?\|SWAPFREE:(?<SWAPFREE>%{NUMBER})?\|ROOTPARTITIONSIZE:(?<ROOTPARTITIONSIZE>%{NUMBER})?\|ROOTPARTITIONUSED:(?<ROOTPARTITIONUSED>%{NUMBER})?\|ROOT1MINLOADAVG:(?<ROOT1MINLOADAVG>%{NUMBER})?\|ROOT5MINLOADAVG:(?<ROOT5MINLOADAVG>%{NUMBER})?\|ROOT15MINLOADAVG:(?<ROOT15MINLOADAVG>%{NUMBER})?\|ETH0IFINOCTETS:(?<ETH0IFINOCTETS>%{NUMBER})?\|ETH0IFOUTOCTETS:(?<ETH0IFOUTOCTETS>%{NUMBER})?\|ETH0INDISCARDS:(?<ETH0INDISCARDS>%{NUMBER})?\|ETH0OUTDISCARDS:(?<ETH0OUTDISCARDS>%{NUMBER})?\|ETH0INERRORS:(?<ETH0INERRORS>%{NUMBER})?\|ETH0OUTERRORS:(?<ETH0OUTERRORS>%{NUMBER})?

grok-full-parse

Other useful tricks:

Spaces

If you have any spaces between your values, if you use %{SPACE} it will account for that gap whether it’s one space or ten spaces.

Example:

HOSTNAME:lamp.example.net|  |CPU1MINAVG:0.00|

is parsed by:

(%{HOSTNAME:Hostname})?\|%{SPACE}\|CPU1MINAVG:(?<CPU1MINAVG>%{BASE10NUM})?\

 

Multiple Possible Words and Versions

Parsing something like “CentOS Linux release 7.6.1810” is possible with the below grok:

%{WORD}%{SPACE}%{WORD}%{SPACE}%{WORD}%{SPACE}[\d\.]+

However, what if the output isn’t exactly the same? If this were a red hat box, you might get the below (from /etc/redhat-release) instead:

“Red Hat Enterprise Linux Server release 6.5 (Santiago)”

In which case your grok needs to actually be something like:

%{WORD}%{SPACE}%{WORD}%{SPACE}%{WORD}%{SPACE}%{WORD}%{SPACE}%{WORD}%{SPACE}%{WORD}%{SPACE}[\d\.]+%{SPACE}%{GREEDYDATA}

So how can we make our grok robust enough to account for either scenario/output? By enclosing each grok pattern within parenthesis followed by a question mark, but leaving the [\d\.]+ for the release version “as is” so it’s required, we’ll match any number of words up to the total possible, before the digits.

Example:

(%{WORD})?(%{SPACE})?(%{WORD})?(%{SPACE})?(%{WORD})?(%{SPACE})?(%{WORD})?(%{SPACE})?(%{WORD})?(%{SPACE})?(%{WORD})?(%{SPACE})?[\d\.]+(%{SPACE})?(%{GREEDYDATA})?

grok-centos

grok-redhat

For reference here is a full list of all the grok patterns:

https://github.com/logstash-plugins/logstash-patterns-core/blob/master/patterns/grok-patterns

There are built on Oniguruma Regular Expressions:

https://github.com/kkos/oniguruma/blob/master/doc/RE

For regex, here’s a handy quick reference:

https://www.rexegg.com/regex-quickstart.html

Sending data to Logstash

Now that my jumpbox is collecting KPI’s from the list of servers provided, I need a reliable way to send them to Logstash for ingest. For this I’ll use filebeat. Here’s the configuration, it’s very simple:

filebeat.inputs:
- type: log
enabled: true
index: "logfiles"
tail_files: true
paths:
- /var/tmp/KPI*.log

processors:
- drop_fields:
fields: ["prospector", "offset"]

output.logstash:
hosts: ["192.168.xxx.xx:5000"]

on the Logstash side, here is the pipeline:

input {
beats {
port => 5000
type => "kpis"
}
}

filter {
if [type] == "kpis" {
grok {
match => {
"message" => [ "(%{HOSTNAME:Hostname})?\|CPU1MINAVG:(?<CPU1MINAVG>%{BASE10NUM})?\|CPU5MINAVG:(?<CPU5MINAVG>%{BASE10NUM})?\|CPU15MINAVG:(?<CPU15MINAVG>%{BASE10NUM})?\|CPUUSERPERCENT:(?<CPUUSERPERCENT>%{NUMBER})?\|CPUSYSTEMPERCENT:(?<CPUSYSTEMPERCENT>%{NUMBER})?\|CPUIDLEPERCENT:(?<CPUIDLEPERCENT>%{NUMBER})?\|MEMTOTAL:(?<MEMTOTAL>%{NUMBER})?\|MEMAVAILABLE:(?<MEMAVAILABLE>%{NUMBER})?\|MEMACTIVE:(?<MEMACTIVE>%{NUMBER})?\|SWAPTOTAL:(?<SWAPTOTAL>%{NUMBER})?\|SWAPFREE:(?<SWAPFREE>%{NUMBER})?\|ROOTPARTITIONSIZE:(?<ROOTPARTITIONSIZE>%{NUMBER})?\|ROOTPARTITIONUSED:(?<ROOTPARTITIONUSED>%{NUMBER})?\|ROOT1MINLOADAVG:(?<ROOT1MINLOADAVG>%{NUMBER})?\|ROOT5MINLOADAVG:(?<ROOT5MINLOADAVG>%{NUMBER})?\|ROOT15MINLOADAVG:(?<ROOT15MINLOADAVG>%{NUMBER})?\|ETH0IFINOCTETS:(?<ETH0IFINOCTETS>%{NUMBER})?\|ETH0IFOUTOCTETS:(?<ETH0IFOUTOCTETS>%{NUMBER})?\|ETH0INDISCARDS:(?<ETH0INDISCARDS>%{NUMBER})?\|ETH0OUTDISCARDS:(?<ETH0OUTDISCARDS>%{NUMBER})?\|ETH0INERRORS:(?<ETH0INERRORS>%{NUMBER})?\|ETH0OUTERRORS:(?<ETH0OUTERRORS>%{NUMBER})?" 
] 
}
}
}
mutate {
remove_field => ["message", "tags"]
}
}

output {
elasticsearch {
hosts => [ "192.168.xxx.xx:9200" ]
index => "kpis-%{+YYYY.MM.dd}"
}
stdout { codec => rubydebug }
}

journalctl -f -u logstash can be used to confirm the data is received and parsed:

Oct 17 22:00:11 Logstash.example.net logstash[2025]: "ETH0INDISCARDS" => "0",
Oct 17 22:00:11 Logstash.example.net logstash[2025]: "CPUIDLEPERCENT" => "99",
Oct 17 22:00:11 Logstash.example.net logstash[2025]: "SWAPFREE" => "909404",
Oct 17 22:00:11 Logstash.example.net logstash[2025]: "ETH0IFINOCTETS" => "693246286",
Oct 17 22:00:11 Logstash.example.net logstash[2025]: "ETH0OUTERRORS" => "0",
Oct 17 22:00:11 Logstash.example.net logstash[2025]: "type" => "KPIs"
Oct 17 22:00:11 Logstash.example.net logstash[2025]: }

The data should now be discover-able in kibana after you create the index pattern to match “kpis*”:

 

Let’s Visualize

Before we go any further, we have to figure out how we want to work with the data collected. With a raw data ingest such as this, unless we set up a mapping template, the data will be stored as a text doc. You can still search for strings based on the foobar.keyword syntax, but you can’t plot averages and counts.

cpu1minavg-string

What we need to do is create a mapping template for the kpis* index pattern that converts these values to the correct numeric equivalent.

Consider the raw data. We can see this from discovering it via the index pattern itself:

cpu1minavg-discovered

Converting this to a float type would make sense. For reference, here are the types: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html

Luckily, in the dev tools of kibana, it’s super easy to create an index mapping template. You can view your current templates with GET _cat/templates and examine individual templates with something like GET _template/logstash.

Using the below, I am now converting the CPU load average values to floats:

PUT _template/kpis
{
"index_patterns": [
"kpis*"
],
"mappings": {
"properties": {
"CPU15MINAVG": {
"type": "float"
},
"CPU5MINAVG": {
"type": "float"
},
"CPU1MINAVG": {
"type": "float"
}
}
}
}

Once this is done, you’ll need re-create the index pattern and drop any old indexes as once the data is indexed, it’s staying that way.

cpu-la-avg-floats

Here’s an example of a index template that would work for the above:

{
"kpis" : {
"order" : 0,
"index_patterns" : [
"kpis*"
],
"settings" : { },
"mappings" : {
"properties" : {
"CPUIDLEPERCENT" : {
"type" : "integer"
},
"ETH0INDISCARDS" : {
"type" : "integer"
},
"ROOT1MINLOADAVG" : {
"type" : "float"
},
"SWAPTOTAL" : {
"type" : "integer"
},
"ETH0OUTDISCARDS" : {
"type" : "integer"
},
"ETH0INERRORS" : {
"type" : "integer"
},
"ETH0OUTERRORS" : {
"type" : "integer"
},
"ROOT15MINLOADAVG" : {
"type" : "float"
},
"CPUUSERPERCENT" : {
"type" : "integer"
},
"ETH0IFOUTOCTETS" : {
"type" : "long"
},
"SWAPFREE" : {
"type" : "integer"
},
"CPUSYSTEMPERCENT" : {
"type" : "integer"
},
"MEMACTIVE" : {
"type" : "integer"
},
"ROOT5MINLOADAVG" : {
"type" : "float"
},
"CPU15MINAVG" : {
"type" : "float"
},
"CPU1MINAVG" : {
"type" : "float"
},
"MEMAVAILABLE" : {
"type" : "integer"
},
"ROOTPARTITIONSIZE" : {
"type" : "integer"
},
"CPU5MINAVG" : {
"type" : "float"
},
"ETH0IFINOCTETS" : {
"type" : "long"
},
"ROOTPARTITIONUSED" : {
"type" : "integer"
},
"MEMTOTAL" : {
"type" : "integer"
}
}
},
"aliases" : { }
}
}

That’s better:

cpula-number-discover

 

I also pointed something out in the data collection script that we need to consider when visualizing:

#HOST-RESOURCES-MIB::hrStorageAllocationUnits.31 = INTEGER: 4096 Bytes

If the data/kpi is different allocation where 1 “count” equals “x” amount of bits or bytes, we need to create a calculation and set a yaxis unit appropriate to visualize it correctly.

Example in timelion:

.es(index=kpis*, q='Hostname.keyword:ns1.example.net', metric=avg:ROOTPARTITIONSIZE).multiply(4096).yaxis(units=bytes).lines(width=1,fill=0.5).color(#AED6F1).label("Root Partition Size"),.es(index=kpis*, q='Hostname.keyword:ns1.example.net', metric=avg:ROOTPARTITIONUSED).multiply(4096).yaxis(units=bytes).lines(width=1,fill=0.5).color(#DAF7A6).label("Root Partition Size"),

When it comes to interface tx and rx rates, we also need to perform some calculations to get things “looking” right. I am NOT polling at a very granular rate at the moment (every 15 minutes) so if there are bursts or drops in traffic during that interval, it will get averaged out. Once the concept is proven, I would definitely increase the frequency the data is collected.

Anyways, since the ifIn/OutOctets (or their 64 bit equivalent, ifHCIn/OutOctets which I’d recommend if possible) are a (32bit in this case) counter, they are ever increasing. To determine the interface rate, you need to subtract the latest counter value, from the previous counter value over a set time interval, multiply it by 8 (because each counter represents an octet/8 bits),  and divide it by that interval.

Example:

IntervalValue2 – IntervalValue1 *8 / timeBetweenIntervals

[user@ns1 dberry]# while true; do date; snmpwalk -ctest -v2c localhost ifInOctets.2; sleep 60; done
Sun Oct 20 21:41:25 EDT 2019
IF-MIB::ifInOctets.2 = Counter32: 3613633761
Sun Oct 20 21:42:25 EDT 2019
IF-MIB::ifInOctets.2 = Counter32: 3613699361

[user@ns1 dberry]# interval1Value=3613633761
[user@ns1 dberry]# interval2Value=3613699361
[user@ns1 dberry]# timeElapsed=60
[user@ns1 dberry]# echo $((interval2Value-interval1Value))
65600
[user@ns1 dberry]# echo $((65600*8))
524800
[user@ns1 dberry]# echo $((524800/timeElapsed))
8746

From this it appears the estimate of rx on that interface was about 8.7Kbps

Running nload at the same time was pretty close:

Curr: 8.70 kBit/s
Avg: 11.02 kBit/s
Min: 6.29 kBit/s
Max: 78.49 kBit/s
Ttl: 3.37 GByte

So how do we tackle that in a Kibana visualization? After much experimentation I settled on the below for now. If/when I increase the frequency of the data collection, I’d likely need to adjust the moving average (mvavg) value to match:

.es(index=kpis*, q='Hostname.keyword:ns1.example.net', metric=sum:ETH0IFINOCTETS).multiply(8).mvavg(15m).derivative().yaxis(units=bits/s).lines(width=1,fill=0.5).color(#AED6F1).label("Eth0 InOctets").scale_interval(1s),.es(index=kpis*, q='Hostname.keyword:ns1.berry.net', metric=sum:ETH0IFOUTOCTETS).multiply(8).mvavg(15m).derivative().yaxis(units=bits/s).lines(width=1,fill=0.5).color(#DAF7A6).label("Eth0 OutOctets").scale_interval(1s),

(The 90Kbps spike is when I did a yum install for nload)

ns1-eth0-rates

Maximum Shards in ES 7

Introduction

So everything is working fine when suddenly all your pretty graphs just hit a brick wall. Discovering your data in the index patterns also shows the same thing:

sept-18-stop

A quick check of filebeat shows logs being sent, and logstash is still happily parsing away. So what gives?

Diagnosis

However, there is also a curious warning log:

[2019-09-21T03:11:05,419][WARN ][logstash.outputs.elasticsearch] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"syslog-2019.09.21", :_type=>"_doc", :routing=>nil}, #<LogStash::Event:0x7b904a4f>], :response=>{"index"=>{"_index"=>"syslog-2019.09.21", "_type"=>"_doc", "_id"=>nil, "status"=>400, "error"=>{"type"=>"validation_exception", "reason"=>"Validation Failed: 1: this action would add [2] total shards, but this cluster currently has [999]/[1000] maximum shards open;"}}}}

The key here is that I’m hitting the limit of 1000 shards. Google-fu tells me this is a new limitation in ES7. That might be why I never encountered this in ES6, but also, that system indexed monthly, not daily, as I am doing in my home lab.

(For reference, this behavior is defined in the output section of your logstash pipeline)

index => "%{type}-%{+YYYY.MM.dd}"

A quick check of the node stats confirms I’m at the limit:

GET /_stats

{
"_shards" : {
"total" : 999,
"successful" : 499,
"failed" : 0
},

Interesting note on the above is that the successful amount is half of the total. More on why that is later…

Workaround

There are two things you could do in this situation.

  1. Delete all your old indices if you don’t need them. The drawback is you will encounter the issue again in the future at some point if you don’t figure out why you hit the limit to begin with. Should you index monthly instead of daily? Reduce the number of shards and replica shard? It depends on your needs.
  2. Increase the “max_shards_per_node” setting above 1000. The drawback is potentially a performance hit during replication between nodes in a cluster, as you are increasing the amount of work required to have a healthy cluster. You could still end up hitting the newly increased upper limit, or end up with a train wreck of a cluster suffering from poor performance the replication failures

Since this is a home lab and I wasn’t concerned with keeping data from three months ago, I dropped old indices. As soon as I did that things started working again.

As this is a single node and not a cluster, the above concerns with increasing the “max_shards_per_node” aren’t relevant, so I bumped this value up to 2000. Apparently this can be defined in elasticsearch.yml, but in ES7 there is a bug where that setting “cluster.max_shards_per_node” in elasticsearch.yml is not read. I suspect this would be because that same value exists as a default in the logstash and kibana settings.

To work around that, setting this persistently via the API works:

[root@elastic elasticsearch]# curl -X PUT localhost:9200/_cluster/settings -H 'Content-type: application/json' --data-binary $'{"transient":{"cluster.max_shards_per_node":2000}}'
{"acknowledged":true,"persistent":{},"transient":{"cluster":{"max_shards_per_node":"2000"}}}
[root@elastic elasticsearch]# curl -X GET "localhost:9200/_cluster/settings?pretty"
{
"persistent" : { },
"transient" : {
"cluster" : {
"max_shards_per_node" : "2000"
}
}
}

Looking at the node stats, they are better. However, I still see the “successful” shards as about half the total…

[root@elastic elasticsearch]# curl -X GET "localhost:9200/_stats?pretty"

{
"_shards" : {
"total" : 289,
"successful" : 146,
"failed" : 0
},

Resolution

Looking at the overall health, it’s a bit more clear why not all the shards are successful. Half of them are actually unassigned:

[root@elastic elasticsearch]# curl -X GET "localhost:9200/_cluster/health?pretty"
{
"cluster_name" : "elastic1",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 146,
"active_shards" : 146,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 143,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 50.51903114186851
}

Why would a shard not be assigned? Well it won’t be assigned to a node for replication purposes when another node does not exist (this is a single stack setup), and the primary shard already exists in the same location/node. It’s basically just duplicated data taking up space.

[root@elastic elasticsearch]# curl -X GET "localhost:9200/*/_settings?pretty" |egrep '("number_of_shards"|"number_of_replicas")' |sort |uniq -c

3 "number_of_replicas" : "0",
105 "number_of_replicas" : "1",
89 "number_of_shards" : "1",
19 "number_of_shards" : "3",
[root@elastic elasticsearch]#
[root@elastic elasticsearch]# curl -XGET localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED

elastiflow-3.5.0-2019.09.16 1 r UNASSIGNED INDEX_CREATED
elastiflow-3.5.0-2019.09.16 2 r UNASSIGNED INDEX_CREATED
elastiflow-3.5.0-2019.09.16 0 r UNASSIGNED INDEX_CREATED
dnsmasq-2019.09.07 0 r UNASSIGNED INDEX_CREATED
syslog-2019.09.15 0 r UNASSIGNED INDEX_CREATED
polling-2019.09.12 0 r UNASSIGNED INDEX_CREATED
snmp-polling-2019.09.08 0 r UNASSIGNED INDEX_CREATED
syslog-2019.09.07 0 r UNASSIGNED INDEX_CREATED

 

Here is the verification we need:

[root@elastic elasticsearch]# curl -XGET localhost:9200/_cluster/allocation/explain?pretty
{
"index" : "dnsmasq-2019.09.06",
"shard" : 0,
"primary" : false,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "CLUSTER_RECOVERED",
"at" : "2019-09-06T11:44:58.531Z",
"last_allocation_status" : "no_attempt"
},
"can_allocate" : "no",
"allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
"node_allocation_decisions" : [
{
"node_id" : "5qG9f4jLTYim3x0V1US7SQ",
"node_name" : "elastic.berry.net",
"transport_address" : "192.168.88.21:9300",
"node_attributes" : {
"ml.machine_memory" : "8201482240",
"xpack.installed" : "true",
"ml.max_open_jobs" : "20"
},
"node_decision" : "no",
"deciders" : [
{
"decider" : "same_shard",
"decision" : "NO",
"explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[dnsmasq-2019.09.06][0], node[5qG9f4jLTYim3x0V1US7SQ], [P], s[STARTED], a[id=Pa92lOgGSjiGDVpNQetebw]]"
}
]
}
]
}

Elastic also tends to force the number of replicas per shard (auto_expand_replicas) to always be one. Since I do not need any replicas, I need to set this to zero. The below will do just that (run this in the console under dev tools in Kibana):

PUT /*/_settings
{
"index" : {
"number_of_replicas":0,
"auto_expand_replicas": false
}
}

Any, health wise my node is now in better shape. No more unassigned shards:

[root@elastic elasticsearch]# curl -X GET "localhost:9200/_cluster/health?pretty"
{
"cluster_name" : "elastic1",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 146,
"active_shards" : 146,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}

 

Conclusion

Obviously, this was a bit of a “break/fix” scenario. In order to prevent it from happening again should I reach 1999 shards, I’ll need to monitor the number of shards regularly and take action to carry out one of the below:

  • Manually drop old data
  • Index monthly instead of daily
  • Implement curator to trim old data automatically
  • Look into compression and archival of older data if I really wish to keep it

I “could” install metricbeat on my elastic node and track the performance by using the elasticsearch module. I do wonder if that will also contribute to more shard usage (I could index those monthly I suppose. That will also create a large index as metricbeat tends to pull in a lot of data).

Another option would be a monitoring script…might go that route instead for now. More on that next post.

 

Tracking Brute Force Attempts in Elastic

Part I

I noticed continual failed login attempts from random users and IPs. I’d already disabled root login, and enabled two factor authentication for specific users, so I wasn’t too concerned. I have to say, whomever is behind this is very determined.

root@junos% tail -f messages
Sep 1 20:51:21 junos sshd[75801]: Failed password for neel from 190.98.xx.xx port 60586 ssh2
Sep 1 20:51:21 junos sshd[75802]: Received disconnect from 190.98.xx.xx: 11: Bye Bye
Sep 1 20:51:29 junos sshd[75803]: Failed password for switch from 115.159.xx.xxport 35762 ssh2
Sep 1 20:51:29 junos sshd[75804]: Received disconnect from 115.159.xx.xx: 11: Bye Bye
Sep 1 20:51:29 junos sshd[75805]: Failed password for user from 105.73.xx.xx port 15458 ssh2
Sep 1 20:51:29 junos sshd[75806]: Received disconnect from 105.73.xx.xx: 11: Bye Bye
Sep 1 20:51:41 junos sshd[75807]: Failed password for nagios from 58.227.xx.xx port 12952 ssh2
Sep 1 20:51:41 junos sshd[75808]: Received disconnect from 58.227.xx.xx: 11: Bye Bye

Since I’m mildly curious if the IPs these originate from have any correlation (likely not since using jumps/proxies to hide ones origin is par for the course since forever), and I am wondering what pattern there may be to the usernames used, I’m going to track these in Elastic.

Remote Syslog in Juniper

This is relatively straight forward.

Send all logs/facilities at all severities (probably overkill and I’ll trim this back later):

set system syslog  host 192.168.xx.xx any any

Specify the port:

set system syslog  host 192.168.xx.xx port 1514

Once the change is committed, packet sniff on the receiver to confirm packets arrive.

[user@logstash ]# tcpdump -ni any -s0 -c10 -vv port 1514

...

22:52:56.432485 IP (tos 0x0, ttl 64, id 9122, offset 0, flags [none], proto UDP (17), length 126)
192.168.xx.xxx.syslog > 192.168.xx.xx.fujitsu-dtcns: [udp sum ok] SYSLOG, length: 98
Facility auth (4), Severity info (6)
Msg: Sep 1 22:52:56 sshd[77131]: Failed password for webmaster from 212.112.xxx.xx port 58044 ssh2

Syslog Pipeline

This is fairly straight forward. There are generally two types of sshd message formats.

The failed or successful login attempt:

Sep 1 20:51:21 junos sshd[75801]: Failed password for neel from 190.98.xx.xx port 60586 ssh2

Sep 1 20:52:03 junos sshd[75809]: Accepted publickey for xxxxxxxx from 192.168.xx.xx port 64316 ssh2

And, the disconnect:

Sep 1 20:51:21 junos sshd[75802]: Received disconnect from 190.98.xx.xx: 11: Bye Bye

One thing that’s easy enough to do is start with a grok filter for the whole message, just to see how it looks once it arrived. The above is straight from /var/log/messages, however it arrives looking like the below, which is almost the same minus the hostname:

<38>Sep 1 21:32:54 sshd[76288]: Failed password for eas from 103.228.112.45 port 58930 ssh2

The below grok patterns worked fine for my purposes:

#Login attempts

%{SYSLOGTIMESTAMP:syslog_timestamp} %{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?: %{GREEDYDATA:syslog_message} for %{USER:ident} from %{IPORHOST:clientip} port %{POSINT:port} %{WORD:protocol}

#Disconnects

%{SYSLOGTIMESTAMP:syslog_timestamp} %{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?: %{GREEDYDATA:syslog_message} from %{IPORHOST:clientip}: %{POSINT:session}: (?<Response>%{WORD} %{WORD})

Here’s the entire config. Note: I’ve left the message field in for now but to clean things up I’ll use the below after the grok pattern match:

mutate {

   remove_field => ["message"]

             }

Here is the entire syslog pipeline config:

input {
tcp {
port => 1514 
type => syslog
}
udp {
port => 1514
type => syslog
}
}

filter {
if [type] == "syslog" {
grok {
match => { 
"message" => [
"%{SYSLOGTIMESTAMP:syslog_timestamp} %{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?: %{GREEDYDATA:syslog_message} for %{USER:ident} from %{IPORHOST:clientip} port %{POSINT:port} %{WORD:protocol}",
"%{SYSLOGTIMESTAMP:syslog_timestamp} %{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?: %{GREEDYDATA:syslog_message} from %{IPORHOST:clientip}: %{POSINT:session}: (?<Response>%{WORD} %{WORD})"
]
}
}
date {
match => [ "syslog_timestamp", "MMM d HH:mm:ss", "MMM dd HH:mm:ss" ]
}
}
}

output {
elasticsearch { 
hosts => ["localhost:9200"] 
index => "%{type}-%{+YYYY.MM.dd}" 
}
stdout { codec => rubydebug }
}

 

journalctl -f -u logstash

Sep 01 22:30:42 logstash.xx.net logstash[12107]: "message" => "<38>Sep 1 22:30:42 sshd[76895]: Failed password for user from 81.174.xx.xx port 33786 ssh2",
Sep 01 22:30:42 logstash.xx.net logstash[12107]: "ident" => "user",
Sep 01 22:30:42 logstash.xx.net logstash[12107]: "type" => "syslog",
Sep 01 22:30:42 logstash.xx.net logstash[12107]: "clientip" => "81.174.xx.xx",
Sep 01 22:30:42 logstash.xx.net logstash[12107]: "protocol" => "ssh2",
Sep 01 22:30:42 logstash.xx.net logstash[12107]: "host" => "192.168.xx.xxx",
Sep 01 22:30:42 logstash.xx.net logstash[12107]: "syslog_pid" => "76895",
Sep 01 22:30:42 logstash.xx.net logstash[12107]: "syslog_timestamp" => "Sep 1 22:30:42",
Sep 01 22:30:42 logstash.xx.net logstash[12107]: "syslog_program" => "sshd",
Sep 01 22:30:42 logstash.xx.net logstash[12107]: "port" => "33786",
Sep 01 22:30:42 logstash.xx.net logstash[12107]: "@version" => "1",
Sep 01 22:30:42 logstash.xx.net logstash[12107]: "@timestamp" => 2019-09-02T02:30:42.000Z,
Sep 01 22:30:42 logstash.xx.net logstash[12107]: "syslog_message" => "Failed password"
Sep 01 22:30:42 logstash.xx.net logstash[12107]: }

 

I’ll update this further once I’ve collected data over some time and have some visualizations to share.

Part II

I realize that this data is going to be much more interesting if I map the external IPs (clientip value after parsing) to a geoip map.

This is pretty straight forward with the below addition to the end of the filter section of the pipeline, after downloading and adding the geo ip database.:

 geoip {
source => "clientip"
database => "/etc/logstash/geoipdbs/GeoLite2-City.mmdb"
}

For reference, here’s the whole pipeline:

input {
tcp {
port => 1514 
type => syslog
}
udp {
port => 1514
type => syslog
}
}

filter {
if [type] == "syslog" {
grok {
match => { 
"message" => [
"%{SYSLOGTIMESTAMP:syslog_timestamp} %{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?: %{GREEDYDATA:syslog_message} for %{USER:ident} from %{IPORHOST:clientip} port %{POSINT:port} %{WORD:protocol}",
"%{SYSLOGTIMESTAMP:syslog_timestamp} %{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?: %{GREEDYDATA:syslog_message} from %{IPORHOST:clientip}: %{POSINT:session}: (?<Response>%{WORD} %{WORD})"
]
}
}
date {
match => [ "syslog_timestamp", "MMM d HH:mm:ss", "MMM dd HH:mm:ss" ]
}
}
geoip {
source => "clientip"
database => "/etc/logstash/geoipdbs/GeoLite2-City.mmdb"
}
}

output {
elasticsearch { 
hosts => ["localhost:9200"] 
index => "%{type}-%{+YYYY.MM.dd}" 
}
stdout { codec => rubydebug }
}

After restarting logstash, we get some nice detail on the location of these IPs:

Sep 02 20:30:28 logstash.xx.net logstash[15553]: "clientip" => "139.59.xx.xxx",
Sep 02 20:30:28 logstash.xx.net logstash[15553]: "ident" => "team4",
Sep 02 20:30:28 logstash.xx.net logstash[15553]: "port" => "51958",
Sep 02 20:30:28 logstash.xx.net logstash[15553]: "@version" => "1",
Sep 02 20:30:28 logstash.xx.net logstash[15553]: "syslog_program" => "sshd",
Sep 02 20:30:28 logstash.xx.net logstash[15553]: "type" => "syslog",
Sep 02 20:30:28 logstash.xx.net logstash[15553]: "protocol" => "ssh2",
Sep 02 20:30:28 logstash.xx.net logstash[15553]: "syslog_pid" => "89840",
Sep 02 20:30:28 logstash.xx.net logstash[15553]: "geoip" => {
Sep 02 20:30:28 logstash.xx.net logstash[15553]: "latitude" => 12.9833,
Sep 02 20:30:28 logstash.xx.net logstash[15553]: "country_code2" => "IN",
Sep 02 20:30:28 logstash.xx.net logstash[15553]: "city_name" => "Bengaluru",
Sep 02 20:30:28 logstash.xx.net logstash[15553]: "country_code3" => "IN",
Sep 02 20:30:28 logstash.xx.net logstash[15553]: "timezone" => "Asia/Kolkata",
Sep 02 20:30:28 logstash.xx.net logstash[15553]: "continent_code" => "AS",
Sep 02 20:30:28 logstash.xx.net logstash[15553]: "region_name" => "Karnataka",
Sep 02 20:30:28 logstash.xx.net logstash[15553]: "location" => {
Sep 02 20:30:28 logstash.xx.net logstash[15553]: "lon" => 77.5833,
Sep 02 20:30:28 logstash.xx.net logstash[15553]: "lat" => 12.9833
Sep 02 20:30:28 logstash.xx.net logstash[15553]: },
Sep 02 20:30:28 logstash.xx.net logstash[15553]: "longitude" => 77.5833,
Sep 02 20:30:28 logstash.xx.net logstash[15553]: "postal_code" => "560100",
Sep 02 20:30:28 logstash.xx.net logstash[15553]: "region_code" => "KA",
Sep 02 20:30:28 logstash.xx.net logstash[15553]: "country_name" => "India",
Sep 02 20:30:28 logstash.xx.net logstash[15553]: "ip" => "139.59.xx.xxx"
Sep 02 20:30:28 logstash.xx.net logstash[15553]: },

Too early to say if there’s a pattern, and these are likely compromised servers/proxies anyways.

[user@logstash ]# journalctl -f -u logstash |grep city_name
Sep 02 20:36:57 logstash.xx.net logstash[15553]: "city_name" => "Moscow",
Sep 02 20:36:57 logstash.xx.net logstash[15553]: "city_name" => "Moscow",
Sep 02 20:36:57 logstash.xx.net logstash[15553]: "city_name" => "Moscow",
Sep 02 20:36:57 logstash.xx.net logstash[15553]: "city_name" => "Moscow",
Sep 02 20:36:57 logstash.xx.net logstash[15553]: "city_name" => "Moscow",
Sep 02 20:36:57 logstash.xx.net logstash[15553]: "city_name" => "Moscow",
Sep 02 20:36:57 logstash.xx.net logstash[15553]: "city_name" => "Moscow",
Sep 02 20:36:58 logstash.xx.net logstash[15553]: "city_name" => "Moscow",
Sep 02 20:37:55 logstash.xx.net logstash[15553]: "city_name" => "San Francisco",
Sep 02 20:37:55 logstash.xx.net logstash[15553]: "city_name" => "San Francisco",
Sep 02 20:37:55 logstash.xx.net logstash[15553]: "city_name" => "San Francisco",
Sep 02 20:37:55 logstash.xx.net logstash[15553]: "city_name" => "San Francisco",
Sep 02 20:37:55 logstash.xx.net logstash[15553]: "city_name" => "San Francisco",
Sep 02 20:37:55 logstash.xx.net logstash[15553]: "city_name" => "San Francisco",
Sep 02 20:37:55 logstash.xx.net logstash[15553]: "city_name" => "San Francisco",
Sep 02 20:39:59 logstash.xx.net logstash[15553]: "city_name" => "Dhaka",
Sep 02 20:39:59 logstash.xx.net logstash[15553]: "city_name" => "Dhaka",
Sep 02 20:39:59 logstash.xx.net logstash[15553]: "city_name" => "Dhaka",
Sep 02 20:39:59 logstash.xx.net logstash[15553]: "city_name" => "Dhaka",
Sep 02 20:40:00 logstash.xx.net logstash[15553]: "city_name" => "Dhaka",
Sep 02 20:40:00 logstash.xx.net logstash[15553]: "city_name" => "Dhaka",
Sep 02 20:40:00 logstash.xx.net logstash[15553]: "city_name" => "Dhaka",
Sep 02 20:41:59 logstash.xx.net logstash[15553]: "city_name" => "Bengaluru",
Sep 02 20:44:28 logstash.xx.net logstash[15553]: "city_name" => "Bengaluru",
Sep 02 20:44:28 logstash.xx.net logstash[15553]: "city_name" => "Bengaluru",
Sep 02 20:44:28 logstash.xx.net logstash[15553]: "city_name" => "Bengaluru",
Sep 02 20:44:28 logstash.xx.net logstash[15553]: "city_name" => "Bengaluru",
Sep 02 20:44:29 logstash.xx.net logstash[15553]: "city_name" => "Bengaluru",
Sep 02 20:44:29 logstash.xx.net logstash[15553]: "city_name" => "Bengaluru",
Sep 02 20:44:29 logstash.xx.net logstash[15553]: "city_name" => "Bengaluru",
Sep 02 20:44:29 logstash.xx.net logstash[15553]: "city_name" => "Bengaluru",

Next, let’s visualize these…

If we tried in this current state, we’d see the below error stating there is no geo_point field in our index:

geo_point

In order to do that, we’ll need to update our index mapping fields to create a geo_point. That is a field with geoip location from both the longitude and latitude. Ie;

Sep 02 20:30:28 logstash.xx.net logstash[15553]: "geoip" => {
Sep 02 20:30:28 logstash.xx.net logstash[15553]: "latitude" => 12.9833,

...

Sep 02 20:30:28 logstash.xx.net logstash[15553]: "longitude" => 77.5833,

...

Sep 02 20:30:28 logstash.xx.net logstash[15553]: "location" => {
Sep 02 20:30:28 logstash.xx.net logstash[15553]: "lon" => 77.5833,
Sep 02 20:30:28 logstash.xx.net logstash[15553]: "lat" => 12.9833
Sep 02 20:30:28 logstash.xx.net logstash[15553]: },

What is convenient about these fields automatically created when we use the geoip plugin source of the clientip, is that the location field itself contains both lon and lat floats already. All we really need to do is map geoip.location -> geo_point.

This is the easiest way. Create a new template for the index pattern, with the mapping we need (note: as of Elasticsearch 7.x, there is no longer a type field in templates, so if you try and define one you’ll get an error):

PUT _template/syslog
{
"index_patterns": [
"syslog*"
],
"mappings": {
"properties": {
"geoip.location": {
"type": "geo_point"
}
}
}
}

Since this would only be applied at index creation, you’ll need to delete the old indices after applying the template. Once done, refresh the index pattern and you should see that the geo.location field is a geo_point type.

geo_point_field

Now, we can visualize all the geoip locations:

geoip_map

I “seem” to be very popular in China. Over 500 hits in the past hour:

geoip_china

Part III

In summary, after collecting data for only 5 days, here are some totals.

Although these attempts are globally distributed (typical botnet), there is definitely a higher, continual attempt rate from China. Over 25k total over 5 days:

geoip-map-7dayslogin-attempt-by-country-2

Trying to login as root is definitely priority #1.

login-attempt-top-10-with-country

 

Solution:

I considered setting up a honey pot just to see what someone would do if they gained access, however the risk of being blamed for anything the left the honeypot (such as a DNS amplification attack) I felt was too great. Also, the only safe way to do this is by ensuring the honey pot can’t reach anything else local on my system, however to review the captures and logs afterwards they’d need to be transferred somewhere. The ambiguity of transferring whatever malicious code attempts are made may also be misinterpreted as originating from me.

Instead, change ssh to another port other than 22, and limit access to only certain subnets.

If this were a linux system, changing the ssh port in sshd, and setting hosts.allow for only local IPs and/or limited subnets would be a start. Fail2Ban would be a good second line of defense.

In juniper/junos, I’ll make a firewall filter and then apply that filter to whichever interface(s) are applicable. In the case of ssh, blocking it at the loopback address will keep these ssh attempts from reaching the control plane.

  1. create the accept filter:
    set firewall filter RESTRICT-SSH term SSH-ACCEPT from source-address 192.168.xx.x/24
    
    set firewall filter RESTRICT-SSH term SSH-ACCEPT from protocol tcp
    
    set firewall filter RESTRICT-SSH term SSH-ACCEPT from destination-port ssh
    
    set firewall filter RESTRICT-SSH term SSH-ACCEPT then accept
  2. create the reject filter:
    set firewall filter RESTRICT-SSH term SSH-REJECT then reject
    
    set firewall filter RESTRICT-SSH term SSH-REJECT then count MALICIOUS-ATTEMPT
    
    set firewall filter RESTRICT-SSH term SSH-REJECT then log
    
    set firewall filter RESTRICT-SSH term SSH-REJECT then syslog
  3. Apply the filter to lo0:
    set interfaces lo0 unit 0 family inet filter input RESTRICT-SSH
  4. Commit confirmed and test from another device from a network NOT in the source-address list in SSH-ACCEPT. One the test is complete, issue another commit else the config will roll back.
    user@junos# commit confirmed 2
    commit confirmed will be automatically rolled back in 2 minutes unless confirmed
    commit complete
    
    # commit confirmed will be rolled back in 5 minutes
    [edit firewall]
    user@junos# commit 
    commit complete
    
    [edit firewall]

Confirmation:

[someuser@elastiflow ]# ssh someuser@192.168.xx.xxx
ssh: connect to host 192.168.xx.xxx port 22: No route to host

Since I chose to log the attempts, next I’ll parse these in logstash and see how long it takes for these attempts to stop:

<158>Sep 7 21:39:35 junos1 PFE_FW_SYSLOG_IP: FW: fe-0/0/0.0 R tcp 49.88.xxx.xxx xx.xx.xx.xx 32160 22 (1 packets)

 

note: after applying this I remembered that snmp polling also needs to reach the control plane of my juniper SRX, so I had to amend the accept filter to allow udp and snmp as well:

set firewall filter RESTRICT-SSH term SSH-ACCEPT from source-address 192.168.xx.x/24
set firewall filter RESTRICT-SSH term SSH-ACCEPT from protocol tcp
set firewall filter RESTRICT-SSH term SSH-ACCEPT from protocol udp
set firewall filter RESTRICT-SSH term SSH-ACCEPT from destination-port ssh
set firewall filter RESTRICT-SSH term SSH-ACCEPT from destination-port snmp
set firewall filter RESTRICT-SSH term SSH-ACCEPT then accept

 

Juniper MIB import into Logstash

From my previous post where we configure SNMP polling for SRX KPIs, we can see the Juniper MIBs need to be imported into Logstash for proper conversion.

MIBs are publicly available from juniper at the below link:

https://www.juniper.net/documentation/en_US/release-independent/junos/mibs/mibs.html

However, this old SRX is running an ancient (10.0) version. I managed to locate a download for all the MIBs here:

http://www.juniper.net/techpubs/software/junos/junos100/juniper-mibs-10.0R4.7.tgz

The method to import / convert the MIBs for the SNMP input plugin in Logstash, differs from the ruby import used for the SNMP trap plugin (https://discuss.elastic.co/t/mib-oid-translation/29710). Apparently we need these to be .dic file types.

https://www.elastic.co/guide/en/logstash/current/plugins-inputs-snmp.html#plugins-inputs-snmp-import-mibs

After downloading and extracting the mib archive, I found (as expected) these OEM specific MIBs have requirements/imports from other MIBs.

[dberry@Logstash JuniperMibs]$ smidump --level=1 -k -f python mib-jnx-chassis-fwdd.txt > mib-jnx-chassis-fwdd.dic 
mib-jnx-chassis-fwdd.txt:16: failed to locate MIB module `JUNIPER-SMI'
mib-jnx-chassis-fwdd.txt:31: unknown object identifier label `jnxMibs'
smidump: module `mib-jnx-chassis-fwdd.txt' contains errors, expect flawed output

These imports are listed in the MIB itself and easy enough to make a list out of:

[dberry@Logstash mibs]$ grep FROM JuniperMibs/mib-jnx-chassis-fwdd.txt |awk -F FROM '{print $2}' |tr -d ';' |awk '{print $1}'
SNMPv2-SMI
JUNIPER-SMI
[dberry@Logstash mibs]$

Here’s an idea. If I can place these in a readable list (like an array), I could pass these through the smidump command. I could do this with a bash script that takes the MIB we want to convert as user input, and reads through the array of MIB dependencies within the smidump…

First, get these MIB dependencies into an array as a list:

Example:

[root@Logstash mibs]# pwd
/usr/share/snmp/mibs
[root@Logstash mibs]# grep FROM DISMAN-EVENT-MIB.txt
Gauge32, mib-2, zeroDotZero FROM SNMPv2-SMI
TruthValue FROM SNMPv2-TC
NOTIFICATION-GROUP FROM SNMPv2-CONF
sysUpTime FROM SNMPv2-MIB
SnmpTagValue FROM SNMP-TARGET-MIB
SnmpAdminString FROM SNMP-FRAMEWORK-MIB;

The columns aren’t consistent, but I could use awk with FROM as the delimiter to get a clean list:

[root@Logstash mibs]# grep FROM DISMAN-EVENT-MIB.txt |awk -F FROM '{print $2}' |tr -d ';' |awk '{print $1}'
SNMPv2-SMI
SNMPv2-TC
SNMPv2-CONF
SNMPv2-MIB
SNMP-TARGET-MIB
SNMP-FRAMEWORK-MIB

We need to be able to determine the full path for these in order to successfully pass them through the smidump command. To do that I’ll use find on the MIBs in the array, and put them into a single line:

  1. Test the array:
    [root@Logstash mibs]# i=1
    
    [root@Logstash mibs]# grep FROM DISMAN-EVENT-MIB.txt |awk -F FROM '{print $2}' |tr -d ';' |awk '{print $1}' |while read MIBdependency
    
    > do array[ $i ]="$MIBdependency"
    
    > (( i++ ))
    
    > echo $MIBdependency
    
    > done
    
    SNMPv2-SMI.txt
    
    SNMPv2-TC.txt
    
    SNMPv2-CONF.txt
    
    SNMPv2-MIB.txt
    
    SNMP-TARGET-MIB.txt
    
    SNMP-FRAMEWORK-MIB.txt
    
    [root@Logstash mibs]#
  2. Same as the above but called by find to get the full path:
    [root@Logstash mibs]# grep FROM DISMAN-EVENT-MIB.txt |awk -F FROM '{print $2}' |tr -d ';' |awk '{print $1}' |while read MIBdependency; do array[ $i ]="$MIBdependency"; (( i++ )); 
    
    > find / -name $MIBdependency >>MIBdependency.tmp
    
    > done
    
    [root@Logstash mibs]# cat MIBdependency.tmp 
    
    /usr/share/mibs/ietf/SNMPv2-SMI
    
    /usr/share/mibs/ietf/SNMPv2-TC
    
    /usr/share/mibs/ietf/SNMPv2-CONF
    
    /usr/share/mibs/ietf/SNMPv2-MIB
    
    /usr/share/mibs/ietf/SNMP-TARGET-MIB
    
    /usr/share/mibs/ietf/SNMP-FRAMEWORK-MIB
  3. Use the preload option (-p) in front of each one and process it as a single line for use in smidump:
    [root@Logstash mibs]# sed 's/^/-p /' MIBdependency.tmp |tr '\n' ' ' 
    
    -p /usr/share/mibs/ietf/SNMPv2-SMI -p /usr/share/mibs/ietf/SNMPv2-TC -p /usr/share/mibs/ietf/SNMPv2-CONF -p /usr/share/mibs/ietf/SNMPv2-MIB -p /usr/share/mibs/ietf/SNMP-TARGET-MIB -p /usr/share/mibs/ietf/SNMP-FRAMEWORK-MIB [root@Logstash mibs]#

Here’s what I eventually settled on:

#!/bin/bash
#Version 1.0 
#This script will convert MIBs into a dic format

#allow backspace to work normally
stty erase ^H

#enable debugging#
#set -x

#Identify the target MIB

read -p "Please Enter the MIB to convert: " MIB2CONVERT

echo "Checking IETF MIB dependencies for $MIB2CONVERT..."

#find the other MIB dependencies, and determine their full path to be used in the smidump command
i=1
grep FROM $MIB2CONVERT |awk -F FROM '{print $2}' |tr -d ';' |awk '{print $1}' |while read MIBdependency
do
array[ $i ]="$MIBdependency"
(( i++ ))
find / -name $MIBdependency >>MIBdependency.tmp
done

#add -p to preload the modules/MIB dependencies
sed 's/^/-p /' MIBdependency.tmp >MIBdependency2.tmp

#make this a single line file for use with the smidump command
MIBdependencyCmd=`tr '\n' ' ' < MIBdependency2.tmp`

#run the smidump command to convert the MIB to python format
timeout 10 smidump -k -f python $MIBdependencyCmd $MIB2CONVERT >$MIB2CONVERT.python

#clean up after ourselves
rm -f MIBdependency*.tmp

echo "All Done!"

However, one thing I noticed was errors, despite the dependencies resolved:

[dberry@Logstash mibs]$ sudo ./convertMIB.sh 
Please Enter the MIB to convert: JuniperMibs/mib-jnx-chassis-fwdd.txt
Checking IETF MIB dependencies for JuniperMibs/mib-jnx-chassis-fwdd.txt...
smidump: module `JuniperMibs/mib-jnx-chassis-fwdd.txt' contains errors, expect flawed output
All Done!
[dberry@Logstash mibs]$

 

I had a suspicion that smidump was looking for a MIB naming and definition to match, based on the previous error for “failed to locate MIB module `JUNIPER-SMI'” Once copying one of the dependencies (mib-jnx-smi.txt) to the name smidump was expecting, (JUNIPER-SMI.txt) I was able to complete the conversion, error free.

[dberry@Logstash mibs]$ cp JuniperMibs/mib-jnx-smi.txt JuniperMibs/JUNIPER-SMI.txt
[dberry@Logstash mibs]$ sudo ./convertMIB.sh 
Please Enter the MIB to convert: JuniperMibs/mib-jnx-chassis-fwdd.txt
Checking IETF MIB dependencies for JuniperMibs/mib-jnx-chassis-fwdd.txt...
All Done!
[dberry@Logstash mibs]$ ls -lthr JuniperMibs/ |grep chassis
-rw-r--r--. 1 dberry dberry 57K Aug 22 2010 mib-jnx-chassis.txt
-rw-r--r--. 1 dberry dberry 2.2K Aug 22 2010 mib-jnx-chassis-fwdd.txt
-rw-r--r--. 1 dberry dberry 4.2K Aug 22 2010 mib-jnx-chassis-alarm.txt
-rw-r--r--. 1 dberry dberry 9.3K Aug 22 2010 mib-jnx-virtualchassis.txt
-rw-r--r--. 1 root root 4.3K Aug 31 00:35 mib-jnx-chassis-fwdd.txt.dic
[dberry@Logstash mibs]$

Another test:

[dberry@Logstash mibs]# sudo ./convertMIB.sh
Please Enter the Sandvine MIB to convert: JuniperMibs/mib-jnx-chassis.txt
Checking IETF MIB dependencies for JuniperMibs/mib-jnx-chassis.txt...
All Done!

[dberry@Logstash mibs]# ls -lthr JuniperMibs/
...
-rw-r--r--. 1 root root 132K Sep 1 14:58 mib-jnx-chassis.txt.dic

Now we need to specify the path for MIBs in the .dic format in the snmp input plugin.

According to the docs, you can define this to whatever path to decide to use for your converted MIBs. However, I ran into the following error doing that from my home directory.

"logstash SnmpMibError: file or directory path expected"

I solved this by defining placing the converted MIBs instead under /etc/logstash/mibs/.

mib_paths => ["/etc/logstash/mibs"]

OIDs are now decoded.

Sep 01 20:22:49 logstash.xx.net logstash[10666]: "iso.org.dod.internet.private.enterprises.2636.3.jnxBoxAnatomy.jnxOperatingTable.jnxOperatingEntry.jnxOperatingBuffer.9.1.0.0" => 51,
Sep 01 20:22:49 logstash.xx.net logstash[10666]: "iso.org.dod.internet.private.enterprises.2636.3.3.1.1.4.510" => 63208,
Sep 01 20:22:49 logstash.xx.net logstash[10666]: "iso.org.dod.internet.private.enterprises.2636.3.jnxFwdd.jnxFwddProcess.jnxFwddHeapUsage.0" => 32,
Sep 01 20:22:49 logstash.xx.net logstash[10666]: "iso.org.dod.internet.private.enterprises.2636.3.jnxFwdd.jnxFwddProcess.jnxFwddDmaMemUsage.0" => 1,
Sep 01 20:22:49 logstash.xx.net logstash[10666]: "iso.org.dod.internet.private.enterprises.2636.3.39.1.12.1.1.1.5.0" => 32,
Sep 01 20:22:49 logstash.xx.net logstash[10666]: "type" => "polling",
Sep 01 20:22:49 logstash.xx.net logstash[10666]: "iso.org.dod.internet.private.enterprises.2636.3.3.1.1.1.518" => 512,
Sep 01 20:22:49 logstash.xx.net logstash[10666]: "iso.org.dod.internet.private.enterprises.2636.3.jnxFwdd.jnxFwddProcess.jnxFwddMicroKernelCPUUsage.0" => 19,
Sep 01 20:22:49 logstash.xx.net logstash[10666]: "iso.org.dod.internet.private.enterprises.2636.3.jnxFwdd.jnxFwddProcess.jnxFwddRtThreadsCPUUsage.0" => 1

Lastly, you’ll need to of course refresh your index pattern and update the visualizations/Timelion queries.

 

Example:

.es(index=snmp-polling*,metric=avg:iso.org.dod.internet.private.enterprises.2636.3.jnxFwdd.jnxFwddProcess.jnxFwddMicroKernelCPUUsage.0).yaxis(units=percent).divide(100).label("FWD Engine Microkernel CPU").lines(fill=1,width=2),.es(index=snmp-polling*,metric=avg:iso.org.dod.internet.private.enterprises.2636.3.jnxFwdd.jnxFwddProcess.jnxFwddRtThreadsCPUUsage.0).yaxis(units=percent).divide(100).label("Real-time threads CPU").lines(fill=1,width=2),.es(index=snmp-polling*,metric=avg:iso.org.dod.internet.private.enterprises.2636.3.jnxFwdd.jnxFwddProcess.jnxFwddHeapUsage.0).yaxis(units=percent).divide(100).label("Heap Percentage").lines(fill=1,width=2),.es(index=snmp-polling*,metric=avg:iso.org.dod.internet.private.enterprises.2636.3.jnxFwdd.jnxFwddProcess.jnxFwddDmaMemUsage.0).yaxis(units=percent).divide(100).label("Buffer").lines(fill=1,width=2),

Juniper SRX Performance Monitoring with the Elastic Stack

Intro, and a few things to note.

I am writing this under a couple of assumptions:

  1. You already have an Elasticsearch instance running and listening on port 9200 (if an alternate port is in use, you will need to specify it in the output section of the Logstash pipeline)
  2. Kibana is also already set up.
  3. You can either add an additional pipeline for these purposes or spin up an additional Logstash instance. In the below example, I’ll use quemu/virt-install.

These steps are not dependent on any paid features, so I won’t be going into watcher and x-pack configurations.

Virtual Machine deployment in KVM

First, spin up a new Logstash instance using the below. Using an additional instance instead of a new pipeline in an already configured Logstash deployment is a matter of preference, but one I prefer as necessary Logstash restarts during this implementation will impact all other data ingest. If this is only done on a new instance, it eliminates impact.

Note: The below command is sanitized, so you’ll of course need to customize hostnames, paths and requirements to what you feel you need. I’ve also set the total ram at 2GB, which will mean Logstash’s default jvm heap of 1GB will be fine at the 50% of total ram recommended. If this instance were handling more than just this test scenario, I’d definitely be allocating more resources (including vCPU).

Additionally, you’ll need to have the relevant CentOS ISO already downloaded, or supply the path to a mirror as an alternate.

Lastly, the bridge used has already been set up prior, to bridge the virtual machines across a physical interface, allowing other network elements to reach them and vice versa.

After running the below, connect to the newly created virtual machine/domain by “virsh console ” and complete the quickstart steps. Once this is done, the installation will proceed “Domain has shutdown” and continue.

[root@kvm~]# virt-install --virt-type=kvm --name logstash.local.net --ram 2048 --vcpus=1 --os-type linux --os-variant=centos7.0 --network=bridge=br1 --graphics vnc --console pty,target_type=serial --disk path=/data/images/elastic/logstash.img,size=50 --location=/data/images/elastic/CentOS-7-x86_64-Minimal-1810.iso --extra-args 'console=ttyS0,115200n8 serial'


Starting install...
Retrieving file .treeinfo... | 354 B 00:00:00 
Retrieving file vmlinuz... | 6.3 MB 00:00:00 
Retrieving file initrd.img... | 50 MB 00:00:00 
Allocating 'logstash.img' | 50 GB 00:00:00 
Domain installation still in progress. Waiting for installation to complete.

Domain has shutdown. Continuing.
Domain creation completed.
Restarting guest.

Installing and Configuring Logstash

To minimize the risk of any dependency headaches, running a “yum update” post install to update all included packages would be preferable, but not required here.

One requirement will be installing java. There’s no point in repeating already available documentation on these steps, so be sure to follow the steps here.

The below will need to be updated/defined in logstash.yml, all other logstash variables are at their default values (the queue.type does not NEED to be persisted, but it’s a personal preference):

[dberry@Logstash ~]$ grep -v '^#' /etc/logstash/logstash.yml |awk NF
node.name: logstash-test 
path.data: /var/lib/logstash
pipeline.id: test 
path.config: /etc/logstash/conf.d/*.conf
queue.type: persisted 
path.logs: /var/log/logstash
[dberry@Logstash ~]$

Next we’ll need to configure the pipeline in logstash using the snmp input plugin to poll the SNMP OIDs we are interested in. First, we need to determine what those OIDs are…

Juniper SRX Configuration Requirements

Configuring the SNMP agent

Follow the kb here to configure the SRX SNMP agent, so that we can poll it from Logstash. For reference, this is what I’ve used. Anywhere you see the community (-c) in the snmpwalks in this article, you should ensure it matches the “community” set here. What you name it should be unique and not left as the default “public”:

dberry@junos> show configuration snmp 
location somelocation;
contact "dberry@example.com";
community changeMe {
authorization read-only;
clients {
192.168.xx.xx/24;    <- Logstash / Elastic IPs
0.0.0.0/0 restrict;
}
}

Gathering Juniper SRX OIDs

There are a few OIDs (such as jnxOperatingCPU) recommended by Juniper to monitor on the SRX platform.

SRX SNMP Monitoring Guide_v1.1

If you have imported the Junos MIBs into Logstash, you can poll the OIDs by their names, otherwise polling the numeric OIDs needs to be done. (Importing them is worthwhile for SNMP trap OID -> String conversion).

First, confirm you are looking at the right OIDs. You should be able to line these up to CLI output. From the example below, we can see the User, Kernel and Interrupt processes are 54% if added up. This corresponds to the “jnxOperatingCPU.9.1.0.0” OID:

dberry@junos> show chassis routing-engine 
Routing Engine status:
Temperature 56 degrees C / 132 degrees F
Total memory 1024 MB Max 522 MB used ( 51 percent)
Control plane memory 560 MB Max 375 MB used ( 67 percent)
Data plane memory 464 MB Max 144 MB used ( 31 percent)
CPU utilization:
User 36 percent
Background 0 percent
Kernel 17 percent
Interrupt 1 percent
Idle 46 percent
Model RE-SRX100H
Serial ID AT2811AF0443
Start time 2019-08-09 22:54:20 EDT
Uptime 20 days, 16 hours, 8 minutes, 46 seconds
Last reboot reason 0x200:chassis control reset 
Load averages: 1 minute 5 minute 15 minute
0.81 0.72 0.63
dberry@junos> show snmp mib walk jnxOperatingCPU 
jnxOperatingCPU.1.1.0.0 = 0
jnxOperatingCPU.2.1.0.0 = 0
jnxOperatingCPU.7.1.0.0 = 0
jnxOperatingCPU.8.1.1.0 = 0
jnxOperatingCPU.9.1.0.0 = 54 
jnxOperatingCPU.9.1.1.0 = 0

Next, convert the OIDs to their numeric equivalent. The jnx MIBs are built on and dependent upon other IETF MIBs such as the SNMPv2-SMI MIB. We can simply walk through them and them convert them if we don’t already have them imported into Logstash:

dberry@junos> show snmp mib walk .1.3.6.1.4.1.2636.3.1.13.1.8 
jnxOperatingCPU.1.1.0.0 = 0
jnxOperatingCPU.2.1.0.0 = 0
jnxOperatingCPU.7.1.0.0 = 0
jnxOperatingCPU.8.1.1.0 = 0
jnxOperatingCPU.9.1.0.0 = 44
jnxOperatingCPU.9.1.1.0 = 0
[dberry@Logstash ~]$ snmpwalk -cchangeMe -v2c 192.168.xx.xxx .1.3.6.1.4.1.2636.3.1.13.1.8
SNMPv2-SMI::enterprises.2636.3.1.13.1.8.1.1.0.0 = Gauge32: 0
SNMPv2-SMI::enterprises.2636.3.1.13.1.8.2.1.0.0 = Gauge32: 0
SNMPv2-SMI::enterprises.2636.3.1.13.1.8.7.1.0.0 = Gauge32: 0
SNMPv2-SMI::enterprises.2636.3.1.13.1.8.8.1.1.0 = Gauge32: 0
SNMPv2-SMI::enterprises.2636.3.1.13.1.8.9.1.0.0 = Gauge32: 44
SNMPv2-SMI::enterprises.2636.3.1.13.1.8.9.1.1.0 = Gauge32: 0

So now we know the OID for the Routing-Engine CPU I am interested in is:

“.1.3.6.1.4.1.2636.3.1.13.1.8.9.1.0.0”

Similarly, for the Fordwarding Engine:

dberry@junos> show chassis forwarding 
FWDD status:
State Online 
Microkernel CPU utilization 22 percent
Real-time threads CPU utilization 1 percent
Heap utilization 31 percent
Buffer utilization 1 percent
Uptime: 36 days, 1 hour, 17 minutes, 22 seconds
dberry@junos> show snmp mib walk jnxFwddProcess 
jnxFwddMicroKernelCPUUsage.0 = 20
jnxFwddRtThreadsCPUUsage.0 = 2
jnxFwddHeapUsage.0 = 31
jnxFwddDmaMemUsage.0 = 1
jnxFwddUpTime.0 = 3115160
[dberry@Logstash ~]# snmpwalk -cchangeMe -v2c 192.168.xx.xxx .1.3.6.1.4.1.2636.3.34.1
SNMPv2-SMI::enterprises.2636.3.34.1.1.0 = Gauge32: 22
SNMPv2-SMI::enterprises.2636.3.34.1.2.0 = Gauge32: 1
SNMPv2-SMI::enterprises.2636.3.34.1.3.0 = Gauge32: 31
SNMPv2-SMI::enterprises.2636.3.34.1.4.0 = Gauge32: 1
SNMPv2-SMI::enterprises.2636.3.34.1.5.0 = INTEGER: 3115042

While we are at it, lets gather some interface rates as well. If you start by looking at the CLI output of “show interfaces…” you can find the individual index for the interface. This helps to identify and confirm them from the OIDs. We can compare these to the results of a SNMP walk through the IF-MIB.

Example:

dberry@junos> show interfaces fe-0/0/0 |grep SNMP 
Interface index: 133, SNMP ifIndex: 510
Interface flags: SNMP-Traps Internal: 0x0
Logical interface fe-0/0/0.0 (Index 71) (SNMP ifIndex 511) 
Flags: SNMP-Traps Encapsulation: ENET2
vrrp dhcp snmp ssh
[dberry@Logstash ~]$ snmpwalk -cchangeMe -v2c 192.168.xx.xxx 1.3.6.1.2.1.31.1.1.1.1 |grep 510
IF-MIB::ifName.510 = STRING: fe-0/0/0

Alternatively, you could look for them here:

root@junos% cat /var/db/dcd.snmp_ix | grep fe
510 "fe-0/0/0" 0 0 1;
511 "fe-0/0/0" 0 1 1;
512 "fe-0/0/1" 0 0 1;
513 "fe-0/0/1" 0 1 1;
514 "fe-0/0/2" 0 0 1;
515 "fe-0/0/2" 0 1 1;
516 "fe-0/0/3" 0 0 1;
517 "fe-0/0/3" 0 1 1;
518 "fe-0/0/4" 0 0 1;
519 "fe-0/0/4" 0 1 1;
520 "fe-0/0/5" 0 0 1;
521 "fe-0/0/5" 0 1 1;
522 "fe-0/0/6" 0 0 1;
523 "fe-0/0/6" 0 1 1;
524 "fe-0/0/7" 0 0 1;
525 "fe-0/0/7" 0 1 1;
root@junos%

Convert the OID string to a numeric format:

[dberry@Logstash ~]$ snmpwalk -On -cchangeMe -v2c 192.168.xx.xxx 1.3.6.1.2.1.31.1.1.1.1 |grep 510
.1.3.6.1.2.1.31.1.1.1.1.510 = STRING: fe-0/0/0
[dberry@elastic conf.d]$

From the IF-MIB, we are going to polling Octets in and out:

[dberry@Logstash ~]$ snmpwalk -cchangeMe -v2c 192.168.xx.xxx 1.3.6.1.2.1.31.1.1.1.10.510
IF-MIB::ifHCOutOctets.510 = Counter64: 14494313342

Convert to a numeric OID:

[dberry@Logstash ~]$ snmpwalk -On -cchangeMe -v2c 192.168.xx.xxx 1.3.6.1.2.1.31.1.1.1.10.510
.1.3.6.1.2.1.31.1.1.1.10.510 = Counter64: 16525645640
[dberry@Logstash ~]$ snmpwalk -cchangeMe -v2c 192.168.xx.xxx 1.3.6.1.2.1.31.1.1.1.6.510
IF-MIB::ifHCInOctets.510 = Counter64: 158449944769

Convert to a numeric OID:

[dberry@Logstash ~]$ snmpwalk -On -cchangeMe-v2c 192.168.xx.xxx 1.3.6.1.2.1.31.1.1.1.6.510
.1.3.6.1.2.1.31.1.1.1.6.510 = Counter64: 152704880077

Configuring the Logstash pipeline

Here is what my initial configuration looked like. Whatever you name your pipeline, make sure it’s located under /etc/logstash/conf.d/ and ends in .conf

input {
snmp {
id => "Juniper_SRX_Polling"
type => "polling"
walk => [".1.3.6.1.4.1.2636.3.34.1"]
get => [".1.3.6.1.2.1.31.1.1.1.6.510",".1.3.6.1.2.1.31.1.1.1.10.510", ".1.3.6.1.4.1.2636.3.3.1.1.1.510",".1.3.6.1.4.1.2636.3.3.1.1.4.510",".1.3.6.1.4.1.2636.3.3.1.1.1.518",".1.3.6.1.4.1.2636.3.3.1.1.4.518",".1.3.6.1.4.1.2636.3.3.1.1.1.522",".1.3.6.1.4.1.2636.3.3.1.1.4.522",".1.3.6.1.4.1.2636.3.1.13.1.8.9.1.0.0",".1.3.6.1.4.1.2636.3.39.1.12.1.1.1.5.0",".1.3.6.1.4.1.2636.3.1.13.1.11.9.1.0.0",".1.3.6.1.4.1.2636.3.1.13.1.7.9.1.0.0"]
hosts => [{host => "udp:192.168.xx.xxx/161" community => "changeMe" version => "2c" retries => 2 timeout => 1000 interval => 10}]
add_field => {host => "%{[@metadata][host_protocol]}:%{[@metadata][host_address]}/%{[@metadata][host_port]},%{[@metadata][host_community]}"}

}
}

output {
if [type] == "polling" {
elasticsearch {
hosts => [ "192.168.xx.xx:9200" ]
index => "snmp-polling-%{+YYYY.MM.dd}"
}
}
}

After restarting logstash, you should have an index created in ES, and be able to discover the polled values after creating the index pattern:

snmp-polling-discoversnmp-polling-discover-2

At this point, I’m collecting the following KPIs:

Routing-Engine CPU

  • Routing-Engine and Data Plane Memory
  • Temperature

Forwarding-Engine

  • microkernel CPU
  • Real time CPU threads
  • Heap Utilization
  • Buffer Utilization

Interface Tx/Rx Rates

Visualizing the Data in Kibana

Timelion seems to be the easiest way to do this, considering some of these values need to be converted into a format that makes sense to the human eye. Here is what I came up with:

Forwarding Engine:

.es(index=snmp-polling*,metric=avg:iso.org.dod.internet.private.enterprises.2636.3.34.1.1.0).yaxis(units=percent).divide(100).label("FWD Engine Microkernel CPU").lines(fill=1,width=2),.es(index=snmp-polling*,metric=avg:iso.org.dod.internet.private.enterprises.2636.3.34.1.2.0).yaxis(units=percent).divide(100).label("Real-time threads CPU").lines(fill=1,width=2),.es(index=snmp-polling*,metric=avg:iso.org.dod.internet.private.enterprises.2636.3.34.1.3.0).yaxis(units=percent).divide(100).label("Heap Percentage").lines(fill=1,width=2),.es(index=snmp-polling*,metric=avg:iso.org.dod.internet.private.enterprises.2636.3.34.1.4.0).yaxis(units=percent).divide(100).label("Buffer").lines(fill=1,width=2),

Routing-Engine CPU:

.es(index=snmp-polling*,metric=avg:iso.org.dod.internet.private.enterprises.2636.3.1.13.1.8.9.1.0.0).yaxis(units=percent).divide(100).label("Routing Engine CPU").lines(fill=1,width=2)

Routing-Engine Total Memory:

.es(index=snmp-polling*,metric=avg:iso.org.dod.internet.private.enterprises.2636.3.1.13.1.11.9.1.0.0).yaxis(units=percent,max=1).divide(100).label("Routing Engine - Total Memory").lines(fill=1,width=2)

Routing-Engine Data Plane Memory:

.es(index=snmp-polling*,metric=avg:iso.org.dod.internet.private.enterprises.2636.3.39.1.12.1.1.1.5.0).yaxis(units=percent,max=1).divide(100).label("Data Plane Memory").lines(fill=1,width=2)

Chassis Temperature:

.es(index=snmp-polling*,metric=avg:iso.org.dod.internet.private.enterprises.2636.3.1.13.1.7.9.1.0.0).yaxis(units=custom::-Degrees-Celcius,max=100).label("Temp").lines(fill=1,width=2)

Put together in a dashboard…

SRX-KPI-Dashboard

Here are the Timelion queries for the Interface Tx/Rx. Since the data we are polling are octets as a counter, they are cumulative. This means we have to perform the calculation for octets (multiply(8)) and average them over the polling interval (using mvag(10s)) below). Using the scale_interval helps keep the data logical for the other 9 seconds between polling events:

.es(index=snmp-polling*,metric=sum:iso.org.dod.internet.private.enterprises.2636.3.3.1.1.1.510).multiply(8).mvavg(10s).yaxis(units=bits/s,max=100000000).label("Rx-fe-0/0/0.0 - Download").lines(fill=1,width=2).scale_interval(1s),
.es(index=snmp-polling*,metric=sum:iso.org.dod.internet.private.enterprises.2636.3.3.1.1.4.510).multiply(8).mvavg(10s).yaxis(units=bits/s).label("Tx-fe-0/0/0.0 - Upload").lines(fill=1,width=1).scale_interval(1s)

(Note: I’ve noted the download and upload to reflect the relative path, as InOctets for a WAN facing interface counts as “download”, InOctets for a LAN facing interface would be “upload”)

.es(index=snmp-polling*,metric=sum:iso.org.dod.internet.private.enterprises.2636.3.3.1.1.1.522).multiply(8).mvavg(10s).yaxis(units=bits/s,max=100000000).label("Rx-fe-0/0/5.0 - Upload").lines(fill=1,width=2).scale_interval(1s),
.es(index=snmp-polling*,metric=sum:iso.org.dod.internet.private.enterprises.2636.3.3.1.1.4.522).multiply(8).mvavg(10s).yaxis(units=bits/s).label("Tx-fe-0/0/5.0 - Download").lines(fill=1,width=1).scale_interval(1s)

Put together in a dashboard…

SRX-INT-Dashboard