Skip to main content

Adding cluster nodes, expanding storage - Elasticsearch

This is a copy/paste/censor of our wiki doc (that I wrote, and have permission to publish; the censoring might make stuff look off, sorry).  I left in the generic stuff because maybe we're doing something horribly wrong and a nice person will point it out.  Or maybe we're doing something that someone else will find inspiration in.  Who knows!  Gives some context anyways.  Hope this helps someone...

Overview of adding a new node to the ES cluster.
1.    Deploy the current CentOS template
2.    Assign an ip address from IPadmin
3.    Create the A record in AD DNS now
4.    Add a 500GB disk, located on one of the VMFS-ES-x datastores
5.    Change the networking:
§  /etc/sysconfig/network-scripts/ifcfg-eth0 (change IP)
§  /etc/sysconfig/network (change hostname)
6.    Run updates: yum update
7.    Reboot
ES Big Disk config
If VM is already in place
This will allow you to add the disk without rebooting:
echo "- - -" > /sys/class/scsi_host/host0/scan
echo "- - -" > /sys/class/scsi_host/host1/scan
echo "- - -" > /sys/class/scsi_host/host2/scan
Continue the disk setup
fdisk -l
fdisk /dev/sdb
***  n p 1 w
pvcreate /dev/sdb1
vgextend vg_name /dev/sdb1
lvcreate -L 490G -n lv_elasticsearch vg_name
vi /etc/fstab
*** copy the root line, change to elasticsearch
mkdir /elasticsearch
mkfs.ext4 -m 0 /dev/vg_name/lv_elasticsearch
mount -a
chown -R elasticsearch:elasticsearch /elasticsearch/
# If the node was already active...
rsync -va --progress /srv/elasticsearch/ /elasticsearch/

Elasticsearch setup
#Add the repo and install - note, all cluster nodes should have similar versions...
vi /etc/elasticsearch/elasticsearch.yml
*** /srv/elasticsearch/data
*** /elasticsearch/data
service elasticsearch start
tail -f /var/log/elasticsearch/site.elk.elasticsearch.log
# ES config site.elk.elasticsearch "ESNODE04" /elasticsearch/data
New node - add plugins
/usr/share/elasticsearch/bin/plugin -install karmi/elasticsearch-paramedic
/usr/share/elasticsearch/bin/plugin -install royrusso/elasticsearch-HQ
/usr/share/elasticsearch/bin/plugin -install lmenezes/elasticsearch-kopf
/usr/share/elasticsearch/bin/plugin -install lukas-vlcek/bigdesk
/usr/share/elasticsearch/bin/plugin -install mobz/elasticsearch-head

Set the heap size
vi /etc/sysconfig/elasticsearch
*** set the ES_HEAP_SIZE here

Almost there...
chkconfig elasticsearch on
service elasticsearch start
tail -f /var/log/elasticsearch/site.elk.elasticsearch.log
At this point, just watch for errors, it should join and be happy.
[2014-12-19 07:29:17,353][INFO ][node                     ] [ESNODE04] version[1.3.7], pid[2232], build[3042293/2014-12-16T13:59:32Z]
[2014-12-19 07:29:17,353][INFO ][node                     ] [ESNODE04] initializing ...
[2014-12-19 07:29:17,361][INFO ][plugins                  ] [ESNODE04] loaded [], sites [head, bigdesk, HQ, kopf, paramedic]
[2014-12-19 07:29:21,058][INFO ][node                     ] [ESNODE04] initialized
[2014-12-19 07:29:21,058][INFO ][node                     ] [ESNODE04] starting ...
[2014-12-19 07:29:21,232][INFO ][transport                ] [ESNODE04] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/]}
[2014-12-19 07:29:21,268][INFO ][discovery                ] [ESNODE04] site.elk.elasticsearch/gz8pDlu9TTymibM1LToetA
[2014-12-19 07:29:24,590][INFO ][cluster.service          ] [ESNODE04] detected_master [ESNODE02][YDtom9MJSGiveG_MDS4QqQ][][inet[/]], added {[ESNODE03][1xEs3izRRxinKwCqMKjO8A][][inet[/]],[][0P3kPIyyR42pH-jypmaViw][][inet[/]]{client=true, data=false},[ESNODE02][YDtom9MJSGiveG_MDS4QqQ][][inet[/]],[ESNODE01][S9nv_0MEQMaHH2Jbw5jEGA][][inet[/]],}, reason: zen-disco-receive(from master [[ESNODE02][YDtom9MJSGiveG_MDS4QqQ][][inet[/]]])
[2014-12-19 07:29:24,931][INFO ][http                     ] [ESNODE04] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/]}
[2014-12-19 07:29:24,931][INFO ][node                     ] [ESNODE04] started

Issues w. plugins loading
If you have timeouts loading plugins, but the cluster health is ok - it's most likely a client-side thing (i.e. cache). Had some problems with Chrome that magically cleared...
# Verify cluster health...


  1. As an update to this, we've just added our fifth ES node. If anyone is curious, our nodes are virtual machines - 8GB RAM (4g heap), 2vCPU, 500GB ES data disk.

    On my new set of goals:
    1. What is the best practice for index creation/data separation?
    2. What are the recommended optimization/cleanup scripts/tools that should be running? (beyond curator)


Post a Comment

Popular posts from this blog

DFSR - eventid 4312 - replication just won't work

This warning isn't documented that well on the googles, so here's some google fodder:

You are trying to set up replication for a DFS folder (no existing replication)Source server is 2008R2, 'branch office' server is 2012R2 (I'm moving all our infra to 2012R2)You have no issues getting replication configuredYou see the DFSR folders get created on the other end, but nothing stagesFinally you get EventID 4312:
The DFS Replication service failed to get folder information when walking the file system on a journal wrap or loss recovery due to repeated sharing violations encountered on a folder. The service cannot replicate the folder and files in that folder until the sharing violation is resolved.  Additional Information:  Folder: F:\Users$\\Desktop\Random Folder Name\  Replicated Folder Root: F:\Users$  File ID: {00000000-0000-0000-0000-000000000000}-v0  Replicated Folder Name: Users  Replicated Folder ID: 33F0449D-5E67-4DA1-99AC-681B5BACC7E5  Replication Group…

Fixing duplicate SPNs (service principal name)

This is a pretty handy thing to know:

SPNs are used when a specific service/daemon uses Kerberos to authenticate against AD. They map a specific service, port, and object together with this convention: class/host:port/name

If you use a computer object to auth (such as local service):

If you use a user object to auth (such as a service account, or admin account):

Why do we care about duplicate SPNs? If you have two entries trying to auth using the same Kerberos ticket (I think that's right...), they will conflict, and cause errors and service failures.

To check for duplicate SPNs:
The command "setspn.exe -X

C:\Windows\system32>setspn -X
Processing entry 7
MSSQLSvc/ is registered on these accounts:
CN=SQL Admin,OU=service accounts,OU=resources,DC=company,DC=local

found 1 groups of duplicate SPNs. (truncated/sanitized)

Note that y…

Logstash to Nagios - alerting based on Windows Event ID

This took way longer than it should have to get here's a config and brain dump...

You want to have a central place to analyze Windows Event/IIS/local application logs, alert off specific events, alert off specific situations.  You don't have the budget for a boxed solution.  You want pretty graphs.  You don't particularly care about individual server states.  (see rationale below - although you certainly have all the tools here to care, I haven't provided that configuration)

ELK stack, OMD, NXlog agent, and Rsyslog.  The premise here is as follows:

Event generated on server into EventLogNXlog ships to Logstash inputLogstash filter adds fields and tags to specified eventsLogstash output sends to a passive Nagios service via the Nagios NSCA outputThe passive service on Nagios (Check_MK c/o OMD) does its thing w. alerting
Open Monitoring Distribution, but the real point here is Check_MK (IIRC Icinga uses this...).  It makes Nagios easy to use and main…