Ruan Bekker's Blog

From a Curious mind to Posts on Github

SQL Inner Joins Examples With SQLite

sqlite

Overview

In this tutorial we will get started with sqlite and use inner joins to query data from multiple tables to answer specific use case needs.

Connecting to your Sqlite Database

Connecting to your database uses the argument to the target database. You can use additional flags to set the properties that you want to enable:

1
$ sqlite3 -header -column mydatabase.db

or you can specify the additional options to your config:

1
2
3
4
cat > ~/.sqliterc << EOF
.mode column
.headers on
EOF

Then connecting to your database:

1
2
3
4
5
$ sqlite3 mydatabase.db
-- Loading resources from /Users/ruan/.sqliterc
SQLite version 3.16.0 2016-11-04 19:09:39
Enter ".help" for usage hints.
sqlite>

Create the Tables

Create the users table:

1
2
3
4
sqlite> create table users (
   ...> id INT(20), name VARCHAR(20), surname VARCHAR(20), city VARCHAR(20),
   ...> age INT(2), credit_card_num VARCHAR(20), job_position VARCHAR(20)
   ...> );

Create the transactions table:

1
2
3
4
sqlite> create table transactions (
   ...> credit_card_num VARCHAR(20), transaction_id INT(20), shop_name VARCHAR(20),
   ...> product_name VARCHAR(20), spent_total DECIMAL(6,2), purchase_option VARCHAR(20)
   ...> );

You can view the tables using .tables:

1
2
sqlite> .tables
transactions  users

And view the schema of the tables using .schema <table-name>

1
2
3
4
5
sqlite> .schema users
CREATE TABLE users (
id INT(20), name VARCHAR(20), surname VARCHAR(20), city VARCHAR(20),
age INT(2), credit_card_num VARCHAR(20), job_position VARCHAR(20)
);

Write to Sqlite Database

Now we will populate data to our tables. Insert a record to our users table:

1
sqlite> insert into users values(1, 'ruan', 'bekker', 'cape town', 31, '2345-8970-6712-4352', 'devops');

Insert a record to our transactions table:

1
sqlite> insert into transactions values('2345-8970-6712-4352', 981623, 'spaza01', 'burger', 101.02, 'credit_card');

Read from the Sqlite Database

Read the data from the users table:

1
2
3
4
sqlite> select * from users;
id          name        surname     city        age         credit_card_num      job_position
----------  ----------  ----------  ----------  ----------  -------------------  ------------
1           ruan        bekker      cape town   31          2345-8970-6712-4352  devops

Read the data from the transactions table:

1
2
3
4
sqlite> select * from transactions;
credit_card_num      transaction_id  shop_name   product_name  spent_total  purchase_option
-------------------  --------------  ----------  ------------  -----------  ---------------
2345-8970-6712-4352  981623          spaza01     burger        101.02       credit_card

Inner Joins with Sqlite

This is where stuff gets interesting.

Let’s say you want to join data from 2 tables, in this example we have a table with our userdata and a table with transaction data.

Say we want to see the user’s name, transaction id, the product they purchased and the total amount spent, we will make use of inner joins.

Example looks like this:

1
2
3
4
5
6
7
8
sqlite> select a.name, b.transaction_id, b.product_name, b.spent_total
   ...> from users
   ...> as a inner join transactions
   ...> as b on a.credit_card_num = b.credit_card_num
   ...> where a.credit_card_num = '2345-8970-6712-4352';
name        transaction_id  product_name  spent_total
----------  --------------  ------------  -----------
ruan        981623          burger         101.02

Let’s say you dont know the credit_card number but you would like to do a lookup the credit card number via the user’s id, then pass the value to the where statement:

1
2
3
4
5
6
7
8
sqlite> select a.name, b.transaction_id, b.product_name, b.spent_total
   ...> from users
   ...> as a inner join transactions
   ...> as b on a.credit_card_num = b.credit_card_num
   ...> where a.credit_card_num = (select credit_card_num from users where id = 1);
name        transaction_id  product_name  spent_total
----------  --------------  ------------  -----------
ruan        981623          burger         101.02

Let’s create another table called products:

1
2
3
4
sqlite> create table products (
   ...> product_id INTEGER(18), product_name VARCHAR(20),
   ...> product_category VARCHAR(20), product_price DECIMAL(6,2)
   ...> );

Write a record with product data to the table:

1
sqlite> insert into products values(0231, 'burger', 'fast foods', 101.02);

Now, lets say the question will be that we need to display the users name, credit card number, product name as well as the product category and products price, by only giving you the credit card number

1
2
3
4
5
6
7
8
9
sqlite> select a.name, b.credit_card_num, c.product_name, c.product_category, c.product_price
   ...> from users
   ...> as a inner join transactions
   ...> as b on a.credit_card_num = b.credit_card_num inner join products
   ...> as c on b.product_name = c.product_name
   ...> where a.credit_card_num = '2345-8970-6712-4352' and c.product_name = 'burger';
name        credit_card_num      product_name  product_category   product_price
----------  -------------------  ------------  -----------------  -------------
ruan        2345-8970-6712-4352  burger        fast foods         101.02

Install Elasticsearch With Ansible Tutorial

In this tutorial we will install a elasticsearch cluster with ansible (well rather a node)

Our inventory:

1
2
3
4
5
6
$ cat inventory.ini
[newes]
esnewnode

[newes:vars]
ansible_python_interpreter=/usr/bin/python3

Our playbook to bootstrap our nodes with Python:

1
2
3
4
5
6
7
8
$ cat bootstrap-python.yml
---
- hosts: newes
  gather_facts: False

  tasks:
  - name: install python
    raw: test -e /usr/bin/python || ( apt update && apt install python -y )

Our playbook to provision users:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
$ cat provision-users.yml
---
# Provisions User on Nodes
# Setup Passwordless SSH from Jumpbox
# Install Packages using APT
- name: bootstrap python
  hosts: newes
  roles:
    - bootstrap-python

- name: setup pre-requisites
  hosts: newes
  roles:
    - create-sudo-user
    - install-modules
    - configure-hosts-file

#- name: setup ruan user on the nodes
#  become: yes
#  become_user: ruan
#  hosts: admin
#  roles:
#    - configure-admin

- name: copy public key to nodes
  become: yes
  become_user: ruan
  hosts: newes
  roles:
    - copy-keys

- name: install elasticsearch
  hosts: newes
  roles:
    - elasticsearch

Our roles that will be included in our playbooks from above:

1
2
3
4
5
6
7
8
9
10
11
12
13
$ cat roles/create-sudo-user/tasks/main.yml
---
- name: Create Sudo User
  user: name=ruan
        groups=sudo
        shell=/bin/bash
        generate_ssh_key=no
        state=present

- name: Set Passwordless SSH Access for ruan user
  copy: src=sudoers
        dest=/etc/sudoers.d
        mode=0440

Sudoers file for the create sudo role:

1
2
$ cat roles/create-sudo-user/files/sudoers
ruan ALL=(ALL) NOPASSWD:ALL

The role to install all the apt packages:

1
2
3
4
5
6
7
8
9
10
11
12
$ cat roles/install-modules/tasks/main.yml
---
- name: Install Packages
  apt: name= state=latest update_cache=yes
  with_items:
    - apt-transport-https
    - ntp
    - python
    - tcpdump
    - wget
    - openssl
    - curl

Role to configure hosts file:

1
2
3
4
5
6
$ cat roles/configure-hosts-file/tasks/main.yml
---
- name: Configure Hosts File
  lineinfile: path=/etc/hosts regexp='.*$' line=" " state=present
  when: hostvars[item].ansible_default_ipv4.address is defined
  with_items: ""

The role to copy the ssh keys:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ cat roles/copy-keys/tasks/main.yml
---
- name: Copy Public Key to Other Hosts
  become: true
  become_user: ruan
  copy:
    src: /tmp/id_rsa.pub
    dest: /tmp/id_rsa.pub
    mode: 0644
- name: Append Public key in authorized_keys file
  authorized_key:
    user: ruan
    state: present
    key: ""

The role to install elasticsearch:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
$ cat roles/elasticsearch/tasks/main.yml
---
- name: get apt repo key
  apt_key:
    url: https://artifacts.elastic.co/GPG-KEY-elasticsearch
    state: present

- name: install apt repo
  apt_repository:
    repo: deb https://artifacts.elastic.co/packages/6.x/apt stable main
    state: present
    filename: elastic-6.x.list
    update_cache: yes

- name: install java
  apt:
    name: openjdk-8-jre
    state: present
    update_cache: yes

- name: install elasticsearch
  apt:
    name: elasticsearch
    state: present
    update_cache: yes

- name: reload systemd config
  systemd: daemon_reload=yes

- name: enable service elasticsearch and ensure it is not masked
  systemd:
    name: elasticsearch
    enabled: yes
    masked: no

- name: ensure elasticsearch is running
  systemd: state=started name=elasticsearch

Deploy Elasticsearch

Bootstrap python then deploy elasticsearch:

1
2
$ ansible-playbook -i inventory.ini -u root bootstrap-python.yml
$ ansible-playbook -i inventory.ini -u root provision-users.yml

Test out elasticsearch:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
$ curl http://127.0.0.1:9200/
{
  "name" : "Z52AEZ7",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "fUiYVjsSQpCbo9QKEiuvaA",
  "version" : {
    "number" : "6.3.0",
    "build_flavor" : "default",
    "build_type" : "deb",
    "build_hash" : "424e937",
    "build_date" : "2018-06-11T23:38:03.357887Z",
    "build_snapshot" : false,
    "lucene_version" : "7.3.1",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}

Elasticsearch Templates Tutorial

elasticsearch

Elasticsearch Index templates allow you to define templates that will automatically be applied on index creation time. The templates can include both settings and mappings..

What are we doing?

We want to create a template on how we would a target index to look like. It should consist of 1 primary shard and 2 replica shards and we want to update the mapping that we can make use of text and keyword string fields.

So then whenever we create an index which matches our template, the template will be applied on index creation.

String Fields

We will make use of the following string fields in our mappings which will be included in our templates:

Text:

A field to index full-text values, such as the body of an email or the description of a product. These fields are analyzed, that is they are passed through an analyzer to convert the string into a list of individual terms before being indexed. The analysis process allows Elasticsearch to search for individual words within each full text field

Keyword":

A field to index structured content such as email addresses, hostnames, status codes, zip codes or tags.

They are typically used for filtering (Find me all blog posts where status is published), for sorting, and for aggregations. Keyword fields are only searchable by their exact value

Note about templates:

Couple of things to keep in mind:

1
2
1. Templates gets referenced on index creation and does not affect existing indexes
2. When you update a template, you need to specify the exact template, the payload overwrites the whole template

View your current templates in your cluster:

1
2
3
4
$ curl -XGET http://localhost:9200/_cat/templates?v
name                          index_patterns             order      version
.monitoring-kibana            [.monitoring-kibana-6-*]   0          6020099
filebeat-6.3.1                [filebeat-6.3.1-*]         1

Create the template foobar_docs which will match any indexes matching foo-* and bar-* which will inherit index settings of 1 primary shards and 2 replica shards and also apply a mapping template shown below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
$ curl -H 'Content-Type: application/json' -XPUT http://localhost:9200/_template/foobar_docs -d '
{
  "index_patterns": [
    "foo-*", "bar-*"
  ], 
  "settings": {
    "number_of_shards": 1, 
    "number_of_replicas": 2
  }, 
  "mappings": {
    "type1": {
      "_source": {"enabled": true}, 
      "properties": {"created_at": {"type": "date"}, 
      "title": {"type": "text"}, 
      "status": {"type": "keyword"}, 
      "content": {"type":"text"}, 
      "first_name": {"type": "keyword"}, 
      "last_name": {"type": "keyword"}, 
      "age": {"type":"integer"}, 
      "registered": {"type": "boolean"}
      }
    }
  }
}'
{"acknowledged":true}

View the template from the api:

1
2
3
$ curl -XGET http://localhost:9200/_cat/templates/foobar_docs?v
name        index_patterns order version
foobar_docs [foo-*, bar-*] 0

Create a index that will match the templates definition:

1
2
$ curl -H 'Content-Type: application/json' -XPUT http://localhost:9200/test-2018.07.20
{"acknowledged":true,"shards_acknowledged":true,"index":"test-2018.07.20"}

Verify that the index has been created:

1
2
3
$ curl -XGET http://localhost:9200/_cat/indices/test-2018.07.20?v
health status index           uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   test-2018.07.20 -5XOfl0GTEGeHycTwL51vQ   5   1          0            0        2kb          1.1kb

We can also inspect the template like shown below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
$ curl -XGET http://localhost:9200/_template/foobar_docs?pretty
{
  "foobar_docs" : {
    "order" : 0,
    "index_patterns" : [
      "foo-*",
      "bar-*"
    ],
    "settings" : {
      "index" : {
        "number_of_shards" : "1",
        "number_of_replicas" : "2"
      }
    },
    "mappings" : {
      "type1" : {
        "_source" : {
          "enabled" : true
        },
        "properties" : {
          "created_at" : {
            "type" : "date"
          },
          "title" : {
            "type" : "text"
          },
          "status" : {
            "type" : "keyword"
          },
          "content" : {
            "type" : "text"
          },
          "first_name" : {
            "type" : "keyword"
          },
          "last_name" : {
            "type" : "keyword"
          },
          "age" : {
            "type" : "integer"
          },
          "registered" : {
            "type" : "boolean"
          }
        }
      }
    },
    "aliases" : { }
  }
}

Ingest a document to your index:

1
2
3
4
5
6
7
8
9
10
$ curl -H 'Content-Type: application/json' -XPOST http://localhost:9200/foo-2018.07.20/type1/ -d '
{
  "title": "this is a post", 
  "status": "active", 
  "content": "introduction post", 
  "first_name": "ruan", 
  "last_name": "bekker", 
  "age": "31", 
  "registered": "true"
}'

Run a search against your elasticsearch index to view the data:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
$ curl -XGET http://localhost:9200/foo-2018.07.20/_search?pretty
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "foo-2018.07.20",
        "_type" : "type1",
        "_id" : "ZYfotmQB9mQGWzJT7W2y",
        "_score" : 1.0,
        "_source" : {
          "title" : "this is a post",
          "status" : "active",
          "content" : "introduction post",
          "first_name" : "ruan",
          "last_name" : "bekker",
          "age" : "31",
          "registered" : "true"
        }
      }
    ]
  }
}

Create another document:

1
2
3
4
5
6
7
8
9
10
11
$ curl -H 'Content-Type: application/json' -XPOST http://localhost:9200/foo-2018.07.20/type1/ -d '
{
  "created_at": 1532077144, 
  "title": "this is a another post", 
  "status": "ae", 
  "content": "introduction post", 
  "first_name": "stefan", 
  "last_name": "bester", 
  "age": 34, 
  "registered": "true"
}'

As you guessed, executing another search against elasticsearch shows us both documents:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
$ curl -XGET http://localhost:9200/foo-2018.07.20/_search?pretty
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "foo-2018.07.20",
        "_type" : "type1",
        "_id" : "ZYfotmQB9mQGWzJT7W2y",
        "_score" : 1.0,
        "_source" : {
          "title" : "this is a post",
          "status" : "active",
          "content" : "introduction post",
          "first_name" : "ruan",
          "last_name" : "bekker",
          "age" : "31",
          "registered" : "true"
        }
      },
      {
        "_index" : "foo-2018.07.20",
        "_type" : "type1",
        "_id" : "rofrtmQB9mQGWzJTxnvp",
        "_score" : 1.0,
        "_source" : {
          "created_at" : 1532077144,
          "title" : "this is a another post",
          "status" : "active",
          "content" : "introduction post",
          "first_name" : "stefan",
          "last_name" : "bester",
          "age" : 34,
          "registered" : "true"
        }
      }
    ]
  }
}

Let’s run a search query for any documents matching people with the age between 30 and 40:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
$ curl -H 'Content-Type: application/json' -XGET http://localhost:9200/foo-2018.07.20/_search?pretty -d '{"query": {"range": {"age": {"gte": 30, "lte": 40}}}}'
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "foo-2018.07.20",
        "_type" : "type1",
        "_id" : "ZYfotmQB9mQGWzJT7W2y",
        "_score" : 1.0,
        "_source" : {
          "title" : "this is a post",
          "status" : "active",
          "content" : "introduction post",
          "first_name" : "ruan",
          "last_name" : "bekker",
          "age" : "31",
          "registered" : "true"
        }
      },
      {
        "_index" : "foo-2018.07.20",
        "_type" : "type1",
        "_id" : "rofrtmQB9mQGWzJTxnvp",
        "_score" : 1.0,
        "_source" : {
          "created_at" : 1532077144,
          "title" : "this is a another post",
          "status" : "active",
          "content" : "introduction post",
          "first_name" : "stefan",
          "last_name" : "bester",
          "age" : 34,
          "registered" : "true"
        }
      }
    ]
  }
}

Search for people with the age between 32 and 40:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
$ curl -H 'Content-Type: application/json' -XGET http://localhost:9200/foo-2018.07.20/_search?pretty -d '{"query": {"range": {"age": {"gte": 32, "lte": 40}}}}'
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "foo-2018.07.20",
        "_type" : "type1",
        "_id" : "rofrtmQB9mQGWzJTxnvp",
        "_score" : 1.0,
        "_source" : {
          "created_at" : 1532077144,
          "title" : "this is a another post",
          "status" : "active",
          "content" : "introduction post",
          "first_name" : "stefan",
          "last_name" : "bester",
          "age" : 34,
          "registered" : "true"
        }
      }
    ]
  }
}

Let’s say we want to update our template with refresh_interval, primary shards of 2 and replicas of 1 settings:

1
2
3
4
5
$ curl -H 'Content-Type: application/json' -XPUT http://localhost:9200/_template/foobar_docs -d '
{
  "index_patterns": ["foo-*", "bar-*"], 
  "settings": {"number_of_shards": 2, "number_of_replicas": 1, "refresh_interval": "15s"}
}'

View the template, as you can see the target template will look exactly like the data body that we are posting to the template api:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
$ curl -XGET http://localhost:9200/_template/foobar_docs?pretty
{
  "foobar_docs" : {
    "order" : 0,
    "index_patterns" : [
      "foo-*",
      "bar-*"
    ],
    "settings" : {
      "index" : {
        "number_of_shards" : "2",
        "number_of_replicas" : "1",
        "refresh_interval" : "15s"
      }
    },
    "mappings" : { },
    "aliases" : { }
  }
}

View our current index, as you can see the index is unaffected of the template change as only new indexes will retrieve the update of the template:

1
2
3
$ curl -XGET http://localhost:9200/_cat/indices/foo-2018.07.20?v
health status index          uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   foo-2018.07.20 ol1pGugrQCKd0xES4R6oFg   1   2          2            0     20.4kb         10.2kb

Create a new index to verify that the template’s config is pulled into the new index:

1
$ curl -H 'Content-Type: application/json' -XPUT http://localhost:9200/foo-2018.07.20-new

View the elasticsearch indexes to verify the behavior:

1
2
3
4
$ curl -XGET http://localhost:9200/_cat/indices/foo-2018.07.*?v
health status index              uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   foo-2018.07.20     ol1pGugrQCKd0xES4R6oFg   1   2          2            0     20.4kb         10.2kb
green  open   foo-2018.07.20-new g6Ii8jtKRFa1zDVB2IsDBQ   2   1          0            0       920b           460b

Delete the indexes:

1
2
$ curl -XDELETE http://localhost:9200/foo-*
{"acknowledged":true}

Delete the templates:

1
2
$ curl -XDELETE 'http://localhost:9200/_template/foobar_docs'
{"acknowledged":true}

Verify that the templates are gone:

1
2
$ curl -XGET http://localhost:9200/_cat/templates/foobar_docs?v
name index_patterns order version

Resources:

https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-templates.html https://www.elastic.co/guide/en/elasticsearch/reference/6.3/mapping-types.html https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-range-query.html

Reindex Your Elasticsearch Indexes Tutorial

At times you may find that the indexes in your cluster are not queried that often but you still want them around. But you also want to reduce the resource footprint by reducing the number of shards, and perhaps increase the refresh interval.

For refresh interval, if new data comes in and we dont care to have it available near real time, we can set the refresh interval for example to 60 seconds, so the index will only have the data available every 60 seconds. (default: 1s)

Reindexing Elasticsearch Indexes

In this example we will use the scenario where we have daily indexes with 5 primary shards and 1 set of replicas and we would like to create a weekly index with 1 primary shard, 1 replica and the refresh interval of 60 seconds, and reindex the previous weeks data into our weekly index.

Create the target weekly index with the mentioned configuration:

1
2
3
4
5
6
7
8
9
$ curl -H "Content-Type: application/json" -XPUT 'http://127.0.0.1:9200/my-index-2019.01.01-07' -d '
{
  "settings": {
      "number_of_shards": "1",
      "number_of_replicas": "1",
      "refresh_interval" : "60s"
  }
}
'

Ensure the index exist:

1
2
3
4
5
6
$ curl -s -XGET 'http://127.0.0.1:9200/_cat/indices/my-index-2019.01.01*?v'
health status index                    uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   my-index-2019.01.01      wbFEJCApSpSlbOXzb1Tjxw   5   1      22007            0      6.6mb          3.2mb
green  open   my-index-2019.01.02      cbDmJR7pbpRT3O2x46fj20   5   1      28031            0      7.2mb          3.4mb
..
green  open   my-index-2019.01.01-07   mJR7pJ9O4T3O9jzyI943ca   1   1          0            0       466b           233b

Create the reindex job, specify the source indexes and the destination index where the data must be reindexed to:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
$ curl -s -H 'Content-Type: application/json' -XPOST 'http://127.0.0.1:9200/_reindex' -d '
{
  "source": {
      "index": [
          "my-index-2019.01.01",
          "my-index-2019.01.02",
          "my-index-2019.01.03",
          "my-index-2019.01.04",
          "my-index-2019.01.05",
          "my-index-2019.01.06",
          "my-index-2019.01.07"
      ]
  },
  "dest": {
      "index": "my-index-2019.01.01-07"
  }
}
'

You can use the tasks api to monitor the progress:

1
2
3
$ curl -s -XGET 'http://127.0.0.1:9200/_cat/tasks?'
indices:data/write/bulk        -3MIFskURPKxd1tg8P2j0w:912621270 -                                transport 1538459598188 22:53:18 3.1ms       x.x.x.x -3MIFsk
indices:data/write/bulk[s]     -3MIFskURPKxd1tg8P2j0w:912621271 -3MIFskURPKxd1tg8P2j0w:816648230 transport 1538459598188 22:53:18 3.1ms       x.x.x.x -3MIFsk

You manipulate the output of the tasks api by either fetching specific actions:

1
$ curl -s -XGET 'http://127.0.0.1:9200/_tasks?actions=*data/write/reindex&detailed&pretty'

Or viewing detailed output:

1
2
$ curl -s -XGET 'http://127.0.0.1:9200/_cat/tasks?detailed' | grep 'indices:data/write/reindex'
indices:data/write/reindex     IvoqWoUqSgGCQ0ELG21nhg:740560815 -                                transport 1538462294714 23:38:14 1.7m        x.x.x.x IvoqWoU reindex from [my-index-2019.01.01] to [my-index-2019.01.01-07]

Or you could get the json response:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
$ curl -s -XGET 'http://127.0.0.1:9200/_tasks?actions=*data/write/reindex&detailed&pretty'
{
  "nodes" : {
    "xx" : {
      "name" : "xx",
      "roles" : [ "data", "ingest" ],
      "tasks" : {
        "xx:876452606" : {
          "node" : "xx",
          "id" : 776452606,
          "type" : "transport",
          "action" : "indices:data/write/reindex",
          "status" : {
            "total" : 4785475,
            "updated" : 0,
            "created" : 234000,
            "deleted" : 0,
            "batches" : 235,
            "version_conflicts" : 0,
            "noops" : 0,
            "retries" : {
              "bulk" : 0,
              "search" : 0
            },
            "throttled_millis" : 0,
            "requests_per_second" : -1.0,
            "throttled_until_millis" : 0
          },
          "description" : "reindex from [my-index-2019.01.07] to [my-index-2019.01.01-07]",
          "start_time_in_millis" : 1538462901120,
          "running_time_in_nanos" : 64654161339,
          "cancellable" : true
        }
      }
    }
  }
}

Anyways, moving along. Reindex jobs will always be listed as a data/write/reindex action, so we can count the output:

1
$ curl -s -XGET 'http://127.0.0.1:9200/_cat/tasks?'  | grep 'data/write/reindex' | wc -l

If the response is 0 then all the tasks completed and we can have a look at our index again:

1
2
3
4
5
6
$ curl -s -XGET 'http://127.0.0.1:9200/_cat/indices/my-index-2019.01.0*?v'
health status index                    uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   my-index-2019.01.01      wbFEJCApSpSlbOXzb1Tjxw   5   1      22007            0      6.6mb          3.2mb
green  open   my-index-2019.01.02      cbDmJR7pbpRT3O2x46fj20   5   1      28031            0      7.2mb          3.4mb
..
green  open   my-index-2019.01.01-07   mJR7pJ9O4T3O9jzyI943ca   1   1     322007            0     45.9mb         22.9mb

Now that we can verify that the reindex tasks finished and we can see the aggregated result in our target index, we can delete our source indexes:

1
$ curl -XDELETE 'http://127.0.0.1:9200/my-index-2019.01.01,my-index-2019.01.02,my-index-2019.01.03,my-index-2019.01.04,my-index-2019.01.05,my-index-2019.01.06,my-index-2019.01.07'

Shrink Your Elasticsearch Index by Reducing the Shard Count With the Shards API

elasticsearch

Resize your Elasticsearch Index with fewer Primary Shards by using the Shrink API.

In Elasticsearch, every index consists of multiple shards and every shard in your elasticsearch cluster contributes to the usage of your cpu, memory, file descriptors etc. This definitely helps for performance in parallel processing. As for an example with time series data, you would write and read a lot to an index with ie the current date.

If that index drops in requests and only read from the index every now and then, we dont need that many shards anymore and if we have multiple indexes, they may build up and take up unessacary compute power.

For a scenario where we want to reduce the size of our indexes, we can use the Shrink API to reduce the number of primary shards.

The Shrink API

The shrink index API allows you to shrink an existing index into a new index with fewer primary shards. The requested number of primary shards in the target index must be a factor of the number of shards in the source index. For example an index with 8 primary shards can be shrunk into 4, 2 or 1 primary shards or an index with 15 primary shards can be shrunk into 5, 3 or 1. If the number of shards in the index is a prime number it can only be shrunk into a single primary shard. Before shrinking, a (primary or replica) copy of every shard in the index must be present on the same node.

Steps on Shrinking:

Create the target index with the same definition as the source index, but with a smaller number of primary shards. Then it hard-links segments from the source index into the target index. Finally, it recovers the target index as though it were a closed index which had just been re-opened.

Reduce the Primary Shards of an Index.

As you may know, you can only set the Primary Shards on Index Creation time and Replica Shards you can set on the fly.

In this example we have a source index: my-index-2019.01.10 with 5 primary shards and 1 replica shard, which gives us 10 shards for that index, that we would like to shrink to an index named: archive_my-index-2019.01.10 with 1 primary shard and 1 replica shard, which will give us 2 shards for that index.

Have a look at your index:

1
2
3
4
5
$ curl -XGET "http://127.0.0.1:9200/_cat/indices/my-index-2019.01.*?v"
health status index                                     uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   my-index-2019.01.10                       xAijUTSevXirdyTZTN3cuA   5   1   80795533            0      5.9gb          2.9gb
green  open   my-index-2019.01.11                       yb8Cjy9eQwqde8mJhR_vlw   5   5   80590481            0      5.7gb          2.8gb
...

And have a look at the nodes, as we will relocate the shards to a specific node:

1
2
3
4
$ curl http://127.0.0.1:9200/_cat/nodes?v
ip            heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
x.x.x.x             8          98   0    0.04    0.03     0.01 m         -      3E9yp60
x.x.x.x            65          99   4    0.43    0.23     0.36 di        -      znFrs18

In this demonstration we only have 2 nodes with a replication factor of 1, which means a index’s shards will always reside on both nodes. In a case with more nodes, we need to ensure that we choose a node where a primary index reside on.

Look at the shards api, by passing the index name to get the index to shard allocation:

1
2
3
4
5
6
7
8
9
10
11
12
$ curl http://127.0.0.1:9200/_cat/shards/my-index-2019.01.10?v'
index               shard prirep state   docs  store ip       node
my-index-2019.01.10 2     p      STARTED  193  101mb x.x.x.x  Lq9P7eP
my-index-2019.01.10 2     r      STARTED  193  101mb x.x.x.x  F5edOwK
my-index-2019.01.10 4     p      STARTED  197  101mb x.x.x.x  Lq9P7eP
my-index-2019.01.10 4     r      STARTED  197  101mb x.x.x.x  F5edOwK
my-index-2019.01.10 3     r      STARTED  184  101mb x.x.x.x  Lq9P7eP
my-index-2019.01.10 3     p      STARTED  184  101mb x.x.x.x  F5edOwK
my-index-2019.01.10 1     r      STARTED  180  101mb x.x.x.x  Lq9P7eP
my-index-2019.01.10 1     p      STARTED  180  101mb x.x.x.x  F5edOwK
my-index-2019.01.10 0     p      STARTED  187  101mb x.x.x.x  Lq9P7eP
my-index-2019.01.10 0     r      STARTED  187  101mb x.x.x.x  F5edOwK

Create the target index:

1
2
3
4
5
6
7
8
$ curl -XPUT -H 'Content-Type: application/json' http://127.0.0.1:9200/archive_my-index-2019.01.10 -d '
{
  "settings": {
      "number_of_shards": "1",
      "number_of_replicas": "1"
  }
}
'

Set the index as read only and relocate every copy of shard to node we indentified in a previous step:

1
2
3
4
5
6
7
8
$ curl -XPUT -H 'Content-Type: application/json' http://127.0.0.1:9200/my-index-2019.01.10/_settings -d '
{
  "settings": {
      "index.routing.allocation.require._name": "Lq9P7eP",
      "index.blocks.write": true
  }
}
'

Now shrink the source index (my-index-2019.01.10) to the target index (archive_my-index-2019.01.10):

1
$ curl -XPOST -H 'Content-Type: application/json' http://127.0.0.1:9200/my-index-2019.01.10/_shrink/archive_my-index-2019.01.10

You can monitor the progress by using the Recovery API:

1
2
3
4
5
6
$ curl -s -XGET "http://127.0.0.1:9200/_cat/recovery/my-index-2019.01.10?human&detailed=true"
my-index-2019.01.10 0 23.3s peer done x.x.x.x  F5edOwK x.x.x.x Lq9P7eP n/a n/a 15 15 100.0% 15 635836677 635836677 100.0% 635836677 0 0 100.0%
my-index-2019.01.10 1 22s   peer done x.x.x.x  Lq9P7eP x.x.x.x Lq9P7eP n/a n/a 15 15 100.0% 15 636392649 636392649 100.0% 636392649 0 0 100.0%
my-index-2019.01.10 2 19.6s peer done x.x.x.x  F5edOwK x.x.x.x Lq9P7eP n/a n/a 15 15 100.0% 15 636809671 636809671 100.0% 636809671 0 0 100.0%
my-index-2019.01.10 3 21.5s peer done x.x.x.x  Lq9P7eP x.x.x.x Lq9P7eP n/a n/a 15 15 100.0% 15 636378870 636378870 100.0% 636378870 0 0 100.0%
my-index-2019.01.10 4 23.3s peer done x.x.x.x F5edOwK- x.x.x.x Lq9P7eP n/a n/a 15 15 100.0% 15 636545756 636545756 100.0% 636545756 0 0 100.0%

You can also pass aliases as your table columns for output:

1
2
3
4
$ curl -s -XGET "http://127.0.0.1:9200/_cat/recovery/my-index-2019.01.10?v&detailed=true&h=index,shard,time,ty,st,shost,thost,f,fp,b,bp"
index                            shard time  ty   st   shost         thost        f  fp     b         bp
my-index-2019.01.10              0     23.3s peer done x.x.x.x x.x.x.x 15 100.0% 635836677 100.0%
...

When the job is done, have a look at your indexes:

1
2
3
4
$ curl -XGET "http://127.0.0.1:9200/_cat/indices/*my-index-2019.01.10?v"
health status index                                     uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   archive_my-index-2019.01.10               PAijUTSeRvirdyTZTN3cuA   1   1   80795533            0      5.9gb          2.9gb
green  open   my-index-2019.01.10                       Cb8Cjy9CQwqde8mJhR_vlw   5   1   80795533            0      2.9gb          2.9gb

Remove the block on your old index in order to make it writable:

1
2
3
4
5
6
7
8
$ curl -XPUT -H 'Content-Type: application/json' http://127.0.0.1:9200/my-index-2019.01.10/_settings" -d '
{
  "settings": {
      "index.routing.allocation.require._name": null,
      "index.blocks.write": null
  }
}
'

Delete the old index:

1
$ curl -XDELETE -H 'Content-Type: application/json' http://127.0.0.1:9200/my-index-2019.01.10

Note:, On AWS Elasticsearch Service, if you dont remove the block and you trigger a redeployment, you will end up with something like this. Shard may still be constraint to a node.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ curl -s -XGET ${ES_HOST/_cat/allocation?v
shards disk.indices disk.used disk.avail disk.total disk.percent host          ip  node
     0           0b    51.2gb    956.5gb   1007.8gb            5 x.x.x.x  x.x.x.x  ap9Mx1R
     1        3.6gb    54.9gb    952.8gb   1007.8gb            5 x.x.x.x  x.x.x.x  PqmoQpN   <-----------
     0           0b    51.2gb    956.5gb   1007.8gb            5 x.x.x.x  x.x.x.x  5p7x4Lc
     0           0b    51.2gb    956.5gb   1007.8gb            5 x.x.x.x  x.x.x.x  c8kniP3
     0           0b    51.2gb    956.5gb   1007.8gb            5 x.x.x.x  x.x.x.x  jPwlwsD
     0           0b    51.2gb    956.5gb   1007.8gb            5 x.x.x.x  x.x.x.x  ljos4mu
   481      904.1gb   990.3gb    521.3gb      1.4tb           65 x.x.x.x  x.x.x.x  qAF-gIU
   481      820.2gb   903.6gb    608.1gb      1.4tb           59 x.x.x.x  x.x.x.x  dR3sNwA
   481      824.6gb   909.1gb    602.6gb      1.4tb           60 x.x.x.x  x.x.x.x  fvL4A9X
   481      792.7gb   876.5gb    635.2gb      1.4tb           57 x.x.x.x  x.x.x.x  lk4svht
   481      779.2gb   864.4gb    647.3gb      1.4tb           57 x.x.x.x  x.x.x.x  uLsej9m
     0           0b    51.2gb    956.5gb   1007.8gb            5 x.x.x.x  x.x.x.x  yM4Ka9l

Resources:

Self Hosted Git and CICD Platform With Gitea and Drone on Docker

Both gitea and drone is built on golang runs on multiple platforms including a raspberry pi and its super lightweight. Oh yes, and its awesome!

In this tutorial we will see how we can implement our own git service and cicd platform by setting up gitea and drone on docker and commit a python flask application to gitea and build a pipeline on drone.

Some definition

Gitea: will be your self hosted git server

Drone Server: will be server being responsible for the web service, repositories, secrets, users, etc.

Drone Agent: will be the workers that runs your builds, jobs etc.

Server Confguration

We will change our host’s ssh port to something else as our git server’s ssh method will be listening on port 22 and we would like to add our ssh key to gitea so that we can commit to our git server via ssh.

Change your server’s ssh port to 2222, by opening /etc/ssh/sshd_config and edit the port to 2222 then restart sshd with:

1
$ /etc/init.d/sshd restart

Your ssh connection will still be established, but you can exit and ssh to your server by specifying the new port:

1
$ ssh -p 2222 user@host

Pre-Requirements

Make sure you have docker and docker-compose installed

Deploy Gitea and Drone

Below is the docker-compose file for our deployment. Note that we are running a postgres database which gitea will be configured on, you can also use other databases like mysql, sqlite etc. Visit their documentation for more info.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
version: "2"

services:
  gitea-app:
    image: gitea/gitea:latest
    container_name: gitea-app
    environment:
      - USER_UID=1000
      - USER_GID=1000
      - ROOT_URL=http://gitea:3000
      - SSH_DOMAIN=mydomain.com
    restart: always
    volumes:
      - ./volumes/gitea_app:/data
    ports:
      - "3000:3000"
      - "22:22"
    networks:
      - appnet

  gitea-db:
    image: postgres:alpine
    container_name: gitea-db
    ports:
      - 5440:5432
    restart: always
    volumes:
      - ./volumes/gitea_db:/var/lib/postgresql/data
    environment:
      - POSTGRES_USER=postgres
      - POSTGRES_PASSWORD=postgres
      - POSTGRES_DB=gitea
    networks:
      - appnet

  drone-server:
    image: drone/drone:0.8
    container_name: drone-server
    ports:
      - 80:8000
      - 9000
    volumes:
      - ./volumes/drone:/var/lib/drone/
    restart: always
    depends_on:
      - gitea
    environment:
      - DRONE_OPEN=true
      - DRONE_HOST=http://drone-server:8000
      - DRONE_GITEA=true
      - DRONE_GITEA_URL=http://gitea:3000
      - DRONE_SECRET=secret
      - DRONE_NETWORK=appnet
    networks:
      - appnet

  drone-agent:
    image: drone/agent:0.8
    container_name: drone-agent
    command: agent
    restart: always
    depends_on:
      - drone-server
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      - DRONE_SERVER=drone-server:9000
      - DRONE_SECRET=secret
    networks:
      - appnet

volumes:
  gitea-app:
  gitea-db:

networks:
  appnet:
    external: true

Create the volumes path:

1
$ mkdir volumes

Create the external network:

1
$ docker network create appnet

Some key configuration,

Our SSH_DOMAIN will be the domain that gets used for things like cloning a repository. When you register your gitea account, you will use the same credentials to logon to drone.

Deploy your stack:

1
$ docker-compose up -d

Register on Gitea

When your deployment is done, access your gitea server which should be available on http://your-docker-ip:3000/ complete the registration, if you decide to go with postgres your username/password will be postgres and your host will be gitea-db:5432.

Make sure to complete the administrator account to get your admin credentials.

Setup SSH Key and Repo

Generate a ssh key that you will use for communicating to git over ssh. If you have already have an ssh key you can skip this step.

1
2
$ ssh-keygen -t rsa
# use the defaults

Login on gitea, once you are logged in, head over to profile / settings / ssh keys: http://your-docker-ip:3000/user/settings/keys.

Add a new ssh key, go back to your terminal and copy the public key which we will provide to gitea:

1
2
$ cat ~/.ssh/id_rsa.pub
<copy the contents to your clipboard>

Paste your public key and provide a descriptive title.

Head back to your dashboard and create your first repository:

image

It should look more or less like this:

image

Enable Repo on Drone

Head over to drone and select the repositores on the right hand side http://:80/account/repos and enable your repository by toggline the selector, it should look more or less like this:

image

Once its enabled head back to gitea.

Clone the repository

On your repository select ssh and copy the ssh link for your repository:

image

Then from your terminal add your private ssh key to your ssh agent and clone the repository:

1
2
3
$ eval $(ssh-agent)
$ ssh-add ~/.ssh/id_rsa
$ git clone git@your-docker-ip:your-user/your-repo.git

Add Example Python Flask app to git

I will use a basic python flask application with some tests.

Let’s first add our pipeline definition for drone, so that drone understands how the pipeline should be run when gitea receives a commit:

1
$ touch .drone.yml

Our pipeline:

.drone.yml
1
2
3
4
5
6
7
8
9
10
pipeline:
  build:
    image: python:3.5.1-alpine
    commands:
      - pip install --upgrade pip setuptools wheel
      - pip wheel -r requirements.txt --wheel-dir=wheeldir --find-links=wheeldir
      - pip install --no-index --find-links=wheeldir -r requirements.txt
      - flake8 app
      - mkdir -p coverage
      - nosetests -v tests/

Lets add that to git:

.drone.yml
1
2
3
$ git add .drone.yml
$ git commit -m "initial add of our pipeline"
$ git push origin master

Our drone file should be in git, now our requirements dependency file for python:

requirements.txt
1
2
3
flask
nose
flake8

Our blank file to make our application a module:

requirements.txt
1
2
$ mkdir app
$ touch app/__init__.py

And our flask app:

app/app.py
1
2
3
4
5
6
7
from flask import Flask
app = Flask(__name__)


@app.route("/")
def hello():
    return "Hello, World!"

Our tests directory and our python init file:

app/app.py
1
2
$ mkdir tests
$ touch tests/__init__.py

Now that we have all our files ready, commit and push to git:

1
2
3
$ git add .
$ git commit -m "add python app"
$ git push origin master

Look at Drone Running

Head over to drone and look at the builds, you should see your build running at http://<docker-ip>:80/<user>/<repo-name>:

image

If everything ran as expected you should see that it passed.

Build Status Badges

You can also include the build status badges from drone which will look like:

image

You can use the drone api: http://drone-ip:80/api/badges/<your-user>/<your-repo>/status.svg

1
2
3
4
5
[![Build Status](http://your-ip/api/badges/your-user/your-repo/status.svg)
[![](https://images.microbadger.com/badges/image/gitea/gitea.svg)](https://microbadger.com/images/gitea/gitea "Get your own image badge on microbadger.com")
[![GitHub release](https://img.shields.io/github/release/go-gitea/gitea.svg)](https://github.com/go-gitea/gitea/releases/latest)
[![Help Contribute to Open Source](https://www.codetriage.com/go-gitea/gitea/badges/users.svg)](https://www.codetriage.com/go-gitea/gitea)
[![Become a backer/sponsor of gitea](https://opencollective.com/gitea/tiers/backer/badge.svg?label=backer&color=brightgreen)](https://opencollective.com/gitea)

Overall gitea and drone is really amazing and quite impressed with it, especially from the low memory footprint and that its so easy to work with.

Resources

Have a look at this for more resources:

Docs:

Examples:

Build Small Golang Docker Containers

In this tutorial I will show you how to build really small docker containers for golang applications. And I mean the difference between 310MB down to 2MB

But Alpine..

So we thinking lets go with alpine right? Yeah sure lets build a small, app running on go with alpine.

Our application code:

app.go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
package main

import (
  "fmt"
  "math/rand"
  "time"
)

func main() {
  lekkewords := []string{
    "dog", "cat", "fish", "giraffe",
    "moo", "spider", "lion", "apple",
    "tree", "moon", "snake", "mountain lion",
    "trooper", "burger", "nasa", "yes",
  }

  rand.Seed(time.Now().UnixNano())
  var zelength int = len(lekkewords)
  var indexnum int = rand.Intn(zelength-1)
  word := lekkewords[indexnum]

  fmt.Println("Number of words:", zelength)
  fmt.Println("Selected index number:", indexnum)
  fmt.Println("Selected word is:", word)
}

Our Dockerfile:

Dockerfile
1
2
3
4
5
6
7
FROM golang:alpine

WORKDIR $GOPATH/src/mylekkepackage/mylekkeapp/
COPY app.go .
RUN go build -o /go/app

CMD ["/go/app"]

Let’s package our app to an image:

Dockerfile
1
❯ docker build -t mygolangapp:using-alpine .

Inspect the size of our image, as you can see it being 310MB

Dockerfile
1
2
3
❯ docker images "mygolangapp:*"
REPOSITORY          TAG                 IMAGE ID            CREATED              SIZE
mygolangapp         using-alpine        eea1d7bde218        About a minute ago   310MB

Just make sure it actually works:

Dockerfile
1
2
3
4
❯ docker run mygolangapp:using-alpine
Number of words: 16
Selected index number: 11
Selected word is: mountain lion

But for something just returning random selected text, 310MB is a bit crazy.

Multi Stage Builds

As Go binaries are self-contained, we can make use of docker’s multi stage builds, where we can build our application on alpine and use the binary on a scratch image:

Our multi stage Dockerfile:

Dockerfile.mult
1
2
3
4
5
6
7
8
9
10
11
FROM golang:alpine AS builder

WORKDIR $GOPATH/src/mylekkepackage/mylekkeapp/
COPY app.go .
RUN go build -o /go/app

FROM scratch

COPY --from=builder /go/app /go/app

CMD ["/go/app"]

Build it:

Dockerfile.mult
1
❯ docker build -t mygolangapp:using-multistage -f Dockerfile.multi .

Notice that the image is only 2.01MB, say w000t!

Dockerfile.mult
1
2
3
4
❯ docker images "mygolangapp:*"
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
mygolangapp         using-multistage    31474c61ba5b        15 seconds ago      2.01MB
mygolangapp         using-alpine        eea1d7bde218        2 minutes ago       310MB

Run the app:

Dockerfile.mult
1
2
3
4
❯ docker run mygolangapp:using-multistage
Number of words: 16
Selected index number: 5
Selected word is: spider

Resources

Source code for this demonstration can be found at github.com/ruanbekker/golang-build-small-images

Secure Your Elasticsearch Cluster With Basic Auth Using Nginx and SSL From Letsencrypt

In this tutorial we will setup a reverse proxy using nginx to translate and load balance traffic through to our elasticsearch nodes. We will also protect our elasticsearch cluster with basic auth and use letsencrypt to retrieve free ssl certificates.

We want to allow certain requests to be bypassed from authentication such as getting status from the cluster and certain requests we want to enforce authentication, such as indexing and deleting data.

Install Nginx:

Install nginx and the dependency package to create basic auth:

1
$ apt install nginx apache2-utils -y

Configure Nginx for Reverse Proxy

We want to access our nginx proxy on port 80: 0.0.0.0:80 and the requests should be proxied through to elasticsearch private addresses: 10.0.0.10:9200 and 10.0.0.11:9200. Traffic will be load balanced between our 2 nodes.

Edit the main nginx configuration:

1
$ vim /etc/nginx/nginx.conf

and populate the information as shown below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
user www-data;
worker_processes auto;
pid /run/nginx.pid;

events {
  worker_connections 1024;
  # multi_accept on;
}

http {

  # basic Settings
  sendfile on;
  tcp_nopush on;
  tcp_nodelay on;
  keepalive_timeout 65;
  types_hash_max_size 2048;
  include /etc/nginx/mime.types;
  default_type application/octet-stream;

  # ssl settings
  ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
  ssl_prefer_server_ciphers on;

  # logging settings
  access_log /var/log/nginx/access.log;
  error_log /var/log/nginx/error.log;

  # gzip settings
  gzip on;
  gzip_disable "msie6";

  # virtual host configs
  include /etc/nginx/conf.d/*.conf;
}

Next, edit the virtual host config:

1
$ vim /etc/nginx/conf.d/elasticsearch.conf

And populate the following config:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# https://gist.github.com/sahilsk/b16cb51387847e6c3329

upstream elasticsearch {
    # define your es nodes
    server 10.0.0.10:9200;
    server 10.0.0.11:9200;
    # persistent http connections
    # https://www.elastic.co/blog/playing-http-tricks-nginx
    keepalive 15;
}

server {
  listen 80;
  server_name elasticsearch.domain.com;

  auth_basic "server auth";
  auth_basic_user_file /etc/nginx/passwords;

  location / {

    # deny node shutdown api
    if ($request_filename ~ "_shutdown") {
      return 403;
      break;
    }

    proxy_pass http://elasticsearch;
    proxy_http_version 1.1;
    proxy_set_header Connection "Keep-Alive";
    proxy_set_header Proxy-Connection "Keep-Alive";
    proxy_redirect off;
  }

  location = / {
    proxy_pass http://elasticsearch;
    proxy_http_version 1.1;
    proxy_set_header Connection "Keep-Alive";
    proxy_set_header Proxy-Connection "Keep-Alive";
    proxy_redirect off;
    auth_basic "off";
  }

  location ~* ^(/_cluster/health|/_cat/health) {
    proxy_pass http://elasticsearch;
    proxy_http_version 1.1;
    proxy_set_header Connection "Keep-Alive";
    proxy_set_header Proxy-Connection "Keep-Alive";
    proxy_redirect off;
    auth_basic "off";
  }
}

Set your username and password to protect your endpoint:

1
$ htpasswd -c /etc/nginx/passwords admin

Enable nginx on boot and restart the process:

1
2
$ systemctl enable nginx
$ systemctl restart nginx

Test it

Now make requests to elasticsearch via your nginx reverse proxy:

1
2
3
$ curl -H 'Content-Type: application/json' -u 'admin:admin' http://myproxy.domain.com/_cat/indices?v
health status index       uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   first-index 1o6yM7tCSqagqoeihKM7_g   5   1          3            0     40.6kb         20.3kb

Letsencrypt SSL Certificates

Add free SSL Certificates to your reverse proxy. Install certbot:

1
2
3
4
5
6
$ apt-get update
$ apt-get install software-properties-common -y
$ add-apt-repository universe
$ add-apt-repository ppa:certbot/certbot
$ apt-get update
$ apt-get install certbot python-certbot-nginx -y

Request a Certificate for your domain:

1
2
3
4
5
$ certbot --manual certonly -d myproxy.domain.com -m my@email.com --preferred-challenges dns --agree-tos

Obtaining a new certificate
Performing the following challenges:
dns-01 challenge for myproxy.domain.com

You will be prompted to make a dns change, since we requested the dns challenge. While this screen is here, we can go our dns provider and make the TXT record change as shown below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Please deploy a DNS TXT record under the name
_acme-challenge.myproxy.domain.com with the following value:

xLP4y_YJvdAK7_aZMJ50gkudTDeIC3rX0x83aNJctGw

Before continuing, verify the record is deployed.
Press Enter to Continue
Waiting for verification...
Cleaning up challenges

IMPORTANT NOTES:
 - Congratulations! Your certificate and chain have been saved at:
   /etc/letsencrypt/live/myproxy.domain.com/fullchain.pem
   Your key file has been saved at:
   /etc/letsencrypt/live/myproxy.domain.com/privkey.pem
   Your cert will expire on 2019-07-01. To obtain a new or tweaked
   version of this certificate in the future, simply run certbot
   again. To non-interactively renew *all* of your certificates, run
   "certbot renew"
 - If you like Certbot, please consider supporting our work by:

   Donating to ISRG / Let's Encrypt:   https://letsencrypt.org/donate
   Donating to EFF:                    https://eff.org/donate-le

Update Nginx Config

Now that we have our ssl certificates, we need to update our nginx config to enable ssl, redirect http to https and point the ssl certificates and ssl private keys to the certificates that we retrieved from letsencrypt.

Open up the virtual host nginx configuration:

1
$ vim /etc/nginx/conf.d/elasticsearch.conf

Update the config like the one below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
upstream elasticsearch {
    server 10.0.0.10:9200;
    server 10.0.0.11:9200;
    keepalive 15;
}

server {
  listen 80;
  server_name myproxy.domain.com;
  return 301 https://$host$request_uri;
}

server {
  listen 443 ssl;
  server_name myproxy.domain.com;

  ssl_certificate /etc/letsencrypt/live/myproxy.domain.com/fullchain.pem;
  ssl_certificate_key /etc/letsencrypt/live/myproxy.domain.com/privkey.pem;

  auth_basic "server auth";
  auth_basic_user_file /etc/nginx/passwords;

  location ^~ /.well-known/acme-challenge/ {
    auth_basic off;
  }

  location / {

    # deny node shutdown api
    if ($request_filename ~ "_shutdown") {
      return 403;
      break;
    }

    proxy_pass http://elasticsearch;
    proxy_http_version 1.1;
    proxy_set_header Connection "Keep-Alive";
    proxy_set_header Proxy-Connection "Keep-Alive";
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header Host $http_host;
    proxy_redirect off;
  }

  location = / {
    proxy_pass http://elasticsearch;
    proxy_http_version 1.1;
    proxy_set_header Connection "Keep-Alive";
    proxy_set_header Proxy-Connection "Keep-Alive";
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header Host $http_host;
    proxy_redirect off;
    auth_basic "off";
  }

  location ~* ^(/_cluster/health|/_cat/health) {
    proxy_pass http://elasticsearch;
    proxy_http_version 1.1;
    proxy_set_header Connection "Keep-Alive";
    proxy_set_header Proxy-Connection "Keep-Alive";
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header Host $http_host;
    proxy_redirect off;
    auth_basic "off";
  }
}

Restart the nginx process:

1
$ systemctl restart nginx

Test the Nginx Proxy with SSL

Test the proxy with HTTP so that we can see that our nginx config redirects us to HTTPS:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
$ curl -iL -u 'admin:admin' http://myproxy.domain.com/_cat/nodes?v
HTTP/1.1 301 Moved Permanently
Server: nginx/1.14.0 (Ubuntu)
Date: Tue, 02 Apr 2019 21:40:09 GMT
Content-Type: text/html
Content-Length: 194
Connection: keep-alive
Location: https://myproxy.domain.com/_cat/nodes?v

HTTP/1.1 200 OK
Server: nginx/1.14.0 (Ubuntu)
Date: Tue, 02 Apr 2019 21:40:10 GMT
Content-Type: text/plain; charset=UTF-8
Content-Length: 276
Connection: keep-alive

ip            heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.0.0.10               40          97   3    0.15    0.10     0.08 mdi       -      Lq9P7eP
10.0.0.11               44          96   3    0.21    0.10     0.09 mdi       *      F5edOwK

Test the proxy with HTTPS:

1
2
3
4
5
6
7
8
9
10
11
$ curl -i -u 'admin:admin' https://myproxy.domain.com/_cat/nodes?v
HTTP/1.1 200 OK
Server: nginx/1.14.0 (Ubuntu)
Date: Tue, 02 Apr 2019 21:40:22 GMT
Content-Type: text/plain; charset=UTF-8
Content-Length: 276
Connection: keep-alive

ip            heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.0.0.10               44          96   4    0.18    0.10     0.09 mdi       *      F5edOwK
10.0.0.11               39          97   5    0.13    0.09     0.08 mdi       -      Lq9P7eP

Setup a cronjob to auto renew the certificates:

1
$ crontab -e

Populate the following line:

1
6 1,13 * * * /usr/bin/certbot renew --post-hook "systemctl restart nginx" --quiet

Resources:

Setup Kibana Dashboards for Nginx Log Data to Understand the Behavior

kibana

In this tutorial we will setup a Basic Kibana Dashboard for a Web Server that is running a Blog on Nginx.

What do we want to achieve?

We will setup common visualizations to give us an idea on how our blog/website is doing.

In some situations we need to create visualizations to understand the behavior of our log data in order to answer these type of questions:

Number Scenario
1 Geographical map to see where people are connecting from
2 Piechart that represents the percentage of cities accessing my blog
3 Top 10 Most Accessed Pages
4 Top 5 HTTP Status Codes
5 Top 10 Pages that returned 404 Responses
6 The Top 10 User Agents
7 Timeseries: Status Codes Over Time
8 Timeseries: Successfull Website Hits over time
9 Counter with Website Hits
10 Average Bytes Returned
11 Tag Cloud with the City Names that Accessed my Blog

Pre-Requirements

I am consuming my nginx access logs with filebeat and shipping them to elasticsearch. You can check out this blogpost to set that up.

The GeoIP Processor plugin is installed on elasticsearch to enrich our data with geographical information. You can check out this blogpost to setup geoip.

You can setup Kibana and Elasticsearch on Docker or setup a 5 Node Elasticsearch Cluster

Setup Kibana Visulizations

Head over to Kibana, make sure that you have added the filebeat-* index patterns. If not, head over to Management -> Index Patterns -> Create Index -> Enter filebeat-* as you Index Pattern, select Next, select your @timestamp as your timestamp field, select create.

Now from the visualization section we will add 11 Visualizations. Everytime that you create a visualization, make sure that you select filebeat as your pattern (thats if you are using filebeat).

When configuring your visualization it will look like the configuration box from below:

image

Geomap: Map to see where people are connecting from

kibana geomap

Select New Visualization: Coordinate Map

1
2
3
  -> Metrics, Value: Count. 
     Buckets, Geo Coordinates, Aggregation: Geohash, 
     Field: nginx.access.geoip.location. 

Save the visualization, in my case Nginx:GeoMap:Filebeat

Piechart: Cities

This can give us a quick overview on the percentage of people interested in our website grouped per city.

image

Select New Visualization, Pie

1
2
3
4
5
 -> Metrics: Slice Size, Aggregation: Count
 -> Buckets: Split Slices, 
    Aggregation: Terms, Field: nginx.access.geoip.city_name, 
    Order by: metric: count, 
    Order: Descending, Size: 20

Save Visualization.

Top 10 Accessed Pages

Great for seeing which page is popular, and Kibana makes it easy to see which page is doing good over a specific time.

image

New Visualization: Vertical

1
2
3
4
5
  -> Metrics: Y-Axis, Aggregation: Count
  -> Buckets: X-Axis, Aggregation: Terms, 
     Field: nginx.access.url, 
     Ordery by: Metric count, 
     Order: Descending, Size 10

I would like to remove /rss and / from my results, so in the search box:

1
NOT (nginx.access.url:"/" OR nginx.access.url:"/rss/" OR nginx.access.url:"/subscribe/" OR nginx.access.url:*.txt)

Save Visualization.

Top 5 HTTP Status Codes

A Grouping of Status Codes (You should see more 200’s) but its quick to identify when 404’s spike etc.

image

Select new visualization: Vertical Bar

1
2
3
4
5
  -> Metrics: Y-Axis, Aggregation: Count
  -> Buckets: X-Axis, Aggregation: Terms, 
     Field: nginx.access.response_code, 
     Ordery by: Metric count, 
     Order: Descending, Size 5

Save Visualization

Top 404 Pages

So when people are requesting pages that does not exist, it could most probably be bots trying to attack your site, or trying to gain access etc. This is a great view to see which ones are they trying and then you can handle it from there.

image

1
2
3
4
5
  -> Metrics: Y-Axis, Aggregation: Count
  -> Buckets: X-Axis, Aggregation: Terms, 
     Terms, Field: nginx.access.url, 
     Order by: Metric count, 
     Order: Descending, Size 20

Top 10 User Agents

Some insights to see the top 10 browsers.

image

New Visualization: Data Table

1
2
3
4
5
6
  -> Metrics: Y-Axis, Aggregation: Count
  -> Buckets: Split Rows, 
     Aggregation: Terms, 
     Field: nginx.access.user_agent.name, 
     Ordery by: Metric count, 
     Order: Descending, Size 10

Save Visualization

Timeseries: Status Codes over Time

With timeseries data its great to see when there was a spike in status codes, when you identify the time, you can further investigate why that happened.

New Visualization: Timelion

1
.es(index=filebeat*, timefield='@timestamp', q=nginx.access.response_code:200).label('OK'), .es(index=filebeat*, timefield='@timestamp', q=nginx.access.response_code:404).label('Page Not Found')

Timeseries: Successfull Website Hits over Time

This is a good view to see how your website is serving traffic over time.

image

New Visualization: Timelion

1
.es(index=filebeat*, timefield='@timestamp', q=nginx.access.response_code:200).label('200')

Count Metric: Website Hits

A counter to see the number of website hits over time.

image

New Visualization: Metric

1
2
  -> Search Query: fields.blog_name:sysadmins AND nginx.access.response_code:200
  -> Metrics: Y-Axis, Aggregation: Count

Average Bytes Transferred

Line chart with the amount of bandwidth being transferred.

image

1
2
3
4
New Visualization: Line

-> Metrics: Y-Axis, Aggregation: Average, Field: nginx.access.body_sent.bytes
-> Buckets: X-Axis, Aggregation: Date Histogram, Field: @timestamp

Tag Cloud with Most Popular Cities

I’ve used cities here, but its a nice looking visualization to group the most accessed fields. With server logs you can use this for the usernames failed in ssh attempts for example.

image

1
2
3
4
5
6
  -> Metrics: Tag size, Aggregation: Count
  -> Buckets: Tags, 
     Aggregation: Terms, 
     Field: nginx.access.geoip.city_name, 
     Ordery by: Metric count, 
     Order: Descending, Size 10

Create the Dashboard

Now that we have all our visualizations, lets build the dashboard that hosts all our visualizations.

Select Dashboard -> Create New Dashboard -> Add -> Select your visualizations -> Reorder and Save

The visualizations in my dashboard looks like this:

image

This is a basic dashboard but its just enough so that you can get your hands dirty and build some awesome visualizations.

Setup a 5 Node Highly Available Elasticsearch Cluster

elasticsearch

This is post 1 of my big collection of elasticsearch-tutorials which includes, setup, index, management, searching, etc. More details at the bottom.

In this tutorial we will setup a 5 node highly available elasticsearch cluster that will consist of 3 Elasticsearch Master Nodes and 2 Elasticsearch Data Nodes.

“Three master nodes is the way to start, but only if you’re building a full cluster, which at minimum is 3 master nodes plus at least 2 data nodes.” - https://discuss.elastic.co/t/should-dedicated-master-nodes-and-data-nodes-be-considered-separately/75093/14

The Overview:

In short the responsibilites of the node types:

Master Nodes: Master nodes are responsible for Cluster related tasks, creating / deleting indexes, tracking of nodes, allocate shards to nodes, etc.

Data Nodes: Data nodes are responsible for hosting the actual shards that has the indexed data also handles data related operations like CRUD, search, and aggregations.

For more concepts of Elasticsearch, have a look at their basic-concepts documentation.

Our Inventory will consist of:

Master Nodes:

1
2
3
Hostname: es-master-1, Private IP: 172.31.0.77
Hostname: es-master-2, Private IP: 172.31.0.45
Hostname: es-master-3, Private IP: 172.31.1.31

Data Nodes:

1
2
Hostname: es-data-1, Private IP:172.31.2.30
Hostname: es-data-2, Private IP:172.31.0.83

Reserved Volumes for Data Nodes:

1
2
es-data-1: 10GB assigned to /dev/vdb
es-data-2: 10GB assigned to /dev/vdb

Authentication:

Note that I have configured the bind address for elasticsearch to 0.0.0.0 using network.host: 0.0.0.0 for this demonstration, but this means that if your server has a public ip address with no firewall rules or no auth, that anyone will be able to interact with your cluster.

This address will also be reachable for all nodes to see each other.

It’s advisable do protect your endpoint, either with basic auth using nginx which can be found in the embedded link, or using firewall rules to protect communication from the outside (depending on your setup)

Setup the Elasticsearch Master Nodes

The setup below how to provision a elasticsearch master node. Repeat this on node: es-master-1, es-master-2, es-master-3

Set your hosts file for name resolution (if you don’t have private dns in place):

1
2
3
4
5
6
7
8
$ cat > /etc/hosts << EOF
127.0.0.1 localhost
172.31.0.77 es-master-1
172.31.0.45 es-master-2
172.31.1.31 es-master-3
172.31.2.30 es-data-1
172.31.0.83 es-data-2
EOF

Get the elasticsearch repositories, install the java development kit dependency and install elasticsearch:

1
2
3
4
5
6
7
$ apt update && apt upgrade -y
$ apt install software-properties-common python-software-properties apt-transport-https -y
$ wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
$ echo "deb https://artifacts.elastic.co/packages/6.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-6.x.list
$ apt update
$ apt install default-jdk -y
$ apt install elasticsearch -y

The elasticsearch config, before we get to the full example config, I just want to show a snippet of how you could split up logs and data.

Note that you can seperate your logging between data/logs like this:

1
2
3
4
5
6
# example of log splitting:
...
path:
  logs: /var/log/elasticsearch
  data: /var/data/elasticsearch
...

Also, your data can be divided between paths:

1
2
3
4
5
6
7
8
# example of data paths:
...
path:
  data:
    - /mnt/elasticsearch_1
    - /mnt/elasticsearch_2
    - /mnt/elasticsearch_3
...

Bootstrap the elasticsearch config with a cluster name (all the nodes should have the same cluster name), set the nodes as master node.master: true disable the node.data and specify that the cluster should at least have a minimum of 2 master nodes before it stops. This is used to prevent split brain.

To avoid a split brain, this setting should be set to a quorum of master-eligible nodes: (master_eligible_nodes / 2) + 1

The full example config:

1
2
3
4
5
6
7
8
9
10
11
$ cat > /etc/elasticsearch/elasticsearch.yml << EOF
cluster.name: es-cluster
node.name: \${HOSTNAME}
node.master: true
node.data: false
path.logs: /var/log/elasticsearch
bootstrap.memory_lock: true
network.host: 0.0.0.0
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.unicast.hosts: ["es-master-1", "es-master-2", "es-master-3"]
EOF

Important settings for your elasticsearch cluster is described on their docs:

1
2
3
4
5
$ cat > /etc/default/elasticsearch << EOF
ES_STARTUP_SLEEP_TIME=5
MAX_OPEN_FILES=65536
MAX_LOCKED_MEMORY=unlimited
EOF

Ensure that pages are not swapped out to disk by requesting the JVM to lock the heap in memory by setting LimitMEMLOCK=infinity. Set the maxiumim file descriptor number for this process: LimitNOFILE and increase the number of threads using LimitNPROC:

1
2
3
4
5
6
7
$ vim /usr/lib/systemd/system/elasticsearch.service

[Service]
LimitMEMLOCK=infinity
LimitNOFILE=65535
LimitNPROC=4096
...

Increase the limit on the number of open files descriptors to user elasticsearch of 65536 or higher

1
2
3
4
$ cat > /etc/security/limits.conf << EOF
elasticsearch soft memlock unlimited
elasticsearch hard memlock unlimited
EOF

Increase the value of the mmap counts as elasticsearch uses mmapfs directory to store its indices:

1
$ sysctl -w vm.max_map_count=262144

For a permanent setting, update vm.max_map_count in /etc/sysctl.conf and run

1
$ sysctl -p /etc/sysctl.conf 

Prepare the directories and set the ownership to elasticsearch:

1
2
$ mkdir /usr/share/elasticsearch/data
$ chown -R elasticsearch:elasticsearch /usr/share/elasticsearch/data

Reload the systemd daemon, enable and start elasticsearch

1
2
3
$ systemctl daemon-reload
$ systemctl enable elasticsearch
$ systemctl restart elasticsearch

Once all 3 elasticsearch masters has been started, verify that they are listening: netstat -tulpn | grep 9200 then look at the cluster health:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
$ curl http://127.0.0.1:9200/_cluster/health?pretty
{
  "cluster_name" : "es-cluster",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 0,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

Have a look at the nodes, you will see that the node.role for now shows mi:

1
2
3
4
5
$ curl http://127.0.0.1:9200/_cat/nodes?v
ip          heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.163.68.8           11          80  18    0.28    0.14     0.09 mi        -      es-master-2
10.163.68.5           14          80  14    0.27    0.18     0.11 mi        *      es-master-1
10.163.68.4           15          79   6    0.62    0.47     0.18 mi        -      es-master-3

Setup the Elasticsearch Data Nodes

Now that we have our 3 elasticsearch master nodes running, its time to provision the 2 elasticsearch data nodes. This setup needs to be repeated on both es-data-1 and es-data-2.

Configure the hosts file for name resolution:

1
2
3
4
5
6
7
8
$ cat > /etc/hosts << EOF
127.0.0.1 localhost
172.31.0.77 es-master-1
172.31.0.45 es-master-2
172.31.1.31 es-master-3
172.31.2.30 es-data-1
172.31.0.83 es-data-2
EOF

Get the elasticsearch repositories, install the java development kit dependency and install elasticsearch:

1
2
3
4
5
6
7
$ apt update && apt upgrade -y
$ apt install software-properties-common python-software-properties apt-transport-https -y
$ wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
$ echo "deb https://artifacts.elastic.co/packages/6.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-6.x.list
$ apt update
$ apt install default-jdk -y
$ apt install elasticsearch -y

Since we attached an extra disk to our data nodes, verify that you can see the disk:

1
2
3
4
5
$ lsblk
NAME   MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vda    253:0    0  25G  0 disk
└─vda1 253:1    0  25G  0 part /
vdb    253:16   0  10G  0 disk             <----

Provision the block device with xfs or anything else that you prefer, create the directory where elasticsearch data will reside, change the ownership that elasticsearch has permission to write/read, set the device on startup and mount the disk:

1
2
3
4
5
6
7
$ mkfs.xfs /dev/vdb
$ mkdir /data
$ mkdir /data/nodes
$ chown -R elasticsearch:elasticsearch /data
$ chown -R elasticsearch:elasticsearch /data/nodes
$ echo '/dev/vdb /data xfs defaults 0 0' >> /etc/fstab
$ mount -a

Verify that the disk is mounted:

1
2
3
4
5
6
$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            994M     0  994M   0% /dev
tmpfs           201M  3.1M  197M   2% /run
/dev/vda1        25G  1.8G   23G   8% /
/dev/vdb         10G   33M   10G   1% /data

Bootstrap the elasticsearch config with a cluster name, set the node.name to an identifier, in this case I will use the servers hostname, set the node.master to false as this will be data nodes, also enable these nodes as data nodes: node.data: true, configure the path.data: /data to the path that we configured, etc:

1
2
3
4
5
6
7
8
9
10
11
12
$ cat > /etc/elasticsearch/elasticsearch.yml << EOF
cluster.name: es-cluster
node.name: \${HOSTNAME}
node.master: false
node.data: true
path.data: /data
path.logs: /var/log/elasticsearch
bootstrap.memory_lock: true
network.host: 0.0.0.0
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.unicast.hosts: ["es-master-1", "es-master-2", "es-master-3"]
EOF

Set a couple of important settings for your elasticsearch cluster is described on their docs:

1
2
3
4
5
$ cat > /etc/default/elasticsearch << EOF
ES_STARTUP_SLEEP_TIME=5
MAX_OPEN_FILES=65536
MAX_LOCKED_MEMORY=unlimited
EOF

Disable swapping, increase the file descriptors and increase the maximum number of threads:

1
2
3
4
5
$ vim /usr/lib/systemd/system/elasticsearch.service
[Service]
LimitMEMLOCK=infinity
LimitNOFILE=65535
LimitNPROC=4096

Also update them via limits.conf:

1
2
3
4
$ cat > /etc/security/limits.conf << EOF
elasticsearch soft memlock unlimited
elasticsearch hard memlock unlimited
EOF

Reload the systemd daemon, enable and start elasticsearch. Allow it to start and check if the ports are listening with netstat -tulpn | grep 9200, then:

1
2
3
$ systemctl daemon-reload
$ systemctl enable elasticsearch
$ systemctl restart elasticsearch

Verify that everything works as expected, look at the cluster health and look at the status and number of nodes:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
$ curl http://127.0.0.1:9200/_cluster/health?pretty
{
  "cluster_name" : "es-cluster",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 5,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

Look at the nodes api and you will see that we now have the extra 2 nodes showing up on node.role as di:

1
2
3
4
5
6
7
$ curl http://127.0.0.1:9200/_cat/nodes?v
ip           heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.163.68.7             9          96   6    0.12    0.11     0.03 di        -      es-data-2
10.163.68.5            10          80   2    0.20    0.09     0.08 mi        *      es-master-1
10.163.68.11           12          96   9    0.12    0.09     0.03 di        -      es-data-1
10.163.68.4            10          79   0    0.00    0.12     0.11 mi        -      es-master-3
10.163.68.8            12          79   1    0.05    0.06     0.07 mi        -      es-master-2

Interact with Elasticsearch

Let’s interact with elasticsearch, the overview:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
$ curl http://127.0.0.1:9200
{
  "name" : "es-data-1",
  "cluster_name" : "es-cluster",
  "cluster_uuid" : "5BLs4sxsSEK-4OxlGnmlmw",
  "version" : {
    "number" : "6.7.0",
    "build_flavor" : "default",
    "build_type" : "deb",
    "build_hash" : "8453f77",
    "build_date" : "2019-03-21T15:32:29.844721Z",
    "build_snapshot" : false,
    "lucene_version" : "7.7.0",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}

Let’s look at the Health API:

1
2
3
$ curl http://127.0.0.1:9200/_cat/health?v
epoch      timestamp cluster    status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1554154652 21:37:32  es-cluster green           5         2     10   5    0    0        0             0                  -                100.0%

Let’s ingest some data into elasticsearch, we will create an index named first-index with some dummy data about people, username, name, surname, location and hobbies:

1
2
3
4
5
$ curl -H 'Content-Type: application/json' -XPOST http://127.0.0.1:9200/first-index/docs/ -d '{"username": "mikes", "name": "mike", "surname": "steyn", "location": {"country": "south africa", "city": "cape town"}, "hobbies": ["sport", "coffee"]}'

$ curl -H 'Content-Type: application/json' -XPOST http://127.0.0.1:9200/first-index/docs/ -d '{"username": "clarissas", "name": "clarissa", "surname": "smith", "location": {"country": "ireland", "city": "dublin"}, "hobbies": ["shopping", "reading", "chess"]}'

$ curl -H 'Content-Type: application/json' -XPOST http://127.0.0.1:9200/first-index/docs/ -d '{"username": "franka", "name": "frank", "surname": "adams", "location": {"country": "new zealand", "city": "auckland"}, "hobbies": ["programming", "swimming", "rugby"]}'

Now that we ingested our data into elasticsearch, lets have a look at the Indices API, where the number of documents, size etc should reflect:

1
2
3
$ curl http://127.0.0.1:9200/_cat/indices?v
health status index       uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   first-index 1o6yM7tCSqagqoeihKM7_g   5   1          3            0     40.6kb         20.3kb

Now lets request a search, which will give you by default 10 returned documents:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
$ curl http://127.0.0.1:9200/first-index/_search?pretty
{
  "took" : 116,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "first-index",
        "_type" : "docs",
        "_id" : "-NTO2mkB8pugP4aC2jtZ",
        "_score" : 1.0,
        "_source" : {
          "username" : "mikes",
          "name" : "mike",
          "surname" : "steyn",
          "location" : {
            "country" : "south africa",
            "city" : "cape town"
          },
          "hobbies" : [
            "sport",
            "coffee"
          ]
        }
      },
      {
        "_index" : "first-index",
        "_type" : "docs",
        "_id" : "-tTR2mkB8pugP4aCAzvG",
        "_score" : 1.0,
        "_source" : {
          "username" : "franka",
          "name" : "frank",
          "surname" : "adams",
          "location" : {
            "country" : "new zealand",
            "city" : "auckland"
          },
          "hobbies" : [
            "programming",
            "swimming",
            "rugby"
          ]
        }
      },
      {
        "_index" : "first-index",
        "_type" : "docs",
        "_id" : "-dTP2mkB8pugP4aC1ztI",
        "_score" : 1.0,
        "_source" : {
          "username" : "clarissas",
          "name" : "clarissa",
          "surname" : "smith",
          "location" : {
            "country" : "ireland",
            "city" : "dublin"
          },
          "hobbies" : [
            "shopping",
            "reading",
            "chess"
          ]
        }
      }
    ]
  }
}

Let’s have a look at our shards using the Shards API, you will also see where each document is assigned to a specific shard, and also if its a primary or replica shard:

1
2
3
4
5
6
7
8
9
10
11
12
$ curl http://127.0.0.1:9200/_cat/shards?v
index       shard prirep state   docs store ip           node
first-index 4     p      STARTED    0  230b 10.163.68.7  es-data-2
first-index 4     r      STARTED    0  230b 10.163.68.11 es-data-1
first-index 2     p      STARTED    0  230b 10.163.68.7  es-data-2
first-index 2     r      STARTED    0  230b 10.163.68.11 es-data-1
first-index 3     r      STARTED    1 6.6kb 10.163.68.7  es-data-2
first-index 3     p      STARTED    1 6.6kb 10.163.68.11 es-data-1
first-index 1     r      STARTED    2  13kb 10.163.68.7  es-data-2
first-index 1     p      STARTED    2  13kb 10.163.68.11 es-data-1
first-index 0     p      STARTED    0  230b 10.163.68.7  es-data-2
first-index 0     r      STARTED    0  230b 10.163.68.11 es-data-1

Then we can also use the Allocation API to see the size of our indices, disk space per node:

1
2
3
4
$ curl http://127.0.0.1:9200/_cat/allocation?v
shards disk.indices disk.used disk.avail disk.total disk.percent host         ip           node
     5       20.3kb    32.4mb      9.9gb      9.9gb            0 10.163.68.11 10.163.68.11 es-data-1
     5       20.3kb    32.4mb      9.9gb      9.9gb            0 10.163.68.7  10.163.68.7  es-data-2

Let’s search for anyone with the surname smith:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
$ curl -s http://127.0.0.1:9200/first-index/_search?q=surname=smith | jq .
{
  "took": 22,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "first-index",
        "_type": "docs",
        "_id": "-dTP2mkB8pugP4aC1ztI",
        "_score": 0.2876821,
        "_source": {
          "username": "clarissas",
          "name": "clarissa",
          "surname": "smith",
          "location": {
            "country": "ireland",
            "city": "dublin"
          },
          "hobbies": [
            "shopping",
            "reading",
            "chess"
          ]
        }
      }
    ]
  }
}

Let’s search for anyone with rugby as one of their hobbies:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
$ curl -s http://127.0.0.1:9200/first-index/_search?q=hobbies=rugby | jq .
{
  "took": 23,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.64072424,
    "hits": [
      {
        "_index": "first-index",
        "_type": "docs",
        "_id": "-tTR2mkB8pugP4aCAzvG",
        "_score": 0.64072424,
        "_source": {
          "username": "franka",
          "name": "frank",
          "surname": "adams",
          "location": {
            "country": "new zealand",
            "city": "auckland"
          },
          "hobbies": [
            "programming",
            "swimming",
            "rugby"
          ]
        }
      }
    ]
  }
}

More on Elasticsearch

I am planning to write up elasticsearch articles on the following topics:

  • Setting up a 5 Node HA Elasticsearch Cluster
  • Indexes / Replicas
  • Search Queries
  • Delete Queries
  • Elasticsearch Snapshots and Restores on S3
  • Mapping Templates
  • Resizing Index Shards
  • Dealing with Old Timeseries Data
  • Elasticsearch Percentiles
  • Managing Yellow and Red Status Clusters
  • Managing High JVM Memory Pressure
  • and more

As I finish up the writing of these posts they will be published under the #elasticsearch-tutorials category on my blog and for any other elasticsearch tutorials, you can find them under the #elasticsearch category.

Oke byyyyyye :D

Resources