Setup ELK stack with Index Lifecycle Management

8 min readApr 9, 2021

I was in my last week at my current company and the Spring Boot based microservices we had were producing tons of logs now, because hey, the application worked! ;)

Apparently, grepping the logs wasn’t proving to be the best utilisation of our time. So, I thought I’ll gift my team with the ELK stack for debugging logs before I move out.

I looked up for the easiest and stable way to set up ELK and came across this already packaged Docker image of ELK. You can find it’s documentation here and the codebase here.

This setup is great, it just works once you have setup the Filebeat and start writing logs from your respective containers. As I wanted minimal changes to my existing services to start producing logs into ELK, I thought to setup Filebeat as a sidecar which will read from the corresponding mounted logs being produced by those services.

I deployed 1 pod running the Docker image for ELK mentioned above, 1 for Filebeat and then tweaked logbacks for respective services to start producing Logstash encoded logs in JSON format in a file; if it’s overwhelming don’t worry, details will follow.

I followed the following steps:-

Setting up ELK
Setting up Filebeat to read logs and emit to Logstash
Setting up service to emit formatted logs for Filebeat
Fixing the ILM policy

Setting up ELK

Created a DockerFile using base image as 7.11.2 for ELK

#DockerFile - ELKFROM sebp/elk:oss-7.11.2
ADD certs/logstash-beats.crt /etc/pki/tls/certs/logstash-beats.crt
ADD certs/logstash-beats.key /etc/pki/tls/private/logstash-beats.key
ADD configs/heartbeat-30-output.conf /etc/logstash/conf.d/30-output.conf
EXPOSE 5601 9200 5044

2. Created my own cert and key to remove the default one by following instructions from Certificates on Filebeat and replacing the default .crt and .key provided in the image as can be seen in line #2 and #3 above

openssl req -x509 -batch -nodes -subj "/CN=heartbeat-elk/" \
-days 3650 -newkey rsa:2048 \
-keyout logstash-beats.key -out logstash-beats.crt

3. The current ELK setup uses Elasticsearch output plugin, which when hooked will result in creating indices on ELK in the format “filebeat-YYYY.MM.DD” as that’s the index pattern mentioned in the 30-output.conf file with the image. We changed this to use ILM because we can’t apply Index Lifecycle Management to indices which are not ending with digits only such as 000001(this is a check on ELK).

Default 30-output.conf:-

output {
  elasticsearch {
    hosts => ["localhost"]
    manage_template => false
    index => "%{[@metadata][beat]}-%{+YYYY.MM.dd}"
  }
}

30-output.conf after our change:-

output {
  elasticsearch {
    hosts => ["localhost"]
    ilm_enabled => "true"
  }
}

Now, once you have enabled ILM flag the image will create the alias as “logstash” and will automatically create index of format “logstash-{now/d}-000001” and will also apply the default “logstash-policy” for ILM which rolls over index after 50GB or 30 days.

My use-case was to rollover index after 50GB or 1 day and delete it after 3 days.

4. Docker command to run ELK container locally with a directory mounted for ELK to store it’s data(locally at `/opt/mountedELKDirectory/heartbeat-elk`) between restarts, we mounted it’s location `/var/lib/elasticsearch`

docker run --rm \
-e RUN_ENVIRONMENT=test \
-e ES_CONNECT_RETRY=60 \
-v /opt/mountedELKDirectory/heartbeat-elk:/var/lib/elasticsearch \
-p 5601:5601 -p 9200:9200 -p 5044:5044  \
-t heartbeat-elk

Setting up Filebeat

Created a DockerFile as below for Filebeat container. Notice the same .crt file being copied here as well.

#DockerFile - FilebeatFROM docker.elastic.co/beats/filebeat:7.11.2
COPY yamls/filebeat.yml /usr/share/filebeat/filebeat.yml
ADD certs/logstash-beats.crt /etc/pki/tls/certs/logstash-beats.crt
USER root
RUN chown -R root /usr/share/filebeat/
RUN chmod -R go-w /usr/share/filebeat/
ENTRYPOINT bash -c 'export PATH=$PATH:/usr/share/filebeat && /usr/local/bin/docker-entrypoint -e'

2. Updated the path in our filebeat.yaml as described below to start harvesting logs

#filebeat.yamlfilebeat.inputs:
  - type: log
    json.keys_under_root: true
    # Json key name, which value contains a sub JSON document produced by our application Console Appender
    json.message_key: log
    enabled: true
    encoding: utf-8
    paths:
      # Location of all our Docker log files (mapped volume in docker-compose.yml)
      - '/opt/mountedLogDirectory/heartbeat-logs/*.log.json'
tags: ["fb-v1"]#Filbeat YAML version, just some additional metadata which can be used to track different filebeat versions
processors:
  # decode the log field (sub JSON document) if JSONencoded, then maps it's fields to elasticsearch fields
  - decode_json_fields:
      fields: ["log"]
      target: ""
      # overwrite existing target elasticsearch fields while decoding json fields
      overwrite_keys: true
  - add_docker_metadata: ~

output.logstash:
    enabled: true
    hosts: ["heartbeat-elk:5044"]
    ssl.enabled: true
    max_retries: 3
# Write Filebeat own logs only to file to avoid catching them with itself in docker log files
logging.to_files: true
logging.to_syslog: false

3. And finally to run the Filebeat Container which will be emitting logs to LogStash on “heartbeat-elk:5044”

docker run --rm \
-v /opt/mountedLogDirectory/:/opt/mountedLogDirectory/ \
--net=host \
--add-host heartbeat-elk:0.0.0.0 \
-t heartbeat-filebeat

Setting up service

The only thing required on the service was to update it’s logback to start producing Logs encoded in a format LogStash expects, which I referenced this to understand and thus created the logback.xml below; this writes to STDOUT in the known format while also producing a log using the LogStash Encoder library.

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
    <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
        <encoder>
            <pattern>%d{HH:mm:ss.SSS} [%thread] %-5level [%X{mdc.token.key}] %logger{36} - %msg%n</pattern>
        </encoder>
    </appender>

    <appender name="JSON_FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
        <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
            <!-- Daily rollover -->
            <fileNamePattern>/opt/mountedLogDirectory/heartbeat-logs/%d{dd-MM-yyyy}.log.json</fileNamePattern>
            <!-- Keep 2 days' worth of history -->
            <maxHistory>2</maxHistory>
        </rollingPolicy>
        <encoder class="net.logstash.logback.encoder.LogstashEncoder">
            <timestampPattern>yyyy-MM-dd'T'HH:mm:ss.SSS</timestampPattern>
            <shortenedLoggerNameLength>36</shortenedLoggerNameLength>
            <includeMdcKeyName>mdc.token.key</includeMdcKeyName>
            <mdcKeyFieldName>mdc.token.key=X-correlation-id</mdcKeyFieldName>
            <customFields>{"service":"${SERVICE_NAME}", "env":"${RUN_ENVIRONMENT}"}</customFields>
        </encoder>
    </appender>

    <springProfile name="test">
        <root level="DEBUG">
            <appender-ref ref="STDOUT" />
            <appender-ref ref="JSON_FILE" />
        </root>
    </springProfile>
</configuration>

2. To make this work, I had to add the following dependencies in my pom.xml

----
<logstash.encoder.version>6.6</logstash.encoder.version>
<logback.version>1.2.3</logback.version>----
<dependency>
   <groupId>net.logstash.logback</groupId>
   <artifactId>logstash-logback-encoder</artifactId>
   <version>${logstash.encoder.version}</version>
</dependency>
<dependency>
   <groupId>ch.qos.logback</groupId>
   <artifactId>logback-core</artifactId>
   <version>${logback.version}</version>
</dependency>
<dependency>
   <groupId>ch.qos.logback</groupId>
   <artifactId>logback-classic</artifactId>
   <version>${logback.version}</version>
</dependency>

My general advice would be to have better nodes for ELK deployment with at least 6–10GB RAM

I also had to create VMs with few tweaked parameters, particularly the one to increase the vm.max_map_count count as mentioned here, on Azure:-

{
  "sysctls": {
    "vmMaxMapCount": 262144
  }
}

Command to create a new nodepool named “himemorypool” with the tweaked VM parameter saved in a json file(linuxOsConfig.json) containing nodes of type: standard_d4s_v3 and also has custom labels, tags associated for Node Affinity:-

az aks nodepool add \
--name himemorypool \
--cluster-name your-aks \
--resource-group your-resource-group \
--node-vm-size standard_d4s_v3 \
--labels osType=configured \
--tags osType=configured \
--mode User \
--linux-os-config <path-to-file>/linuxOsConfig.json \
--node-count 3

Now, once I had the new Nodepool with better resources created, I needed to setup an affinity for my ELK containers to be only scheduled on these nodes which I achieved by using the following YAML that mandates this node affinity using `requiredDuringSchedulingIgnoredDuringExecution` annotation. More can be found here :-

--- ELK Deployment YAML
apiVersion: apps/v1
kind: Deployment
metadata:
  name: heartbeat-elk
  namespace: your-namespace
spec:
  replicas: 1
  template:
    metadata:
      labels:
        name: heartbeat-elk
    spec:
      containers:
        - name: heartbeat-elk
          image: registry.azurecr.io/heartbeat-elk:${VERSION_DEPLOY}
          imagePullPolicy: Always
          resources:
            requests:
              cpu: 2000m
              memory: 6Gi
            limits:
              cpu: 3000m
              memory: 10Gi
          ports:
            - name: kibana-port
              containerPort: 5601
            - name: logstash-port
              containerPort: 5044
          env:
            - name: ES_CONNECT_RETRY
              value: "60"
          volumeMounts:
            - mountPath: /var/lib/elasticsearch
              name: heartbeatvolume-test-elk
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: osType
                    operator: In
                    values:
                      - configured
      volumes:
        - name: heartbeatvolume-test-elk
          azureFile:
            secretName: your-secrets
            shareName: heartbeatvolume-test-elk
            readOnly: false

YAML for Filebeat to be mounted to same directory where logs are being produced:-

apiVersion: apps/v1
kind: Deployment
metadata:
  name: heartbeat-filebeat
  namespace: your-namespace
spec:
  replicas: 1
  selector:
    matchLabels:
      name: heartbeat-filebeat
  template:
    metadata:
      labels:
        name: heartbeat-filebeat
    spec:
      containers:
        - name: heartbeat-filebeat
          image: registry.azurecr.io/heartbeat-filebeat:${VERSION_DEPLOY}
          imagePullPolicy: Always
          resources:
            requests:
              cpu: 100m
              memory: 1Gi
            limits:
              cpu: 500m
              memory: 2Gi
          env:
            - name: RUN_ENVIRONMENT
              value: "test"
          volumeMounts:
            - mountPath: /opt/mountedLogDirectory/
              name: heartbeatvolume-test
      volumes:
        - name: heartbeatvolume-test
          azureFile:
            secretName: your-secrets
            shareName: heartbeatvolume-test
            readOnly: true

Fixing the ILM policy

Once all of these were in place, I checked our Kibana dashboard at our host IP on port 5601. Then we moved to the Stack Management section by going to Hamburger icon on left top > Management > Stack Management

2. Created an index named “logstash-*”

3. I found an already existing Index post step #2, in the format “logstash-YYYY.MM.DD-000001”, which calmed me, as it meant everything is hooked up perfectly.

4. It had the default lifecycle policy named “logstash-policy” and rollover alias “logstash” tagged to it, as shown in pictures below

5. Now I needed to remove those Lifecycles temporarily. Why? Because even if you edit an ILM policy, it’s properties won’t get applied to indices created prior to this update. So we remove the lifecycle, edit the policy and then add the lifecycle back to apply it to this originally created index too. Removal was done by selecting the index, and then going to right bottom corner Manage > Remove Policy

6. I edited the “logstash-policy” by going to Data > Index Lifecycle Policies, as shown below. In this case, I changed the rollover period to 1 minute and deletion to happen after 3 minutes for testing and then ultimately settled with changing minutes by days after verifying that it worked.

7. Once this was done I went to DevTools under Management and ran HTTP requests to see the ILM policy related details being applied to my indices

8. ELK takes upto 10 minutes to check & run the lifecycle stages, so don’t worry if you don’t see the stages being rolled-over or deleted right after your mentioned periods. e.g. if you have configured for roll-over after 1 minute, it might take 10 minutes for this status check and initiate rollover.

9. Lastly I had edited those policies to rollover after 1 day and delete post 3 days from rollover and that can be seen from following snapshots