Upgrade elasticsearch cluster from elasticsearch 2.4.4 to 6.5.1

Chen

3 min readDec 2, 2018

set up a ES 6.5.1 cluster

some config fields/plugins are changed/deprecated:

in 2.4.4

es_plugins:
— plugin: cloud-aws
— plugin: delete-by-query
— plugin: lmenezes/elasticsearch-kopf

should now be changed to:

6.5.1

es_plugins:
— plugin: discovery-ec2

bootstrap.mlockall: true should be changed to bootstrap.memory_lock: true,

threadpool.bulk.queue_size: 100, to thread_pool.bulk.queue_size: 100, (really?)

Remove

discovery.zen.ping.multicast.enabled: false

We are using ansible: https://github.com/elastic/ansible-elasticsearch ,so the config becomes:

roles:
— { role: elastic.elasticsearch, es_enable_xpack: false, es_api_host: “{{ ansible_default_ipv4.address }}”, es_instance_name: “{{ es_cluster_name }}_{{ ansible_default_ipv4.address }}”, es_heap_size: “{{ heap_size }}”, es_data_dirs: “{{ data_dirs}}”,
es_config: {
cluster.name: “{{ es_cluster_name }}”,
discovery.zen.ping.unicast.hosts: “{{ es_hosts | join(‘,’) }}”,
network.host: “{{ ansible_default_ipv4.address }}”,
http.port: “{{ es_http_port }}”,
http.max_content_length: “500m”,
thread_pool.bulk.queue_size: 100,
bootstrap.memory_lock: true,
transport.tcp.port: “{{ es_tcp_port }}”,
node.data: true,
cluster.routing.allocation.disk.threshold_enabled: true,
cluster.routing.allocation.disk.watermark.low: 93%,
cluster.routing.allocation.disk.watermark.high: 95%,
reindex.remote.whitelist: “10.0.3.51:9200,10.0.3.41:9200,10.0.3.26:9200,10.0.3.175:9200,10.0.3.169:9200” }

Remember to add reindex.remote.whitelist, as it will allow you to reindex from your old cluster. bootstrap.memory_lock is also important.

2. In order to do reindex, we need to fix schema

You can first get your old schema to a file (oldschema.json) by running a simple curl:

For example:

curl http://${yourhost}:9200/myindex/_settings,_mappings > $DIR/myindex.json

Some critical changes:

There is no string field anymore. It changes to keyword/text, for all not_analyzed string field ,you can change to type:”keyword”
_default_ mapping is deprecated. You can just remove them.

3. Once we have the schema, you can create new index in the new cluster 6.5.1

4. Write your reindex script.

One big breaking change is that Elasticsearch 6.5.1 no longer supports 0|1 as boolean value. So a simple reindex will fail. There are two ways to get around this issue. one is to use pipeline:(copied from https://discuss.elastic.co/t/reindex-from-2-4-to-6-5-failed-on-boolean-type/158846/2, not tried myself)

“The way I solved it was to create a pipeline 1 with a processor that transformed the problematic integer values to proper booleans and then used this pipeline in a reindex operation. When all indices had been reindexed I could upgrade my cluster to ES 6 without further problems.

Both pipelines and the reindex API are available in Elasticsearch 2.4 but I have no experience using them in that old version; my upgrade was done from version 5.6 so if you struggle using pipelines in the reindexing step you could first try to upgrade from 2.4 to 5.6, without changing the boolean field values, and from there to 6.5 using the aforementioned pipeline to fix the integer values.”

Another way is to write script in your reindex script: For example: clean is a boolean value of user.

“script”:{“source”: “boolean clean = ctx._source[“user”].get(“clean”) instanceof boolean ? ctx._source[“user”].get(“clean”) : ctx._source[“user”].get(“clean”) instanceof Integer ? (“1” == ctx._source[“user”].get(“clean”).toString()) : false;}

If possible, it is also a good idea to explicitly list out attributes to avoid unexpected errors. I use a simple python script to just contact attributes:(put here for you to copy easily)

#!/usr/bin/python
import sys
import json
from datetime import datetime
from elasticsearch import Elasticsearch
from elasticsearch import helpers
from util import *

reload(sys)
sys.setdefaultencoding(‘utf-8’)

es = Elasticsearch([‘yourhost:9200’],
sniff_on_connection_fail=True
)
query = {
“query”: {
“bool”: {
“must”: [
{
“term”: {
“slug”: “testslug”
}
}
]
}
}
}

scroll = helpers.scan(es, query=query, index=’your_index’, scroll=’5m’)
attr_dict = {}
attr_array=[]
for res in scroll:
source = res[‘_source’]
for attr in source:
if attr == ‘attribute_wanna_skip’:
continue

if attr not in attr_dict:
if attr == ‘embedded_object’:
for sub in source[attr]:
sub_attr = attr+’.’+sub
if sub_attr not in attr_dict:
attr_array.append(sub_attr)
attr_dict[sub_attr]= True
else:
attr_array.append(attr)
attr_dict[attr]= True

print json.dumps(attr_array)

Then you can just copy it out to “_source” field in the destination of reindex.

Also you can add “size” in your source to control the batch size. Note that its not honoring http.max_content_length, refer to https://discuss.elastic.co/t/reindex-does-not-honor-http-max-content-length-500m/158869 . So i have to change it all the way from 500 to 400 to 300…to 50.(yuk!)

5. Fix your queries.

5.1) ignore_unmapped no longer works. Change to “unmapped_type” : “your data type”

5.2) cannot pass 0/1 to your bool query anymore. You can however continue to use “true|false”

5.3)minimum_should_match = 1 no longer works inside a bool with must and must_not. I have to move it inside an embedded bool with should clause

Upgrade elasticsearch cluster from elasticsearch 2.4.4 to 6.5.1

#!/usr/bin/python
import sys
import json
from datetime import datetime
from elasticsearch import Elasticsearch
from elasticsearch import helpers
from util import *

reload(sys)
sys.setdefaultencoding(‘utf-8’)

es = Elasticsearch([‘yourhost:9200’],
sniff_on_connection_fail=True
)
query = {
“query”: {
“bool”: {
“must”: [
{
“term”: {
“slug”: “testslug”
}
}
]
}
}
}

scroll = helpers.scan(es, query=query, index=’your_index’, scroll=’5m’)
attr_dict = {}
attr_array=[]
for res in scroll:
source = res[‘_source’]
for attr in source:
if attr == ‘attribute_wanna_skip’:
continue

if attr not in attr_dict:
if attr == ‘embedded_object’:
for sub in source[attr]:
sub_attr = attr+’.’+sub
if sub_attr not in attr_dict:
attr_array.append(sub_attr)
attr_dict[sub_attr]= True
else:
attr_array.append(attr)
attr_dict[attr]= True

print json.dumps(attr_array)

Written by Chen

No responses yet

Upgrade elasticsearch cluster from elasticsearch 2.4.4 to 6.5.1

#!/usr/bin/pythonimport sysimport jsonfrom datetime import datetimefrom elasticsearch import Elasticsearchfrom elasticsearch import helpersfrom util import *

reload(sys)sys.setdefaultencoding(‘utf-8’)

es = Elasticsearch([‘yourhost:9200’], sniff_on_connection_fail=True )query = { “query”: { “bool”: { “must”: [ { “term”: { “slug”: “testslug” } } ] } }}

scroll = helpers.scan(es, query=query, index=’your_index’, scroll=’5m’)attr_dict = {}attr_array=[]for res in scroll: source = res[‘_source’] for attr in source: if attr == ‘attribute_wanna_skip’: continue

if attr not in attr_dict: if attr == ‘embedded_object’: for sub in source[attr]: sub_attr = attr+’.’+sub if sub_attr not in attr_dict: attr_array.append(sub_attr) attr_dict[sub_attr]= True else: attr_array.append(attr) attr_dict[attr]= True

print json.dumps(attr_array)

Written by Chen

No responses yet

#!/usr/bin/python
import sys
import json
from datetime import datetime
from elasticsearch import Elasticsearch
from elasticsearch import helpers
from util import *

reload(sys)
sys.setdefaultencoding(‘utf-8’)

es = Elasticsearch([‘yourhost:9200’],
sniff_on_connection_fail=True
)
query = {
“query”: {
“bool”: {
“must”: [
{
“term”: {
“slug”: “testslug”
}
}
]
}
}
}

scroll = helpers.scan(es, query=query, index=’your_index’, scroll=’5m’)
attr_dict = {}
attr_array=[]
for res in scroll:
source = res[‘_source’]
for attr in source:
if attr == ‘attribute_wanna_skip’:
continue

if attr not in attr_dict:
if attr == ‘embedded_object’:
for sub in source[attr]:
sub_attr = attr+’.’+sub
if sub_attr not in attr_dict:
attr_array.append(sub_attr)
attr_dict[sub_attr]= True
else:
attr_array.append(attr)
attr_dict[attr]= True