How to properly prepare parallel indexing in ElasticSearch with python?

T

teobon2014-10-28 12:06:38

Python

teobon, 2014-10-28 12:06:38

There is a task - to create an index in Elasticsearch as quickly as possible.
The input is data prepared by python.
The output is an index created from scratch.
The infrastructure is the simplest so far - one ES node, one index, one shard without replicas.
There is not much data, 100K documents, but there is geodata, ES indexes units of documents per second.
ES is fed using elasticutils in a loop (i.e. synchronously) in bulk mode.
The problem is that building an index takes 10 hours, I would like to reduce the time by several times (the target value is 1 hour).
At the same time, the infrastructure does not rest on the CPU, memory, or IO.
Of the 8 cores that are currently allocated to this server, 2 are loaded on average.
IO does not sag, loading is around 0.
Memory allocated 18GB, free on average 8GB, ie. it's ok too.
Those. It turns out that the problem is not in the infrastructure, but in the settings of the entire indexing bundle.
Potential areas for improvement:
1) feed ES asynchronously from python (for example, using celery)
2) optimize the index storage structure in ES (many shards, etc)
3) optimize ES settings (xs what can be improved here, pools are already in in excess)
What do you think, in what directions is it best to dig?

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

E

Eternalko, 2014-10-28
@Eternalko

Have you looked at index_concurrency? In theory, there should be 8 and 8 cores loaded.
index.merge.scheduler.max_thread_count sort of like if you don't have an IO plug, then you can increase it.
By the way, it's weird. In my ES (by mistake) I received 1-2k logs (documents) per second. The server was weak but calmly held the blow. True, the index was considered once a day and I did not notice how it was and how it was. Removed all unnecessary without parsing.

T

teobon, 2014-12-18
@teobon

The only thing that somehow helped to load all the cores was an increase in the number of shards (there was one, now there are 4).
A larger number of shards does not give an increase, a smaller number does not load all the cores.
At the same time, I did not fully understand why. Those. in principle, you can try to explain - there may be blocking of shards for writing during indexing, but it seems like there should not be such behavior according to the dock.

U

un1t, 2015-02-08
@un1t

100 thousand documents is not much at all.
I indexed up to 250 thousand, indexing takes only a few minutes on a weak VDS. (not aware of geodata)
What is your average document size? Do you index all fields?
When creating an index, I specify

es.create_index(index, {
                'index': {
                    'refresh_interval': -1,
                }
            })

I send documents through bulk_index 10 thousand at a time, at the end I start updating the index
es.refresh(index)