Big Data: Elastisearch Datastore in Action

This blog post will help get you on your way to managing your very own ElasticSearch datastore. ElasticSearch is an open-source, distributed, RESTful, search engine. ES (ElasticSearch) uses JVM and is built on top of Apache Lucene. ES is great for indexing large amounts of data, sifting through a large result set, and analyzing data.



INTRODUCTION

ElasticSearch is an open-source, distributed, RESTful, search engine. ES (ElasticSearch) uses JVM and is built on top of Apache Lucene. ES is great for indexing large amounts of data, sifting through a large result set, and analyzing data.

ES can store up to 2.1 billion documents or 274 billion distinct terms in an index. This is awesome, however, there are some important things to be aware of before you start importing records (known as “documents” in ES). One of those things is that the primary shards must be set before creating the index.

Unfortunately, I learned this lesson the hard way! One of my ES indexes contained 1.6 billion documents and was starting to cause issues. So, if you're importing billions of records please plan accordingly and add more primary shards then the default.

Note: The great thing about ElasticSearch is that it is JSON over HTTP. This has the advantage of allowing multiple programming languages easily talk to an ElasticSearch datastore.

In this blog post I will demonstrate how to import documents via the _bulk API module. I will also show you how to communicate with your ElasticSearch datastore in python using the ElasticSearch library. After you should be able to create a valid ES JSON file, import a large set of documents, and be able to manipulate them in Python.

EXAMPLE 1: IMPORTING RECORDS VIA THE _BULK API MODULE

Document importing is relatively fast with the _bulk API module. Let’s say you have security books that you want to index. In this example, I will demonstrate how to import these documents via the bulk API.

{"index":{"_index":"books","_type":"security","_id":"0"}}
{"Publisher Name":"O’Reilly","Year":"2017", "Author": "Lee Brotherston and Amanda Berlin", "Title": "Defensive Security Handbook: Best Practices for Securing Infrastructure"}
{"index":{"_index":"books","_type":"security","_id":"1"}}
{"Publisher Name":"CreateSpace","Year":"2017", "Author": "Alan J White and Ben Clark", "Title": "Blue Team Field Manual"}
{"index":{"_index":"books","_type":"security","_id":"2"}}
{"Publisher Name":"CreateSpace","Year":"2014", "Author": "Ben Clark", "Title": "Red Team Field Manual"}
{"index":{"_index":"books","_type":"security","_id":"3"}}
{"Publisher Name":"No Starch Press","Year":"2014", "Author": "Justin Seitz", "Title": "Black Hat Python: Programming for Hackers and Pentesters"}
{"index":{"_index":"books","_type":"security","_id":"4"}}
{"Publisher Name":"No Starch Press","Year":"2014", "Author": "Georgia Weidman", "Title": "Penetration Testing: A Hands-On Introduction to Hacking"}

Parse data into this format. ES is efficient if all fields are the same in an index.

Then use the curl command to import the file into ES:

curl -s -H "Content-Type: application/x-ndjson" -XPOST 'localhost:9200/_bulk' --data-binary @books.json

for file in books_{1..33}; do
     curl -s -H "Content-Type: application/x-ndjson" -XPOST ‘localhost/_bulk' --data-binary @$file > /dev/null
     echo $file
done

If the import is successful, you will get “result”: “created” for each document. Each JSON file should have between 2 – 80,000 lines. Do not go over 10MB per file. The file is too big and will be rejected.

Bash is an excellent tool to import multiple files into ES. If you have one massive JSON file, split it up into parts. Then import it with this bash code:

for file in books_{1..33}; do
 	curl -s -H "Content-Type: application/x-ndjson" -XPOST ‘localhost/_bulk' --data-binary @$file > /dev/null
 	echo $file
done

The /dev/null takes the output and throws it away rather than printing it on the screen. This makes importing much faster. Careful with this approach as you may lose errors!

Visit the index in the browser to verify everything has been imported.

http://localhost:9200/books/security/_count

The _count API will show how many documents are currently in the books/security index.

EXAMPLE 2: WORKING WITH ES IN PYTHON.

In this example, I will demonstrate how to use python to search our ES datastore.

Use pip install ElasticSearch to install the python ES library.

#!/usr/bin/env python
 
# Author: Amanda Szampias
 
from elasticsearch import Elasticsearch
 
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
 
def get_author(author, maxsize=50, doc_type=False):
  return es.search(index='books', size=maxsize, q='Author:"' + author + '"')
 
def get_year(year, maxsize=50, doc_type=False):
  return es.search(index='books', size=maxsize, q='Year:"' + year + '"')
 
def main():
  print(get_author('Justin Seitz', 100))
  print('-----------------------------')
  print(get_year('2017', 100))
 
if __name__ == "__main__":
  main()

If you run this code it will return:

Amandas-MacBook-Pro:ESSearch aszampias$ ./searchES.py
{u'hits': {u'hits': [{u'_score': 0.51623213, u'_type': u'security', u'_id': u'3', u'_source': {u'Author': u'Justin Seitz', u'Year': u'2014', u'Title': u'Black Hat Python: Programming for Hackers and Pentesters', u'Publisher Name': u'No Starch Press'}, u'_index': u'books'}], u'total': 1, u'max_score': 0.51623213}, u'_shards': {u'successful': 5, u'failed': 0, u'total': 5}, u'took': 2, u'timed_out': False}
-----------------------------
{u'hits': {u'hits': [{u'_score': 0.2876821, u'_type': u'security', u'_id': u'0', u'_source': {u'Author': u'Lee Brotherston and Amanda Berlin', u'Year': u'2017', u'Title': u'Defensive Security Handbook: Best Practices for Securing Infrastructure', u'Publisher Name': u'O\u2019Reilly'}, u'_index': u'books'}, {u'_score': 0.2876821, u'_type': u'security', u'_id': u'1', u'_source': {u'Author': u'Alan J White and Ben Clark', u'Year': u'2017', u'Title': u'Blue Team Field Manual', u'Publisher Name': u'CreateSpace'}, u'_index': u'books'}], u'total': 2, u'max_score': 0.2876821}, u'_shards': {u'successful': 5, u'failed': 0, u'total': 5}, u'took': 2, u'timed_out': False}

The search query “Justin Seitz” returned one document. The search query “2017” returned two documents. Since I didn’t specify a doc_type (ours is security), it searched the entire ES datastore. Anything is searchable in ElasticSearch.

"cluster_block_exception","reason":"blocked by: [FORBIDDEN/8/index write (api)];"}}},{"index":{"_index":"emailbreach","_type":"exploitin0antipublic","_id":"825014698","status":403,"error":{"type":"cluster_block_exception","reason":"blocked by: [FORBIDDEN/8/index write (api)];"}}

<br>

TROUBLESHOOTING

If you're new to ES (especially Amazon Web Services ES) you may run into a few errors.

If you see an error like this:

"cluster_block_exception","reason":"blocked by: [FORBIDDEN/8/index write (api)];"}}},{"index":{"_index":"emailbreach","_type":"exploitin0antipublic","_id":"825014698","status":403,"error":{"type":"cluster_block_exception","reason":"blocked by: [FORBIDDEN/8/index write (api)];"}}

It could be because the ES disk space is full. Expand the shard space and/or add more nodes.

If you're using Amazon AWS ES and you see this error:

Error (403, '{"Message":"User: anonymous is not authorized to perform: es:ESHttpGet on resource: hl-esstore"}')

Whitelist your IP in Amazon. Your IP is not allowed to communicate with ES.

IN CONCLUSION

Now you know how to create valid ES JSON files, import large amount of documents into ES using the bulk _API, and use python to manipulate your ES datastore. This blog post will help get you on your way to managing your very own ElasticSearch datastore.

ElasticSearch is the perfect free product to store and analyze big data. It has a vibrant community and ES is still being improved daily. You can also update ES to a newer version without losing all your data. It is highly scalable and manageable.