How to index json dataset in Elasticsearch

How to index json dataset in Elasticsearch

Elasticsearch_json_datasets_indexing

Here in this article we will see how we can index a json dataset consisting of documents in elasticsearch. We will be setting up elasticsearch cluster and use a python script to extract the documents from json dataset and send it to elasticsearch for indexing.

Test Environment

Fedora workstation 37
Docker
Docker Compose

If you are intereseted in watching the video. Here is the YouTube video on the same step by step procedure outlined below.

Procedure

Step1: Download json datasets

For this article i am downloading a sample dataset available at the following github respository – awesome-json-datasets.

wget http://api.nobelprize.org/v1/prize.json

Step2: Install jq

As a part of second step we are installing jq tool. jq is a command line json processor which can be used to slice, filter, map and transform with the json formatted data. We can use this tool to inspect json data in a readable format.

sudo dnf install jq

Step3: Inspect json data

In this step we will using jq processor to inspect our prizes.json dataset and verify what kind of documents are present in this dataset and what are different types of data present in each of these fields.

File: prizes.json

[admin@fedser elastic-kibana]$ cat prize.json | jq
{
  "prizes": [
    {
      "year": "2022",
      "category": "chemistry",
      "laureates": [
        {
          "id": "1015",
          "firstname": "Carolyn",
          "surname": "Bertozzi",
          "motivation": "\"for the development of click chemistry and bioorthogonal chemistry\"",
          "share": "3"
        },
        {
          "id": "1016",
          "firstname": "Morten",
          "surname": "Meldal",
          "motivation": "\"for the development of click chemistry and bioorthogonal chemistry\"",
          "share": "3"
        },
        {
          "id": "743",
          "firstname": "Barry",
          "surname": "Sharpless",
          "motivation": "\"for the development of click chemistry and bioorthogonal chemistry\"",
          "share": "3"
        }
      ]
    },
...

Step4: Ensure Docker service running

Now that we are having our dataset ready, let us ensure docker and docker compose is installed on our system and is running.

sudo systemctl start docker.service
sudo systemctl enable docker.service
sudo systemctl status docker.service

Step5: Ensure Elasticsearch and Kibana service Cluster running

Now let’s start up the elasticsearch cluster using the docker-compose.yml file. You can check out my article on “How to enable audit logging in Elasticsearch” were in we have setup the elasticsearch cluster using docker-compose file.

docker-compose ps
docker-compose up -d

Output:

         Name                        Command                  State                             Ports                       
----------------------------------------------------------------------------------------------------------------------------
elastic-kibana_es01_1     /bin/tini -- /usr/local/bi ...   Up (healthy)   0.0.0.0:9200->9200/tcp,:::9200->9200/tcp, 9300/tcp
elastic-kibana_es02_1     /bin/tini -- /usr/local/bi ...   Up (healthy)   9200/tcp, 9300/tcp                                
elastic-kibana_es03_1     /bin/tini -- /usr/local/bi ...   Up (healthy)   9200/tcp, 9300/tcp                                
elastic-kibana_kibana_1   /bin/tini -- /usr/local/bi ...   Up (healthy)   0.0.0.0:5601->5601/tcp,:::5601->5601/tcp          
elastic-kibana_setup_1    /bin/tini -- /usr/local/bi ...   Exit 0 

Copy the ca certificate from any elasticsearch docker container as shown below. This is required for us to connect to the elasticsearch cluster and send REST API requests in a secure channel.

docker cp elastic-kibana_es01_1:/usr/share/elasticsearch/config/certs/ca/ca.crt .

Step6: Index Document

Now that we have our elasticsearch cluster up and running. Let’s look at the below python script which we will use to extract the json documents from the prize.json dataset and send it using a POST REST API request to elasticsearch search for indexing.

File: indexDocuments.py

import json
import requests
import sys

indexname = sys.argv[1]
print(indexname)
url = "https://localhost:9200/{}/_doc".format(indexname)
cafile = 'ca.crt'
headers = {'Content-type': 'application/json', 'Accept': 'text/plain'}
username = "elastic"
password = "admin@1234"

# Opening JSON file
f = open('prize.json')
  
# returns JSON object as 
# a dictionary
data = json.load(f)

# print json data
#print(data['prizes'][0])

# print number of documents in the json
count = len(data['prizes'])
print(count)

# index each document
for each in range(count):
    print(each)
    prizedoc = data['prizes'][each]
    r = requests.post(url, data=json.dumps(prizedoc), auth=(username, password), headers=headers, verify=cafile)
    print(r.status_code)

# Closing file
f.close()
python indexDocuments.py prizes

Step7: Validate the Index

Once our indexing is done we can navigate to Stack Management – Data View and Create our Data View based on the index source that got generated. Here as you can see there is a prizes index created with each document indexed within that index. But if we closely look at the document field types, every field is of type text. But from our initial inspection of json dataset using jq tool we saw that some of the field values are with numeric data.

These fields can be mapped to their respective types so we have the correct mapping of the data. Let us see what options we have to map these field types correctly in the document in our next article.

Hope you enjoyed reading this article. Thank you..