Demystifying Elasticsearch: Understanding nodes, clusters, shards and indices — Part 2
This is the 2nd post in a series about learning Elasticsearch.
Elasticsearch is a distributed, open-source search and analytics engine designed for scalability and speed. Understanding its architecture is crucial for effectively deploying and managing Elasticsearch clusters. In this article, we’ll delve into the basics of Elasticsearch architecture, covering nodes, clusters, shards, type of nodes, and the role of indices.
Nodes in Elasticsearch
What is a Node?
An Elasticsearch cluster is made up of nodes, which are individual instances of the Elasticsearch server. Nodes can be deployed on separate machines or run on a single machine for development purposes.
Types of Nodes
Elasticsearch nodes can be categorized into several types based on their roles and responsibilities within the cluster. Here are the main types of nodes in Elasticsearch:
Master Node:
- The master node is responsible for cluster-wide activities such as creating or deleting an index, adding or removing nodes, and managing the cluster state.
- There can only be one active master node in a cluster at a time, but multiple nodes can be configured as potential master nodes.
- It is recommended to have dedicated master-eligible nodes to ensure stability in larger clusters.
Data Node:
- Data nodes are responsible for storing and indexing data.
- They handle tasks such as CRUD (Create, Read, Update, Delete) operations and search requests.
- Data nodes hold a portion of the data in the cluster and distribute the data across the nodes to ensure scalability and fault tolerance.
- In a large cluster, you can have multiple data nodes to distribute the data load.
Ingest Node:
- Ingest nodes are responsible for pre-processing documents before they are indexed.
- They can apply transformations to incoming documents using pipelines, which may involve enriching data, modifying the structure, or dropping certain fields.
- Ingest nodes can be configured to perform specific tasks, enhancing the flexibility of data processing in Elasticsearch.
Client Node:
- A client node is designed to handle client requests and doesn’t store data or participate in the indexing process.
- It acts as a load balancer, distributing incoming requests across data nodes to balance the workload.
- Client nodes are useful in scenarios where you want to separate the handling of client requests from the data and master node responsibilities.
Coordinating Node:
- A coordinating node is similar to a client node and is responsible for coordinating search and index requests.
- It doesn’t hold data but routes requests to the appropriate data nodes and aggregates the results.
- Coordinating nodes help in distributing the search and query processing load across the cluster.
Machine Learning Node (Deprecated):
- In earlier versions of Elasticsearch, there was a specialized machine learning node designed for running machine learning jobs.
- However, as of version 7.9, machine learning capabilities have been integrated into dedicated machine learning plugins, and the concept of a separate machine learning node is deprecated.
Node Roles
Nodes can have multiple roles, and a node can serve as both a master and a data node simultaneously. Understanding and assigning appropriate roles to nodes is essential for optimizing the cluster’s performance and resilience.
Clusters in Elasticsearch
What is a Cluster?
A cluster is a collection of nodes that work together to store data and provide search and indexing capabilities. It provides horizontal scalability, allowing you to add or remove nodes easily to accommodate changing requirements.
Cluster State
- The cluster state is a crucial aspect managed by the master node. It holds information about indices, node availability, and shard allocation. Understanding the cluster state is essential for monitoring and troubleshooting Elasticsearch clusters.
Discovery and Communication
- Nodes in a cluster use a discovery mechanism to find each other. This can be static configuration, multicast discovery, unicast discovery, or other plugins.
- The communication between nodes involves exchanging information about the cluster state, performing health checks, and coordinating actions like shard allocation.
Cluster Health
- Elasticsearch uses a color-coded health system to indicate the overall status of a cluster. The colors are green, yellow, and red, representing good health, partial availability, and critical issues, respectively.
Green: All primary and replica shards are active.
Yellow: All primary shards are active, but some or all replica shards are not allocated.
Red: Some or all primary shards are not active.
Shards in Elasticsearch
What is a Shard?
In Elasticsearch, a shard is a basic unit of data storage and search. Elasticsearch uses a distributed architecture to handle large amounts of data and provide scalable and efficient search capabilities. Sharding is the process of breaking down the index into smaller, more manageable pieces called shards. Understanding shards is crucial for designing and optimizing Elasticsearch clusters.
Each index is divided into multiple primary and replica shards.
Types of Shards
Primary Shards: Primary shards contain the main data and handle all write operations. The number of primary shards is set when creating an index and cannot be changed later.
Replica Shards: Replica shards are copies of the primary shards, serving as failover mechanisms. They improve system resilience and enable parallel search and retrieval operations.
Let’s consider a cluster comprising 2 dedicated master-eligible nodes and 2 data nodes. Additionally, there is an index with 3 primary shards and 1 replica. The diagram below illustrates how Elasticsearch distributes the 6 shards (3 primary shards and 1 replica of each) across the data nodes in the cluster:
In this scenario, each data node hosts both primary and replica shards, ensuring that the index’s data is distributed across the available nodes for fault tolerance. Elasticsearch will automatically handle failover scenarios, utilizing the replicas in case of node failures to maintain data availability.
Distributed Storage
- Elasticsearch is designed to be a distributed system, meaning that it can efficiently distribute and store data across multiple nodes in a cluster.
- Shards are distributed across nodes to distribute both the storage and processing load.
Scaling and Performance
- Sharding enables horizontal scaling, allowing Elasticsearch to handle large amounts of data and traffic by adding more nodes to the cluster.
- Each shard can be independently processed by a node, allowing for parallelism and improved search performance.
Search and Query Distribution
- When a search query is executed, it is distributed across all relevant shards in parallel, improving query performance.
- This parallelism is possible because each shard holds a subset of the data, and the results are aggregated to form the final result set.
Rebalancing
- Elasticsearch automatically redistributes shards across nodes to ensure a balanced distribution of data and workload.
- This process is known as shard rebalancing and is triggered when new nodes are added to the cluster or when nodes are removed.
Failure Tolerance
- Having multiple replicas of a shard provides fault tolerance. If a node containing a primary shard fails, one of its replica shards can be promoted to the primary role, ensuring continued availability of the data.
Indices in Elasticsearch
What is an Index?
An index is a collection of documents sharing similar characteristics. It serves as the primary unit for organizing and managing data within Elasticsearch. Each document within an index is uniquely identified by a document ID. Indices are analogous to tables in a relational database.
Index Settings and Mappings
Index settings define the configuration of an index, such as the number of primary shards and replica shards. Mappings define the data types and properties for fields within documents, influencing how Elasticsearch indexes and searches the data.
Documents
In Elasticsearch, documents are the basic units of information that are stored and indexed. They are essentially JSON objects that contain the actual data you want to store and search.
Conclusion
In summary, understanding the basics of Elasticsearch architecture is fundamental for efficiently deploying and managing Elasticsearch clusters. Nodes, clusters, shards, and indices are essential components that collectively contribute to the scalability, performance, and resilience of Elasticsearch in various use cases. Regular monitoring, proper configuration, and careful planning are key to maintaining a healthy Elasticsearch environment.