Demystifying Elasticsearch: Indices and mapping — Part 3
This is the 3rd post in a series about learning Elasticsearch.
Elasticsearch is a powerful and flexible open-source search and analytics engine that allows you to store, search, and analyze large volumes of data quickly and in near real-time. To effectively utilize Elasticsearch, you need to understand how to index data. This process involves defining and creating indices, mapping types and fields, and performing bulk indexing.
Defining and Creating Indices:
An index in Elasticsearch is similar to a table in traditional relational databases. It is a collection of documents that share a similar structure and are stored in a similar way. Each index has a unique name that is used to identify and reference it.
- Index Naming Conventions: Choose meaningful names for your indices based on the type of data they will store. Avoid using uppercase letters or special characters, and consider using lowercase letters with underscores or hyphens.
- Index Settings: Elasticsearch allows you to configure various index settings. In the example provided below, we set the number of primary shards and replicas. Other settings include analysis settings for text fields, refresh interval, and more.
- Dynamic Index Creation: Elasticsearch can dynamically create an index when you index a document into it. This is useful when dealing with diverse datasets.
Creating an Index:
To create an index, you can use the create index
API. For example:
PUT <url_elasticsearch_server:port>/my_index
{
"settings": {
"number_of_shards": 3,
"number_of_replicas": 2
}
}
This example creates an index named my_index
with 3 primary shards and 2 replicas.
Mapping Types and Fields
In Elasticsearch, mapping defines how documents and their fields are stored and indexed. It is similar to the concept of a schema in relational databases. The mapping types and fields in Elasticsearch are crucial for understanding how data is structured and how it can be queried. Let’s explore the key concepts:
Mapping Types
Core Data Types:
- Text: Used for full-text search. Analyzed and tokenized.
{
"mappings": {
"properties": {
"content": {
"type": "text",
"analyzer": "standard"
}
}
}
}
- Keyword: Used for keyword-type search. Not analyzed, good for filtering and sorting.
{
"mappings": {
"properties": {
"tags": {
"type": "keyword"
}
}
}
}
- Date: For date and time values.
{
"mappings": {
"properties": {
"timestamp": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss"
}
}
}
}
- Numeric: For numeric values (integer, long, float, double).
{
"mappings": {
"properties": {
"quantity": {
"type": "integer"
}
}
}
}
- Boolean: For Boolean values (true/false).
{
"mappings": {
"properties": {
"is_published": {
"type": "boolean"
}
}
}
}
Complex Data Types:
- Object: Used for nested JSON objects.
{
"mappings": {
"properties": {
"user": {
"type": "object",
"properties": {
"name": { "type": "text" },
"age": { "type": "integer" }
}
}
}
}
}
- Array: For arrays of simple or complex types.
{
"mappings": {
"properties": {
"tags": {
"type": "keyword"
},
"comments": {
"type": "text"
}
}
}
}
- Nested: Similar to objects but maintains the relationship between nested objects.
{
"mappings": {
"properties": {
"comments": {
"type": "nested",
"properties": {
"user": { "type": "text" },
"comment_text": { "type": "text" }
}
}
}
}
}
Specialized Data Types:
- Geo-point: For geographical coordinates (latitude and longitude).
{
"mappings": {
"properties": {
"location": {
"type": "geo_point"
}
}
}
}
- IP: For IPv4 and IPv6 addresses.
{
"mappings": {
"properties": {
"ip_address": {
"type": "ip"
}
}
}
}
- Attachment: For full-text search on binary data like PDFs or Word documents. Attachment (Deprecated in recent versions; Tika or Ingest Attachment plugin recommended):
{
"mappings": {
"properties": {
"document": {
"type": "attachment"
}
}
}
}
Fields
Field Data Types:
- Each field in Elasticsearch must be assigned a specific data type.
- For example, a field can be of type
text
,keyword
,date
,integer
, etc.
Field Indexing Options:
- Index: Determines whether the field is searchable or not.
- Analyzer: Defines how the field’s text should be analyzed during indexing.
In Elasticsearch, an analyzer is a crucial component of the text analysis process. It plays a vital role in breaking down textual data into terms that can be efficiently indexed and searched. Analyzers are used during both indexing and querying phases to ensure that the text is processed consistently and meaningfully.
Here are some key components and concepts related to analyzers in Elasticsearch:
- Character Filters: These are used to preprocess the input text before it undergoes tokenization. Character filters can be used to remove HTML tags, replace specific characters, or perform other transformations.
- Tokenizer: This is responsible for breaking the input text into individual terms, or tokens. Elasticsearch provides various tokenizers, such as the standard tokenizer, whitespace tokenizer, and more. The choice of tokenizer depends on the specific requirements of the data.
- Token Filters: After tokenization, token filters are applied to modify or filter the tokens. Common use cases include stemming (reducing words to their root form), lowercase conversion, synonym expansion, and stopword removal.
Elasticsearch supports different types of analyzers, including:
- Standard Analyzer: This is the default analyzer, which uses the standard tokenizer, lowercase filter, and stopword filter.
- Simple Analyzer: It uses a simple tokenizer that divides the text into lowercase tokens, with no additional processing.
- Whitespace Analyzer: This analyzer breaks the text into terms whenever it encounters whitespace characters.
- Custom Analyzers: Users can create their own custom analyzers by combining different character filters, tokenizers, and token filters to meet specific requirements.
Field Properties:
- Store: Determines whether the actual field value should be stored or not.
- Doc Values: Used for efficient sorting and aggregations.
- Index Options: Configures the level of detail stored in the inverted index.
Elasticsearch Data Structure
Elasticsearch utilizes an inverted index data structure to enable fast full-text searches. The inverted index is a fundamental concept in information retrieval.
Here’s how it works:
- Tokenization: During the indexing process, Elasticsearch analyzes the text in documents and breaks it down into individual terms or tokens. These tokens are typically words, but they can also be things like numbers or other meaningful units of text.
- Inverted Index: For each unique term identified in the documents, Elasticsearch creates an inverted index. This index is like a lookup table that maps each term to a list of document IDs where that term appears. So, instead of scanning every document to find a term, Elasticsearch can simply look up the term in the inverted index to find the associated documents.
- Fast Search: When a user performs a search query, Elasticsearch refers to the inverted index to quickly identify the documents that contain the specified terms. This allows Elasticsearch to retrieve relevant documents much more efficiently than scanning through every document.
- Scoring: Elasticsearch also calculates a relevance score for each document based on factors like term frequency, field length, and other relevance algorithms. This helps rank the search results by relevance.
While Elasticsearch uses an inverted index primarily for full-text search, it employs other data structures for different types of fields.
Here are some common field types and their associated data structures:
Keyword Fields: For fields that are not analyzed and treated as single terms, such as identifiers, tags, or exact matches, Elasticsearch often uses a data structure similar to a hash table or a trie. This allows for efficient exact match lookups without the need for tokenization or stemming.
Numeric Fields: Fields containing numerical data, such as integers or floating-point numbers, use specialized data structures like BKD trees (Block-KD trees) or other tree structures. These structures allow for efficient range queries and aggregations on numeric values.
Date Fields: Similar to numeric fields, date fields often use specialized tree structures like BKD trees to support efficient range queries and date-based aggregations.
Geo-Point Fields: For geographic coordinates, Elasticsearch uses spatial data structures like geohashes or quad-trees. These structures enable efficient spatial queries and calculations, supporting tasks like proximity searches and distance calculations.
When to Use different types of mapping?
Text vs. Keyword:
- Use
text
for full-text search (e.g., article content). - Use
keyword
for exact matching, sorting, and aggregations (e.g., tags, usernames).
Date and Numeric Types:
- Use
date
for date and time values. - Choose the appropriate numeric type based on precision and range requirements.
Object and Nested Types:
- Use
object
for JSON objects that don’t require separate querying. - Use
nested
when you need to maintain the relationship between nested objects.
Geo-point and IP Types:
- Use
geo-point
for geographical data. - Use
ip
for IP addresses.
Analyzer Selection:
- Choose an appropriate analyzer based on the language and search requirements for
text
fields.
Indexing and Searching Considerations:
- Decide whether a field should be indexed or not based on search requirements.
- Configure store options based on whether you need to retrieve the original field value.
Sorting and Aggregations:
- Use
doc_values
for fields involved in sorting and aggregations.
Dynamic vs. Explicit Mapping:
- Decide whether to use dynamic mapping (automatic mapping based on data) or explicit mapping (predefined mapping).
Dynamic Mapping
Dynamic mapping in Elasticsearch refers to the automatic creation of field mappings based on the data that is indexed. When you index a document with new fields, Elasticsearch dynamically generates the mappings for those fields. This flexibility is particularly useful in scenarios where the structure of your data is evolving or when dealing with diverse datasets.
For example, if you index a document with a new field like “location” or “timestamp,” Elasticsearch will automatically create the appropriate mapping for these fields based on the data types and values encountered. This dynamic approach simplifies the initial setup, as you don’t have to define mappings explicitly for every field.
However, dynamic mapping also comes with challenges. It might not always capture your intended data types or configurations, and incorrect mappings can lead to undesired search results. To address this, explicit mapping can be employed to provide more control over the indexing process.
Explicit Mapping
Explicit mapping involves defining the data types and configurations for fields explicitly before indexing data into Elasticsearch. By specifying the mapping in advance, you have more control over how Elasticsearch indexes your data. This can help in ensuring that the mappings align with your application’s requirements and prevent unintended data type conversions.
To use explicit mapping, you can create an index with a predefined mapping, specifying the field types, formats, and other configurations. This approach is beneficial when dealing with structured data or when you want to enforce specific data constraints. It provides a clear and predictable schema for your data, making it easier to maintain and query.
Explicit mapping is particularly useful when dealing with complex data models or when you want to optimize search performance by customizing the way fields are analyzed and indexed.
Defining Mapping:
You can define mapping when creating an index or update it later using the put mapping
API. For example:
PUT <url_elasticsearch_server:port>/my_index/_mapping
{
"properties": {
"title": {
"type": "text"
},
"timestamp": {
"type": "date"
}
}
}
This example defines mapping for the title
field as text and the timestamp
field as a date.
Bulk Indexing
Bulk indexing is a way to efficiently index multiple documents in a single request. This is more performant than indexing each document individually.
- Bulk API Format: The Bulk API request is a newline-delimited JSON format. Each action (index, delete, update) is specified in a single line. The request consists of metadata, such as the index and document ID, followed by the actual document data.
- Error Handling: The Bulk API allows you to process multiple actions in a single request, but it also returns detailed information about each action, including any errors encountered during the process.
- Optimizing Bulk Indexing: To optimize bulk indexing performance, consider tuning parameters such as the number of bulk actions in a single request, the size of each bulk request, and the refresh interval.
For example:
POST <url_elasticsearch_server>/my_index/_bulk
{"index":{"_id":"1"}}
{"title":"Document 1","timestamp":"2024-01-20T12:00:00"}
{"index":{"_id":"2"}}
{"title":"Document 2","timestamp":"2024-01-20T13:00:00"}
In this example, two documents are being indexed in the my_index
index.
Managing Mappings Over Time
As your application evolves, so does your data. Elasticsearch provides mechanisms to manage mappings over time, allowing you to adapt to changing requirements without compromising on data consistency and integrity.
One approach is to use index templates, which enable you to define mappings and settings that are applied when creating new indices. This ensures consistency across multiple indices and simplifies the process of handling evolving data structures.
Additionally, index aliases can be employed to switch between different indices seamlessly. This is useful when you need to reindex data with updated mappings without affecting the search experience.
Regularly reviewing and updating mappings is essential to maintaining a healthy Elasticsearch setup. Tools like the Index Management API and the Index Lifecycle Management (ILM) feature can be utilized to automate the process of managing mappings, making it easier to handle data changes efficiently.
Conclusion:
Mastering the process of indexing data in Elasticsearch involves a comprehensive understanding of index creation, mapping definition, and bulk indexing strategies. Consideration of best practices, such as naming conventions, dynamic mapping, and optimization techniques, ensures a robust and efficient Elasticsearch setup. Regularly refer to Elasticsearch documentation for updates and advanced features as you continue to explore and implement indexing strategies in your projects.