【Day 15】由淺入深來探討Elasticsearch - Mapping parameters - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2023 iThome 鐵人賽

DAY 15

Software Development

由淺入深來探討Elasticsearch，從基礎語法到底層相關原理系列第 15 篇

【Day 15】由淺入深來探討Elasticsearch - Mapping parameters

15th鐵人賽 #elasticsearch #mapping

blank

2023-09-17 12:29:59

536 瀏覽

分享至

今天來介紹有關在mapping時可以設置的參數，其中也有一些坑需要大家注意
前半部會先介紹相關的一些參數以及如何設定
後半部則會說明在設置mapping時如何進行最佳化處理
那我們就開始吧～

properties：
像是object或是nested field，有包含sub_field時調用

"field_name": {
        "type": "nested",
        "properties": { 
          "age":  { "type": "integer" },
          "name": { "type": "text"  }
        }
      }

在查詢或是聚合等，需要調用inner field時可以用.來使用

"query": {
    "match": {
      "field_name.inner_field_name": ""
    }
}

format：
在document中，如果有date相關的欄位通常是string的形式
設置format可以讓ES去解析這些string轉成date type

{
  "mappings": {
    "properties": {
      "date": {
        "type":   "date",
        "format": "yyyy-MM-dd"
      }
    }
  }
}

date欄位能支援date math，例如 +1h就是加一小時
根據官方文檔有許多可以直接拿來用的format設置，不過當使用dynamic mapping時，可以為這些欄位的名稱添加strict_，這樣ES會用更嚴格的標準去判斷該欄位是否符合format規格

PUT /test_date
{
  "mappings": {
    "properties": { 
      "my_date_field": {
        "type": "date",
        "format": "strict_date"
      }
    }
  }
}

如果你想讓多種date format能使用，用||來分隔兩種形式，即使寫在array中也是(雖然我用array創建都會有mapper_parsing_exception，但是官方文檔是用array示範，可能是要注意的點)

"format": "yyyy/MM||dd/MM/yyyy"

Coerce：
在開始介紹之前，我們先來做個測試

創建一個文檔，並且讓ES動態映射id欄位
接著看一下mapping

PUT /test_coerce/_doc/1
{
  "id": 1
}

GET /test_coerce/_mapping

我們看到type是long，那我們輸入另外一個document試試看

PUT /test_coerce/_doc/2
{
  "id": "2"
}

居然成功了～明明我們輸入的是字串
這就是coerce，如果欄位的mapping已經是long了
如果輸入的文檔中有這種髒數據，ES還是會藉由coerce將其轉換成欄位應該的映射

但是你去看_source當然還是字串的樣子，因為他是放原始數據
並且要注意，這樣的強制轉換，不是創建index時發生的
代表如果我一開始存

PUT /test_coerce/_doc/1
{
  "id": "1"
}

id欄位的mapping如下

如果要關閉的話

PUT /test_coerce
{
  "mappings": {
    "properties": {
      "id": {
        "type": "keyword",
        "coerce": false
      }
    }
  }
}

index：

欄位如果進行索引，會讓搜尋效率上升，因此如果有欄位不需要使用到query的話，可以設置false
像是numeric, date, boolean, ip, geo_point或是keyword type field如果設置index false的話一樣可以使用query，只要doc_value有保留的話。其他類型欄位如果設置index false就沒辦法使用query
index false欄位在大多情況下讓欄位無法使用query，但是能使用aggregation

PUT /test_index
{
  "mappings": {
    "properties": {
      "test_score": {
        "type": "integer",
        "index": false
      }
    }
  }
}

index_option：

只有text跟keyword支援設定，並用來控制 inverted index 中要存放哪些資訊
可以設置幾種模式，按照排序分別為docs, freqs, positions(默認), offsets
設置後面的參數，就會包含到前面的參數。例如我設置positions，就包含docs與freqs

參數名稱	描述
docs	僅存doc number，代表該內容是否有在特定欄位中
freqs	除上面之外，還儲存出現的頻率。計算相關分數時，頻率越高分數越高
positions	除上面之外，還儲存位置與排序資料。對於proximity與phrase query有幫助(後面會介紹)
offsets	會計算term的起始與結束的偏移量，幫助hightlighting

以下為設置方法與示範offsets功能：

PUT /test_offset
{
  "mappings": {
    "properties": {
      "text": {
        "type": "text",
        "index_options": "offsets" // 設置index_options
      }
    }
  }
}

// 匯入文檔
PUT /test_offset/_doc/1
{
  "text": "test offset function"
}

// 查詢該文檔
GET /test_offset/_search
{
  "query": {
    "match": {
      "text": "test"
    }
  },
  "highlight": { 
    "fields": {
      "text": {} 
    }
  }
}

可以看到我們的查詢單字被強調顯示出來

doc_values：

是另一種不同於inverted index的數據結構。一般搜尋是透過term去找document，doc_values用於在文檔中去有沒有該term
是一種Column-oriented導向的儲存方式，更利於排序與聚合。例如大家的數學考試分數都存在同一欄，而不是存一個同學的國文、英文與數學成績(Row-Oriented)
在前面index提到某些特定欄位不設置index只留doc_values時，在犧牲部分查詢速度下達到儲存空間優化的效果
禁用index下默認該field是doc_value_only
可以透過設置false來讓doc_value也不儲存

PUT /test_doc_value
{
  "mappings": {
    "properties": {
      "score": {
        "type": "keyword",
        "doc_values": false
      }
    }
  }
}

norms：

紀錄許多normalization factors，例如每個欄位中，每個byte的排序等等
因為也會需要許多空間儲存，不需要考慮相關分數的話可以禁掉

"norms": false

null_value：

null_value 參數允許您用指定值替換顯式空值，以便可以對其進行搜索，因為正常情況空值是無法被檢索的
空array不在範圍內，所以設置[]一樣無法被搜尋

PUT /test_null
{
  "mappings": {
    "properties": {
      "status_code": {
        "type":       "keyword",
        "null_value": "NULL" 
      }
    }
  }
}
// 查詢方式
GET /test_null/_search
{
  "query": {
    "term": {
      "status_code": "NULL" 
    }
  }
}

接下來統整有關mapping的推薦設置：