2024 鐵人賽 Day7: Text Analyzer II

2024 iThome 鐵人賽

DAY 0

自我挑戰組

重新開始 elasticsearch 系列第 6 篇

16th鐵人賽

kimcheng

2024-09-22 20:21:09

433 瀏覽

分享至

** 因為 elasticsearch 要打很多字，寫文苦手如我決定縮寫它，所以會用 ES 代稱。*

上一篇講到如果 ES 提供的 Analyzer 不符合需求，可以自己根據 Analyzer 的結構創造一個，這一篇就來講講怎麼創造一個。

在開始前要先介紹一個好用工具：Analyzer testing API。

這個 API 可以讓你測試 Analyzer 是不是真的如你預期的運作，或是你希望對內建的 Analyzer 有更具體的認識，也可以使用。

以內建的 whitespace 為例，我們可以看看經過 Analyzer 之後，會是什麼效果：

POST _analyze
{
  "analyzer": "whitespace",
  "text":     "a Custom Elasticsearch anAlyzer."
}

你會得到：

{
  "tokens": [
    {
      "token": "a",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "Custom",
      "start_offset": 2,
      "end_offset": 8,
      "type": "word",
      "position": 1
    },
    {
      "token": "Elasticsearch",
      "start_offset": 9,
      "end_offset": 22,
      "type": "word",
      "position": 2
    },
    {
      "token": "anAlyzer.",
      "start_offset": 23,
      "end_offset": 32,
      "type": "word",
      "position": 3
    }
  ]
}

tokens 內所包含的內容，就是 ES 經過 whitespace Analyzer 之後，會被放進 index 的 tokens。可以看到 whitespace analyzer 就只是使用空白來切分 token 而已，完全沒有其他處理。

這個 Analyzer 顯然沒有很令人滿意，首先希望大小寫全部都變成小寫的，然後 a 這個字被 index 似乎沒有什麼意義。尋找了官方文件後發現 Lowercase Tokenizer 加上 Stop token filter 可以滿足我的想法，我們來先測試看看：

POST _analyze
{
  "tokenizer": "lowercase",
  "filter": ["stop"],
  "text":     "a Custom Elasticsearch anAlyzer."
}

結果：

{
  "tokens": [
    {
      "token": "custom",
      "start_offset": 2,
      "end_offset": 8,
      "type": "word",
      "position": 1
    },
    {
      "token": "elasticsearch",
      "start_offset": 9,
      "end_offset": 22,
      "type": "word",
      "position": 2
    },
    {
      "token": "analyzer",
      "start_offset": 23,
      "end_offset": 31,
      "type": "word",
      "position": 3
    }
  ]
}

確定這個結果是我想要的，開始把 Analyzer 寫進 index 的設定裡：

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "demo_analyzer": { 
          "type": "custom",
          "tokenizer": "lowercase",
          "filter": [
            "stop"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "my_text": {
        "type": "text",
        "analyzer": "demo_analyzer" 
      }
    }
  }
}

那已經被寫在 index 裡的 analyzer 可以用 analyze api 測試嗎？

可以的，在 analyzer api 的 path 加上 index 的名稱就可以了，如下：

GET my-index-000001/_analyze 
{
  "analyzer": "demo_analyzer", 
  "text":     "a Custom Elasticsearch anAlyzer."
}

ES 大部分內建的 Analyzer components 是有參數可以調整的，像是去除 stopwords 的 stop token filter 就可以自己設定哪些字是 stopword 不要被索引，這樣的調整就會創建客制化的 analyzer components。

下一篇，會介紹 Analyzer 有兩種：index analyzer、search analyzer。

2024 鐵人賽 Day6: Text Analyzer I

2024 鐵人賽 Day8: Text Analyzer III

系列文

重新開始 elasticsearch 共 29 篇

RSS系列文訂閱系列文

2 人訂閱

完整目錄

熱門推薦

{{ item.channelVendor }} | {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

902 組

團體組數

37 組

累計文章數

19856 篇

完賽人數

528 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 17th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# linux windows server css react

IT邦幫忙

重新開始 elasticsearch 系列 第 6 篇