StellarSQL 12: Lexical Scanner Case Study

2019鐵人賽 stellarsql dbmd

微中子

團隊微中子的小宇宙

2018-10-27 23:56:26

1171 瀏覽

分享至

12: Lexical Scanner Case Study

2018/10/28

It's recommemded to read on Gitbook

I found it becomes more and more difficult to programing some and record the process on this series. The reason is that it costs huge time for studying and thinking about the next steps. I expected I could write some code every day, but it seems that I would take two or three days for studying on a topic and then implementing functions.

So, let's talked about my studying today. I spend some time studying the implementation of TiDB that I talked about yesterday. I have some reflection on its methods.

It uses "trie" to identify tokens. Trie is a data structure. It is good at searching words in a dictionary. So, in this case, TiDB searches tokens in a symbols dictionary.

I am thinking why it uses trie, but I still don't understand why. As far as I am concerned, I would rather use hash. The searching complexity of trie and hash are both O(N). In this case, we only need to look up tokens, so no need to considering inserting and deleting data. In the condition that with a large amount of words, trie would use less space. However, in our case, symbols are only about a hundred, and adopting hash is much easy and straightforward for me.

No matter MySQL or TiDB, they both use Yacc, and their scanners are just the interface for Yacc. I would not use Yacc, on one hand it is written in C++ (I want StellarSQL is pure Rust. Though in fact, Yacc is run to create codes before the runtime of the program), and the other hand I would like to doing any parts by myself (less third party modules as possible), so the term, "from scratch", would make sense.

Further, considering the performance, many DBMS implement their own bytes or strings handler for the scanner. I have not designed the protocol of the message yet, and I am just thinking about converting the string to bytes as protocol to make it simple. Anyway, handling bytes by well designed protocol would optimize the performance, but I would just process strings and using the Rust standard library for now. Maybe I will rewrite and refactor this part in the future, but now it's fine to "keep it simple, keep it stupid".

Quick Link
1.StellarSQL Repository
2.Gitbook of this series

Author
Liu, An-Chi (劉安齊). A software engineer, who loves writing code and promoting CS to people. Welcome to follow me at Facebook Page. More information on Personal Site and Github.