Day6 - 型別：字元、布林值

16th鐵人賽 rust javascript

blueye

2024-09-20 02:53:09

416 瀏覽

分享至

今天接續介紹 Rust 的原生資料型別子集：純量剩下的型別。

字元

Rust 的 字元(char) 型別大小是 4 bytes，並表示為一個 Unicode 純量數值，所以可以單獨的表示任何有效的 Unicode 字符，不論是中文、韓文、日文甚至是表情符號等。

而&str 和 String 則是是 Rust 中最常用的兩種字串型別，就不是基本型別了。
它們有一些重要的區別：

&str 是一個引用，指向的是已經存在的字串數據，它是不可變的，大小固定。
String 是一個擁有者型別，它是可變的，可以動態地增長或縮減字符串內容。

目前對字串的理解先有個印象就好。

以下舉例 Rust 和 JavaScript 在字元 / 字串的差別：
Rustchar 的字面值是用單引號賦值，單引號雙引號意義是不一樣的不能混用！

可以正常編譯及執行的例子：

fn main() {
    let c1: char = 'a'; // 這是字元
    let c2: &str = "apple"; // 這是字串，字元組成字串
    println!("{}", c1);
    println!("{}", c2);
}

把 'a' 換成 "a" ，因為我們有定義型別，編譯期會報錯。

fn main() {
    let c1: char = "a";
    let c2: &str = "apple";
    println!("{}", c1);
    println!("{}", c2);
}

$ cargo run
error[E0308]: mismatched types
 --> src/main.rs:2:20
  |
2 |     let c1: char = "a";
  |             ----   ^^^ expected `char`, found `&str`
  |             |
  |             expected due to this
  |
help: if you meant to write a `char` literal, use single quotes
  |
2 |     let c1: char = 'a';
  |                    ~~~

把單雙引號反過來也是編譯期會報錯。

JavaScript 則是單雙引號都可以，只是 lint 有習慣用法而已。

const c1 = 'a';
const c2 = "a";
const s1 = 'apple';
const s2 = "apple";
console.log(c1);
console.log(c2);
console.log(s1);
console.log(s2);

Rust 的char 可以單獨的表示任何有效的 Unicode 字符，包含特殊自行及表情符號等。

fn main() {
    let greet = "Hi😀"; // 😀 超出 BMP 範圍
    for ch in greet.chars() {
        println!("Character: {}; code unit: {}", ch, ch as u32);
    }
}

$ cargo run
Character: H; code unit: 72
Character: i; code unit: 105
Character: 😀; code unit: 128512

JavaScript 的字串使用 UTF-16 編碼，對於 BMP(Basic Multilingual Plane) 內的字符，每個字符用一個 16 位單元表示；對於超出 BMP 的字符，則需要兩個 16 位單元（稱為代理對）來表示。意思就是在 BMP 範圍外的字符處理是需要被注意的，包含長度和顯示會和 BMP 內的字符不同。

const str = "Hi😀";

for (let i = 0; i < str.length; i+= 1) {
    console.log(`Character: ${str[i]}, code unit: ${str.charCodeAt(i)}`);
};

$ node index.js
Character: H, code unit: 72
Character: i, code unit: 105
Character: �, code unit: 55357
Character: �, code unit: 56832

這邊就可以觀察到 😀 是由兩個 16 位元組成，而且無法被單獨顯示。

const str = "Hi😀";
console.log(`Character: ${str[0]}, code unit: ${str.charCodeAt(0)}`);
console.log(`Character: ${str[1]}, code unit: ${str.charCodeAt(1)}`);
console.log(`Character: ${str[2]}${str[3]}, code units: ${str.charCodeAt(2)}, ${str.charCodeAt(3)}`);
console.log(`Length of str: ${str.length}`)

$ node index.js
Character: H, code unit: 72
Character: i, code unit: 105
Character: 😀, code units: 55357, 56832
Length of str: 4

要取得完整字符需使用 Array.from() 或 for...of 的寫法，遍歷的方式會和上面的寫法不同。

const str = "Hi😀";

Array.from(str).forEach((char, index) => {
    console.log(`Character ${index}: ${char}`);
})

$ node index.js
Character 0: H
Character 1: i
Character 2: 😀

補充：
Unicode 標量值是指所有有效的 Unicode 代碼點（Unicode code points），但不包括代理區間（surrogate code points）。代理區間的代碼點在 UTF-16 編碼中被用來表示某些特殊的字符，但它們本身並不是有效的獨立字符。
Unicode 代碼點的範圍是從 0x0000 到 0x10FFFF（包括 0x10FFFF）。而代理區間的代碼點範圍是 0xD800 到 0xDFFF。
Rust 中的 char 都必須是有效的 Unicode 標量值。這代表不能使用代理區間的代碼點或超出有效範圍的代碼點來構造 char。違反這個規則會導致未定義行為(undefined behavior)。

布林值(boolean)

布林值的大小為一個位元組，和大部分程式語言一樣，值就是 true 和 false 兩種，要在 Rust 中定義布林型別的話用 bool 。

fn main() {
    let t = true;
    let f: bool = false; // 型別詮釋的方式
}

通常是拿來 if 表達式使用。用法和其他程式語言也大同小異。

fn main() {
    let result = true;
    if result {
        println!("result is true");
    } else {
        println!("result is not true");
    }
}

如果原本是寫 JavaScript ，有一個比較大的區別是 Rust if 表達式後面接的一定要是布林型別，比如 JavaScript 應該滿常以下這樣寫的，因為 JavaScript 會做強制轉型(Coercion)，方便但有時候容易誤判。

const types = [0, 1, 'str', true, {}, [], undefined];

types.forEach((e) => {
    if (e) {
        console.log(e, ' means true');
    } else {
        console.log(e, ' means false');
    };
});

$ node index.js
0  means false
1  means true
str  means true
true  means true
{}  means true
[]  means true
undefined  means false

在 Rust 只要不是bool都會在編譯的時候就報錯，不會自己幫你轉型別。

$ cargo run
error[E0308]: mismatched types
 --> src/main.rs:3:8
  |
3 |     if result {
  |        ^^^^^^ expected `bool`, found integer

For more information about this error, try `rustc --explain E0308`.