[Day 30] LeetCode 75 - 692. Top K Frequent Words 以及完賽感言

14th鐵人賽 c leetcode

eric1864

2022-10-12 11:48:47

1423 瀏覽

分享至

LeetCode 75 Level 1 - Day 15 Heap

題目連結
難易度：Medium

692. Top K Frequent Words

題目敘述

Given an array of strings words and an integer k, return the k most frequent strings.
Return the answer sorted by the frequency from highest to lowest. Sort the words with the same frequency by their lexicographical order.

預設函數

/**
 * Note: The returned array must be malloced, assume caller calls free().
 */
char ** topKFrequent(char ** words, int wordsSize, int k, int* returnSize){

}

題目限制

1 <= words.length <= 500
1 <= words[i].length <= 10
words[i] consists of lowercase English letters.
k is in the range [1, The number of unique words[i]]

解題過程及程式碼

本題輸入給一個二維陣列，內容由一個一個字串組成，裡面有許多字串是完全相同並重複出現的，題目再給了一個整數k，請輸出k個最高頻出現的字串 (也是需要回傳二維陣列)，如果字串出現次數相同則以 lexicographical order (字典順)來排序。

首先我們先解決 Lexicographical Order 這件事，依維基百科的說明可以寫出程式碼：「在英文字典中，排列單詞的順序是先按照第一個字母以升序排列（即a、b、c……z 的順序）；如果第一個字母一樣，那麼比較第二個、第三個乃至後面的字母。如果比到最後兩個單詞不一樣長（比如，sigh 和 sight），那麼把短者排在前。」

Lexicographic order From Wikipedia
Lexicographical Order程式碼

/* 回傳 1: 字母依序比較其中a[i] > b[i]，或是依序比較後每個字母相同但是a比較長 */
/* 回傳 0: 依序比較每個字母都相同，而且a b的字串長度相同 */
/* 回傳 -1: 字母依序比較其中a[i] < b[i]，或是依序比較後每個字母相同但是a比較短 */
int lexicographicOrder(char *a, char *b) {
    int i = 0;

    while ((a[i] != '\0') && (b[i] != '\0')) {
        if (a[i] > b[i]) {
            return 1;
        } else if (a[i] < b[i]){
            return -1;
        }
        i++;
    }

    if ((a[i] == '\0') && (b[i] == '\0')) {
        return 0;
    } else if (a[i] == '\0') {
        return -1;
    } else {
        return 1;
    }
}

參考網路上的做法，用Binary Heap (二元堆積)以實現Max-Priority Queue，priority的判斷首先是字串出現的頻率，再來是字串的字典順序。

[LeetCode] Top K Frequent Words 前K个高频词

二元堆積是Complete Binary Tree (完全二元樹)，依維基百科的說明：「在一顆二元樹中，若除最後一層外的其餘層都是滿的，並且最後一層要麼是滿的，要麼在右邊缺少連續若干節點，則此二元樹為完全二元樹（Complete Binary Tree）」。這個特性使得我們可以用一個陣列來存放並操作這個二元樹，如下圖 (來源：大话数据结构)是一個完全二元樹，元素數量是n (n = 9)，我們可以將它的元素放到陣列裡，會滿足以下條件：

節點i = 1是樹的root (根節點)，i > 1時每個節點的父節點是i/2。
節點i的左子階是2i (若2i < n，2i > n時i沒有左子階)。
節點i的右子階是2i+1 (若2i+1 < n，2i+1 > n時i沒有右子階)。

至於Max Heap由維基百科的原文說明：「若母節點的值恆大於等於子節點的值，此堆積稱為最大堆積（max heap）」。

接著我們來看元素放入Max-Priority-Queue的過程，首先第一個元素沒有比較對象，直接丟到i = 1位置，接下來的元素放進來前都要跟它預期位置的父節點 (i / 2)做比較

若新元素較大則新元素要放在i / 2的位置，原來i / 2位置的元素要放到i = 2的位置。
若新元素較小則直接放在i = 2位置。
依規則我們可以實作出放入Heap的push程式，其中的heapOrder函數在本題裡我們同時考慮了字串出現的頻率和字典順序。

/* insert item into a max heap of current size *n */
void push(Sqlist *L, element item) {
    int i;
    if (HEAP_FULL(L->length)) {
        return;
    }
    i = ++(L->length);
    while ((i != 1) && (heapOrder(&item, &(L->r[i / 2])) > 0)) {
        L->r[i] = L->r[i / 2];
        i /= 2;
    }
    L->r[i] = item;
}

在元素都放入Heap後接下來就要取出priority最高的元素，這裡我們將root與最後一個元素交換，將元素數量減1，再將root依規則重新排列，如下圖

整理後我們可以實作出取出Heap的pop程式，其中的heapAdjust函數依priority來調整Heap的順序

element pop(Sqlist *L) {
    element item = L->r[1];
    swap(L, 1, L->length);
    heapAdjust(L, 1, --(L->length));
    return item;
}

有了push和pop程式我們就可以將字串的index放入Heap裡，priority同時考慮出現頻率和字典順序，最後再取出priority最大的root節點。

程式碼上傳

#define MAXSIZE 100
#define MAX_STR_SIZE 20
#define HEAP_FULL(n) (n == MAXSIZE - 1)

typedef struct {
    char data[MAX_STR_SIZE];
    int freq;
    /* other fields */
} element;

typedef struct {
    element r[MAXSIZE+1];
    int length;
} Sqlist;

element pop(Sqlist *);
void push(Sqlist *, element);
void heapAdjust(Sqlist *, int, int);
int heapOrder(element *, element *);
int lexicographicOrder(char *, char *);

char ** topKFrequent(char ** words, int wordsSize, int k, int* returnSize){
    int counter;
    Sqlist lst1;
    lst1.length = 0;

    int *dynArr = (int *)calloc(wordsSize, sizeof(int));  /* 動態記憶體配置初始化為0 */
    char **dynArr2D = (char **)malloc(sizeof(char *) * k);  /* 動態記憶體配置 */
    for (int i=0; i<k; i++) {
        dynArr2D[i] = (char *)malloc(sizeof(char) * MAX_STR_SIZE);  /* 動態記憶體配置 */
        memset(dynArr2D[i], '\0', 1 * sizeof(char));  /* 記憶體區塊設定 -> 用來初始化為空字串 */
    }

    /* 找出字串重複的次數並記錄 */
    for (int i=0; i<wordsSize; i++) {
        if (dynArr[i] != -1) {
            counter = 1;
        } else {
            continue;
        }

        for (int j=i+1; j<wordsSize; j++) {
            if (strcmp(words[i], words[j]) == 0) {
                counter++;
                dynArr[j] = -1;
            }
        }
        dynArr[i] = counter;
    }

    /* 將字串和出現頻率一起丟到Heap裡，頻率相同的元素依字典順序排序 */
    for (int i=0; i<wordsSize; i++) {
        if (dynArr[i] > 0) {
            element item;
            strcpy(item.data, words[i]);
            item.freq = dynArr[i];
            push(&lst1, item);
        }
    }

    /* 取出題目要求的前幾名高頻字串 */
    for (int i=0; i<k; i++) {
        strcpy(dynArr2D[i], pop(&lst1).data);
    }
    *returnSize = k;
    return dynArr2D;
}

/* 元素內容交換 */
void swap(Sqlist *L, int i, int j) {
    element temp = L->r[i];
    L->r[i] = L->r[j];
    L->r[j] = temp;
}

/* delete element with the highest key from the heap */
element pop(Sqlist *L) {
    element item = L->r[1];
    swap(L, 1, L->length);
    heapAdjust(L, 1, --(L->length));
    return item;
}

/* Heap調整方式 */
void heapAdjust(Sqlist *L, int s, int m) {
    element temp;
    int j;
    temp = L->r[s];
    for (int j=2*s; j<=m; j*=2) {
        if ((j < m) && (heapOrder(&(L->r[j]), &(L->r[j+1])) < 0)) {
            j++;
        }
        if (heapOrder(&temp, &(L->r[j])) >= 0) {
            break;
        }
        L->r[s] = L->r[j];
        s = j;
    }
    L->r[s] = temp;
}

/* insert item into a max heap of current size *n */
void push(Sqlist *L, element item) {
    int i;
    if (HEAP_FULL(L->length)) {
        return;
    }
    i = ++(L->length);
    while ((i != 1) && (heapOrder(&item, &(L->r[i / 2])) > 0)) {
        L->r[i] = L->r[i / 2];
        i /= 2;
    }
    L->r[i] = item;
}

/* 回傳 1: a > b，a的freq 大於 b的freq或是在 a的freq 等於 b的freq情況下，a的字典順序較靠前*/
/* 回傳 0: a = b */
/* 回傳 -1: a = b */
int heapOrder(element *a, element *b) {
    if (a->freq > b->freq) {
        return 1;
    } else if (a->freq < b->freq) {
        return -1;
    } else {
        if (lexicographicOrder(a->data, b->data) < 0) {
            return 1;
        } else if (lexicographicOrder(a->data, b->data) > 0) {
            return -1;
        } else {
            return 0;
        }
    }
}

/* 回傳 1: 字母依序比較其中a[i] > b[i]，或是依序比較後每個字母相同但是a比較長 */
/* 回傳 0: 依序比較每個字母都相同，而且a b的字串長度相同 */
/* 回傳 -1: 字母依序比較其中a[i] < b[i]，或是依序比較後每個字母相同但是a比較短 */
int lexicographicOrder(char *a, char *b) {
    int i = 0;

    while ((a[i] != '\0') && (b[i] != '\0')) {
        if (a[i] > b[i]) {
            return 1;
        } else if (a[i] < b[i]){
            return -1;
        }
        i++;
    }

    if ((a[i] == '\0') && (b[i] == '\0')) {
        return 0;
    } else if (a[i] == '\0') {
        return -1;
    } else {
        return 1;
    }
}

參考資料

30天完賽感言

30天的旅程到今天暫時告一段落，這是第一次參加鐵人賽，選擇了和工作有點關係自己又有興趣的主題，過程中又要上班又要每天發文真的不容易，連到了連假都會忘記到晚上10:00才想起來「今天還沒發文!」，能完賽真覺得蠻幸運的，希望明年還能繼續參加，題目可能會是LeetCode 75 Level 2，到時就會有更多的Medium甚至是Hard難度題，本系列任何內容有想討論的，都歡迎在該篇文章下方留言，謝謝大家看到這邊。

Best Wishes,

Eric Hsu
2022-10-12