数据结构复习-第三部分

本系列文章为作者复习数据结构过程中的知识点总结、速通，欢迎补充

Written by SJTU-XHW

Reference: 张同珍老师 PPT | UNIkeEN

注：以下所有用到的键值对定义为：

template <class KEY, class VAL>
struct pair {
    KEY key;
    VAL value;
};

Chapter 7 集合、查找表

7.1 集合的零碎概念总结

键值对（C++：pair）：由一对键、值构成的结构体，值可以有多个、可以相同；但键可以相互比较，在单重集中具有唯一性；
集合的特征：无序性（集合中元素间没有前驱后继等关系）、确定性、互异性（多重集中没有此要求，但本部分也不涉及）；
集合的建议存储方式：散列存储；
集合的基本运算形式：创、清、增（插入）删查（集合中最重要、最基本的运算）；
查找：确定指定关键字值的数据元素在查找表中是否存在；
查找表：用于查找的数据结构都称之为~；
静态查找表、动态查找表
- 一般采用数组/线性结构存储、无需改变增删元素的查找表；
- 一般采用树形结构 / 散列结构存储、需要实时增删元素的查找表；
内部查找、外部查找
- 被查找元素全部存放于内存中的查找操作，称内部查找，否则称外部查找；
- 内部查找以比较次数为衡量时间性能的标准；外部查找（访问速率 << 比较速率）以外存访问次数为衡量时间性能的标准

7.2 静态查找表

存储实现：一个数组 / 顺序表【表头指针、长度、规模】

小特性：可以从后向前查找成功返回下标，失败返回-1
运算实现：静态查找表因为其特性，仅关心查找运算
- 无序表的查找运算：顺序查找$O(n)$
甚至为了减少比较次数，牺牲list[0]作为“哨兵”，每次查找前将target放入list[0]，这样一定能找到，并且如果停止在0上，则未找到；停止在其他序列上就找到了
- 有序表的查找运算：顺序查找、二分查找（$O(log\space n)$）、插值查找、分块查找；
  - 分块查找一定注意概念：先查找索引表（含有块内最大关键字、块起始地址），再在块内使用顺序 / 二分查找（取决于是不是有序的）；
提示：本部分对代码考察较少，一般考察查找次数分析、时间复杂度分析；
例题观赏
- 证明：分块查找的块大小为 $m=\sqrt{n}$ 时（n为总元素个数），分块查找的平均时间性能最佳为$O(\sqrt{n})$；
  $\begin{aligned} &假设表长为\space n,\space 被分为\space m\space 块,平均每块有\space \dfrac{n}{m}\space个元素;\\ &再假定索引、块内均采用顺序查找方法;\\ &则平均情况下查找次数=\dfrac{1}{m}\sum\limits_{i=1}^{m}{i}+\dfrac{1}{\cfrac{n}{m}}\sum\limits_{j=1}^{\cfrac{n}{m}}j=\dfrac{1}{2}(m+\dfrac{n}{m})+1\\ &由基本不等式可知,\space当且仅当\space m=\sqrt{n}\space时取最小值\sqrt{n}+1,\space故时间复杂度O(\sqrt{n}) \end{aligned}$
- 对于一个长度为3的顺序表查找，目标是第一个元素的概率是$\dfrac{1}{2}$，目标是第二个元素的概率是$\dfrac{1}{3}$，目标是第三个元素的概率是$\dfrac{1}{6}$，问顺序查找（从前往后）的平均查找长度；若该表同时满足有序性，问二分查找的平均查找长度；
  $\begin{aligned} &\overline{sLen}_{seq}=1\times\dfrac{1}{2}+2\times\dfrac{1}{3}+3\times\dfrac{1}{6}=\dfrac{5}{3}\\ &\overline{sLen}_{bin}=1\times\dfrac{1}{3}+2\times(\dfrac{1}{2}+\dfrac{1}{6})=\dfrac{5}{3}\\ \end{aligned}$

7.3 动态查找表

考虑所有的线性结构均不合适：顺序存储=>插入删除不利；链表=>查找只能顺序；

解决方案：

查找树：处理动态查找表的树型结构；

散列表：专用于集合查找的数据结构；

7.3.1 二叉查找树

定义：或者是一棵空树，或者是一棵同时满足以下条件的二叉树：

左子树不空时，左子树上所有元素的键值小于根结点的键值（右子树同理）；
它的左右子树也都是二叉查找树；

引申结论：一棵二叉查找树的中序序列是按键值递增排序的序列（所以二叉查找树又称二叉排序树）

⚠极易错误的点：是左右子树上所有的点都满足键值大小关系！！！不仅仅是左右结点满足这个关系。

例如：“设计程序检查二叉树是否为一个二叉查找树”，应该利用中序序列的特性，初始化一个最小数据值，中序遍历同时判断；而不应该只是简单地判断当前结点、左右子结点的值关系

template <class T>
bool isBSTree(node<T>* cur) {
    static T minData;
    if (!cur) return true;
    if (!isBSTree(cur->left) || cur->data < minData)
        return false;
    minData = cur->data;
    return isBSTree(cur->right);
}

存储实现：二叉树的标准存储法【根结点】

template <class KEY, class VAL>
class binSearchTree {
private:
    struct node {
        pair<KEY, VAL> data;
        node* left;
        node* right;
        node(): left(0), right(0) {}
        node(pair<KEY, VAL> p, node* L=0, node* R=0)
            : data(p), left(L), right(R) {}
    };
    node* root;
    void clear(node*& target);
public:
    binSearchTree();
    ~binSearchTree();
    void insert(const pair<KEY, VAL> dt);
    void remove(const KEY& k);
    pair<KEY, VAL>* find(const KEY& k) const;
};

运算实现

创建删除运算

template <class KEY, class VAL>
binSearchTree<KEY, VAL>::binSearchTree() { root = nullptr; }

template <class KEY, class VAL>
binSearchTree<KEY, VAL>::clear(node*& target) {
    if (!target) return;
    clear(target->left);
    clear(target->right);
    delete target; target = nullptr;
}

template <class KEY, class VAL>
binSearchTree<KEY, VAL>::~binSearchTree() { clear(root); }

查找运算 $O(log\space n)-O(log\space n)-O(n)$，推导平均情况：（有n种形态：左0右n-1、左2右n-2、……、左n-1右0，最后包括根结点）

$\begin{aligned} P(n)&=\dfrac{1}{n^2}\sum\limits_{i=0}^{n-1}[1+(P(i)+1)\times i+(P(n-i-1)+1)\times(n-i-1)]\\ &\le2(1+\dfrac{1}{n})\space ln\space n\space\sim1.38\space log\space n\\ \end{aligned}$

// 非递归既省时又省空间，用递归系统栈的开销就大啦
template <class KEY, class VAL>
pair<KEY, VAL>* binSearchTree<KEY, VAL>::find(const KEY& k) const {
    node* cur = root;
    while (cur && cur->data.key != k) {
        if (cur->data.key > k) cur = cur->left;
        else cur = cur->right;
    }
    return (cur ? &(cur->data) : nullptr);
}

插入运算 $O(log\space n)，实际相当于查找时间复杂度$

// 递归法
template <class KEY, class VAL>
void binSearchTree<KEY, VAL>::insert(const pair<KEY, VAL>& dt, node*& target) {
    if (!target) target = new node(dt);
    else if (target->data.key < dt.key)
        insert(dt, target->right);
    else if (target->data.key > dt.key)
        insert(dt, target->left);
}

template <class KEY, class VAL>
void binSearchTree<KEY, VAL>::insert(const pair<KEY, VAL>& dt) {
    insert(dt, root);
}
// --------------------------------------------------------
// 非递归法1，这里二阶指针不能用引用代替！否则会把root值改变！
template <class KEY, class VAL>
void binSearchTree<KEY, VAL>::insert(const pair<KEY, VAL>& dt) {
    node** cur = &root;
    while (*cur && (*cur)->data.key != dt.key) {
        if ((*cur)->data.key > dt.key) cur = &((*cur)->left);
        else cur = &((*cur)->right);
    }
    if (!(*cur)) *cur = new node(dt);
}
// 非递归法2，除了用自身二阶指针保存外，更易懂的还可以用前后指针保存
template <class KEY, class VAL>
void binSearchTree<KEY, VAL>::insert(const pair<KEY, VAL>& dt) {
    if (!root) { root = new node(dt); return; }
    node* pre, cur = root;
    while (cur && cur->data.key != dt.key) {
        if (cur->data.key > dt.key) cur = cur->left;
        else cur = cur->right;
        pre = cur;
    }
    if (!cur) {
        if (pre->left == cur) pre->left = new node(dt);
        else pre->right = new node(dt);
    }
}

删除运算 $O(log\space n)，实际相当于查找时间复杂度$，注意需要在右子树中找最小 / 左子树中找最大结点作为替身再删除；

// 仅展示递归法，非递归法可由插入运算同理得
template <class KEY, class VAL>
void binSearchTree<KEY, VAL>::remove(const KEY& k, node*& target) {
    if (!target) return;
    if (target->data.key < k) remove(k, target->right);
    else if (target->data.key > k) remove(k, target->left);
    else if (target->left && target->right) {
        node* sub = target->right;
        while (sub->left) sub = sub->left;
        target->data = sub->data;
        remove(sub->data.key, target->right);
    }
    else {    // 思考为何度为1和度为0的结点情况可以合并
        node* tmp = target;
        target = (target->left ? target->left : target->right);
        delete tmp;
    }
}

template <class KEY, class VAL>
void binSearchTree<KEY, VAL>::remove(const KEY& k) {
    remove(k, root);
}

7.3.2 AVL树：平衡的二叉查找树

平衡条件左右子树高度差不大于 1（也即平衡因子不大于1）
- 对二叉查找树应用该平衡条件的原因：限制二叉查找树高度在对数级，避免退化为链表；
- AVL 树的平衡性大于普通二叉查找树，小于完全二叉树；

了解即可：AVL树对数级高度的理论依据
$\begin{aligned} &设\space S_H\space是高度为\space H\space的最小的AVL树规模，则:\\ &S_H=\begin{equation}\left\{\begin{aligned}&\qquad\quad1,&H=1\\&\qquad\quad2,&H=2\\&S_{H-1}+S_{H-2}+1,\space&H\ge3\\\end{aligned}\right.\end{equation} \end{aligned} \Longrightarrow S_H\approx\dfrac{\sqrt{5}}{5}(\dfrac{1+\sqrt{5}}{2})^{H+2}-1\le N$ $故:\space H\le1.44\space log_2(N+1)-0.328$

AVL树存储实现：二叉树的标准存储法 + 结点类高度字段【根结点】

template <class KEY, class VAL>
class AVLTree {
private:
    struct node {
        pair<KEY, VAL> data;
        node* left;
        node* right;
        int height;
        node(): left(0), right(0), height(1) {}
        node(const pair<KEY, VAL>& dt, node* L=0, node* R=0, int h=1)
            : data(dt), left(L), right(R), height(h) {}
    };
    node* root;
    void clear(node*& target);
    int height(node* target) { return (target ? target->height : 0); }
    void RR(node*& target);
    void LL(node*& target);
    void RL(node*& target);
    void LR(node*& target);
public:
    AVLTree();
    ~AVLTree();
    pair<KEY, VAL>* find(const KEY& k) const;
    void insert(const pair<KEY, VAL>& dt);
    void remove(const KEY& k);
};

AVL树失衡分析

AVL树与二叉查找树不同的是，在插入、删除后需要修正树的平衡性，下面讨论这两种操作如何修正平衡性；

AVL树运算实现

创建、删除操作

template <class KEY, class VAL>
AVLTree<KEY, VAL>::AVLTree() { root = nullptr; }

template <class KEY, class VAL>
void AVLTree<KEY, VAL>::clear(node*& target) {
    if (!target) return;
    clear(target->left);
    clear(target->left);
    delete target; target = nullptr;
}

template <class KEY, class VAL>
AVLTree<KEY, VAL>::~AVLTree() { clear(root); }

查找操作 $O(log\space n)$，写法同二叉查找树，但最坏时间性能好于二叉查找树；

template <class KEY, class VAL>
pair<KEY, VAL>* AVLTree<KEY, VAL>::find(const KEY& k) const {
    pair<KEY, VAL>* cur = root;
    while (cur && cur->data != k) {
        if (cur->data.key > k) cur = cur->left;
        else cur = cur->right;
    }
    return (cur ? &(cur->data) : nullptr);
}

[PRIVATE] 调整操作 $O(1)$

template <class KEY, class VAL>
void AVLTree<KEY, VAL>::LL(node*& target) {
    node* post = target->left;
    target->left = post->right;
    post->right = target;
    int hLeft = height(target->left), hRight = height(target->right);
    target->height = (hLeft > hRight ? hLeft : hRight) + 1;
    hLeft = height(post->left);
    post->height = (hLeft > target->height ? hLeft : target->height) + 1;
    target = post;
}

template <class KEY, class VAL>
void AVLTree<KEY, VAL>::RR(node*& target) {
    node* post = target->right;
    target->right = post->left;
    post->left = target;
    int hLeft = height(target->left), hRight = height(target->right);
    target->height = (hLeft > hRight ? hLeft : hRight) + 1;
    hRight = height(post->right);
    post->height = (target->height > hRight ? target->height : hRight) + 1;
    target = post;
}

template <class KEY, class VAL>
void AVLTree<KEY, VAL>::LR(node*& target) {
    RR(target->left); LL(target);
}

template <class KEY, class VAL>
void AVLTree<KEY, VAL>::RL(node*& target) {
    LL(target->right); RR(target);
}

插入操作 $O(log\space n)$，插入前部分与二叉查找树相同，另需回溯维持平衡；

// 非递归法，需要使用到前章节的栈类seqStack；显然可以知道递归方法怎么写，略
template <class KEY, class VAL>
void AVLTree<KEY, VAL>::insert(const pair<KEY, VAL>& dt) {
    if (!root) { root = new node(dt); return; }
    seqStack<node**> hist; node** tmp = NULL;
    hist.push(&root);
    while (true) {
        tmp = hist.pop(); hist.push(tmp); if (!(*tmp)) break;
        if ((*tmp)->data.key > dt.key) hist.push(&((*tmp)->left));
        else hist.push(&((*tmp)->right));
    }
    tmp = hist.pop(); *tmp = new node(dt);
    while (!hist.isempty()) {
        node** cur = hist.pop();
        int hLeft = height((*cur)->left),
            hRight = height((*cur)->right);
        if (hLeft - hRight > 1) {
            if (height((*cur)->left->left) > height(*cur)->left->right)
                { LL(*cur); return; }
            else { LR(*cur); return; }
        }
        else if (hRight - hLeft > 1) {
            if (height((*cur)->right->right) > height((*cur)->right->left))
                { RR(*cur); return; }
            else { RL(*cur); return; }
        }
        else (*cur)->height = (hLeft > hRight ? hLeft : hRight) + 1;    
    }
}

删除操作 $O(log\space n)$，与二叉查找树的删除前部分相同，后部分需要调整平衡；

// 非递归法，需要使用到前章节的栈类seqStack；显然可以知道递归方法怎么写，略
template <class KEY, class VAL>
void AVLTree<KEY, VAL>::remove(const KEY& k) {
    if (!root) return;
    seqStack<node**> hist; KEY target = k;
    node** tmp = &root;
    while (true) {
        hist.push(tmp); if (!(*tmp)) return;
        if ((*tmp)->data.key < target) tmp = &((*tmp)->right);
        else if ((*tmp)->data.key > target) tmp = &((*tmp)->left);
        else {    // 找到要删除的结点
            if ((*tmp)->left && (*tmp)->right) { // 要删除结点的度为2
                node* sub = (*tmp)->right;
                while (sub) sub = sub->left;
                (*tmp)->data = sub->data; target = sub->data.key;
                tmp = &((*tmp)->right);
            }
            else {    // 要删除结点的度小于2
                node* rn = *tmp;
                *tmp = ((*tmp)->left ? (*tmp)->left : (*tmp)->right);
                delete rn; hist.pop(); break;
            }
        }
    }
    while (!hist.isempty()) {
        tmp = hist.pop();
        int newLH = height((*tmp)->left), newRH = height((*tmp)->right);
        if (newLH - newRH > 1) {    // 高度改变且失衡
            if (height((*tmp)->left->left) > height((*tmp)->left->right)) LR(*tmp);
            else LL(*tmp);
        }
        else if (newRH - newLH > 1) {
            if (height((*tmp)->right->right) < height((*tmp)->right->left)) RL(*tmp);
            else RR(*tmp);
        }    // 高度未改变且未失衡
        else if (newRH - newLH == 1 || newLH - newRH == 1) return;
        // 剩下情况是高度改变但未失衡，或者失衡调整后：需要更新高度
        (*tmp)->height = (height((*tmp)->left) > height((*tmp)->right)
                          ? height((*tmp)->left) : height((*tmp)->right)) + 1;
    }
}

7.3.3 散列表

思路：将关键字对应的数据元素存储在指定位置，这个位置由关键字值运算（hash函数）映射出来。这样意味着一般情况下具有线性的查找时间；

散列表的存储实现：闭散列表—数组【表头指针、规模】；开散列表—链表数组【指针数组头、规模】；

主要问题：散列函数设计尽量避免碰撞、碰撞问题解决；
散列函数设计
- 直接定址法：线性映射，适用于数据本身分布均匀，且在同一数量级中；
  
  $H(x)=ax+b$
- 除留取余法：取余函数；
  
  $H(x)=a\space mod\space M.\space经验表明，M为素数时，分布较为均匀$
- 数字分析法：根据数字实际特性确定哪些位作为散列函数的参数来映射；
- 平方取中法：适用于关键字数据各个位数上分布均匀，且关键字值域大于数组规模，可以用关键字的平方，再取中间几位（和每一位数都有关系）；
- 折叠法：适用于关键字长 >> 散列表规模的情况，按每N位折叠再相加生成；
碰撞解决方案
- 闭散列表（迟删除={空，活动，删除}）：
  
  ↪ 线性探测法（向后+1移动至空位置/已删除位置）；
  
  ↪ 二次探测法（探测 $H+i^2$ 单位，数学上可以处理为前一个索引基础上 + (2i - 1)，能解决线性探测的初始聚集问题）；
  
  ↪ 再散列法使用两个散列函数H1、H2，H1计算应该处于的起始位置，H2计算碰撞后下一个探测的位置的步长：H1(x)、(H1(x)+H2(x)) mod M、……、(H1(x)+k H2(x)) mod M；
- 开散列表：拉链法，天然不存在碰撞问题；
闭散列表可能存在的问题和优化建议
- 如果选择除留取余作为hash函数，需要尽量保证数组长度（哪怕扩容）是素数；
- 负载因子（非空单元比例）过高可能会导致查找等操作时间性能下降，需要及时检查、清理迟删除的元素、进行扩容（涉及重散列）；

实现示例（看看就好）：以除留取余作为hash函数、闭散列表针对上面2个问题已优化；

// Hash method: dividing the residual method.
// The solution to Hash Collision: linear detecting.
template <class KEY, class VAL>
class closeHashTable {
private:
    struct hashNode {
        pair<KEY, VAL> data;
        // 0: empty, 1: active; 2: deleted
        int state;
        hashNode(): state(0) {}
        hashNode(const pair<KEY, VAL>& dt, int s=0)
            : data(dt), state(0) {}
    };

    int size;
    int empty;
    int deleted;
    hashNode* array;
    int (*key2int)(const KEY& key);

    static int defaultK2int(const KEY& k) { return k; }
    static int minPrime(int min);

    // Check the rate of the deleted nodes.
    // Return true if over half of the nodes is in deleted state.
    bool checkDeleted() const { return (deleted > size / 2); }
    // Check the rate of the active nodes.
    // Return true if the load factor is greater than 0.5.
    bool checkActive() const { return (empty + deleted < size / 2); }

    // Rehash the hash table to improve its performance.
    void reHash(bool expand=0);

public:
    closeHashTable(int capacity=13, int (*k2int)(const KEY&)=defaultK2int);
    ~closeHashTable();
    pair<KEY, VAL>* find(const KEY& key) const;
    void insert(const pair<KEY, VAL>& s);
    void remove(const KEY& key);
};

template <class KEY, class VAL>
int closeHashTable<KEY, VAL>::minPrime(int min) {
    if (min % 2 == 0) ++min;
    for (int i = min;; i += 2) {
        bool flag = 1;
        for (int j = 3; j * j <= min; j += 2)
            if (i % j == 0) { flag = 0; break; }
        if (flag) return i;
    }
}


template <class KEY, class VAL>
closeHashTable<KEY, VAL>::closeHashTable(int capacity, int (*k2int)(const KEY&)) {
    size = minPrime(capacity);
    array = new hashNode[size];
    key2int = k2int;
    empty = size; deleted = 0;
}

template <class KEY, class VAL>
closeHashTable<KEY, VAL>::~closeHashTable() {
    if (array) delete[] array;
}

template <class KEY, class VAL>
pair<KEY, VAL>* closeHashTable<KEY, VAL>::find(const KEY& key) const {
    // Why we need initPos: when the array is fully occupied, we need
    // to know if we have traversed the array.
    int initPos, pos;
    initPos = pos = key2int(key) % size;

    do {
        if (array[pos].state == 0) return 0;
        if (array[pos].state == 1 && array[pos].data.key == key)
            return &(array[pos].data);
        // Attention: state == -1 means delay delete,
        // which means keeping searching.
        pos = (pos + 1) % size;
    } while (initPos != pos);
    // The array is full and we can't find the data with 'key'.
    return 0;
}

template <class KEY, class VAL>
void closeHashTable<KEY, VAL>::insert(const pair<KEY, VAL>& s) {
    if (checkActive()) reHash(true);

    int initPos, pos;
    initPos = pos = key2int(s.key) % size;

    do {
        // Attention: we can overwrite the deleted entry(state=2).
        if (array[pos].state != 1) {
            // (array[pos].state ? --deleted : --empty);
            if (array[pos].state) --deleted;
            else --empty;
            array[pos].data = s;
            array[pos].state = 1;
            return;
        }
        pos = (pos + 1) % size;
    } while (initPos != pos);
    // Warning: the array is full.
}

template <class KEY, class VAL>
void closeHashTable<KEY, VAL>::remove(const KEY& key) {
    if (checkDeleted()) reHash();

    int initPos, pos;
    initPos = pos = key2int(key) % size;

    do {
        // Save time.
        if (array[pos].state == 0) return;
        if (array[pos].state == 1 && array[pos].data.key == key) {
            array[pos].state = 2; ++deleted; return;
        }
        pos = (pos + 1) % size;
    } while (initPos != pos);
}

template <class KEY, class VAL>
void closeHashTable<KEY, VAL>::reHash(bool expand) {
    hashNode* tmp = array;
    int oldSize = size;
    size = (expand ? minPrime(2*size) : size);
    array = new hashNode[size];
    empty = size; deleted = 0;

    for (int i = 0; i < oldSize; ++i) {
        if (tmp[i].state == 1) insert(tmp[i].data);
    }
    delete[] tmp;
}

template <class KEY, class VAL>
class openHashTable {
private:
    struct hashNode {
        pair<KEY, VAL> data;
        hashNode* next;

        hashNode(const pair<KEY, VAL>& s, hashNode* n=0)
            : data(s), next(n) {}
    };

    int size;
    hashNode** hashTable;
    int (*key2int)(const KEY& key);

    static int defaultK2int(const KEY& key) { return key; }

public:
    openHashTable(int initSize=101, int (*k2int)(const KEY& k)=defaultK2int);
    ~openHashTable();

    void insert(const pair<KEY, VAL>& x);
    void remove(const KEY& key);
    pair<KEY, VAL>* find(const KEY& key) const;
};

template <class KEY, class VAL>
openHashTable<KEY, VAL>::openHashTable(int initSize, int (*k2int)(const KEY& k)) {
    size = initSize;
    hashTable = new hashNode*[size] {0};
    key2int = k2int;   
}

template <class KEY, class VAL>
openHashTable<KEY, VAL>::~openHashTable() {
    for (int i = 0; i < size; ++i) {
        while (hashTable[i]) {
            hashNode* tmp = hashTable[i];
            hashTable[i] = hashTable[i]->next;
            delete tmp;
        }
    }
    delete[] hashTable;
}

template <class KEY, class VAL>
void openHashTable<KEY, VAL>::insert(const pair<KEY, VAL>& x) {
    if (find(x.key)) return;
    int idx = key2int(x.key) % size;
    hashTable[idx] = new hashNode(x, hashTable[idx]);
}

template <class KEY, class VAL>
void openHashTable<KEY, VAL>::remove(const KEY& key) {
    int idx = key2int(key) % size;
    hashNode* cur = hashTable[idx];
    if (!cur) return;
    while (cur->next && cur->next->data.key != key)
        cur = cur->next;
    if (!(cur->next)) return;
    hashNode* tmp = cur->next;
    cur->next = cur->next->next;
    delete tmp;
}

template <class KEY, class VAL>
pair<KEY, VAL>* openHashTable<KEY, VAL>::find(const KEY& key) const {
    int idx = key2int(key) % size;
    hashNode* cur = hashTable[idx];
    while (cur && cur->data.key != key)
        cur = cur->next;
    if (!cur) return nullptr;
    else return &(cur->data);
}

Chapter 8 排序

8.1 零碎概念集合

内排序和外排序：数据规模、数据存放位置、算法优化要求不一样；
排序算法的稳定性：集合中相同关键字值的元素在排序后，这些元素的相对次序不变，则称这种算法是稳定的；反之是不稳定的；（从现在开始，除非标明不稳定，其余都是稳定的）
常见应用：集合排序（便于查找）、查找序列中的重复元素；

注：本章排序算法以递增序列为目标；

8.2 插入排序

8.2.1 直接插入排序

每个外围循环将 a[i] 插入 a[0]~a[i-1] 的序列中，从后向前比较交换

如果数据是一个一个输入的（不同时间），那么直接插入排序是一个极好的选择

时间复杂度分析：

最坏：a[j] 比较 j 次、a[j] 交换 j + 2 次，故$O(n^2)$
最好$O(n)$、平均$O(n^2)$

// 为了简化逻辑结构、方便应用，代码中将直接用键KEY来替换pair键值对的排序，下同
template <class KEY>
void simpleInsertSort(KEY seq[], int size) {
    for (int i = 1; i < size; ++i) {
        KEY pre = seq[i];
        for (int j = i - 1; j >= 0; --j) {
            if (seq[j] > pre) seq[j + 1] = seq[j];
            else { seq[j + 1] = pre; break; }
        }
    }
}

8.2.2 二分插入排序

将直接插入的比较改为二分查找，并没有改变时间复杂度；

8.2.3 希尔排序（不稳定）

选定增量序列 {h1=1, h2, …, ht}，对待排序列进行 ht-排序、……、h2-排序、h1-排序；

每个 hk-排序使用直接插入排序；

shell 他本人建议增量序列取定：N/2 -> N/4 -> …… -> 1；

理论依据：hk-有序数组经过 h(k-1)-排序后仍然有序；

时间复杂度分析：

对于增量序列 $\{\dfrac{N}{2},\space\dfrac{N}{4},\space…,\space1\}$（或者间距序列为 $H= \{ 2^k-1\mid k=1,2,\ldots,\lfloor\log_2 n\rfloor \}$(从大到小）)，时间复杂度为$O(n^{\tfrac{3}{2}})$ ；
对于间距序列$H= \{ k=2^p\cdot 3^q\mid p,q\in \mathbb N,k\le n \}$（从大到小），时间复杂度$O(n\space log^2 n)$
有兴趣参阅证明过程，请移步：希尔排序 - OI Wiki (oi-wiki.org)
总体的移动次数（对所有增量序列）：$n^{1.25}\sim 1.6n^{1.25}\Longrightarrow O(n^{1.25})$；

template <class KEY>
void shellSort(KEY seq[], int size) {
    for (int step = size / 2; step >= 1; step /= 2) {
        // 对每个子序列进行直接插入排序
        for (int i = step; i < size; ++i) {
            pre = seq[i];
            for (int j = i - 1; j >= 0; --j) {
                if (seq[j] > pre) seq[j + 1] = seq[j];
                else { seq[j + 1] = pre; break; }
            }
        }
    }
}

8.3 选择排序

8.3.1 直接选择排序（特定条件下不稳定）

从未排序部分顺序查找出一个最大/最小元素，与未排序部分的末尾交换，以增长排序部分；

不稳定的、平方时间复杂度的简单算法，导致它仅常用于一些小规模的排序中；

时间复杂度分析：最好、最坏、平均：$O(n^2)$；

template <class KEY>
void simpleSelectSort(KEY seq[], int size) {
    for (int i = 0; i < size - 1; ++i) {
        int minIdx = i; KEY minVal = seq[i];
        for (int j = i + 1; j < size; ++j) {
            if (seq[j] < minVal) {
                minIdx = j; minVal = seq[j];
            }
        }
        seq[minIdx] = seq[i];
        seq[i] = minVal;
    }
}

8.3.2 堆排序（不稳定）

利用之前优先级队列中，二叉堆的特性进行排序；

出队后，从数组末尾向前摆放，所以要递增排序就要构建最大化堆（反之同理）；

通常情况0号位存有数据，所以 i 结点的左右儿子分别为 2i + 1、2i + 2，父结点 floor((i-1)/2)；

时间复杂度分析：出队N次，$O(n\space log\space n)$；

template <class KEY>
void percolateDown(KEY seq[], int size, int idx) {
    int next = idx * 2 + 1; KEY tmp = seq[idx];
    while (next < size) {
        if (next + 1 < size && seq[next] < seq[next + 1]) ++next;
        if (tmp < seq[next]) seq[idx] = tmp;
        else break;
        idx = next; next = 2 * idx + 1;
    }
    seq[idx] = tmp;
}

template <class KEY>
void heapSort(KEY seq[], int size) {
    // O(n)
    for (int i = (size - 1) / 2; i >= 0; --i)
        percolateDown(seq, size, i);
    // O(nlogn)
    for (int i = size - 1; i > 0; --i) {
        KEY tmp = seq[0]; seq[0] = seq[i]; set[i] = tmp;
        percolateDown(seq, i, 0);
    }
}

8.4 交换排序

8.4.1 冒泡排序

第 i 次循环（从前向后）安排好倒数第 i 个位置；

如果一趟起泡没有发生任何交换，则序列一定有序，可以中止；

时间复杂度分析：最好 $O(n)$，平均、最坏均为 $O(n^2)$；

template <class KEY>
void bubbleSort(KEY seq[], int size) {
    for (int i = 0; i < size - 1; ++i) {
        bool flag = true;
        for (int j = i; j < size; ++j) {
            if (seq[j + 1] < seq[j]) {
                KEY tmp = seq[j]; seq[j] = seq[j + 1];
                seq[j + 1] = tmp; flag = false;
            }
        }
        if (flag) break;
    }
}

8.4.2 快速排序（不稳定）

选标准元素 -> 划分 -> 递归（或不用递归，用栈保存划分的起始、中止序号）

时间复杂度分析：

最好：$T(N)=2T(\dfrac{N}{2})+cN\Longrightarrow T(N)=O(N\space log\space N)$；
最坏：$T(N)=T(N-1)+cN\Longrightarrow T(N)=O(N^2)$；
平均：$T(N)=2\dfrac{1}{N}\sum\limits_{i=0}^{N-1}{T(i)}+cN\Longrightarrow T(N)=O(N\space log\space N)$；

空间复杂度分析：和第一章说的一样，系统栈算辅助空间

最好：$O(log\space n)$；最坏：$O(n)$；

template <class KEY>
int divide(KEY seq[], int left, int right) {
    KEY pivot = seq[left];
    while (left < right) {
        while (left < right && pivot < seq[right]) --right;
        if (left < right) seq[left] = seq[right];
        while (left < right && seq[left] < pivot) ++left;
        if (left < right) seq[right] = seq[left];
    }
    seq[left] = pivot; return left;
}

template <class KEY>    // 可以添加一个包装函数，或者改成非递归
void quickSort(KEY seq[], int left, int right) {
    if (left >= right) return;
    int mid = divide(seq, left, right);
    quickSort(seq, left, mid - 1);
    quickSort(seq, mid + 1, right);
}

8.5 归并排序

以二路归并为例，先得到2份有序序列（应用递归），另启空间分别按大小加入

考虑到C++int自动向0取整（索引偏左），故merge在分化的时候应该多留一个给左边，防止死循环；

时间复杂度分析：时间复杂度与数据分布无关！始终为 $O(n\space log\space n)$；

空间复杂度分析：$O(n)$；

template <class KEY>
void merge(KEY seq[], int left, int right, int mid) {
    KEY* tmp = new KEY[right - left + 1];
    int leftIdx = 0, rightIdx = mid, idx = 0;
    while (leftIdx < mid && rightIdx <= right) {
        if (seq[leftIdx] > seq[rightIdx])
            tmp[idx++] = seq[rightIdx++];
        else tmp[idx++] = seq[leftIdx++];
    }
    while (leftIdx < mid) tmp[idx++] = seq[leftIdx++];
    while (rightIdx <= right) tmp[idx++] = seq[rightIdx++];
    for (int i = 0; i < right - left + 1; ++i)
        seq[i] = tmp[i];
    delete[] tmp;
}

template <class KEY>    // 可以添加一个包装函数
void mergeSort(KEY seq[], int left, int right) {
    if (left >= right) return;
    int mid = (left + right) / 2;
    mergeSort(seq, left, mid);
    mergeSort(seq, mid + 1, right);
    merge(seq, left, right， mid + 1);
}

8.6 基数排序

从最低位到最高位，分别放进 base（基数）个口袋中，再按口袋位置”倒出“，重复至所有数的最高位（不足的补0），排序完成；

时间复杂度分析：$O(maxLen\times n)$；

空间复杂度分析：$O(1)$；

MSD & LSD：从最高位还是最低位开始；

8-EX 常用排序总结

更多排序算法在算法设计/高级数据结构中进一步阐明；

⚠易错警示：算法的最好、最坏情况，与数据分布有关（除了堆【出队始终慢于入队】、归并、基数排序），但不与其完全对应（即“不一定最好情况是有序的时候”）！

当输入序列有序时：

直接插入排序 $O(n)$ 最好、常数个增量元素序列的希尔排序 $O(n)$ 最好（因为 k-排序对应的直接插入排序线性时间复杂度）、快速排序 $O(n^2)$ 最坏、冒泡排序 $O(n)$ 最好；

当输入序列逆序时：

直接插入排序 $O(n^2)$ 最坏、希尔排序 $O(n^2)$ 最坏（因为每两个就要比较一次）、快速排序 $O(n^2)$ 最坏（因为基准元素依然在两端）、冒泡排序 $O(n^2)$ 最坏；