CST LAB2

HashFun

策略实现 (396字)

`good_hashing`

int good_hashing::operator()(char* str, int N){
    unsigned int seed = 131;
    unsigned int hash = 0;
    for (int i = 0; i < strlen(str); i++) {
        hash = hash * seed + str[i];
        hash %= N;
    }
    return hash;
}

此处参考了 BKDR hash 函数, 使用 seed = 131 作为种子, 可实现较为均匀的 hash 映射.

`poor_hashing`

poor_hashing 函数定义为: $poor\_hashing(str)=89\{\sum_{i = 0}^{len(str)-1} str[i]\}\%N.$

由于 ascii 的大小为 0~127, 且字符串长度受限制, 生成的 hash 值最多不会超过 200000, 这导致了不均匀的到 TABLE_SIZE 的映射; 同时对字符串直接求和的操作很容易导致 collision.

`quadratic_probe`

定义 int dist 为当前双向试探的试探长度, bool direction 为当前双向试探的试探方向. 每次执行 init() 时, 初始化 dist = 0; direction = false;.

int quadratic_probe::operator()(hash_entry* Table, int table_size, int last_choice){
    dist++;
    int offset, pos;
    if(direction) {
        direction = false;
        offset = ((long long)(dist * dist) + 1 >> 1) % table_size;
        pos = last_choice - offset;
        while (pos < 0) pos += table_size;
        return pos;
    }
	...
}

每次冲突时, 依据 direction 的值决定偏移方向, 依据 dist 的大小决定偏移量.

`overflow_probe`

if (dynamic_cast<overflow_probe*>(my_collision))
    table_size = 400031;

此处修改了测试框架, 若 my_collision 可成功动态转换为 overflow_probe* 指针, 此时使用公共溢出区策略, 将 table_size 修改为 400031.

定义 int overflow_head 为当前缓冲区的起始 index. 每次执行 init() 时, 初始化 overflow_head = 400031.

int overflow_probe::operator()(hash_entry* Table, int table_size, int last_choice){
    return overflow_head++;
}

每次冲突时, 顺次遍历缓冲区直到返回一个可供插入词条的空位置.

测试数据 (222字)

generator.cpp 作为数据生成器.

/* generator.cpp */
srand((int)time(0));
num = rand() % 500000 + 1;

使用当前时间作为随机数种子, 产生 1~500000 内的随机数作为行号, 将 poj.txt 的该行信息作为输入数据;

使用 ReadLine(filename, line) 读入 poj.txt 的指定行信息;

使用 GenerateInsert(srt) 与 GenerateQuery(srt) , 通过 string 的 substr() 与 find() 方法从该行字符串中构造符合格式的输入数据;

shuffle_pattern 决定是否对插入和删除操作进行打乱 (即随机化).

数据 01.in 进行了 40000 次插入, 40000 次查询, 遵循先插入后查询的次序;
数据 02.in 进行了 100000 次插入, 100000 次查询, 遵循先插入后查询的次序;
数据 03.in 进行了 100000 次插入, 100000 次查询, 遵循边插入边查询的次序.

生成数据的参数附在 parameters.txt 中.

分析结果 (276字)

“好” 哈希函数的性能更佳. 因为 “坏” 哈希函数将字符串不均匀地映射到哈希表中, 容易导致hash聚集, 增大了冲突可能, 在处理冲突时会消耗更多时间.
双向平方试探性能更佳. 当数据规模较大时, 使用双向平方试探移动出冲突区域的效率较高; 同时, 比起使用 “坏” 哈希函数, 使用 “好” 哈希函数时这种移动的效率会更高.
使用 “好” 哈希函数时封闭散列性能略占优, 使用 “坏” 哈希函数时开放散列性能明显占优, 总体上开放散列性能占优. 在数据装填因子大、哈希函数均匀的情况下适合使用封闭散列.
这会导致字符串分布出现一定的规律性, 映射后的 hash 冲突较为严重, 使得性能出现下降.
模拟 vector 类的扩容, 在装填因子较大或较小时, 重新动态分配 hash_entry* new_table, 迁移原表数据, 并释放原指针即可.