LevelDB二级索引实现姚凯文(kevinyao0901) 姜嘉祺

kevinyao0901 1c4799d690 update report			1ヶ月前
Report/png	finish running bench & update report	删除	1ヶ月前
benchmarks	init the repo with base	删除	1ヶ月前
cmake	init the repo with base	删除	1ヶ月前
db	finish Secondary Index based on DBImpl&indexdb_	删除	1ヶ月前
doc	init the repo with base	删除	1ヶ月前
helpers/memenv	init the repo with base	删除	1ヶ月前
include/leveldb	finish Secondary Index based on DBImpl&indexdb_	删除	1ヶ月前
issues	init the repo with base	删除	1ヶ月前
port	init the repo with base	删除	1ヶ月前
table	init the repo with base	删除	1ヶ月前
test	finish Secondary Index based on DBImpl&indexdb_	删除	1ヶ月前
third_party	init the repo with base	删除	1ヶ月前
util	init the repo with base	删除	1ヶ月前
.clang-format	init the repo with base		1ヶ月前
.gitignore	finish Secondary Index based on DBImpl&indexdb_		1ヶ月前
.gitmodules	init the repo with base		1ヶ月前
AUTHORS	init the repo with base		1ヶ月前
CMakeLists.txt	finish Secondary Index based on DBImpl&indexdb_		1ヶ月前
CONTRIBUTING.md	init the repo with base		1ヶ月前
LICENSE	init the repo with base		1ヶ月前
NEWS	init the repo with base		1ヶ月前
README.md	update report		1ヶ月前
TODO	init the repo with base		1ヶ月前

README.md

实验报告：在 LevelDB 中构建二级索引的设计与实现

实验目的

在 LevelDB 的基础上设计和实现一个支持二级索引的功能，优化特定字段的查询效率。通过此功能，用户能够根据字段值高效地检索对应的数据记录，而不需要遍历整个数据库。

实现思路

1. 二级索引的概念

二级索引是一种额外的数据结构，用于加速某些特定字段的查询。在 LevelDB 中，键值对的存储是以 key:value 的形式。通过创建二级索引，我们将目标字段的值与原始 key 建立映射关系，存储在独立的索引数据库中，从而支持基于字段值的快速查询。

例如，原始数据如下：

k_1 : name:Customer#000000001|address:IVhzIApeRb|phone:25-989-741-2988
k_2 : name:Customer#000000002|address:XSTf4,NCwDVaW|phone:23-768-687-3665
k_3 : name:Customer#000000001|address:MG9kdTD2WBHm|phone:11-719-748-3364

为字段 name 创建索引后，索引数据库中的条目如下：

name:Customer#000000001-k_1 : k_1
name:Customer#000000001-k_3 : k_3
name:Customer#000000002-k_2 : k_2

2. 设计目标

创建索引：扫描数据库中的所有记录，基于指定字段提取值，并将字段值和原始 key 编码后写入二级索引数据库 indexDb_。
查询索引：在二级索引数据库中快速定位字段值对应的原始 key。
删除索引：移除二级索引数据库中所有与目标字段相关的条目。

具体实现

1. DBImpl 类的设计

在 LevelDB 的核心类 DBImpl 中，增加了对二级索引的支持，包括：

索引字段管理：使用成员变量 fieldWithIndex_ 保存所有已经创建索引的字段名。
索引数据库：使用成员变量 indexDb_ 管理二级索引数据库。

class DBImpl : public DB {
private:
    std::vector<std::string> fieldWithIndex_; // 已创建索引的字段列表
    leveldb::DB* indexDb_;                    // 存储二级索引的数据库
};

2. 二级索引的创建

在 DBImpl 中实现 CreateIndexOnField 方法，用于对指定字段创建二级索引：

遍历主数据库中的所有数据记录。
解析目标字段的值。
在索引数据库中写入二级索引条目，键为 "fieldName:field_value-key"，值为原始数据的键。

示例：

核心代码：

Status DBImpl::CreateIndexOnField(const std::string& fieldName) {
    // 检查字段是否已创建索引
    for (const auto& field : fieldWithIndex_) {
        if (field == fieldName) {
            return Status::InvalidArgument("Index already exists for this field");
        }
    }

    // 添加到已创建索引的字段列表
    fieldWithIndex_.push_back(fieldName);

    // 遍历主数据库，解析字段值并写入索引数据库
    leveldb::ReadOptions read_options;
    leveldb::Iterator* it = this->NewIterator(read_options);

    for (it->SeekToFirst(); it->Valid(); it->Next()) {
        std::string key = it->key().ToString();
        std::string value = it->value().ToString();

        // 提取字段值
        size_t field_pos = value.find(fieldName + ":");
        if (field_pos != std::string::npos) {
            size_t value_start = field_pos + fieldName.size() + 1;
            size_t value_end = value.find("|", value_start);
            if (value_end == std::string::npos) value_end = value.size();

            std::string field_value = value.substr(value_start, value_end - value_start);
            std::string index_key = fieldName + ":" + field_value;

            // 在索引数据库中创建条目
            leveldb::Status s = indexDb_->Put(WriteOptions(), Slice(index_key), Slice(key));
            if (!s.ok()) {
                delete it;
                return s;
            }
        }
    }

    delete it;
    return Status::OK();
}

3. 二级索引的查询

在 DBImpl 中实现 QueryByIndex 方法，通过目标字段值查找对应的原始键：

在索引数据库中遍历 fieldName:field_value 开头的条目。
收集结果并返回。

核心代码：

std::vector<std::string> DBImpl::QueryByIndex(const std::string& fieldName) {
    std::vector<std::string> results;
    leveldb::ReadOptions read_options;
    leveldb::Iterator* it = indexDb_->NewIterator(read_options);

    for (it->Seek(fieldName); it->Valid(); it->Next()) {
        std::string value = it->value().ToString();
        if (!value.empty()) {
            results.push_back(value);
        }
    }

    delete it;
    return results;
}

4. 二级索引的删除

在 DBImpl 中实现 DeleteIndex 方法，通过目标字段名移除对应的所有索引条目：

在 fieldWithIndex_ 中移除字段。
遍历索引数据库，删除所有以 fieldName: 开头的条目。

核心代码：

Status DBImpl::DeleteIndex(const std::string& fieldName) {
    auto it = std::find(fieldWithIndex_.begin(), fieldWithIndex_.end(), fieldName);
    if (it == fieldWithIndex_.end()) {
        return Status::NotFound("Index not found for this field");
    }

    // 从已创建索引列表中移除字段
    fieldWithIndex_.erase(it);

    // 遍历索引数据库，删除相关条目
    leveldb::ReadOptions read_options;
    leveldb::Iterator* it_index = indexDb_->NewIterator(read_options);

    for (it_index->SeekToFirst(); it_index->Valid(); it_index->Next()) {
        std::string index_key = it_index->key().ToString();
        if (index_key.find(fieldName + ":") == 0) {
            Status s = indexDb_->Delete(WriteOptions(), Slice(index_key));
            if (!s.ok()) {
                delete it_index;
                return s;
            }
        }
    }

    delete it_index;
    return Status::OK();
}

示例流程

插入原始数据：

k_1 : name:Customer#000000001|address:IVhzIApeRb|phone:25-989-741-2988
k_2 : name:Customer#000000002|address:XSTf4,NCwDVaW|phone:23-768-687-3665

创建索引：
- 调用 CreateIndexOnField("name")，索引数据库生成条目：
```
name:Customer#000000001-k_1 : k_1
name:Customer#000000002-k_2 : k_2
```
查询索引：
- 调用 QueryByIndex("name:Customer#000000001")，返回 ["k_1"]。
删除索引：
- 调用 DeleteIndex("name")，移除所有 name: 开头的索引条目。

测试结果：

Benchmark测试运行结果及分析：

插入时间 (Insertion time for 100001 entries: 516356 microseconds)

这个时间（516356 微秒，约 516 毫秒）看起来是合理的，特别是对于 100001 条记录的插入操作。如果你的数据插入过程没有特别复杂的计算或操作，这个时间应该是正常的，除非硬件性能或其他因素导致延迟。

没有索引的查询时间 (Time without index: 106719 microseconds)

这个时间是查询在没有索引的情况下执行的时间。106719 微秒（大约 107 毫秒）对于没有索引的查询来说是可以接受的，尤其是在数据量较大时。如果数据库没有索引，查找所有相关条目会比较耗时。

创建索引的时间 (Time to create index: 596677 microseconds)

这个时间（596677 微秒，约 597 毫秒）对于创建索引来说是正常的，尤其是在插入了大量数据后。如果数据量非常大，索引创建时间可能会显得稍长。通常情况下，创建索引的时间会随着数据量的增加而增大。

有索引的查询时间 (Time with index: 68 microseconds)

这个时间（68 微秒）非常短，几乎可以认为是一个非常好的优化结果。通常，索引查询比没有索引时要快得多，因为它避免了全表扫描。因此，这个时间是非常正常且预期的，说明索引大大加速了查询。

查询结果 (Found 1 keys with index)

这里显示索引查询找到了 1 个键。是正常的, name=Customer#10000 应该返回 1 条记录。

数据库统计信息 (Database stats)

Compactions
Level  Files Size(MB) Time(sec) Read(MB) Write(MB)
--------------------------------------------------
  0        2        6         0        0         7
  1        5        8         0       16         7

这些信息表明数据库的压缩（Compaction）过程。Level 0 和 Level 1 显示了数据库的文件数和大小。此部分数据正常，意味着数据库在处理数据时有一些 I/O 操作和文件整理。

删除索引的时间 (Time to delete index on field 'name': 605850 microseconds)

删除索引的时间（605850 微秒，约 606 毫秒）比创建索引的时间稍长。这个时间是合理的，删除索引通常会涉及到重新整理数据结构和清理索引文件，因此可能比创建索引稍慢。

benchmark运行结果总结：

整体来看，输出结果是正常的：

插入和索引创建时间：插入数据和创建索引所需的时间相对较长，但考虑到数据量和索引的生成，时间是合理的。
有索引的查询时间：索引加速了查询，这部分的时间（68 微秒）非常短，表现出色。
删除索引的时间：删除索引需要稍长时间，这也是常见的现象。

总结

本实验通过在 DBImpl 中集成索引管理功能，实现了对二级索引的创建、查询和删除。二级索引数据存储在独立的 indexDb_ 中，通过高效的键值映射提升了字段值查询的效率。