LevelDB二级索引实现姚凯文(kevinyao0901) 姜嘉祺

kevinyao0901 1c62334710 Merge branch 'main' of https://github.com/kevinyao0901/levelDB_secondary-index			9 months ago
Report/png	finish running bench & update report	删除	9 months ago
benchmarks	init the repo with base	删除	10 months ago
cmake	init the repo with base	删除	10 months ago
db	update Put & delete with writebatch to ensuring Atomicity	删除	9 months ago
doc	init the repo with base	删除	10 months ago
helpers/memenv	init the repo with base	删除	10 months ago
include/leveldb	finish Secondary Index based on DBImpl&indexdb_	删除	9 months ago
issues	init the repo with base	删除	10 months ago
port	init the repo with base	删除	10 months ago
table	init the repo with base	删除	10 months ago
test	upload benchmark	删除	9 months ago
third_party	init the repo with base	删除	10 months ago
util	init the repo with base	删除	10 months ago
.clang-format	init the repo with base		10 months ago
.gitignore	finish Secondary Index based on DBImpl&indexdb_		9 months ago
.gitmodules	init the repo with base		10 months ago
AUTHORS	init the repo with base		10 months ago
CMakeLists.txt	finish Secondary Index based on DBImpl&indexdb_		9 months ago
CONTRIBUTING.md	init the repo with base		10 months ago
LICENSE	init the repo with base		10 months ago
NEWS	init the repo with base		10 months ago
README.md	Merge branch 'main' of https://github.com/kevinyao0901/levelDB_secondary-index		9 months ago
TODO	init the repo with base		10 months ago

README.md

实验报告：在 LevelDB 中构建二级索引的设计与实现

实验目的

在 LevelDB 的基础上设计和实现一个支持二级索引的功能，优化特定字段的查询效率。通过此功能，用户能够根据字段值高效地检索对应的数据记录，而不需要遍历整个数据库。

实现思路

1. 二级索引的概念

二级索引是一种额外的数据结构，用于加速某些特定字段的查询。在 LevelDB 中，键值对的存储是以 key:value 的形式。通过创建二级索引，我们将目标字段的值与原始 key 建立映射关系，存储在独立的索引数据库中，从而支持基于字段值的快速查询。

例如，原始数据如下：

k_1 : name:Customer#000000001|address:IVhzIApeRb|phone:25-989-741-2988
k_2 : name:Customer#000000002|address:XSTf4,NCwDVaW|phone:23-768-687-3665
k_3 : name:Customer#000000001|address:MG9kdTD2WBHm|phone:11-719-748-3364

为字段 name 创建索引后，索引数据库中的条目如下：

name:Customer#000000001-k_1 : k_1
name:Customer#000000001-k_3 : k_3
name:Customer#000000002-k_2 : k_2

2. 设计目标

创建索引：扫描数据库中的所有记录，基于指定字段提取值，并将字段值和原始 key 编码后写入二级索引数据库 indexDb_。
查询索引：在二级索引数据库中快速定位字段值对应的原始 key。
删除索引：移除二级索引数据库中所有与目标字段相关的条目。

具体实现

1. DBImpl 类的设计

在 LevelDB 的核心类 DBImpl 中，增加了对二级索引的支持，包括：

索引字段管理：使用成员变量 fieldWithIndex_ 保存所有已经创建索引的字段名。
索引数据库：使用成员变量 indexDb_ 管理二级索引数据库。

class DBImpl : public DB {
private:
    std::vector<std::string> fieldWithIndex_; // 已创建索引的字段列表
    leveldb::DB* indexDb_;                    // 存储二级索引的数据库
};

2. 二级索引的创建

在 DBImpl 中实现 CreateIndexOnField 方法，用于对指定字段创建二级索引：

遍历主数据库中的所有数据记录。
解析目标字段的值。
在索引数据库中写入二级索引条目，键为 "fieldName:field_value-key"，值为原始数据的键。

示例：

核心代码：

Status DBImpl::CreateIndexOnField(const std::string& fieldName) {
    // 检查字段是否已创建索引
    for (const auto& field : fieldWithIndex_) {
        if (field == fieldName) {
            return Status::InvalidArgument("Index already exists for this field");
        }
    }

    // 添加到已创建索引的字段列表
    fieldWithIndex_.push_back(fieldName);

    // 遍历主数据库，解析字段值并写入索引数据库
    leveldb::ReadOptions read_options;
    leveldb::Iterator* it = this->NewIterator(read_options);

    for (it->SeekToFirst(); it->Valid(); it->Next()) {
        std::string key = it->key().ToString();
        std::string value = it->value().ToString();

        // 提取字段值
        size_t field_pos = value.find(fieldName + ":");
        if (field_pos != std::string::npos) {
            size_t value_start = field_pos + fieldName.size() + 1;
            size_t value_end = value.find("|", value_start);
            if (value_end == std::string::npos) value_end = value.size();

            std::string field_value = value.substr(value_start, value_end - value_start);
            std::string index_key = fieldName + ":" + field_value;

            // 在索引数据库中创建条目
            leveldb::Status s = indexDb_->Put(WriteOptions(), Slice(index_key), Slice(key));
            if (!s.ok()) {
                delete it;
                return s;
            }
        }
    }

    delete it;
    return Status::OK();
}

3. 二级索引的查询

在 DBImpl 中实现 QueryByIndex 方法，通过目标字段值查找对应的原始键：

在索引数据库中遍历 fieldName:field_value 开头的条目。
收集结果并返回。

核心代码：

std::vector<std::string> DBImpl::QueryByIndex(const std::string& fieldName) {
    std::vector<std::string> results;
    leveldb::ReadOptions read_options;
    leveldb::Iterator* it = indexDb_->NewIterator(read_options);

    for (it->Seek(fieldName); it->Valid(); it->Next()) {
        std::string value = it->value().ToString();
        if (!value.empty()) {
            results.push_back(value);
        }
    }

    delete it;
    return results;
}

4. 二级索引的删除

在 DBImpl 中实现 DeleteIndex 方法，通过目标字段名移除对应的所有索引条目：

在 fieldWithIndex_ 中移除字段。
遍历索引数据库，删除所有以 fieldName: 开头的条目。

核心代码：

Status DBImpl::DeleteIndex(const std::string& fieldName) {
    auto it = std::find(fieldWithIndex_.begin(), fieldWithIndex_.end(), fieldName);
    if (it == fieldWithIndex_.end()) {
        return Status::NotFound("Index not found for this field");
    }

    // 从已创建索引列表中移除字段
    fieldWithIndex_.erase(it);

    // 遍历索引数据库，删除相关条目
    leveldb::ReadOptions read_options;
    leveldb::Iterator* it_index = indexDb_->NewIterator(read_options);

    for (it_index->SeekToFirst(); it_index->Valid(); it_index->Next()) {
        std::string index_key = it_index->key().ToString();
        if (index_key.find(fieldName + ":") == 0) {
            Status s = indexDb_->Delete(WriteOptions(), Slice(index_key));
            if (!s.ok()) {
                delete it_index;
                return s;
            }
        }
    }

    delete it_index;
    return Status::OK();
}

5. 对 `Put` 和 `Delete` 方法的内容更新描述

为了在 Put 和 Delete 操作中同步更新二级索引，我们对代码进行了以下扩展：

Put 方法

在 Put 方法中，新增逻辑检查并更新字段索引：

字段值提取与检查
- 遍历所有已创建索引的字段列表 (fieldWithIndex_)。
- 检查待插入数据值 (val) 中是否包含当前字段。
- 如果字段存在，提取该字段的值 (fieldValue)。
构建索引键与插入索引数据库
- 使用字段名和字段值组合构建索引键 (field:fieldValue)。
- 将该索引键与原始键 (key) 写入二级索引数据库 indexDb_。
- 如果写入操作失败，立即返回错误状态。

此逻辑保证在 Put 方法中，对 fieldWithIndex_ 中的每个字段都可以维护最新的索引关系。

Delete 方法

在 Delete 方法中，新增逻辑检查并移除相关字段索引：

字段值提取与检查
- 遍历所有已创建索引的字段列表 (fieldWithIndex_)。
- 检查待删除数据键 (key) 中是否包含当前字段。
- 如果字段存在，提取该字段的值 (fieldValue)。
构建索引键与删除索引条目
- 使用字段名和字段值组合构建索引键 (field:fieldValue)。
- 从二级索引数据库 indexDb_ 中删除该索引键。
- 如果删除操作失败，立即返回错误状态。

此逻辑确保在 Delete 操作中能够正确移除已删除记录对应的二级索引条目。

6.数据插入与删除原子性的实现

为确保主数据库 (DBImpl) 和二级索引数据库 (indexDb_) 的一致性，我们在 Put 和 Delete 方法中采用了事务处理机制 (WriteBatch)，以实现原子性操作。具体实现如下：

插入数据的实现

主数据写入：在 Put 方法中，首先将主数据写入操作 (key 和 val) 添加到事务批次 (WriteBatch) 中。
解析字段值更新索引：遍历 fieldWithIndex_（即需要建立索引的字段列表），在 val 中提取对应字段的值。如果字段值非空，则构建索引键值对，例如 fieldName:fieldValue -> key，并将该索引插入到 indexDb_ 中。
事务提交：利用 WriteBatch 将主数据库写入操作和索引更新操作一并提交。通过 this->Write 确保事务的原子性，即所有写入操作成功或全部失败。

WriteBatch batch; // 创建事务
batch.Put(key, val); // 写入主数据库

for (const auto& field : fieldWithIndex_) {
    std::string fieldValue = 提取字段值(val, field); 
    if (!fieldValue.empty()) {
        std::string indexKey 
        std::string indexValue 
        batch.Put(indexKey, indexValue); // 添加索引写入操作
    }
}

Status s = this->Write(o, &batch); // 事务提交

删除数据的实现

主数据删除：在 Delete 方法中，将主数据库的删除操作加入事务。
解析字段值删除索引：遍历 fieldWithIndex_，根据 key 提取字段值，并构建对应的索引键。例如，从 key 提取 fieldValue，构建索引键 fieldName:fieldValue，将其从 indexDb_ 中删除。
事务提交：将主数据库删除操作和索引删除操作合并为一个事务，通过 this->Write 一并提交。

WriteBatch batch; // 创建事务
batch.Delete(key); // 删除主数据库记录

for (const auto& field : fieldWithIndex_) {
    std::string fieldValue = 提取字段值(key, field); 
    if (!fieldValue.empty()) {
        std::string indexKey ;
        batch.Delete(indexKey); // 添加索引删除操作
    }
}

Status s = this->Write(options, &batch); // 事务提交

事务处理的优点

原子性：通过 WriteBatch，可以将主数据的更新和索引的更新/删除操作捆绑为一个原子事务，避免因系统崩溃导致的不一致性问题。
简化代码逻辑：事务批次的使用使得多步骤操作整合到统一的提交过程，降低了代码的复杂性。
一致性保障：如果某个步骤失败，整个事务会回滚，保证数据库状态的一致性。

通过这种设计，我们实现了主数据库和二级索引的紧密联动，确保了在插入和删除操作中的数据一致性。

示例流程

插入原始数据：

k_1 : name:Customer#000000001|address:IVhzIApeRb|phone:25-989-741-2988
k_2 : name:Customer#000000002|address:XSTf4,NCwDVaW|phone:23-768-687-3665

创建索引：
- 调用 CreateIndexOnField("name")，索引数据库生成条目：
```
name:Customer#000000001-k_1 : k_1
name:Customer#000000002-k_2 : k_2
```
查询索引：
- 调用 QueryByIndex("name:Customer#000000001")，返回 ["k_1"]。
删除索引：
- 调用 DeleteIndex("name")，移除所有 name: 开头的索引条目。

测试结果：

Benchmark测试运行结果及分析：

插入时间 (Insertion time for 100001 entries: 516356 microseconds)

这个时间（516356 微秒，约 516 毫秒）看起来是合理的，特别是对于 100001 条记录的插入操作。如果数据插入过程没有特别复杂的计算或操作，这个时间应该是正常的，除非硬件性能或其他因素导致延迟。

没有索引的查询时间 (Time without index: 106719 microseconds)

这个时间是查询在没有索引的情况下执行的时间。106719 微秒（大约 107 毫秒）对于没有索引的查询来说是可以接受的，尤其是在数据量较大时。如果数据库没有索引，查找所有相关条目会比较耗时。

创建索引的时间 (Time to create index: 596677 microseconds)

这个时间（596677 微秒，约 597 毫秒）对于创建索引来说是正常的，尤其是在插入了大量数据后。如果数据量非常大，索引创建时间可能会显得稍长。通常情况下，创建索引的时间会随着数据量的增加而增大。

有索引的查询时间 (Time with index: 68 microseconds)

这个时间（68 微秒）非常短，几乎可以认为是一个非常好的优化结果。通常，索引查询比没有索引时要快得多，因为它避免了全表扫描。因此，这个时间是非常正常且预期的，说明索引大大加速了查询。

查询结果 (Found 1 keys with index)

这里显示索引查询找到了 1 个键。是正常的, name=Customer#10000 应该返回 1 条记录。

数据库统计信息 (Database stats)

Compactions
Level  Files Size(MB) Time(sec) Read(MB) Write(MB)
--------------------------------------------------
  0        2        6         0        0         7
  1        5        8         0       16         7

这些信息表明数据库的压缩（Compaction）过程。Level 0 和 Level 1 显示了数据库的文件数和大小。此部分数据正常，意味着数据库在处理数据时有一些 I/O 操作和文件整理。

删除索引的时间 (Time to delete index on field 'name': 605850 microseconds)

删除索引的时间（605850 微秒，约 606 毫秒）比创建索引的时间稍长。这个时间是合理的，删除索引通常会涉及到重新整理数据结构和清理索引文件，因此可能比创建索引稍慢。

benchmark运行结果总结：

整体来看，输出结果是正常的：

插入和索引创建时间：插入数据和创建索引所需的时间相对较长，但考虑到数据量和索引的生成，时间是合理的。
有索引的查询时间：索引加速了查询，这部分的时间（68 微秒）非常短，表现出色。
删除索引的时间：删除索引需要稍长时间，这也是常见的现象。

总结

本实验通过在 DBImpl 中集成索引管理功能，实现了对二级索引的创建、查询和删除。二级索引数据存储在独立的 indexDb_ 中，通过高效的键值映射提升了字段值查询的效率。

README.md

实验报告：在 LevelDB 中构建二级索引的设计与实现

实验目的

实现思路

1. 二级索引的概念

2. 设计目标

具体实现

1. DBImpl 类的设计

2. 二级索引的创建

核心代码：

3. 二级索引的查询

核心代码：

4. 二级索引的删除

核心代码：

5. 对 Put 和 Delete 方法的内容更新描述

Put 方法

Delete 方法

6.数据插入与删除原子性的实现

插入数据的实现

删除数据的实现

事务处理的优点

示例流程

总结

5. 对 `Put` 和 `Delete` 方法的内容更新描述