|
|
@ -1,167 +1,246 @@ |
|
|
|
### **实验计划说明报告:基于 `embedded_secondary-index` 的 `LevelDB` 实现及实验** |
|
|
|
LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values. |
|
|
|
|
|
|
|
------ |
|
|
|
> **This repository is receiving very limited maintenance. We will only review the following types of changes.** |
|
|
|
> |
|
|
|
> * Fixes for critical bugs, such as data loss or memory corruption |
|
|
|
> * Changes absolutely needed by internally supported leveldb clients. These typically fix breakage introduced by a language/standard library/OS update |
|
|
|
|
|
|
|
#### **1. 实验背景** |
|
|
|
[](https://github.com/google/leveldb/actions/workflows/build.yml) |
|
|
|
|
|
|
|
LevelDB 是一个高性能的持久化键值存储引擎,提供简单的 `API` 用于高效的读写操作。然而,传统 `LevelDB` 仅支持基于主键的快速查询,而无法直接支持对二级属性的查询需求。在许多场景(如搜索系统或复杂索引系统)中,需要支持高效的二级索引查询。 |
|
|
|
Authors: Sanjay Ghemawat (sanjay@google.com) and Jeff Dean (jeff@google.com) |
|
|
|
|
|
|
|
本实验计划基于 `embedded_secondary-index` 的设计扩展了 `LevelDB`,支持通过嵌入式布隆过滤器实现的二级索引查询,并引入了 Top-K 查询功能以提升二级属性查询的实用性和效率。 |
|
|
|
# Features |
|
|
|
|
|
|
|
------ |
|
|
|
* Keys and values are arbitrary byte arrays. |
|
|
|
* Data is stored sorted by key. |
|
|
|
* Callers can provide a custom comparison function to override the sort order. |
|
|
|
* The basic operations are `Put(key,value)`, `Get(key)`, `Delete(key)`. |
|
|
|
* Multiple changes can be made in one atomic batch. |
|
|
|
* Users can create a transient snapshot to get a consistent view of data. |
|
|
|
* Forward and backward iteration is supported over the data. |
|
|
|
* Data is automatically compressed using the [Snappy compression library](https://google.github.io/snappy/), but [Zstd compression](https://facebook.github.io/zstd/) is also supported. |
|
|
|
* External activity (file system operations etc.) is relayed through a virtual interface so users can customize the operating system interactions. |
|
|
|
|
|
|
|
#### **2. 实验目标** |
|
|
|
# Documentation |
|
|
|
|
|
|
|
- 实现一个支持二级索引查询的 `LevelDB` 扩展版本。 |
|
|
|
- 验证嵌入式二级索引的设计在读写性能和查询效率上的优越性。 |
|
|
|
- 测试支持二级索引查询的数据库在 Top-K 查询功能上的性能表现。 |
|
|
|
[LevelDB library documentation](https://github.com/google/leveldb/blob/main/doc/index.md) is online and bundled with the source code. |
|
|
|
|
|
|
|
------ |
|
|
|
# Limitations |
|
|
|
|
|
|
|
#### **3. 系统设计** |
|
|
|
* This is not a SQL database. It does not have a relational data model, it does not support SQL queries, and it has no support for indexes. |
|
|
|
* Only a single process (possibly multi-threaded) can access a particular database at a time. |
|
|
|
* There is no client-server support builtin to the library. An application that needs such support will have to wrap their own server around the library. |
|
|
|
|
|
|
|
本实验采用 **`embedded_secondary-index`** 的实现方式,将二级索引嵌入到 `LevelDB` 的原有数据结构中。以下是系统的核心设计: |
|
|
|
# Getting the Source |
|
|
|
|
|
|
|
##### **3.1 数据结构设计** |
|
|
|
```bash |
|
|
|
git clone --recurse-submodules https://github.com/google/leveldb.git |
|
|
|
``` |
|
|
|
|
|
|
|
1. **`MemTable`**: |
|
|
|
- 在内存中维护主键与二级属性的数据映射关系。 |
|
|
|
- 对二级属性构建布隆过滤器以提高查询效率。 |
|
|
|
2. **`SSTable`**: |
|
|
|
- 每个 `SSTable` 包含多个数据块(存储键值对)、元数据块(记录索引信息)和布隆过滤器块(分别用于主键和二级属性的快速过滤)。 |
|
|
|
- 数据写入磁盘时,布隆过滤器被嵌入到 `SSTable` 中,避免额外的索引文件。 |
|
|
|
3. **布隆过滤器**: |
|
|
|
- 对每个数据块的二级属性计算布隆过滤器位串。 |
|
|
|
- 通过内存中加载的布隆过滤器快速筛选可能包含目标数据的块,减少磁盘 IO。 |
|
|
|
# Building |
|
|
|
|
|
|
|
##### **3.2 查询算法设计** |
|
|
|
This project supports [CMake](https://cmake.org/) out of the box. |
|
|
|
|
|
|
|
1. **Top-K 查询**: |
|
|
|
- 查询时,先通过布隆过滤器筛选出可能的 `SSTable` 和数据块。 |
|
|
|
- 使用小顶堆保存查询结果,根据 `sequence_number`(插入顺序)排序,最终返回最近的 K 条记录。 |
|
|
|
2. **层次化查询流程**: |
|
|
|
- 优先从 `MemTable` 查询; |
|
|
|
- 若未命中,则逐层遍历 `SSTable`。 |
|
|
|
### Build for POSIX |
|
|
|
|
|
|
|
------ |
|
|
|
Quick start: |
|
|
|
|
|
|
|
#### **4. 实验步骤** |
|
|
|
```bash |
|
|
|
mkdir -p build && cd build |
|
|
|
cmake -DCMAKE_BUILD_TYPE=Release .. && cmake --build . |
|
|
|
``` |
|
|
|
|
|
|
|
##### **4.1 系统实现** |
|
|
|
### Building for Windows |
|
|
|
|
|
|
|
1. 修改 `LevelDB` 的源码以支持二级索引嵌入: |
|
|
|
- 更新 `SSTable` 数据块结构,增加布隆过滤器支持; |
|
|
|
- 修改 `Write` 和 `Flush` 流程,嵌入二级索引信息。 |
|
|
|
2. 扩展数据库的 `API`: |
|
|
|
- 实现二级索引的查询接口(`RangeLookUp` 和 `Top-K LookUp`)。 |
|
|
|
3. 使用 Google Test 编写单元测试,验证功能正确性。 |
|
|
|
First generate the Visual Studio 2017 project/solution files: |
|
|
|
|
|
|
|
##### **4.2 计划性能测试** |
|
|
|
```cmd |
|
|
|
mkdir build |
|
|
|
cd build |
|
|
|
cmake -G "Visual Studio 15" .. |
|
|
|
``` |
|
|
|
The default default will build for x86. For 64-bit run: |
|
|
|
|
|
|
|
1. **数据准备**: |
|
|
|
```cmd |
|
|
|
cmake -G "Visual Studio 15 Win64" .. |
|
|
|
``` |
|
|
|
|
|
|
|
- 生成包含主键和二级属性的模拟数据集。 |
|
|
|
To compile the Windows solution from the command-line: |
|
|
|
|
|
|
|
- 数据格式示例: |
|
|
|
```cmd |
|
|
|
devenv /build Debug leveldb.sln |
|
|
|
``` |
|
|
|
|
|
|
|
```json |
|
|
|
{ |
|
|
|
"primary_key": "id12345", |
|
|
|
"secondary_key": "tag123", |
|
|
|
"value": "This is a test record." |
|
|
|
} |
|
|
|
``` |
|
|
|
or open leveldb.sln in Visual Studio and build from within. |
|
|
|
|
|
|
|
2. **测试指标**: |
|
|
|
Please see the CMake documentation and `CMakeLists.txt` for more advanced usage. |
|
|
|
|
|
|
|
- 数据写入性能(`QPS`)。 |
|
|
|
- 基于二级属性的查询性能: |
|
|
|
- 单次查询耗时; |
|
|
|
- 不同 Top-K 参数下的查询性能; |
|
|
|
- 对比嵌入式二级索引与传统外部索引在查询性能上的表现。 |
|
|
|
# Contributing to the leveldb Project |
|
|
|
|
|
|
|
3. **测试工具**: |
|
|
|
计划使用 Benchmark 工具测量数据库的吞吐量与延迟。 |
|
|
|
> **This repository is receiving very limited maintenance. We will only review the following types of changes.** |
|
|
|
> |
|
|
|
> * Bug fixes |
|
|
|
> * Changes absolutely needed by internally supported leveldb clients. These typically fix breakage introduced by a language/standard library/OS update |
|
|
|
|
|
|
|
The leveldb project welcomes contributions. leveldb's primary goal is to be |
|
|
|
a reliable and fast key/value store. Changes that are in line with the |
|
|
|
features/limitations outlined above, and meet the requirements below, |
|
|
|
will be considered. |
|
|
|
|
|
|
|
Contribution requirements: |
|
|
|
|
|
|
|
------ |
|
|
|
1. **Tested platforms only**. We _generally_ will only accept changes for |
|
|
|
platforms that are compiled and tested. This means POSIX (for Linux and |
|
|
|
macOS) or Windows. Very small changes will sometimes be accepted, but |
|
|
|
consider that more of an exception than the rule. |
|
|
|
|
|
|
|
#### **5. 附录:系统结构图** |
|
|
|
2. **Stable API**. We strive very hard to maintain a stable API. Changes that |
|
|
|
require changes for projects using leveldb _might_ be rejected without |
|
|
|
sufficient benefit to the project. |
|
|
|
|
|
|
|
1. 下面提供一些建议的结构图,可以清晰说明基于 **`embedded_secondary-index`** 的设计和实现,适合配合实验报告使用: |
|
|
|
3. **Tests**: All changes must be accompanied by a new (or changed) test, or |
|
|
|
a sufficient explanation as to why a new (or changed) test is not required. |
|
|
|
|
|
|
|
------ |
|
|
|
4. **Consistent Style**: This project conforms to the |
|
|
|
[Google C++ Style Guide](https://google.github.io/styleguide/cppguide.html). |
|
|
|
To ensure your changes are properly formatted please run: |
|
|
|
|
|
|
|
### **1. 系统整体架构图** |
|
|
|
``` |
|
|
|
clang-format -i --style=file <file> |
|
|
|
``` |
|
|
|
|
|
|
|
**图示内容** |
|
|
|
展示 `embedded_secondary-index` 的整体设计,包括主键、二级属性的存储方式,以及布隆过滤器与 `SSTable` 的嵌入关系。 |
|
|
|
We are unlikely to accept contributions to the build configuration files, such |
|
|
|
as `CMakeLists.txt`. We are focused on maintaining a build configuration that |
|
|
|
allows us to test that the project works in a few supported configurations |
|
|
|
inside Google. We are not currently interested in supporting other requirements, |
|
|
|
such as different operating systems, compilers, or build systems. |
|
|
|
|
|
|
|
**图示结构** |
|
|
|
## Submitting a Pull Request |
|
|
|
|
|
|
|
 |
|
|
|
Before any pull request will be accepted the author must first sign a |
|
|
|
Contributor License Agreement (CLA) at https://cla.developers.google.com/. |
|
|
|
|
|
|
|
- 要点说明: |
|
|
|
1. 二级索引与布隆过滤器紧密嵌入 `SSTable` 的元数据块中,避免外部索引文件的开销。 |
|
|
|
2. 查询时,通过布隆过滤器快速过滤非相关 `SSTable`,只访问可能的匹配块。 |
|
|
|
In order to keep the commit timeline linear |
|
|
|
[squash](https://git-scm.com/book/en/v2/Git-Tools-Rewriting-History#Squashing-Commits) |
|
|
|
your changes down to a single commit and [rebase](https://git-scm.com/docs/git-rebase) |
|
|
|
on google/leveldb/main. This keeps the commit timeline linear and more easily sync'ed |
|
|
|
with the internal repository at Google. More information at GitHub's |
|
|
|
[About Git rebase](https://help.github.com/articles/about-git-rebase/) page. |
|
|
|
|
|
|
|
------ |
|
|
|
# Performance |
|
|
|
|
|
|
|
### **2. 数据写入流程图** |
|
|
|
Here is a performance report (with explanations) from the run of the |
|
|
|
included db_bench program. The results are somewhat noisy, but should |
|
|
|
be enough to get a ballpark performance estimate. |
|
|
|
|
|
|
|
**图示内容** |
|
|
|
描述写入数据时如何解析主键和二级属性,并更新布隆过滤器和 `SSTable` 的流程。 |
|
|
|
## Setup |
|
|
|
|
|
|
|
**图示结构** |
|
|
|
We use a database with a million entries. Each entry has a 16 byte |
|
|
|
key, and a 100 byte value. Values used by the benchmark compress to |
|
|
|
about half their original size. |
|
|
|
|
|
|
|
 |
|
|
|
LevelDB: version 1.1 |
|
|
|
Date: Sun May 1 12:11:26 2011 |
|
|
|
CPU: 4 x Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz |
|
|
|
CPUCache: 4096 KB |
|
|
|
Keys: 16 bytes each |
|
|
|
Values: 100 bytes each (50 bytes after compression) |
|
|
|
Entries: 1000000 |
|
|
|
Raw Size: 110.6 MB (estimated) |
|
|
|
File Size: 62.9 MB (estimated) |
|
|
|
|
|
|
|
- **要点说明**: |
|
|
|
写入过程中,自动解析主键和二级属性,实时更新布隆过滤器,确保写入操作高效完成。 |
|
|
|
## Write performance |
|
|
|
|
|
|
|
------ |
|
|
|
The "fill" benchmarks create a brand new database, in either |
|
|
|
sequential, or random order. The "fillsync" benchmark flushes data |
|
|
|
from the operating system to the disk after every operation; the other |
|
|
|
write operations leave the data sitting in the operating system buffer |
|
|
|
cache for a while. The "overwrite" benchmark does random writes that |
|
|
|
update existing keys in the database. |
|
|
|
|
|
|
|
### **3. 数据查询流程图** |
|
|
|
fillseq : 1.765 micros/op; 62.7 MB/s |
|
|
|
fillsync : 268.409 micros/op; 0.4 MB/s (10000 ops) |
|
|
|
fillrandom : 2.460 micros/op; 45.0 MB/s |
|
|
|
overwrite : 2.380 micros/op; 46.5 MB/s |
|
|
|
|
|
|
|
**图示内容** |
|
|
|
展示基于二级属性查询的具体步骤,包括布隆过滤器筛选、块访问和结果返回。 |
|
|
|
Each "op" above corresponds to a write of a single key/value pair. |
|
|
|
I.e., a random write benchmark goes at approximately 400,000 writes per second. |
|
|
|
|
|
|
|
**图示结构** |
|
|
|
Each "fillsync" operation costs much less (0.3 millisecond) |
|
|
|
than a disk seek (typically 10 milliseconds). We suspect that this is |
|
|
|
because the hard disk itself is buffering the update in its memory and |
|
|
|
responding before the data has been written to the platter. This may |
|
|
|
or may not be safe based on whether or not the hard disk has enough |
|
|
|
power to save its memory in the event of a power failure. |
|
|
|
|
|
|
|
 |
|
|
|
## Read performance |
|
|
|
|
|
|
|
- **要点说明**: |
|
|
|
布隆过滤器用于筛选目标 `SSTable`,通过小顶堆实现 Top-K 的排序与记录收集,保证查询的效率。 |
|
|
|
We list the performance of reading sequentially in both the forward |
|
|
|
and reverse direction, and also the performance of a random lookup. |
|
|
|
Note that the database created by the benchmark is quite small. |
|
|
|
Therefore the report characterizes the performance of leveldb when the |
|
|
|
working set fits in memory. The cost of reading a piece of data that |
|
|
|
is not present in the operating system buffer cache will be dominated |
|
|
|
by the one or two disk seeks needed to fetch the data from disk. |
|
|
|
Write performance will be mostly unaffected by whether or not the |
|
|
|
working set fits in memory. |
|
|
|
|
|
|
|
------ |
|
|
|
readrandom : 16.677 micros/op; (approximately 60,000 reads per second) |
|
|
|
readseq : 0.476 micros/op; 232.3 MB/s |
|
|
|
readreverse : 0.724 micros/op; 152.9 MB/s |
|
|
|
|
|
|
|
### **4. `SSTable` 布局示意图** |
|
|
|
LevelDB compacts its underlying storage data in the background to |
|
|
|
improve read performance. The results listed above were done |
|
|
|
immediately after a lot of random writes. The results after |
|
|
|
compactions (which are usually triggered automatically) are better. |
|
|
|
|
|
|
|
**图示内容** |
|
|
|
展示 `SSTable` 内部如何组织主键、二级属性和布隆过滤器的布局。 |
|
|
|
readrandom : 11.602 micros/op; (approximately 85,000 reads per second) |
|
|
|
readseq : 0.423 micros/op; 261.8 MB/s |
|
|
|
readreverse : 0.663 micros/op; 166.9 MB/s |
|
|
|
|
|
|
|
**图示结构** |
|
|
|
Some of the high cost of reads comes from repeated decompression of blocks |
|
|
|
read from disk. If we supply enough cache to the leveldb so it can hold the |
|
|
|
uncompressed blocks in memory, the read performance improves again: |
|
|
|
|
|
|
|
 |
|
|
|
readrandom : 9.775 micros/op; (approximately 100,000 reads per second before compaction) |
|
|
|
readrandom : 5.215 micros/op; (approximately 190,000 reads per second after compaction) |
|
|
|
|
|
|
|
- **要点说明:** |
|
|
|
1. 每个 `SSTable` 包含数据块(Data Blocks)、元数据块(Meta Block)和布隆过滤器块(Bloom Filter Blocks)。 |
|
|
|
2. 二级属性的布隆过滤器和主键布隆过滤器分别存储,提供不同维度的快速索引。 |
|
|
|
## Repository contents |
|
|
|
|
|
|
|
------ |
|
|
|
See [doc/index.md](doc/index.md) for more explanation. See |
|
|
|
[doc/impl.md](doc/impl.md) for a brief overview of the implementation. |
|
|
|
|
|
|
|
### **5. Top-K 查询堆排序示意图** |
|
|
|
The public interface is in include/leveldb/*.h. Callers should not include or |
|
|
|
rely on the details of any other header files in this package. Those |
|
|
|
internal APIs may be changed without warning. |
|
|
|
|
|
|
|
**图示内容** |
|
|
|
以小顶堆为核心,说明查询结果如何按照时间顺序(`sequence_number`)进行排序。 |
|
|
|
Guide to header files: |
|
|
|
|
|
|
|
**图示结构** |
|
|
|
* **include/leveldb/db.h**: Main interface to the DB: Start here. |
|
|
|
|
|
|
|
 |
|
|
|
* **include/leveldb/options.h**: Control over the behavior of an entire database, |
|
|
|
and also control over the behavior of individual reads and writes. |
|
|
|
|
|
|
|
* **include/leveldb/comparator.h**: Abstraction for user-specified comparison function. |
|
|
|
If you want just bytewise comparison of keys, you can use the default |
|
|
|
comparator, but clients can write their own comparator implementations if they |
|
|
|
want custom ordering (e.g. to handle different character encodings, etc.). |
|
|
|
|
|
|
|
- **要点说明**: |
|
|
|
查询过程中,维护一个固定大小的小顶堆,仅保留最近的 K 条记录,大幅提高排序效率。 |
|
|
|
* **include/leveldb/iterator.h**: Interface for iterating over data. You can get |
|
|
|
an iterator from a DB object. |
|
|
|
|
|
|
|
------ |
|
|
|
* **include/leveldb/write_batch.h**: Interface for atomically applying multiple |
|
|
|
updates to a database. |
|
|
|
|
|
|
|
* **include/leveldb/slice.h**: A simple module for maintaining a pointer and a |
|
|
|
length into some other byte array. |
|
|
|
|
|
|
|
* **include/leveldb/status.h**: Status is returned from many of the public interfaces |
|
|
|
and is used to report success and various kinds of errors. |
|
|
|
|
|
|
|
* **include/leveldb/env.h**: |
|
|
|
Abstraction of the OS environment. A posix implementation of this interface is |
|
|
|
in util/env_posix.cc. |
|
|
|
|
|
|
|
* **include/leveldb/table.h, include/leveldb/table_builder.h**: Lower-level modules that most |
|
|
|
clients probably won't use directly. |