10215300402 朱维清 10222140408 谷杰
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

523 lines
18 KiB

  1. leveldb
  2. =======
  3. _Jeff Dean, Sanjay Ghemawat_
  4. The leveldb library provides a persistent key value store. Keys and values are
  5. arbitrary byte arrays. The keys are ordered within the key value store
  6. according to a user-specified comparator function.
  7. ## Opening A Database
  8. A leveldb database has a name which corresponds to a file system directory. All
  9. of the contents of database are stored in this directory. The following example
  10. shows how to open a database, creating it if necessary:
  11. ```c++
  12. #include <cassert>
  13. #include "leveldb/db.h"
  14. leveldb::DB* db;
  15. leveldb::Options options;
  16. options.create_if_missing = true;
  17. leveldb::Status status = leveldb::DB::Open(options, "/tmp/testdb", &db);
  18. assert(status.ok());
  19. ...
  20. ```
  21. If you want to raise an error if the database already exists, add the following
  22. line before the `leveldb::DB::Open` call:
  23. ```c++
  24. options.error_if_exists = true;
  25. ```
  26. ## Status
  27. You may have noticed the `leveldb::Status` type above. Values of this type are
  28. returned by most functions in leveldb that may encounter an error. You can check
  29. if such a result is ok, and also print an associated error message:
  30. ```c++
  31. leveldb::Status s = ...;
  32. if (!s.ok()) cerr << s.ToString() << endl;
  33. ```
  34. ## Closing A Database
  35. When you are done with a database, just delete the database object. Example:
  36. ```c++
  37. ... open the db as described above ...
  38. ... do something with db ...
  39. delete db;
  40. ```
  41. ## Reads And Writes
  42. The database provides Put, Delete, and Get methods to modify/query the database.
  43. For example, the following code moves the value stored under key1 to key2.
  44. ```c++
  45. std::string value;
  46. leveldb::Status s = db->Get(leveldb::ReadOptions(), key1, &value);
  47. if (s.ok()) s = db->Put(leveldb::WriteOptions(), key2, value);
  48. if (s.ok()) s = db->Delete(leveldb::WriteOptions(), key1);
  49. ```
  50. ## Atomic Updates
  51. Note that if the process dies after the Put of key2 but before the delete of
  52. key1, the same value may be left stored under multiple keys. Such problems can
  53. be avoided by using the `WriteBatch` class to atomically apply a set of updates:
  54. ```c++
  55. #include "leveldb/write_batch.h"
  56. ...
  57. std::string value;
  58. leveldb::Status s = db->Get(leveldb::ReadOptions(), key1, &value);
  59. if (s.ok()) {
  60. leveldb::WriteBatch batch;
  61. batch.Delete(key1);
  62. batch.Put(key2, value);
  63. s = db->Write(leveldb::WriteOptions(), &batch);
  64. }
  65. ```
  66. The `WriteBatch` holds a sequence of edits to be made to the database, and these
  67. edits within the batch are applied in order. Note that we called Delete before
  68. Put so that if key1 is identical to key2, we do not end up erroneously dropping
  69. the value entirely.
  70. Apart from its atomicity benefits, `WriteBatch` may also be used to speed up
  71. bulk updates by placing lots of individual mutations into the same batch.
  72. ## Synchronous Writes
  73. By default, each write to leveldb is asynchronous: it returns after pushing the
  74. write from the process into the operating system. The transfer from operating
  75. system memory to the underlying persistent storage happens asynchronously. The
  76. sync flag can be turned on for a particular write to make the write operation
  77. not return until the data being written has been pushed all the way to
  78. persistent storage. (On Posix systems, this is implemented by calling either
  79. `fsync(...)` or `fdatasync(...)` or `msync(..., MS_SYNC)` before the write
  80. operation returns.)
  81. ```c++
  82. leveldb::WriteOptions write_options;
  83. write_options.sync = true;
  84. db->Put(write_options, ...);
  85. ```
  86. Asynchronous writes are often more than a thousand times as fast as synchronous
  87. writes. The downside of asynchronous writes is that a crash of the machine may
  88. cause the last few updates to be lost. Note that a crash of just the writing
  89. process (i.e., not a reboot) will not cause any loss since even when sync is
  90. false, an update is pushed from the process memory into the operating system
  91. before it is considered done.
  92. Asynchronous writes can often be used safely. For example, when loading a large
  93. amount of data into the database you can handle lost updates by restarting the
  94. bulk load after a crash. A hybrid scheme is also possible where every Nth write
  95. is synchronous, and in the event of a crash, the bulk load is restarted just
  96. after the last synchronous write finished by the previous run. (The synchronous
  97. write can update a marker that describes where to restart on a crash.)
  98. `WriteBatch` provides an alternative to asynchronous writes. Multiple updates
  99. may be placed in the same WriteBatch and applied together using a synchronous
  100. write (i.e., `write_options.sync` is set to true). The extra cost of the
  101. synchronous write will be amortized across all of the writes in the batch.
  102. ## Concurrency
  103. A database may only be opened by one process at a time. The leveldb
  104. implementation acquires a lock from the operating system to prevent misuse.
  105. Within a single process, the same `leveldb::DB` object may be safely shared by
  106. multiple concurrent threads. I.e., different threads may write into or fetch
  107. iterators or call Get on the same database without any external synchronization
  108. (the leveldb implementation will automatically do the required synchronization).
  109. However other objects (like Iterator and `WriteBatch`) may require external
  110. synchronization. If two threads share such an object, they must protect access
  111. to it using their own locking protocol. More details are available in the public
  112. header files.
  113. ## Iteration
  114. The following example demonstrates how to print all key,value pairs in a
  115. database.
  116. ```c++
  117. leveldb::Iterator* it = db->NewIterator(leveldb::ReadOptions());
  118. for (it->SeekToFirst(); it->Valid(); it->Next()) {
  119. cout << it->key().ToString() << ": " << it->value().ToString() << endl;
  120. }
  121. assert(it->status().ok()); // Check for any errors found during the scan
  122. delete it;
  123. ```
  124. The following variation shows how to process just the keys in the range
  125. [start,limit):
  126. ```c++
  127. for (it->Seek(start);
  128. it->Valid() && it->key().ToString() < limit;
  129. it->Next()) {
  130. ...
  131. }
  132. ```
  133. You can also process entries in reverse order. (Caveat: reverse iteration may be
  134. somewhat slower than forward iteration.)
  135. ```c++
  136. for (it->SeekToLast(); it->Valid(); it->Prev()) {
  137. ...
  138. }
  139. ```
  140. ## Snapshots
  141. Snapshots provide consistent read-only views over the entire state of the
  142. key-value store. `ReadOptions::snapshot` may be non-NULL to indicate that a
  143. read should operate on a particular version of the DB state. If
  144. `ReadOptions::snapshot` is NULL, the read will operate on an implicit snapshot
  145. of the current state.
  146. Snapshots are created by the `DB::GetSnapshot()` method:
  147. ```c++
  148. leveldb::ReadOptions options;
  149. options.snapshot = db->GetSnapshot();
  150. ... apply some updates to db ...
  151. leveldb::Iterator* iter = db->NewIterator(options);
  152. ... read using iter to view the state when the snapshot was created ...
  153. delete iter;
  154. db->ReleaseSnapshot(options.snapshot);
  155. ```
  156. Note that when a snapshot is no longer needed, it should be released using the
  157. `DB::ReleaseSnapshot` interface. This allows the implementation to get rid of
  158. state that was being maintained just to support reading as of that snapshot.
  159. ## Slice
  160. The return value of the `it->key()` and `it->value()` calls above are instances
  161. of the `leveldb::Slice` type. Slice is a simple structure that contains a length
  162. and a pointer to an external byte array. Returning a Slice is a cheaper
  163. alternative to returning a `std::string` since we do not need to copy
  164. potentially large keys and values. In addition, leveldb methods do not return
  165. null-terminated C-style strings since leveldb keys and values are allowed to
  166. contain `'\0'` bytes.
  167. C++ strings and null-terminated C-style strings can be easily converted to a
  168. Slice:
  169. ```c++
  170. leveldb::Slice s1 = "hello";
  171. std::string str("world");
  172. leveldb::Slice s2 = str;
  173. ```
  174. A Slice can be easily converted back to a C++ string:
  175. ```c++
  176. std::string str = s1.ToString();
  177. assert(str == std::string("hello"));
  178. ```
  179. Be careful when using Slices since it is up to the caller to ensure that the
  180. external byte array into which the Slice points remains live while the Slice is
  181. in use. For example, the following is buggy:
  182. ```c++
  183. leveldb::Slice slice;
  184. if (...) {
  185. std::string str = ...;
  186. slice = str;
  187. }
  188. Use(slice);
  189. ```
  190. When the if statement goes out of scope, str will be destroyed and the backing
  191. storage for slice will disappear.
  192. ## Comparators
  193. The preceding examples used the default ordering function for key, which orders
  194. bytes lexicographically. You can however supply a custom comparator when opening
  195. a database. For example, suppose each database key consists of two numbers and
  196. we should sort by the first number, breaking ties by the second number. First,
  197. define a proper subclass of `leveldb::Comparator` that expresses these rules:
  198. ```c++
  199. class TwoPartComparator : public leveldb::Comparator {
  200. public:
  201. // Three-way comparison function:
  202. // if a < b: negative result
  203. // if a > b: positive result
  204. // else: zero result
  205. int Compare(const leveldb::Slice& a, const leveldb::Slice& b) const {
  206. int a1, a2, b1, b2;
  207. ParseKey(a, &a1, &a2);
  208. ParseKey(b, &b1, &b2);
  209. if (a1 < b1) return -1;
  210. if (a1 > b1) return +1;
  211. if (a2 < b2) return -1;
  212. if (a2 > b2) return +1;
  213. return 0;
  214. }
  215. // Ignore the following methods for now:
  216. const char* Name() const { return "TwoPartComparator"; }
  217. void FindShortestSeparator(std::string*, const leveldb::Slice&) const {}
  218. void FindShortSuccessor(std::string*) const {}
  219. };
  220. ```
  221. Now create a database using this custom comparator:
  222. ```c++
  223. TwoPartComparator cmp;
  224. leveldb::DB* db;
  225. leveldb::Options options;
  226. options.create_if_missing = true;
  227. options.comparator = &cmp;
  228. leveldb::Status status = leveldb::DB::Open(options, "/tmp/testdb", &db);
  229. ...
  230. ```
  231. ### Backwards compatibility
  232. The result of the comparator's Name method is attached to the database when it
  233. is created, and is checked on every subsequent database open. If the name
  234. changes, the `leveldb::DB::Open` call will fail. Therefore, change the name if
  235. and only if the new key format and comparison function are incompatible with
  236. existing databases, and it is ok to discard the contents of all existing
  237. databases.
  238. You can however still gradually evolve your key format over time with a little
  239. bit of pre-planning. For example, you could store a version number at the end of
  240. each key (one byte should suffice for most uses). When you wish to switch to a
  241. new key format (e.g., adding an optional third part to the keys processed by
  242. `TwoPartComparator`), (a) keep the same comparator name (b) increment the
  243. version number for new keys (c) change the comparator function so it uses the
  244. version numbers found in the keys to decide how to interpret them.
  245. ## Performance
  246. Performance can be tuned by changing the default values of the types defined in
  247. `include/options.h`.
  248. ### Block size
  249. leveldb groups adjacent keys together into the same block and such a block is
  250. the unit of transfer to and from persistent storage. The default block size is
  251. approximately 4096 uncompressed bytes. Applications that mostly do bulk scans
  252. over the contents of the database may wish to increase this size. Applications
  253. that do a lot of point reads of small values may wish to switch to a smaller
  254. block size if performance measurements indicate an improvement. There isn't much
  255. benefit in using blocks smaller than one kilobyte, or larger than a few
  256. megabytes. Also note that compression will be more effective with larger block
  257. sizes.
  258. ### Compression
  259. Each block is individually compressed before being written to persistent
  260. storage. Compression is on by default since the default compression method is
  261. very fast, and is automatically disabled for uncompressible data. In rare cases,
  262. applications may want to disable compression entirely, but should only do so if
  263. benchmarks show a performance improvement:
  264. ```c++
  265. leveldb::Options options;
  266. options.compression = leveldb::kNoCompression;
  267. ... leveldb::DB::Open(options, name, ...) ....
  268. ```
  269. ### Cache
  270. The contents of the database are stored in a set of files in the filesystem and
  271. each file stores a sequence of compressed blocks. If options.block_cache is
  272. non-NULL, it is used to cache frequently used uncompressed block contents.
  273. ```c++
  274. #include "leveldb/cache.h"
  275. leveldb::Options options;
  276. options.block_cache = leveldb::NewLRUCache(100 * 1048576); // 100MB cache
  277. leveldb::DB* db;
  278. leveldb::DB::Open(options, name, &db);
  279. ... use the db ...
  280. delete db
  281. delete options.block_cache;
  282. ```
  283. Note that the cache holds uncompressed data, and therefore it should be sized
  284. according to application level data sizes, without any reduction from
  285. compression. (Caching of compressed blocks is left to the operating system
  286. buffer cache, or any custom Env implementation provided by the client.)
  287. When performing a bulk read, the application may wish to disable caching so that
  288. the data processed by the bulk read does not end up displacing most of the
  289. cached contents. A per-iterator option can be used to achieve this:
  290. ```c++
  291. leveldb::ReadOptions options;
  292. options.fill_cache = false;
  293. leveldb::Iterator* it = db->NewIterator(options);
  294. for (it->SeekToFirst(); it->Valid(); it->Next()) {
  295. ...
  296. }
  297. ```
  298. ### Key Layout
  299. Note that the unit of disk transfer and caching is a block. Adjacent keys
  300. (according to the database sort order) will usually be placed in the same block.
  301. Therefore the application can improve its performance by placing keys that are
  302. accessed together near each other and placing infrequently used keys in a
  303. separate region of the key space.
  304. For example, suppose we are implementing a simple file system on top of leveldb.
  305. The types of entries we might wish to store are:
  306. filename -> permission-bits, length, list of file_block_ids
  307. file_block_id -> data
  308. We might want to prefix filename keys with one letter (say '/') and the
  309. `file_block_id` keys with a different letter (say '0') so that scans over just
  310. the metadata do not force us to fetch and cache bulky file contents.
  311. ### Filters
  312. Because of the way leveldb data is organized on disk, a single `Get()` call may
  313. involve multiple reads from disk. The optional FilterPolicy mechanism can be
  314. used to reduce the number of disk reads substantially.
  315. ```c++
  316. leveldb::Options options;
  317. options.filter_policy = NewBloomFilterPolicy(10);
  318. leveldb::DB* db;
  319. leveldb::DB::Open(options, "/tmp/testdb", &db);
  320. ... use the database ...
  321. delete db;
  322. delete options.filter_policy;
  323. ```
  324. The preceding code associates a Bloom filter based filtering policy with the
  325. database. Bloom filter based filtering relies on keeping some number of bits of
  326. data in memory per key (in this case 10 bits per key since that is the argument
  327. we passed to `NewBloomFilterPolicy`). This filter will reduce the number of
  328. unnecessary disk reads needed for Get() calls by a factor of approximately
  329. a 100. Increasing the bits per key will lead to a larger reduction at the cost
  330. of more memory usage. We recommend that applications whose working set does not
  331. fit in memory and that do a lot of random reads set a filter policy.
  332. If you are using a custom comparator, you should ensure that the filter policy
  333. you are using is compatible with your comparator. For example, consider a
  334. comparator that ignores trailing spaces when comparing keys.
  335. `NewBloomFilterPolicy` must not be used with such a comparator. Instead, the
  336. application should provide a custom filter policy that also ignores trailing
  337. spaces. For example:
  338. ```c++
  339. class CustomFilterPolicy : public leveldb::FilterPolicy {
  340. private:
  341. FilterPolicy* builtin_policy_;
  342. public:
  343. CustomFilterPolicy() : builtin_policy_(NewBloomFilterPolicy(10)) {}
  344. ~CustomFilterPolicy() { delete builtin_policy_; }
  345. const char* Name() const { return "IgnoreTrailingSpacesFilter"; }
  346. void CreateFilter(const Slice* keys, int n, std::string* dst) const {
  347. // Use builtin bloom filter code after removing trailing spaces
  348. std::vector<Slice> trimmed(n);
  349. for (int i = 0; i < n; i++) {
  350. trimmed[i] = RemoveTrailingSpaces(keys[i]);
  351. }
  352. return builtin_policy_->CreateFilter(&trimmed[i], n, dst);
  353. }
  354. };
  355. ```
  356. Advanced applications may provide a filter policy that does not use a bloom
  357. filter but uses some other mechanism for summarizing a set of keys. See
  358. `leveldb/filter_policy.h` for detail.
  359. ## Checksums
  360. leveldb associates checksums with all data it stores in the file system. There
  361. are two separate controls provided over how aggressively these checksums are
  362. verified:
  363. `ReadOptions::verify_checksums` may be set to true to force checksum
  364. verification of all data that is read from the file system on behalf of a
  365. particular read. By default, no such verification is done.
  366. `Options::paranoid_checks` may be set to true before opening a database to make
  367. the database implementation raise an error as soon as it detects an internal
  368. corruption. Depending on which portion of the database has been corrupted, the
  369. error may be raised when the database is opened, or later by another database
  370. operation. By default, paranoid checking is off so that the database can be used
  371. even if parts of its persistent storage have been corrupted.
  372. If a database is corrupted (perhaps it cannot be opened when paranoid checking
  373. is turned on), the `leveldb::RepairDB` function may be used to recover as much
  374. of the data as possible
  375. ## Approximate Sizes
  376. The `GetApproximateSizes` method can used to get the approximate number of bytes
  377. of file system space used by one or more key ranges.
  378. ```c++
  379. leveldb::Range ranges[2];
  380. ranges[0] = leveldb::Range("a", "c");
  381. ranges[1] = leveldb::Range("x", "z");
  382. uint64_t sizes[2];
  383. leveldb::Status s = db->GetApproximateSizes(ranges, 2, sizes);
  384. ```
  385. The preceding call will set `sizes[0]` to the approximate number of bytes of
  386. file system space used by the key range `[a..c)` and `sizes[1]` to the
  387. approximate number of bytes used by the key range `[x..z)`.
  388. ## Environment
  389. All file operations (and other operating system calls) issued by the leveldb
  390. implementation are routed through a `leveldb::Env` object. Sophisticated clients
  391. may wish to provide their own Env implementation to get better control.
  392. For example, an application may introduce artificial delays in the file IO
  393. paths to limit the impact of leveldb on other activities in the system.
  394. ```c++
  395. class SlowEnv : public leveldb::Env {
  396. ... implementation of the Env interface ...
  397. };
  398. SlowEnv env;
  399. leveldb::Options options;
  400. options.env = &env;
  401. Status s = leveldb::DB::Open(options, ...);
  402. ```
  403. ## Porting
  404. leveldb may be ported to a new platform by providing platform specific
  405. implementations of the types/methods/functions exported by
  406. `leveldb/port/port.h`. See `leveldb/port/port_example.h` for more details.
  407. In addition, the new platform may need a new default `leveldb::Env`
  408. implementation. See `leveldb/util/env_posix.h` for an example.
  409. ## Other Information
  410. Details about the leveldb implementation may be found in the following
  411. documents:
  412. 1. [Implementation notes](impl.md)
  413. 2. [Format of an immutable Table file](table_format.md)
  414. 3. [Format of a log file](log_format.md)