|
|
- <!DOCTYPE html>
- <html>
- <head>
- <link rel="stylesheet" type="text/css" href="doc.css" />
- <title>Leveldb file layout and compactions</title>
- </head>
-
- <body>
-
- <h1>Files</h1>
-
- The implementation of leveldb is similar in spirit to the
- representation of a single
- <a href="http://labs.google.com/papers/bigtable.html">
- Bigtable tablet (section 5.3)</a>.
- However the organization of the files that make up the representation
- is somewhat different and is explained below.
-
- <p>
- Each database is represented by a set of file stored in a directory.
- There are several different types of files as documented below:
- <p>
- <h2>Log files</h2>
- <p>
- A log file (*.log) stores a sequence of recent updates. Each update
- is appended to the current log file. When the log file reaches a
- pre-determined size (approximately 1MB by default), it is converted
- to a sorted table (see below) and a new log file is created for future
- updates.
- <p>
- A copy of the current log file is kept in an in-memory structure (the
- <code>memtable</code>). This copy is consulted on every read so that read
- operations reflect all logged updates.
- <p>
- <h2>Sorted tables</h2>
- <p>
- A sorted table (*.sst) stores a sequence of entries sorted by key.
- Each entry is either a value for the key, or a deletion marker for the
- key. (Deletion markers are kept around to hide obsolete values
- present in older sorted tables).
- <p>
- The set of sorted tables are organized into a sequence of levels. The
- sorted table generated from a log file is placed in a special <code>young</code>
- level (also called level-0). When the number of young files exceeds a
- certain threshold (currently four), all of the young files are merged
- together with all of the overlapping level-1 files to produce a
- sequence of new level-1 files (we create a new level-1 file for every
- 2MB of data.)
- <p>
- Files in the young level may contain overlapping keys. However files
- in other levels have distinct non-overlapping key ranges. Consider
- level number L where L >= 1. When the combined size of files in
- level-L exceeds (10^L) MB (i.e., 10MB for level-1, 100MB for level-2,
- ...), one file in level-L, and all of the overlapping files in
- level-(L+1) are merged to form a set of new files for level-(L+1).
- These merges have the effect of gradually migrating new updates from
- the young level to the largest level using only bulk reads and writes
- (i.e., minimizing expensive seeks).
-
- <h2>Large value files</h2>
- <p>
- Each large value (greater than 64KB by default) is placed in a large
- value file (*.val) of its own. An entry is maintained in the log
- and/or sorted tables that maps from the corresponding key to the
- name of this large value file. The name of the large value file
- is derived from a SHA1 hash of the value and its length so that
- identical values share the same file.
- <p>
- <h2>Manifest</h2>
- <p>
- A MANIFEST file lists the set of sorted tables that make up each
- level, the corresponding key ranges, and other important metadata.
- A new MANIFEST file (with a new number embedded in the file name)
- is created whenever the database is reopened. The MANIFEST file is
- formatted as a log, and changes made to the serving state (as files
- are added or removed) are appended to this log.
- <p>
- <h2>Current</h2>
- <p>
- CURRENT is a simple text file that contains the name of the latest
- MANIFEST file.
- <p>
- <h2>Info logs</h2>
- <p>
- Informational messages are printed to files named LOG and LOG.old.
- <p>
- <h2>Others</h2>
- <p>
- Other files used for miscellaneous purposes may also be present
- (LOCK, *.dbtmp).
-
- <h1>Level 0</h1>
- When the log file grows above a certain size (1MB by default):
- <ul>
- <li>Write the contents of the current memtable to an sstable
- <li>Replace the current memtable by a brand new empty memtable
- <li>Switch to a new log file
- <li>Delete the old log file and the old memtable
- </ul>
- Experimental measurements show that generating an sstable from a 1MB
- log file takes ~12ms, which seems like an acceptable latency hiccup to
- add infrequently to a log write.
-
- <p>
- The new sstable is added to a special level-0 level. level-0 contains
- a set of files (up to 4 by default). However unlike other levels,
- these files do not cover disjoint ranges, but may overlap each other.
-
- <h1>Compactions</h1>
-
- <p>
- When the size of level L exceeds its limit, we compact it in a
- background thread. The compaction picks a file from level L and all
- overlapping files from the next level L+1. Note that if a level-L
- file overlaps only part of a level-(L+1) file, the entire file at
- level-(L+1) is used as an input to the compaction and will be
- discarded after the compaction. Aside: because level-0 is special
- (files in it may overlap each other), we treat compactions from
- level-0 to level-1 specially: a level-0 compaction may pick more than
- one level-0 file in case some of these files overlap each other.
-
- <p>
- A compaction merges the contents of the picked files to produce a
- sequence of level-(L+1) files. We switch to producing a new
- level-(L+1) file after the current output file has reached the target
- file size (2MB). The old files are discarded and the new files are
- added to the serving state.
-
- <p>
- Compactions for a particular level rotate through the key space. In
- more detail, for each level L, we remember the ending key of the last
- compaction at level L. The next compaction for level L will pick the
- first file that starts after this key (wrapping around to the
- beginning of the key space if there is no such file).
-
- <p>
- Compactions drop overwritten values. They also drop deletion markers
- if there are no higher numbered levels that contain a file whose range
- overlaps the current key.
-
- <h2>Timing</h2>
-
- Level-0 compactions will read up to four 1MB files from level-0, and
- at worst all the level-1 files (10MB). I.e., we will read 14MB and
- write 14MB.
-
- <p>
- Other than the special level-0 compactions, we will pick one 2MB file
- from level L. In the worst case, this will overlap ~ 12 files from
- level L+1 (10 because level-(L+1) is ten times the size of level-L,
- and another two at the boundaries since the file ranges at level-L
- will usually not be aligned with the file ranges at level-L+1). The
- compaction will therefore read 26MB and write 26MB. Assuming a disk
- IO rate of 100MB/s (ballpark range for modern drives), the worst
- compaction cost will be approximately 0.5 second.
-
- <p>
- If we throttle the background writing to something small, say 10% of
- the full 100MB/s speed, a compaction may take up to 5 seconds. If the
- user is writing at 10MB/s, we might build up lots of level-0 files
- (~50 to hold the 5*10MB). This may signficantly increase the cost of
- reads due to the overhead of merging more files together on every
- read.
-
- <p>
- Solution 1: To reduce this problem, we might want to increase the log
- switching threshold when the number of level-0 files is large. Though
- the downside is that the larger this threshold, the larger the delay
- that we will add to write latency when a write triggers a log switch.
-
- <p>
- Solution 2: We might want to decrease write rate artificially when the
- number of level-0 files goes up.
-
- <p>
- Solution 3: We work on reducing the cost of very wide merges.
- Perhaps most of the level-0 files will have their blocks sitting
- uncompressed in the cache and we will only need to worry about the
- O(N) complexity in the merging iterator.
-
- <h2>Number of files</h2>
-
- Instead of always making 2MB files, we could make larger files for
- larger levels to reduce the total file count, though at the expense of
- more bursty compactions. Alternatively, we could shard the set of
- files into multiple directories.
-
- <p>
- An experiment on an <code>ext3</code> filesystem on Feb 04, 2011 shows
- the following timings to do 100K file opens in directories with
- varying number of files:
- <table class="datatable">
- <tr><th>Files in directory</th><th>Microseconds to open a file</th></tr>
- <tr><td>1000</td><td>9</td>
- <tr><td>10000</td><td>10</td>
- <tr><td>100000</td><td>16</td>
- </table>
- So maybe even the sharding is not necessary on modern filesystems?
-
- <h1>Recovery</h1>
-
- <ul>
- <li> Read CURRENT to find name of the latest committed MANIFEST
- <li> Read the named MANIFEST file
- <li> Clean up stale files
- <li> We could open all sstables here, but it is probably better to be lazy...
- <li> Convert log chunk to a new level-0 sstable
- <li> Start directing new writes to a new log file with recovered sequence#
- </ul>
-
- <h1>Garbage collection of files</h1>
-
- <code>DeleteObsoleteFiles()</code> is called at the end of every
- compaction and at the end of recovery. It finds the names of all
- files in the database. It deletes all log files that are not the
- current log file. It deletes all table files that are not referenced
- from some level and are not the output of an active compaction. It
- deletes all large value files that are not referenced from any live
- table or log file.
-
- </body>
- </html>
|