|  | <!DOCTYPE html> | 
						
						
							|  | <html> | 
						
						
							|  | <head> | 
						
						
							|  | <link rel="stylesheet" type="text/css" href="doc.css" /> | 
						
						
							|  | <title>Leveldb file layout and compactions</title> | 
						
						
							|  | </head> | 
						
						
							|  | 
 | 
						
						
							|  | <body> | 
						
						
							|  | 
 | 
						
						
							|  | <h1>Files</h1> | 
						
						
							|  | 
 | 
						
						
							|  | The implementation of leveldb is similar in spirit to the | 
						
						
							|  | representation of a single | 
						
						
							|  | <a href="http://labs.google.com/papers/bigtable.html"> | 
						
						
							|  | Bigtable tablet (section 5.3)</a>. | 
						
						
							|  | However the organization of the files that make up the representation | 
						
						
							|  | is somewhat different and is explained below. | 
						
						
							|  | 
 | 
						
						
							|  | <p> | 
						
						
							|  | Each database is represented by a set of file stored in a directory. | 
						
						
							|  | There are several different types of files as documented below: | 
						
						
							|  | <p> | 
						
						
							|  | <h2>Log files</h2> | 
						
						
							|  | <p> | 
						
						
							|  | A log file (*.log) stores a sequence of recent updates.  Each update | 
						
						
							|  | is appended to the current log file.  When the log file reaches a | 
						
						
							|  | pre-determined size (approximately 1MB by default), it is converted | 
						
						
							|  | to a sorted table (see below) and a new log file is created for future | 
						
						
							|  | updates. | 
						
						
							|  | <p> | 
						
						
							|  | A copy of the current log file is kept in an in-memory structure (the | 
						
						
							|  | <code>memtable</code>).  This copy is consulted on every read so that read | 
						
						
							|  | operations reflect all logged updates. | 
						
						
							|  | <p> | 
						
						
							|  | <h2>Sorted tables</h2> | 
						
						
							|  | <p> | 
						
						
							|  | A sorted table (*.sst) stores a sequence of entries sorted by key. | 
						
						
							|  | Each entry is either a value for the key, or a deletion marker for the | 
						
						
							|  | key.  (Deletion markers are kept around to hide obsolete values | 
						
						
							|  | present in older sorted tables). | 
						
						
							|  | <p> | 
						
						
							|  | The set of sorted tables are organized into a sequence of levels.  The | 
						
						
							|  | sorted table generated from a log file is placed in a special <code>young</code> | 
						
						
							|  | level (also called level-0).  When the number of young files exceeds a | 
						
						
							|  | certain threshold (currently four), all of the young files are merged | 
						
						
							|  | together with all of the overlapping level-1 files to produce a | 
						
						
							|  | sequence of new level-1 files (we create a new level-1 file for every | 
						
						
							|  | 2MB of data.) | 
						
						
							|  | <p> | 
						
						
							|  | Files in the young level may contain overlapping keys.  However files | 
						
						
							|  | in other levels have distinct non-overlapping key ranges.  Consider | 
						
						
							|  | level number L where L >= 1.  When the combined size of files in | 
						
						
							|  | level-L exceeds (10^L) MB (i.e., 10MB for level-1, 100MB for level-2, | 
						
						
							|  | ...), one file in level-L, and all of the overlapping files in | 
						
						
							|  | level-(L+1) are merged to form a set of new files for level-(L+1). | 
						
						
							|  | These merges have the effect of gradually migrating new updates from | 
						
						
							|  | the young level to the largest level using only bulk reads and writes | 
						
						
							|  | (i.e., minimizing expensive seeks). | 
						
						
							|  | 
 | 
						
						
							|  | <h2>Large value files</h2> | 
						
						
							|  | <p> | 
						
						
							|  | Each large value (greater than 64KB by default) is placed in a large | 
						
						
							|  | value file (*.val) of its own.  An entry is maintained in the log | 
						
						
							|  | and/or sorted tables that maps from the corresponding key to the | 
						
						
							|  | name of this large value file.  The name of the large value file | 
						
						
							|  | is derived from a SHA1 hash of the value and its length so that | 
						
						
							|  | identical values share the same file. | 
						
						
							|  | <p> | 
						
						
							|  | <h2>Manifest</h2> | 
						
						
							|  | <p> | 
						
						
							|  | A MANIFEST file lists the set of sorted tables that make up each | 
						
						
							|  | level, the corresponding key ranges, and other important metadata. | 
						
						
							|  | A new MANIFEST file (with a new number embedded in the file name) | 
						
						
							|  | is created whenever the database is reopened.  The MANIFEST file is | 
						
						
							|  | formatted as a log, and changes made to the serving state (as files | 
						
						
							|  | are added or removed) are appended to this log. | 
						
						
							|  | <p> | 
						
						
							|  | <h2>Current</h2> | 
						
						
							|  | <p> | 
						
						
							|  | CURRENT is a simple text file that contains the name of the latest | 
						
						
							|  | MANIFEST file. | 
						
						
							|  | <p> | 
						
						
							|  | <h2>Info logs</h2> | 
						
						
							|  | <p> | 
						
						
							|  | Informational messages are printed to files named LOG and LOG.old. | 
						
						
							|  | <p> | 
						
						
							|  | <h2>Others</h2> | 
						
						
							|  | <p> | 
						
						
							|  | Other files used for miscellaneous purposes may also be present | 
						
						
							|  | (LOCK, *.dbtmp). | 
						
						
							|  | 
 | 
						
						
							|  | <h1>Level 0</h1> | 
						
						
							|  | When the log file grows above a certain size (1MB by default): | 
						
						
							|  | <ul> | 
						
						
							|  | <li>Write the contents of the current memtable to an sstable | 
						
						
							|  | <li>Replace the current memtable by a brand new empty memtable | 
						
						
							|  | <li>Switch to a new log file | 
						
						
							|  | <li>Delete the old log file and the old memtable | 
						
						
							|  | </ul> | 
						
						
							|  | Experimental measurements show that generating an sstable from a 1MB | 
						
						
							|  | log file takes ~12ms, which seems like an acceptable latency hiccup to | 
						
						
							|  | add infrequently to a log write. | 
						
						
							|  | 
 | 
						
						
							|  | <p> | 
						
						
							|  | The new sstable is added to a special level-0 level.  level-0 contains | 
						
						
							|  | a set of files (up to 4 by default).  However unlike other levels, | 
						
						
							|  | these files do not cover disjoint ranges, but may overlap each other. | 
						
						
							|  | 
 | 
						
						
							|  | <h1>Compactions</h1> | 
						
						
							|  | 
 | 
						
						
							|  | <p> | 
						
						
							|  | When the size of level L exceeds its limit, we compact it in a | 
						
						
							|  | background thread.  The compaction picks a file from level L and all | 
						
						
							|  | overlapping files from the next level L+1.  Note that if a level-L | 
						
						
							|  | file overlaps only part of a level-(L+1) file, the entire file at | 
						
						
							|  | level-(L+1) is used as an input to the compaction and will be | 
						
						
							|  | discarded after the compaction.  Aside: because level-0 is special | 
						
						
							|  | (files in it may overlap each other), we treat compactions from | 
						
						
							|  | level-0 to level-1 specially: a level-0 compaction may pick more than | 
						
						
							|  | one level-0 file in case some of these files overlap each other. | 
						
						
							|  | 
 | 
						
						
							|  | <p> | 
						
						
							|  | A compaction merges the contents of the picked files to produce a | 
						
						
							|  | sequence of level-(L+1) files.  We switch to producing a new | 
						
						
							|  | level-(L+1) file after the current output file has reached the target | 
						
						
							|  | file size (2MB).  The old files are discarded and the new files are | 
						
						
							|  | added to the serving state. | 
						
						
							|  | 
 | 
						
						
							|  | <p> | 
						
						
							|  | Compactions for a particular level rotate through the key space.  In | 
						
						
							|  | more detail, for each level L, we remember the ending key of the last | 
						
						
							|  | compaction at level L.  The next compaction for level L will pick the | 
						
						
							|  | first file that starts after this key (wrapping around to the | 
						
						
							|  | beginning of the key space if there is no such file). | 
						
						
							|  | 
 | 
						
						
							|  | <p> | 
						
						
							|  | Compactions drop overwritten values.  They also drop deletion markers | 
						
						
							|  | if there are no higher numbered levels that contain a file whose range | 
						
						
							|  | overlaps the current key. | 
						
						
							|  | 
 | 
						
						
							|  | <h2>Timing</h2> | 
						
						
							|  | 
 | 
						
						
							|  | Level-0 compactions will read up to four 1MB files from level-0, and | 
						
						
							|  | at worst all the level-1 files (10MB).  I.e., we will read 14MB and | 
						
						
							|  | write 14MB. | 
						
						
							|  | 
 | 
						
						
							|  | <p> | 
						
						
							|  | Other than the special level-0 compactions, we will pick one 2MB file | 
						
						
							|  | from level L.  In the worst case, this will overlap ~ 12 files from | 
						
						
							|  | level L+1 (10 because level-(L+1) is ten times the size of level-L, | 
						
						
							|  | and another two at the boundaries since the file ranges at level-L | 
						
						
							|  | will usually not be aligned with the file ranges at level-L+1).  The | 
						
						
							|  | compaction will therefore read 26MB and write 26MB.  Assuming a disk | 
						
						
							|  | IO rate of 100MB/s (ballpark range for modern drives), the worst | 
						
						
							|  | compaction cost will be approximately 0.5 second. | 
						
						
							|  | 
 | 
						
						
							|  | <p> | 
						
						
							|  | If we throttle the background writing to something small, say 10% of | 
						
						
							|  | the full 100MB/s speed, a compaction may take up to 5 seconds.  If the | 
						
						
							|  | user is writing at 10MB/s, we might build up lots of level-0 files | 
						
						
							|  | (~50 to hold the 5*10MB).  This may signficantly increase the cost of | 
						
						
							|  | reads due to the overhead of merging more files together on every | 
						
						
							|  | read. | 
						
						
							|  | 
 | 
						
						
							|  | <p> | 
						
						
							|  | Solution 1: To reduce this problem, we might want to increase the log | 
						
						
							|  | switching threshold when the number of level-0 files is large.  Though | 
						
						
							|  | the downside is that the larger this threshold, the larger the delay | 
						
						
							|  | that we will add to write latency when a write triggers a log switch. | 
						
						
							|  | 
 | 
						
						
							|  | <p> | 
						
						
							|  | Solution 2: We might want to decrease write rate artificially when the | 
						
						
							|  | number of level-0 files goes up. | 
						
						
							|  | 
 | 
						
						
							|  | <p> | 
						
						
							|  | Solution 3: We work on reducing the cost of very wide merges. | 
						
						
							|  | Perhaps most of the level-0 files will have their blocks sitting | 
						
						
							|  | uncompressed in the cache and we will only need to worry about the | 
						
						
							|  | O(N) complexity in the merging iterator. | 
						
						
							|  | 
 | 
						
						
							|  | <h2>Number of files</h2> | 
						
						
							|  | 
 | 
						
						
							|  | Instead of always making 2MB files, we could make larger files for | 
						
						
							|  | larger levels to reduce the total file count, though at the expense of | 
						
						
							|  | more bursty compactions.  Alternatively, we could shard the set of | 
						
						
							|  | files into multiple directories. | 
						
						
							|  | 
 | 
						
						
							|  | <p> | 
						
						
							|  | An experiment on an <code>ext3</code> filesystem on Feb 04, 2011 shows | 
						
						
							|  | the following timings to do 100K file opens in directories with | 
						
						
							|  | varying number of files: | 
						
						
							|  | <table class="datatable"> | 
						
						
							|  | <tr><th>Files in directory</th><th>Microseconds to open a file</th></tr> | 
						
						
							|  | <tr><td>1000</td><td>9</td> | 
						
						
							|  | <tr><td>10000</td><td>10</td> | 
						
						
							|  | <tr><td>100000</td><td>16</td> | 
						
						
							|  | </table> | 
						
						
							|  | So maybe even the sharding is not necessary on modern filesystems? | 
						
						
							|  | 
 | 
						
						
							|  | <h1>Recovery</h1> | 
						
						
							|  | 
 | 
						
						
							|  | <ul> | 
						
						
							|  | <li> Read CURRENT to find name of the latest committed MANIFEST | 
						
						
							|  | <li> Read the named MANIFEST file | 
						
						
							|  | <li> Clean up stale files | 
						
						
							|  | <li> We could open all sstables here, but it is probably better to be lazy... | 
						
						
							|  | <li> Convert log chunk to a new level-0 sstable | 
						
						
							|  | <li> Start directing new writes to a new log file with recovered sequence# | 
						
						
							|  | </ul> | 
						
						
							|  | 
 | 
						
						
							|  | <h1>Garbage collection of files</h1> | 
						
						
							|  | 
 | 
						
						
							|  | <code>DeleteObsoleteFiles()</code> is called at the end of every | 
						
						
							|  | compaction and at the end of recovery.  It finds the names of all | 
						
						
							|  | files in the database.  It deletes all log files that are not the | 
						
						
							|  | current log file.  It deletes all table files that are not referenced | 
						
						
							|  | from some level and are not the output of an active compaction.  It | 
						
						
							|  | deletes all large value files that are not referenced from any live | 
						
						
							|  | table or log file. | 
						
						
							|  | 
 | 
						
						
							|  | </body> | 
						
						
							|  | </html>
 |