WO2006064498A2

WO2006064498A2 - Transactional flash file system for microcontrollers and embedded systems

Info

Publication number: WO2006064498A2
Application number: PCT/IL2005/001342
Authority: WO
Inventors: Eran Gal; Michal Spivak; Sivan Avraham Toledo
Original assignee: Ramot At Tel-Aviv University Ltd.
Priority date: 2004-12-16
Filing date: 2005-12-13
Publication date: 2006-06-22
Also published as: WO2006064498A3

Abstract

A transactional file system for a flash memory is based on persistent data structures that preferably are versloned trees. Only two copies of each tree are maintained: a read-write current version and a read-only previous version. Each internal node of each tree includes a spare pointer. Internal nodes are modifies by changing their spare pointers. Tentative changes to the current versions are marked using commit andabort flags in the internal nodes, and corresponding indications are logged. The tentative changes are committed or aborted in accordance with the logged indications. Awitness marker in the log and a witness heqder eleswhare in the flash memory are used at boot time to identify the log.

Description

TRANSACTIONAL FLASH FILE SYSTEM FOR MICROCONTROLLERS

AND EMBEDDED SYSTEMS

FIELD AND BACKGROUND OF THE INVENTION The present invention relates to a file system for flash memories and, more particularly, to a transactional flash file system for microcontrollers and for small embedded systems.

Flash memory is a type of electrically erasable programmable read-only memory (EEPROM). Flash memory is nonvolatile (retains its content without power), so flash memory is used to store files and other persistent objects in workstations and servers (for the BIOS), in handheld computers and mobile phones, in digital cameras, and in portable music players.

The read/write/erase behavior of flash memory is radically different than that of other programmable memories, such as volatile RAM and magnetic disks. Perhaps more importantly, memory cells in a flash device (as well as in other types of EEPROMs) can be written to only a limited number of times, between 10,000 and 1,000,000, after which they wear out and become unreliable.

Flash memories come in three forms: on-chip memories in system-on-a-chip microcontrollers, standalone chips for board-level integration and removable memory devices (USB sticks, SmartMedia cards, CompactFlash cards, and so on). The file system of the present invention is designed for system-on-a-chip microcontrollers that include flash memories and for on-board stand-alone chips. The file system is particularly suited for devices with very little RAM (system-on-a-chip microcontrollers often include only 1-2 KB of RAM).

Flash memories also come in several flavors with respect to how reads and writes are performed. The two main categories are NOR flash, which behaves much like a conventional EEPROM device, and NAND flash, which behaves like a block device. But even within each category there are many flavors, especially with respect to how writes are performed.

Because flash memories, especially NOR flash, have very different performance characteristics than magnetic disks, file systems designed for disks are usually not appropriate for flash. Flash memories are random access devices that do not benefit at all from access patterns with temporal locality (rapid repeated access to the same location). NOR flash memories do not benefit from spatial locality in read and write accesses; some benefit from sequential access to blocks of a few bytes. Spatial locality is important in erasures — performance is best when much of the data in a block becomes obsolete roughly at the same time. The only disk-oriented file systems that addresses at least some of these issues are log- structured file systems, which indeed have been used on flash devices, often with flash-specific adaptations [S, P]. But even log-structured file systems ignore some of the features of flash devices, such as the ability to quickly write a small chunk of data anywhere in the file system. As demonstrated below, by exploiting flash-specific features we obtain a much more efficient file system.

File systems designed for small embedded systems must contend with another challenge, extreme scarcity of resources, especially RAM. Many of the existing flash file systems need large amounts of RAM, usually at least tens of kilobytes, so these file systems are not suitable for small microcontrollers

The remainder of this section provides background information on flash memories. We also cover the basic principles of flash-storage management, but this section does not survey flash file systems, data structure and algorithms; relevant citations are given below.

Flash memories are a type of electrically-erasable programmable read-only memory (EEPROM). EEPROM devices store information using modified mosfet transistors with an additional floating gate. This gate is electrically isolated from the rest of the circuit, but the gate nonetheless can be charged and discharged using a tunneling and/or a hot-electron- injection effect.

Traditional EEPROM devices support three types of operations. The device is memory mapped, so reading is performed using the processor's memory-access instructions, at normal memory-access times (tens of nanoseconds). Writing is performed using a special on-chip controller, not using the processor's memory-access instructions. This operation is usually called programming, and takes much longer than reading; usually a millisecond or more.

Programming can only clear set bits in a word (flip bits from 'I' to '0'), but not vice versa. Traditional EEPROM devices support reprogramming, wherein an already programmed word is programmed again. Reprogramming can only clear additional bits in a word. To set bits, the word must be erased, an operation that is also carried out by the on-chip controller. Erasures often take much longer than even programming, often half a second or more. The word size of traditional EEPROM, which controls the program and erase granularities, is usually one byte. Flash memories, or more precisely, flash-erasable EEPROMs, were invented to circumvent the long erase times of traditional EEPROM, and to achieve denser layouts. Both goals are achieved by replacing the byte-erase feature by a block-erase feature, which operates at roughly the same speed as the byte-erase, about 0.5—1.5 seconds. That is, flash memories also erase slowly, but each erasure erases more bits. Block sizes in flash memories can range from as low as 128 bytes to 64 KB. Herein, these erasure blocks are called erase units.

Many flash devices, and in particular the devices for which the file system of the present invention is designed, differ from traditional EEPROMs only in the size of erase units. That is, these flash devices are memory mapped, these flash devices support fine-granularity programming, and these flash devices support reprogramming. Many of the flash devices that use the so-called NOR organization support all of these features, but some NOR devices do not support reprogramming. In general, devices that store more than one bit per transistor (MLC devices) rarely support reprogramming, while single-bit per transistor devices often support reprogramming, but other factors may also affect reprogrammability.

Flash-device manufacturers offer a large variety of different devices with different features. Some devices support additional programming operations that program several words at a time; some devices have uniform-size erase blocks, but some devices have blocks of several sizes and/or multiple banks, each with a different block size. Devices with NAND organization are essentially block devices — read, write, and erase operations are performed by a controller on fixed-length blocks. A single file-system design is unlikely to suit all of these devices. The file system of the present invention is designed for the type of flash memories that is most commonly used in system-on-a-chip microcontrollers and low-cost standalone flash chips: reprogrammable NOR devices. Storage cells in EEPROM devices wear out. After a certain number of erase-program cycles, a cell can no longer reliably store information. The number of reliable erase-program cycles is random, but device manufacturers specify a guaranteed lower bound. Due to wear, the life of a flash device is greatly influenced by how the device is managed by software: if the software evens out the wear (number of erasures) of different erase units, the device lasts longer until one of the erase units wears out.

Like other memory devices, flash devices are prone to errors. Before the device wears out, most errors are due to manufacturing defects, to external influences, such as heat and voltage fluctuations, and to the program disturb phenomenon. Program disturb refers to charge loss (and hence the loss of a bit) in one memory cell that is caused by the high voltages used to program a nearby cell. In a worn-out device, memory errors are also caused by the device's inability to reliably retain data.

Error detection and correction mechanisms can be built into the hardware of some memory devices, but not into reprogrammable NOR devices. A hardware mechanism uses extra memory cells in each error protection (detection/correction) block. Whenever the contents of a block are written, the hardware mechanism writes an error protection code into the extra cells. Whenever the contents of a block are read, the hardware also reads the error protection code and uses the code to correct errors in the data or to signal a memory error to the rest of the system. For example, some DRAM devices offer error correction on bytes or words, and most disks offer error detection on each 512-byte sector. In reprogrammable NOR devices, such a mechanism cannot work. The problem is that the error protection bits must be programmed when data are first written to the protected block. Subsequent reprogramming (clearing of bits) of the protected block changes the data, so the protection code must change as well. Unless the new error protection code can be stored by only clearing bits in the previous code, which is unlikely, the new code cannot be stored until the next erasure.

Flash devices are used to store data objects. If the size of the data objects matches the size of erase units, then managing the device is fairly simple. A unit is allocated to an object when it is created. When the data object is modified, the modified version is first programmed into an erased unit, and then the previous copy is erased. A mapping structure must be maintained in RAM and/or in the flash memory itself to map application objects to erase units. This organization is used mostly in flash devices that simulate magnetic disks — such devices are often designed with 512-byte erase units that each stores a disk sector.

When data objects are smaller than erase units, a more sophisticated mechanism is required to reclaim space. When an object is modified, the new version is programmed into a not-yet-programmed area in some erase unit. Then the previous version is marked as obsolete. When the system runs out of space, the system finds an erase unit with some obsolete objects, copies the still-valid objects of that unit to free space in other units, and erases the unit. The process of moving the valid data to other units, modifying the object mapping structures, and erasing a unit is called reclamation. The data objects stored in an erase unit can be of uniform size, or of variable size.

SUMMARY OF THE INVENTION

Hereinafter, the file system of the present invention is referred to as "TFFS", which stands for "Transactional Flash File System". TFFS is an innovative file system for flash memories. TFFS is efficient, ensures long device life, and supports general transactions. The design of TFFS is unique. In particular, TFFS uses an innovative data structure, pruned versioned trees, that we developed specifically for flash file systems. General transactions have not been supported heretofore in flash- specific file systems. Although transactions may seem like a luxury for small embedded systems, and we believe that transaction support by the file system can simplify many applications and contribute to the reliability of embedded systems.

According to the present invention there is provided a data structure for storing a plurality of records in a memory of a computer system and for retrieving the records, the data structure including: (a) a current version of a tree, each of whose leaves includes a data object having a retrieval key for one of the records; and (b) a single previous version of the tree.

According to the present invention there is provided a method of storing a plurality of records in a memory of a computer system and of retrieving the records, including the steps of: (a) providing at most two versions of a tree, each of whose leaves includes a data object having a retrieval key for one of the records, a first version of the tree being a current version of the tree and a second version of the tree being a previous version of the tree; and (b) updating the tree by steps including: (i) changing only the current version of the tree.

According to the present invention there is provided a computer-readable storage medium having computer-readable code embodied on the computer-readable storage medium, the computer-readable code for storing a plurality of records in a memory of a computer system and for retrieving the records, the computer-readable code including: (a) program code for providing at most two versions of a tree, each of whose leaves includes a data object having a retrieval key for one of the records, a first version of the tree being a current version of the tree and a second version of the tree being a previous version of the tree; and (b) program code for updating the tree by steps including changing only the current version of the tree.

According to the present invention there is provided a persistent data structure for storing a plurality of records in a flash memory of a computer system and for retrieving the records. According to the present invention there is provided a method of storing a plurality of records in a flash memory of a computer system and of retrieving the records, including the steps of: (a) providing a persistent tree that includes a plurality of leaves and a plurality of internal nodes, each leaf including a data object having a retrieval key for one of the records, each internal node including at least one spare pointer; and (b) changing at least one the internal node by modifying one of the at least one spare pointer of each at least one internal node.

According to the present invention there is provided a computer-readable storage medium having computer-readable code embodied on the computer-readable storage medium, the computer-readable code for storing a plurality of records in a flash memory of a computer system and for retrieving the records, the computer-readable code including: (a) program code for providing a persistent tree that includes a plurality of leaves and a plurality of internal nodes, each leaf including a data object having a retrieval key for one of the records, each internal node including at least one spare pointer; and (b) program code for changing at least one the internal node by modifying one of the at least one spare pointer of each at least one internal node.

According to the present invention there is provided a file system, for storing a plurality of files in a memory of a computer system and for retrieving the files, including: (a) for each file, a corresponding data structure including a mechanism for recording tentative changes to the data structure by transactions of the file system; and (b) a log of indications of the tentative changes.

According to the present invention there is provided a method of storing a plurality of files in a nonvolatile memory of a computer system and of retrieving the files, including the steps of: (a) providing, for each file, a corresponding data structure including a mechanism for recording tentative changes to the data structure; and (b) logging indications of the tentative changes.

According to the present invention there is provided a computer-readable storage medium having computer-readable code embodied on the computer-readable storage medium, the computer-readable code for storing a plurality of files in a nonvolatile memory of a computer system and for retrieving the files, the computer code including: (a) program code for providing, for each file, a corresponding data structure including a mechanism for recording tentative changes to the data structure; and (b) program code for logging indications of the tentative changes.

According to the present invention, in a computer system that includes a nonvolatile memory wherein is stored a data structure having a root, there is provided a method of managing the data structure so that the root of the data structure is found unambiguously at boot time, including the steps of: (a) storing, in the nonvolatile memory, a witness record corresponding to the data structure; (b) storing, in a header of the root, a witness marker corresponding to the witness record; and (c) upon booting the computer system: upon finding two candidate headers, each candidate header including a witness marker: accepting, as the header of the root, the candidate header whose witness marker corresponds to the witness record.

According to the present invention there is provided a computer-readable storage medium having computer-readable code embodied on the computer-readable storage medium, the computer-readable code for managing a data structure, having a root, that is stored in a nonvolatile memory of a computer system, so that the root of the data structure is found unambiguously at boot time, the computer-readable code including: (a) program code for storing, in the nonvolatile memory, a witness record corresponding to the data structure; (b) program code for storing, in a header of the root, a witness marker corresponding to the witness record; and (c) program code for: upon booting the computer system: upon finding two candidate headers, each candidate header including a witness marker: accepting, as the header of the root, the candidate header whose witness marker corresponds to the witness record.

A first data structure of the present invention, for storing a plurality of records in a memory of a computer system and for retrieving the records from the memory, includes exactly two versions of a tree: a current version and a single previous version. This is in contrast to prior art persistent trees, that have no upper limit on the allowed number of previous versions. Each leaf of the tree includes a data object that has a retrieval key for one of the records. Preferably, the current version of the tree is a read- write version and the previous version of the tree is a read-only version.

Preferably, each internal node of the tree includes at least one, and most preferably only one, spare pointer. (An "internal node" of a tree is understood herein to be a node of the tree that is not a leaf.) Also most preferably, each internal node of the tree also includes a commit flag. The scope of the present invention also includes a file system that includes one or more such data structures, for storing a plurality of files in a memory of a computer system and for retrieving the files from the memory. Preferably, the file system stores and retrieves the files via a plurality of directories, and the file system includes one such data structure for each file and one such data structure for each directory. Most preferably, the file system also includes one such data structure for mapping names of the directories to metadata records of the directories, and/or one such data structure for mapping globally unique identifiers of the files and of the directories to the data structures of the files and the directories, and/or one such data structure for storing and retrieving transaction identifiers of the data structures of the files and for storing and retrieving transaction identifiers of the data structures of the directories. Corresponding to the first data structure of the present invention is a method of storing a plurality of records in a memory of a computer system and retrieving the records from the memory. According to the basic method, at most two versions of a tree, a first, current version and a second, previous version, are provided. Each leaf of the tree includes a data object that has a retrieval key for one of the records. Preferably, the current version of the tree is a read- write version and the pervious version of the tree is a read-only version. The tree is updated by steps including changing only the current version of the tree.

Preferably, the updating of the tree also includes committing the changes of the current version; substituting the committed current version for the previous version (so that the committed current version becomes the previous version) and substituting a copy of the committed current version for the current version (so that the copy becomes the current version).

Most preferably, each internal node of the tree is provided with a spare pointer that includes a commit flag. As part of the committing of the changes of the current version, the commit flags are cleared. It is important to note that the process of putting the commit flags in a state that indicates that the changes to the current version of the tree have been committed is referred to herein as "clearing" the commit flags rather than as "setting" the commit flags because of the primary intended use of this and other methods of the present invention: managing files that are stored in a flash memory. As discussed above, a freshly erased block of a flash memory stores only ' 1 ' bits. If a bit of such a memory is used as a commit flag, then changing the bit, as part of showing that the changes to the current version of the tree have been committed, changes the bit from a T to a O', i.e., clears the bit.

A second data structure of the present invention is a persistent data structure for storing a plurality of records in a flash memory of a computer system and for retrieving the records from the memory. As far as we are aware, even though flash memories have been used in embedded systems since 1992 and even though persistent data structures have been known in the art since 1989, persistent data structures have not been used heretofore for storing and retrieving data in flash memories, despite the many benefits described herein that derive from such use of persistent data structures.

Preferably, the persistent data structure includes a persistent tree. Preferably the tree includes a plurality of leaves and/or a plurality of internal nodes. Each leaf includes a data object that has a retrieval key for one of the records. More preferably, each internal node includes at least one, and most preferably only one, spare pointer. Also most preferably, each spare pointer includes a commit flag and/or an abort flag.

Corresponding to the second data structure of the present invention is a method of storing a plurality of records in a flash memory of a computer system and of retrieving the records from the flash memory. According to the basic method, a persistent tree is provided that includes a plurality of leaves and a plurality of internal nodes. Each leaf includes a data object that has a retrieval key for one of the records. Each internal node includes at least one spare pointer. At least one internal node of the tree is changed by modifying one of the internal node's spare pointers, preferably to point to a new child node. Preferably, the method includes providing each spare pointer with a commit flag and committing the change(s) by steps including clearing the commit flag(s) of the spare pointer(s) of the changed node(s). Alternatively or additionally, the method includes providing each spare pointer with an abort flag and canceling the change(s) by steps including clearing the abort flag(s) of the spare pointer(s) of the changed nodes. As in the case of the commit flags, the process of putting the abort flags in a state that indicates that the changes to the current version of the tree have been canceled is referred to herein as "clearing" the abort flags because the primary intended use of this and other methods of the present invention: managing files that are stored in a flash memory. As discussed above, a freshly erased block of a flash memory stores only ' 1 ' bits. If a bit of such a memory is used as an abort flag, then changing the bit, as part of showing that the changes to the current version of the tree have been canceled, changes the bit from a T to a '0', i.e., clears the bit.

A file system of the present invention, for storing a plurality of files in a memory of a computer system and for retrieving the files from the memory, includes, for each file, a data structure having a mechanism for recording tentative changes to the data structure by transactions of the file system. The file system also includes a log of indications of the tentative changes. The log is used in conjunction with the data structures at boot time to complete or cancel changes that were interrupted by an unexpected shutdown of the computer system. This is in contrast to similar prior art file systems, whose logs store the changes themselves rather than indications of changes that are stored in other data structures of the file system. Preferably, the data structures are trees, and the mechanisms for recording tentative changes include each internal node of each tree having a commit flag that is cleared in accordance with the log when the tentative changes are committed and an abort flag that is cleared in accordance with the log when the tentative changes are aborted. Most preferably, the trees are the trees of the first data structure of the present invention. Corresponding to the file system of the present invention is a method of storing a plurality of files in a nonvolatile memory of a computer system and of retrieving the files from the memory. According to the basic method, for each file there is provided a corresponding data structure with a mechanism for recording tentative changes to that data structure. Indications of the tentative changes are logged.

Preferably, the method includes committing and/or aborting the tentative changes in accordance with the indications. Preferably, logging the indications includes recording the indications in a log in the nonvolatile memory. A witness record corresponding to the log also is stored in the nonvolatile memory. A witness marker that points to the witness record is stored in the log. When the computer system is booted, the witness marker and the witness record are used to verify the identity of the log, to make sure that a data object in the nonvolatile memory that appears to be the log really is the log.

Such a use of a witness record and a witness marker to verify the identity of a root of a general data structure stored in a nonvolatile memory of a computer system is a method of the present invention in its own right. Specifically, the method is a method of managing the data structure so that the root is found unambiguously at boot time. A witness record corresponding to the root is stored in the nonvolatile memory. A witness marker corresponding to the witness record is stored in a header of the root. When the computer system is booted, if two candidate headers, each with a witness marker, are found, then the candidate header whose witness marker corresponds to the witness record is accepted as the actual header of the root of the data structure. Preferably, the witness marker that corresponds to the witness record includes a pointer to the witness record.

Preferably, the trees of the present invention are B-trees.

The scope of the present invention also includes computer-readable storage media having embedded thereon computer-readable code for implementing the file systems and the methods of the present invention.

The data structures that TFFS uses, particularly the pruned versioned B-trees, are a significant innovation. This design borrows ideas from both research on persistent data structures and from earlier flash file systems. We have adapted prior art persistent search trees to our needs: our trees can cluster many operations on a single tree into a single version, and our algorithms prune old versions from the trees. Spare pointers are related to replacement pointers that were used in the notoriously-inefficient Microsoft Flash File System [/_., 2, 3, 4], and to replacement block maps in the Flash Translation Layer [5, fr\. But again, we have adapted these ideas: in Microsoft's flash file system, paths of replacement pointers grew and grew; in TFFS, spare pointers never increase the length of paths. The replacement blocks in the prior art Flash Translation Layer are designed for patching elements in a table, whereas our replacement pointers are designed for a pointer-based data structure.

The file system of the present invention is designed for NOR flash, and in particular, for devices that are memory mapped and allow reprogramming. That is, the file system of the present invention assumes that four flash operations are available: Reading data through random-access memory instructions. Erasing a block of storage; in flash memories and EEPROM, erasing a block sets all the bits in the block to a logical T. Clearing one or more bits in a word (usually 1-4 bytes) initially consisting of all ones. This is called programming. Clearing one or more bits in a word with some zero bits. This is called reprogramming the word.

Virtually all NOR devices support the first three operations, and many support all four, but some do not support reprogramming.

File systems for NOR flash can write very efficiently small amounts of data, even if these writes are performed atomically and committed immediately. Writes to NOR flash are not performed in large blocks, so the time to perform a write operation to the file system can be roughly proportional to the amount of data written, even for small amounts. This observation is not entirely new: proportional write mechanisms have been used in Microsoft's ffs2 [7_., 2, 5, 4] and in other linked-list based flash file systems [7]. But these file systems suffered from performance problems and were not widely used. More recent file systems tend to be block based, both in order to borrow ideas from disk file systems and in order to support NAND flash, in which writes are performed in blocks. Using the file system of the present invention to manage files stored on NOR flash, it is possible to benefit from cheap small writes, without incurring the performance penalties of linked-list based designs.

As discussed below, RAM is a limited resource in microcontrollers and in embedded systems. One of the uses of RAM in prior art file systems for managing nonvolatile memories such as hard disks and flash memories is to accumulate changes to the file systems pending a mass transfer of these changes to the nonvolatile memories. This is because any changes, even small changes, to data stored in a nonvolatile memory tend to have a fixed cost. The preferred data structures of the present invention, along with the log of the present invention, when stored in a flash memory rather than in RAM and when used as part of a file system for the flash memory, allow the file system to be kept up-to-date with only small changes and at relatively low cost.

BRIEF DESCRIPTION OF THE DRAWINGS The invention is herein described, by way of example only, with reference to the accompanying drawings, wherein:

FIG. 1 illustrates the API of the file system of the present invention; FIG. 2 illustrates the partitioning of an erase unit;

FIG. 3 shows the translation of a logical unit/sector pointer to a physical pointer; FIG. 4 illustrates tree modification by path copying;

FIG. 5 illustrates the use of spare pointers to modify a tree; FIG. 6 shows the results of storage overhead simulations; FIGs. 7 and 8 shows the results of file system endurance simulations; FIGs. 9 and 10 show the result of reclamation efficiency simulations; FIG. 11 is a simplified partial block diagram of a microprocessor with a flash file system of the present invention.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

The present invention is of a transactional flash file system which can be used to manage files stored in a flash memory. Specifically, the present invention can be used to manage files stored in the NOR flash memory of an embedded microcontroller that uses such an on-chip or on-board NOR flash memory as a persistent file store.

The principles and operation of a transactional flash file system according to the present invention may be better understood with reference to the drawings and the accompanying description.

Design Goals

TFFS is designed to meet the requirements of small embedded systems that need a general-purpose file system for NOR flash devices. Our design goals, roughly in order of importance, were: Supporting the construction of highly reliable embedded applications

Efficiency in terms of RAM usage, flash storage utilization, speed and code size

High endurance

Optional error detection (at an additional cost) Embedded applications must contend with sudden power loss. In any system consisting of both volatile and nonvolatile storage, loss of power may leave the file system itself in an inconsistent state, or the application's files in an inconsistent state from the application's own viewpoint. TFFS performs all file operations atomically, and the file system always recovers to a consistent state after a crash. Furthermore, TFFS' API supports explicit and concurrent transactions. Without transactions, all but the simplest applications would need to implement an application-specific recovery mechanism to ensure reliability. TFFS takes over that responsibility. The support for concurrent transactions allows multiple concurrent applications on the same system to utilize this recovery mechanism. Some embedded systems ignore the power loss issue, and as a consequence are simply unreliable. For example, the ECI Telecom B-FOCuS 270/400PR router/ ADSL modem presents to the user a dialog box that reads "[The save button] saves the current configuration to the flash memory. Do not turn off the power before the next page is displayed, or else the unit will he damaged!" . Similarly, the manual of the Olympus C-725 digital camera warns the user that losing power while the flash-access lamp is blinking could destroy stored pictures.

Efficiency, as always, is a multifaceted issue. Many microcontrollers only have 1 —2 KB of RAM, and in such systems, RAM is often the most constrained resource. As in any file system, maximizing the effective storage capacity is important; this usually entails minimizing internal fragmentation and the storage overhead of the file system's data structures. In many NOR flash devices, programming (writing) is much slower than reading, and erasing blocks is even slower. Erasure times of more than half a second are common. Therefore, speed is heavily influenced by the number of erasures, and also by overhead writes (writes other than the data writes indicated by the API). Finally, storage for code is also constrained in embedded systems, so embedded file systems need to fit into small footprints. In TFFS, RAM usage is the most important efficiency metric. In particular, TFFS never buffers writes, in order not to use buffer space. The RAM usage of TFFS is independent of the number of resources in use, such as open files. This design decision trades off speed for RAM. For small embedded systems, this is usually the right choice. For example, recently-designed sensor-network nodes have only 0.5—4 KB of RAM, so file systems designed for them must contend, like TFFS, with severe RAM constraints. We do not believe that RAM-limited systems will disappear anytime soon, due to power issues and to mass-production costs; even tiny 8-bit microcontrollers are still widely used. The API of the file system of the present invention is nonstandard. Referring now to the drawings, Figure 1 presents the API in a slightly simplified form. In Figure 1, the types TI D and FD stand for transaction identifier and file descriptor (handle), respectively. We do not show the type of arguments when the type is clear from the name (e.g., char* for buffer, FD for file, and so on). We also do not show a few utility functions, such as a function to retrieve a file's properties. This API is designed to meet two main goals: support for transactions, and efficiency. The efficiency concerns are addressed by API features such as support for integer file names and for variable-length record files. In many embedded applications, file names are never presented to the user in a string form. Some systems do not have a textual user interface at all. Some systems do have a textual user interface, but present the files as nameless objects, such as appointments, faxes, or messages. Variable-length record files allow applications to efficiently change the length of a portion of a file without rewriting the entire file.

We deliberately excluded some common file-system features that we felt were not essential for embedded system, and which would have complicated the design or would have made the file system less efficient. The most important among these are directory traversals, file truncation, and changing the attributes of a file (e.g., it's the file's name). TFFS does not support these features. Directory traversals are helpful for human users; but embedded file systems are used by embedded applications, so the file names are embedded in the applications, and the applications know which files exist.

One consequence of the exclusion of these features is that the file system cannot be easily integrated into some operating systems, such as Linux and Windows CE. Even though these operating systems are increasingly used in small embedded systems, such as residential gateways and PDAs, we felt that the penalty in efficiency and code size to support general file- system semantics would be unacceptable for smaller devices.

On the other hand, the support for transactions does allow applications to reliably support features such as long file names. An application that needs long names for files can keep a long-names file with a record per file. This file maintains the association between the integer name of a file and its long file name, and by creating the file and adding a record to the naming file in a transaction, this application data structure always remains consistent.

The endurance issue is unique to flash file systems. Because each block can only be erased a limited number of times, uneven wear of the blocks leads to early loss of storage capacity (in systems that can detect and not use worn-out blocks), or to an untimely death of the entire system (if the system cannot function with some bad blocks).

The inclusion of an error-detection mechanism in TFFS is driven by several considerations. First, in a system designed for reprogrammable NOR devices, we cannot rely on hardware error detection/correction. Because we intend to exploit reprogrammability, we also cannot rely on error protection by the device driver. Second, even though NOR devices are considered reliable (in particular, more reliable than NAND devices), NOR devices are not completely free of memory errors. Flash devices differ in their error rates, partially because there is a tradeoff between performance and low program-disturb error rates. This implies that some devices suffer from high error rates because these devices are designed for high- performance. Third, when a device wears out, it can no longer reliably retain data. This is the meaning of being worn out. At this stage of the life of the device, we must expect memory errors even if the device is otherwise reliable. If we wish to use the device beyond the manufacturer's guaranteed number of erasures, or if we wish to use the device outside the guaranteed physical conditions (temperature, voltage), then we must expect errors. Fourth, in highly-reliable computer systems all the failures must be expected ones. Memory errors, if detected, are expected, so presumably the system reacts to them in an appropriate way. Undetected memory errors, on the other hand, can lead to all sorts of unexpected failures, such as infinite loops and memory corruptions. Memory errors in internal file-system structures can lead to such unexpected failures even if all sensitive user data is protected by error detection codes.

These considerations led us to certain assumptions that guided our design. First, we assume that memory errors are caused by two stochastic sources. One source causes independent bit flippings at a low probability throughout the life of the device. This source models errors caused by program-disturb, extreme temperatures, and so on. The second source causes errors at a high probability in specific erase units, and is active at certain times, but not all the time. We assume that this source can only be turned on in a unit when the unit is erased. We also assume that the errors caused by this source are uniformly distributed within the affected erase units. This source models both manufacturing defects and wearing out of erase units.

There is another flash-specific problem that we need to address. In many persistent memory devices, interrupting an erasure or program operation leaves the affected memory cells in an undefined state. This is certainly true in flash devices; nothing should be assumed about the contents of such cells. In disk-based systems, this problem can be addressed by logging the write operation to a log (journal) stored in a fixed location. Suppose that a write operation is interrupted by a power failure. When the system boots after the failure, the system reads the log and replays the failed write, or at least clears the undefined disk block. In flash devices, storing the log in a fixed location can accelerate the wear of that area. Therefore, the log is moved around. When the system boots, the system searches for the log. Corrupt data left from a failed program or erase operation may look like the log, or may otherwise interfere with the search. Therefore, the last design goal of TFFS is the ability to reliably find its log even after interrupted program or erase operations. It turns out that the error-detection mechanism of TFFS enhances the reliability of our solution to this problem.

We show below that TFFS does meet these objectives. We show that the support for transactions is correct, and we show experimentally that TFFS is efficient and avoids early wear.

The Design of the File System This section describes the design of TFFS. Because the design is quite complex, this section is long and is broken into several subsection. Subsection A describes low-level memory management technique that we use and the structure and interpretation of pointers. Subsection B climbs one level up. It describes the main data structure that TFFS uses, called pruned versioned trees. These trees are a form of B-trees that are augmented to support transactions. These trees are used in TFFS to represent files, the file-system's name spaces, and private file- system objects. Trees can cause trouble in small embedded systems, because recursive tree algorithms can cause stack overflows. Subsection C explains how we avoid stack overflows in such algorithms.

Subsection D climbs another level up and explains how the name spaces of TFFS are represented.

The next four subsections address the issue of atomicity. Subsection E shows how to support transactions on pruned versioned trees. TFFS identifies transactions using short integers, for space efficiency. Because these integers are short, they are in short supply. Subsection F describes how TFFS allocates and deallocates transaction identifiers. Subsection G explains how TFFS performs atomic operations that are not part of transactions, and what gains are achieved by not performing the same operations as part of transactions. Subsection H describes the structure of the log (journal), which is the key to atomicity. We use the log not only to ensure atomicity, but also to find other key data structures at boot time. Finding the log is one of the first things TFFS does when it boots, and if TFFS cannot identify the log, it cannot boot. An interrupted erasure can leave the partially-erased unit looking like a log. Subsection I explains how we identify the true log. The last subsection of the section describes the error-detection mechanisms that we built into TFFS. When the optional error detection mechanisms are used, error detection is ubiquitous: every file-system structure contains one or more error-detection codes. To avoid cluttering the presentation, we do not discuss these codes and the space allocated to them until Subsection J. A. Logical Pointer and the Structure of Erase Units

The memory space of flash devices is partitioned into erase units, which are the smallest blocks of memory that can be erased. TFFS assumes that all erase units have the same size. (In some flash devices, especially devices that are intended to serve as boot devices, some erase units are smaller than others; in some cases the irregularity can be hidden from TFFS by the flash device driver, which can cluster several small units into a single standard-size one, or TFFS can ignore the irregular units.) TFFS reserves one unit for the log, which allows TFFS to perform transactions atomically. The structure of this erase unit is simple: it is treated as an array of fixed-size records, which TFFS always fills in order. The other erase units are all used by TFFS' memory allocator, which uses them to allocate variable-sized blocks of memory that we call sectors.

The on-flash data structure that the memory allocator uses is designed to achieve one primary goal. Suppose that an erase unit that still contains valid data is selected for erasure, perhaps because it contains the largest amount of obsolete data. The valid data must be copied to another erase unit prior to the first unit's erasure. If there are pointers to the physical location of the valid data, these pointers must be updated to reflect the new location of the data. Pointer modification poses two problems. First, the pointers to be modified must be found. Second, if these pointers are themselves stored on the flash device, they cannot be modified in place, so the sectors that contain them must be rewritten elsewhere, and pointers to them must now be modified as well. The memory allocator's data structure is designed so that pointer modification is never needed.

TFFS avoids pointer modification by using logical pointers to sectors rather than physical pointers; pointers to addresses within sectors are not stored at all. A logical pointer is an unsigned integer (usually a 16-bit integer) partitioned into two fields: a logical erase unit number and a sector number. When valid data in an erase unit are moved to another erase unit prior to erasure, the new erase unit receives the logical number of the unit to be erased, and each valid sector retains its sector number in the new erase unit.

A table, indexed by logical erase-unit number, stores logical-to-physical erase-unit mapping. In the preferred implementation the table is stored in RAM, but the table alternatively can be stored in a sector on the flash device itself, to save RAM.

Erase units that contain sectors (rather than the log) are divided into four parts, as shown in Figure 2 that illustrates the partitioning of an erase unit 10. The top of unit 10 (lowest addresses) stores a small header, which is immediately followed by an array of sector descriptors. The bottom of unit 10 contains sectors, which are stored contiguously. The area between the last sector descriptor and the last sector is free. The sector area grows upwards, and the array of sector descriptors grows downwards, but the sector area and the array of sector descriptors never collide. Area is not reserved for sectors or descriptors; when sectors are small more area is used by the descriptors than when sectors are large. A sector descriptor contains the erase-unit offset of first address in the sector, as well as a "valid" Boolean flag and an "obsolete" Boolean flag. Normally, these and other similar Boolean flags are single bits; but when the optional error detection is used these flags are represented by two physical bits. Clearing the "valid" flag indicates that the offset field has been written completely (this ensures that sector descriptors are created atomically). Clearing the "obsolete" flag indicates that the sector that the descriptor refers to is now obsolete.

A logical pointer is translated to a physical pointer as shown in Figure 3. .The logical pointer consists of a logical erase unit number 3 and a sector number 4. Logical erase-unit number 3 within the logical pointer is used as an index into the logical-to-physical erase-unit table. This provides the number 11 of the physical erase unit 10 that contains the sector. Then sector number 4 within the logical pointer is used to index into the sector-descriptors array on that physical erase unit 10. This returns a sector descriptor. The offset in that descriptor is added to the address of physical erase unit 10 to yield the physical address of the sector.

Before an erase unit is erased, the valid sectors on the erase unit are copied to another erase unit. Logical pointers to these sectors remain valid only if the sectors retain their sector number on the new erase unit. For example, sector number 6, which is referred to by the seventh sector descriptor in the sector-descriptors array, must be referred to by the seventh sector descriptor in the new erase unit. The offset of the sector within the erase unit can change when the sector is copied to a new erase unit, but the sector descriptor must retain its position. Because all the valid sectors in the erase unit to be erased must be copied to the same new erase unit, and because specific sector numbers must be available in that new erase unit, TFFS always copies sectors to a fresh erase unit that is completely empty prior to erasure of another unit. Also, TFFS always compacts the sectors that it copies in order to create a large contiguous free area in the new erase unit.

TFFS allocates a new sector in two steps. First, TFFS finds an erase unit with a large- enough free area to accommodate the size of the new sector. The preferred implementation uses a unit-selection policy that combines a limited-search best-fit approach with a classification of sectors into frequently changed sectors and infrequently changed sectors. The policy attempts to cluster infrequently-modified sectors together in order to improve the efficiency of erase-unit reclamation (the fraction of the obsolete data on the erase unit just prior to erasure). Next, TFFS finds, on the unit that has been selected to receive the new sector, an empty sector descriptor to refer to the sector. Empty descriptors are represented by a bit pattern of all l's, the erased state of the flash. If all the descriptors are used, TFFS allocates a new descriptor at the bottom of the descriptors array. (TFFS knows whether all the descriptors are used in an erase unit; if all the descriptors are indeed in use, the best-fit search ensures that the selected unit has space for both the new sector and for a new descriptor).

The size of the sector-descriptors array of an erase unit is not represented explicitly. When an erase unit is selected for erasure, TFFS determines the size of the erase unit's sector- descriptor array using a linear downwards traversal of the array, while maintaining the minimal sector offset that a descriptor refers to. When the traversal reaches that location minimal sector offset, the traversal is terminated. The size of sectors is not represented explicitly, either, but that size is needed in order to copy valid sectors to the new erase unit during reclamations. The same downward traversal is also used by TFFS to determine the size of each sector. The traversal algorithm exploits the following invariant properties of the erase-unit structure. Sectors and their descriptors belong to two categories: reclaimed sectors, which are copied into the unit during the reclamation of another unit, and new sectors, allocated later. Within each category, sectors with consecutive descriptors are adjacent to each other. That is, if descriptors i andj>i are both reclaimed or both new, and if descriptors i+\,...j-l all belong to the other category, then sector i immediately precedes sector j. This important invariant holds because (1) we copy reclaimed sectors from lower-numbered descriptors to higher numbered descriptors, (2) we always allocate the lowest-numbered free descriptor in a unit for a new sector, and (3) we allocate the sectors themselves from the top down (from right to left in Figure 2). The algorithm keeps track of two descriptor indices, r, the reclaimed descriptor, and n, the last new descriptor. When the algorithm examines a new descriptor i, the algorithm first determines whether the descriptor is free (all l's), new or reclaimed. If the descriptor is free, the algorithm proceeds to the next descriptor. Otherwise, if the sector lies to the right of the last-reclaimed mark stored in the erase unit's header, then the sector is a reclaimed sector, otherwise the sector is new. Suppose that sector i is new; sector i starts at the address given by its header, and sector i ends at the last address before n, or the end of the erase unit if i is the first new sector encountered so far. The case of reclaimed sectors is completely symmetric. Note that the traversal processes both valid and obsolete sectors. As mentioned above, each erase unit starts with a header. The header indicates whether the erase unit is free, is used for the log, or is used for storing sectors. The header contains the number of the logical unit that the physical unit represents (this field is not used in the log unit), and an erase counter. The header also stores the highest (leftmost) sector offset of sectors copied as part of another erase unit's reclamation process; this field allows us to determine the size of sectors efficiently. Finally, the header indicates whether the erase unit is used for storing frequently- or infrequently-modified data; this helps cluster related data to improve the efficiency of reclamation. In a file system that uses n physical units of m bytes each, and with an erase counter bounded by g, the size of the erase-unit header in bits is 3+riog₂πl+riog₂7«l+riog₂g]. Flash devices are typically guaranteed for up to one million erasures per unit (and often less, around 100,000), so an erase counter of 24 bits allows accurate counting even if the actual endurance is 16 million erasures. This implies that the size of the header is roughly 27+log₂(«m), which is approximately 46 bits for a 512 KB device and 56 bits for a 512 MB device.

The erase-unit headers represent an on-flash storage overhead that is proportional to the number of erase units. The size of the logical-to-physical erase-unit mapping table is also proportional to the number of erase units. Therefore, a large number of erase units causes a large storage overhead. In devices with small erase units, it may be advantageous to use a flash device driver that aggregates several physical units into larger units, so that TFFS uses a smaller number of larger units. B. Pruned Versioned Search Trees

TFFS uses an innovative data structure that we call pruned versioned search trees to support efficient atomic file-system operations. This data structure is a derivative of persistent search trees, but this data structure is specifically tailored to the needs of file systems. In TFFS, each node of a tree is stored in a variable-sized sector.

Trees are widely-used in file systems. For example, Unix file systems use a tree of indirect blocks, whose root is the inode, to represent files, and many file systems use search trees to represent directories. When the file system changes from one state to another, a search tree may need to change. One way to implement atomic operations is to use a versioned search tree. Abstractly, a versioned tree is a sequence of versions of the tree. Queries specify the version that they need to search. Operations that modify the tree always operate on the most recent version. When a sequence of modifications is complete, an explicit commit operation freezes the most recent version, which becomes read-only, and creates a new read- write version. When versioned search trees are used to implement file systems, usually only the read- write version (the last one) and the read-only version that precedes it are accessed. The read- write version represents the state of the search tree while processing a transaction, and the readonly version represents the most-recently committed version. The read-only versions satisfy all the data-structure invariants; but the read- write version may not satisfy all the data- structure invariants. Because old read-only versions are not used, these versions can be pruned from the search tree, thereby saving space. We call versioned trees that restrict read access to the most recently-committed version pruned versioned trees.

The simplest technique to implement versioned trees is called path copying, illustrated in Figure 4. When a tree node is modified, the modified version cannot overwrite the existing node, because the existing node participates in the last committed version. Instead, the modified node is written elsewhere in memory. This requires a modification in the parent node as well, to point to the new node, so a new copy of the parent node is created as well. This always continues until the root. If a node is modified twice or more before the new version is committed, the new node can be modified in place, or a new copy of the node can be created in each modification.

In the example illustrated in Figure 4, replacing the data associated with the leaf whose key is 31 in the binary tree on the left creates a new leaf, as shown in the binary tree on the right. The leaf replacement propagates up to the root. The new root represents a new version of the tree. Data structure items that are created as a result of the leaf modification are indicated by dark shading.

If the new node is stored in RAM, the new node is usually modified in place, but when the new node is stored on a difficult to modify memory, such as flash memory, a new copy of the node is created. The log-structured file system, for example, represents each file as search tree whose root is an inode, and uses path copying to modify files atomically. WAFL, a prior art file system that is used by Network Appliance, Inc. of Sunnyvale CA, USA in its data storage products. WAFL supports snapshots, and represents the entire file-system as a single search tree, which is modified in discrete write episodes; WAFL maintains several read-only versions of the file-system search tree to provide users with access to historical states of the file system.

An innovative technique that we call node copying can often prevent the copying of the path from a node to the root when the node is modified, as shown in Figure 5. This technique relies on spare pointers in tree nodes, and on nodes that can be physically modified in place. To implement node copying, nodes are allocated with one or more spare pointers, which are initially empty. When a child pointer in a node needs to be updated, the system determines whether the node still contains an empty spare pointer. If the node does still contain an empty spare pointer, the spare pointer is modified instead. The modified spare pointer points to the new child, and the modified spare pointer contains an indication of which original pointer it replaces.

Each spare pointer also includes a commit flag, to indicate whether the pointer has been created in the current read- write version or in a previous version. If the commit flag is set, then tree accesses in both the read-only version and the read-write version should traverse the spare pointer, not the original pointer that it replaces. If the commit flag is not yet set, then tree access in the read-write version should traverse the spare pointer but tree access in the read-only version should traverse the original pointer. Spare pointers also have an abort flag; if the abort flag is set, then the spare pointer is simply ignored.

In B-trees, in which nodes have a variable number of child pointers, the spare pointer can also be used to add a new child pointer. This serves two purposes. First, it allows us to allocate variable-size nodes, containing only enough child pointers for the number of children the node has at creation time. Second, it allows us to store original child pointers without commit flags.

In principle, using more than one spare pointer per node can further reduce the number of node creations, at the expense of more storage overhead. However, even with only one spare pointer per node, it can be shown that the amortized cost of a single tree update/insert/delete operation is constant. Therefore, TFFS preferably uses only one spare pointer per node. C. Tree Traversals with a Bounded Stack

Small embedded systems often have very limited stack space. Some systems do not use a stack at all: static compiler analysis maps automatic variables to static RAM locations. To support systems with a bounded or no stack, TFFS never uses explicit recursion. TFFS does use recursive algorithms, but the recursion uses a statically-allocated stack and the depth of the stack is configured at compile time.

Algorithms that traverse trees from the root to the leaves can be easily implemented iteratively. On the other hand, recursion is the most natural way to implement algorithms that descend from the root to a leaf and climb back up, and algorithms that enumerate all the nodes in a tree.

TFFS avoids explicit recursion using three mechanisms. The first mechanism is a simulated stack. The recursive algorithm is implemented using a loop, and an array of logical pointers replaces the automatic variable that points to the current node. Different iterations of the loop, which correspond to different recursion depths, use different elements of this array. The size of these arrays is fixed at compile time, so this mechanism can only support trees or subtrees up to a certain depth limit.

When the depth limit of a simulated recursive algorithm is reached, the algorithm cannot use the array of logical pointer to locate the parent of the current node. Instead, the second mechanism is initiated. The algorithm starts a top-down search for the terminal leaf of the path from the root, and uses the array to remember the last nodes visited in this top-down traversal. When the search finds the current node, the array is ready with pointers to the immediate ancestors of the current node, and the simulated recursion continues.

Asymptotically, the top-down searches to locate the immediate ancestors of a node turn Θ(log«)-time Θ(logrc)-space B-tree operations, such as insert (where n is the number of nodes) into Θ(log²/?)-time Θ(l)-space operations. The simulated stack allows us to configure the constants.

The third mechanism uses the log, which we describe later. This mechanism is used to construct new trees. New trees for record files are constructed layer by layer, starting from the leaves. Every iteration of the construction creates one layer of the tree. The first iteration allocates nodes for the leaves. Pointers to these nodes are written in the log. Each subsequent iteration creates internal nodes that point to the nodes created in the previous iteration. Because the pointers are written in the log there is no need to save them in RAM. The process ends when the last layer has only one node, which becomes the root of the new tree. E. Mapping Files and File Names

TFFS uses pruned versioned search trees for mapping files and file names. Most of the search trees represent files and directories, one search tree per file/directory. These search trees are versioned. In the case of record files, each record of a file is stored in a separate sector, and the file's search tree maps record numbers to the logical addresses of sectors. In the case of binary files, extents of contiguous data are stored on individual sectors, and the file's search tree maps file offsets to sectors. The extent of a binary file is created when data are appended to the file. Preferably, TFFS does not change this initial partitioning. TFFS supports two naming mechanisms for the open system call. One mechanism is a hierarchical name space of directories and files, as in most file systems. However, in TFFS directory entries are short unsigned integers, not strings, in order to avoid string comparisons in directory searches. The second mechanism is a flat namespace consisting of unique strings. A file or directory can be part of one name space or part of both namespaces. The hierarchical name space does not allow long names. Therefore, directory search trees are indexed by the short integer entry names. The leaves of directory search trees are the metadata records of the files. The metadata record contains the internal file identifier ("Globally Unique Identifier", or GUID) of the directory entry, as well as the file type (record/binary/directory), the optional long name, permissions, and so on. In TFFS, the metadata is immutable.

TFFS assumes that long names are globally unique. We use a hash function to map these string names to 16-bit integers, which are perhaps not unique. TFFS maps a directory name to its metadata record using a search tree indexed by hash values. The leaves of this search tree are either the metadata records themselves (if a hash value maps into a single directory name), or arrays of logical pointers to metadata records (if the names of several directories map into the same hash value).

A second search tree maps GUIDs to file/directory trees. This search tree is indexed by GUID and the leaves of the search tree are logical pointers to the roots of file/directory search trees. The open system call comes in two versions: one version returns a GUID given a directory name, and the other returns a GUID given a directory GUID and a 16-bit file identifier within that directory. The first version computes the hash value of the given name and uses the hash value to search the directory-names search tree. When the first version of the open system call reaches a leaf, that version verifies the directory name if the leaf is the metadata of a directory, or searches for a metadata record with the appropriate name if the leaf is an array of pointers to metadata records. The second version of the open system call searches the GUID search tree for the given GUID of the directory. The leaf that this search returns is a logical pointer to the root of the directory search tree. The open system call then searches this directory search tree for the file with the given identifier; the leaf that is found is a logical pointer to the metadata record of the sought-after file. That metadata record contains the GUID of the file.

In file-access system calls, the file is specified by a GUID. These system calls find the root of the file's search tree using the GUID tree.

E. Transactions on Pruned Versioned Search Trees

The main data structures of TFFS are pruned versioned search trees. We now explain how transactions interact with these trees. By transactions we mean not only explicit user-level transactions, but also implicit transactions that perform a single file-system modification atomically.

Each transaction receives a Transaction IDentifier (TID). These identifiers are integers that are allocated in order when the transaction starts, so they also represent discrete time stamps. A transaction with a low TID time stamp started before a transaction with a higher TID. The file system can commit transactions out of order, but the linearization order of the transactions always corresponds to their TID: a transaction with TID t can observe all the side effects of committed transactions M and lower, and cannot observe any of the side effects of transactions t+1 and higher.

When a transaction modifies a search tree, the transaction creates a new version of the search tree. That version remains active, in read-write mode, until the transaction either commits or aborts. If the transaction commits, all the spare pointers that the transaction created are marked as committed. In addition, if the transaction created a new root for the file, the new root becomes active (the pointer to the root of the file's search tree, somewhere else in the file system, is updated). If the transaction aborts, all the spare pointers that the transaction created are marked as aborted by a special flag. Aborted spare pointers are not valid and are never dereferenced.

Therefore, a search tree can be in one of two states: either with uncommitted and unaborted spare pointers (or an uncommitted root), or with only committed or aborted spare pointers. A search tree in the first state is being modified by a transaction that is not yet committed or aborted. Suppose that a search tree is being modified by transaction t, and that the last committed transaction that modified the search tree is transaction s. The read-only version of the search tree, consisting of all the original child pointers and all the committed spare pointers, represents the state of the tree at discrete times r for s≤ r<t. We do not know what the state of the tree was at times smaller than s: perhaps some of the committed spare pointers represent changes made earlier, but we cannot determine when, so we do not know whether to follow those committed spare pointers or not. Also, some of the nodes that existed at times before s may cease to exist at time s. The read-write version of the search tree represents the state of the search tree at time t, but only if transaction t will commit. If transaction t will abort, then the state of the search tree at time t is the same state as at time s. If transaction t will commit, we still do not know what the state of the search tree will be at time t+1, because transaction t may continue to modify the search tree. Hence, the file system allows transactions with TID r, for s<r<t, access to the read-only version of the search tree, and to transaction t access to the read-write version of the search tree. All other access attempts, including, for example, read and write requests from transactions r<s, cause the accessing transaction to abort. In principle, instead of aborting transactions later than t, TFFS could block these transactions, but we assume that the operating system's scheduler cannot block a request.

If a search tree is in the second state, with only committed or aborted spares, we must keep track not only of the last modification time s of the search tree, but also of the latest transaction u≥ s that read the search tree. The file system admits read requests from any transaction r for r>s, and write requests from a transaction f> u. As before, read requests from a transaction r<s causes transaction r to abort. A write request from a transaction t<u causes t to abort, because the write request might affect the state of the search tree that u already observed.

To enforce these, access rules, we associate three TIDs with each versioned search tree of a file or of a directory: the last committed modification TID, the last read-access TID, and the TID, if any that currently modifies the search tree. These TIDs are kept in a TID search tree, indexed by the internal identifiers of the versioned search trees. The file system never accesses the read-only version of the TID search tree. Therefore, although the TID search tree is implemented as a pruned versioned search tree, the file system treats the TID search tree as a normal mutable search tree. The next subsection presents an optimization that allows the file system not to store the TIDs associated with a search tree. F. Using Bounded Transaction Identifiers

To allow TFFS to represent TIDs in a bounded number of bits, and also to save RAM, TFFS represents TIDs modulo a small number. In essence, this allows TFFS to store information only on transactions in a small window of time. Older transactions than this window permits must be either committed or aborted.

The TID allocator consists of three simple data structures that are kept in RAM: next TID to allocate, the oldest TID in the TID search tree, and a bit vector with one bit per TID within the current TID window. The bit vector stores, for each TID that might be represented in the system, whether the TID is still active or whether the TID has already been aborted or committed. When a new TID needs to be allocated, the allocator first determines whether the next sequential TID, in cyclic order, represents an active transaction. If it does represent an active transaction, the allocation simply fails. No new transactions can be started until the oldest one in the system either commits or aborts. If the next TID is not active and not in the TID search tree, the next TID is allocated and the next-TID variable is incremented (modulo the window size). If the next TID is not active but the next TID is in the TID search tree, then the TID search tree is first cleaned, and then the next TID is allocated.

Before cleaning the TID tree, the file system determines how many TIDs can be cleaned. Cleaning is expensive, so TFFS cleans on demand, and when it cleans, it cleans as many TIDs as possible. The number of TIDs that can be cleaned is the number of consecutive inactive TIDs in the oldest part of the TID window. After determining this number, TFFS traverses the entire TID tree and invalidates the appropriate TIDs. An invalid TID in the tree represents a time before the current window; transactions that old can never abort a transaction, so the exact TID is irrelevant. We cannot search for TIDs to be invalidated because the TID search tree is indexed by the identifiers of the file/directory search trees, not by TID. The file system often can avoid cleaning the TID search tree. Whenever no transaction is active, the file system deletes the entire TID search tree. Therefore, if long chains of concurrent transactions are rare, tree cleanup is rare or not performed at all. The cost of TID cleanups can also be reduced by using a large TID window size, at the expense of slight storage inefficiency.

G. Atomic Non-transactional Operations To improve performance and to avoid running out of TIDs, TFFS supports non- transactional operations. Most requests to TFFS specify a TID as an argument. If no TID is passed to a system call (the TID argument is 0), the requested operation is performed atomically, but without any serializability guarantees. That is, the operation will either succeed completely, or will fail completely, but the operation may break the serializability of concurrent transactions.

TFFS allows an atomic operation to modify a file or directory only if no outstanding transaction has already modified the file's or the directory's search tree. But this still does not guarantee serializability. Consider a sequence of operations in which a transaction reads a file that is subsequently modified by an atomic operation and then is read or modified again by the transaction. It is not possible to serialize the atomic operation and the transaction.

Therefore, it is best to use atomic operations only on files/directories that do not participate in outstanding transactions. An easy way to ensure this is to access a particular file either only in transactions or only in atomic operations.

Atomic operations are more efficient than single-operation transactions in two ways. First, during an atomic operation the TID search tree is read, to ensure that the file is not being modified by an outstanding transaction, but the TID search tree is not modified. Second, a large number of small transactions can cause the file system to run out of TIDs, if an old transaction remains outstanding; atomic operations avoid this possibility, because atomic operations do not use TIDs at all. H. The Log

TFFS uses a log to implement transactions, atomic operations, and atomic maintenance operations. As explained above, the log is stored in a single erase unit as an array of fixed-size records that grows downwards. The erase unit containing the log is marked as such in its header.

Each log entry contains up to four items: a valid/obsolete flag, an entry-type identifier, a transaction identifier, and a logical pointer. The first two items are present in each log entry; the last two remain unused in some entry types. Some log-entry types use the memory allocated for some of these items for storing other data. For example, the Erase Marker, which spans two log entries, stores the physical erase-unit number in the memory allocated for the logical pointer in one of the log entries and the erase counter in the memory allocated for the logical pointer in the other log entry.

TFFS uses the following log-entry types:

(a) New Sector and New Tree Node. These entry types allow the system to undo a sector allocation by marking the pointed-to sector as obsolete. The

New-Sector entry is ignored when a transaction is committed. The New- Tree-Node entry causes the file system to mark the spare pointer in the node, if used, as committed. This ensures that a node that is created in a transaction and modified in the same transaction is marked correctly.

(b) Obsolete Sector. Sectors are marked as obsolete only when the transaction that obsoleted them is committed. This entry is ignored at abort time, and clears the obsolete flag of the sector at commit time.

(c) Modified Spare Pointer. Points to a node whose spare pointer has been set. Clears the spare pointer's commit flag at commit time or it's the spare pointer's abort flag at abort time.

(d) New File. Points to the root of a file search tree that was created in a transaction. At commit time, this entry causes the file to be added to the

GUID search tree and to the containing directory. Ignored at abort time.

(e) File Root. Points to the root of a file search tree, if the transaction created a new root. At commit time, the entry is used to modify the file's entry in the GUID search tree. Ignored at abort time. (f) Commit Marker. Ensures that the transaction is redone at boot time.

(g) Reclaim Marker. Signifies that an erase unit is being reclaimed. The entry contains a physical erase-unit number of the source and a physical erase- unit number of the destination. The entry does not contain a sector pointer or a TID. If the log contains a non-obsolete reclaim marker at boot time, the reclamation is repeated; this completes an interrupted reclamation. Note that reclamation is idempotent.

(h) Erase Marker. Signifies that an erase unit is about to be erased. The entry contains a physical erase-unit number and an erase count, but does not contain a sector pointer or a TID. This entry typically uses two fixed-size record slots. If the log contains a non-obsolete erase-marker record at boot time, the physical unit is erased again; this completes an interrupted erasure.

(i) Witness Marker. The entry points to a special sector descriptor that points back to the physical erase unit that contains the log. The next subsection explains how this marker is used.

(j) GUID-Search-Tree Pointer, TID-Search-Tree Pointer, and Directory- Hash-Search-Tree Pointer. These records are written to the log when the root of one of these search trees moves, to allow the file system to find these search trees at boot time.

File search trees are modified during transactions and so is the TID search tree. The GUID and directory-hash search trees, and directory search trees, however, are modified only during commits. We cannot modify the GUID and directory-hash search trees during transactions because our versioned search trees support only one outstanding transaction. Delaying the search tree modification until commit time allows multiple outstanding transactions to modify a single directory, and allows multiple transactions to create files and directories (these operations affect the GUID search tree and the directory-hash search tree). TFFS does not allow file and directory deletions to be part of explicit transactions because that would have complicated the file/directory creation system calls.

The delayed operations are logged but not actually performed on the trees. After the commit system call is invoked, but before the commit marker is written to the log, the delayed operations are performed. When a transaction accesses a search tree whose modification by the same transaction may have been delayed, the search tree access must scan the log to determine the actual state of the search tree, from the viewpoint of that transaction. Many of these log scans are performed in order to find the roots of files that were created by the transaction or whose root was moved by the transaction. To locate the roots of these file trees more quickly, the file system keeps a cache of file roots that were modified by the transaction. If a file that is accesses is marked in the TID search tree as being modified by the transaction, the access routine first checks this cache. If the cache contains a pointer to the file's root, the search in the log is avoided; otherwise, the log is scanned for a non-obsolete file-root entry. L Finding the Log at Boot Time The first action of TFFS at boot time is to find the erase unit that contains the log. Once the log is found, the log governs the rest of the boot sequence. But finding the log is not trivial.

During normal operation, exactly one erase unit serves as a log. This erase unit's header marks the erase unit as containing a log. If TFFS find only one erase unit whose header marks it as containing a log, then this erase unit is the one that contains the log. But two events may cause the headers of two erase units to be marked as logs. One event, which we have already mentioned in the "Design goals" Section, is the interrupted erasure of an erase unit. An interrupted erasure can leave an erase unit in any state, including a state in which its header looks like the header of the log. The probability of such an event is low, especially if error detection is used (because the header of the partially-erased unit must also pass the error- detection tests for the erase unit to be considered a potential log unit), but this probability is not zero. The other event is the reclamation of the log unit. When the erase unit containing the log is reclaimed, the valid log entries are copied to another erase unit. At certain times during this process, both erase units have valid headers that identify them as containing the log.

We first explain how to handle the easier case of an interrupted log reclamation. When this happens, TFFS finds two erase units i and/ with log headers. One of them, say i, ends with a reclamation marker stating that i is being copied into /. Unit / cannot contain any valid reclamation marker, because log-reclamation markers are not copied during the reclamation of the log. This allows TFFS to unambiguously decide the direction of the reclamation, from / to/. If the log-reclamation marker at the end of the log in unit i is already obsolete, then / contains all the relevant log entries. TFFS has found the log: it is in unity. If the log-reclamation marker at the end of the log in / is still valid, then the reclamation has not been completed and it is simply repeated. In this case, unit i still serves as the log. The more difficult case is an interrupted erasure. After an interrupted erasure, the genuine log on unit / ends with an erase marker for some other unit, say/. In principle,/ might appear to contain a log and might pass error-detection tests. In such a case, we call/ a forged log and / the genuine log. According to the prior art, the only totally reliable way to distinguish between the forged log and the genuine log is to inspect the rest of the file system, to see which of the two candidate logs is consistent with the state of the file system. In general, this might not always be possible, and even when it is possible, these consistency tests can be complex or computationally expensive.

We have designed a simple consistency test that uses a special witness record stored elsewhere in the file system. The principle of the witness record is to add to the state of the file system a special record that can be easily located and that can be consistent with only one of the two log candidates. In other words, the witness record ensures that there is always a cheap way to determine the consistency of a log candidate with respect to the rest of the file system.

The witness record that TFFS uses is a dummy sector descriptor in one of the other erase units. Each sector descriptor contains one flag that, if set, designates it as a witness. A valid witness record also contains the physical erase-unit number of the current log. The witness for a new log unit is allocated and written before the reclamation of the previous log unit. Now the old log is reclaimed into the new log. The first log entry in the new log (or in any log) is a witness marker, which contains a logical pointer to the new log's witness record. Before the old log is erased, its witness record is invalidated.

We now explain how the witness record is used at boot time. Suppose that TFFS boots and finds two erase units, i andy, that both appear to contain logs, and that each erase unit ends with an erase marker claiming to erase the other unit. Both alleged logs must start with a witness marker (if one does not, it is forged and we are done). We inspect the two memory locations pointed to by the witness markers. Only one of them can be a valid witness record, and it must point back to the erase unit claiming it as its witness. The uniqueness of the witness is not obvious; it requires a proof. The proof uses an invariant, which states that at all times, TFFS is in exactly one of three possible states:

1. There is exactly one valid witness record on the flash memory.

2. An erase unit i with a log header ends with an erase marker that points to an erase unity containing all the valid witness records except for exactly one witness record; the one witness record that is not on erase unit y is consistent with the log on erase unit i (the log on erase unit i points to it and it points to the log on erase unit /).

3. There are exactly two valid witness records on the flash memory, and exactly one of the valid witness records is consistent with a log that ends with a log-reclamation marker.

Because witness records are marked by an otherwise unused flag in sector descriptors, witness records can only be created explicitly during the creation of a new log, or by the arbitrary bit pattern left in a unit after an interrupted erasure. State 1 is typical for the normal operation of TFFS. State 1 can cease to hold only under one of two conditions: either TFFS creates explicitly another witness record, or an interrupted erasure creates forged witness records. The first transition moves the file system from State 1 to State 3, because TFFS creates the new witness after writing the reclamation marker to the old log, and because nothing ever follows a log-reclamation marker in a log. The second transition either leaves TFFS in the first state (if the partially erased unit contains no valid witnesses) or in the second state. From State 3, TFFS can only move back to State 1, because TFFS will redo the reclamation until the reclamation succeeds. When the reclamation succeeds, it invalidates the old witness before doing anything else. The difficult part of the proof involves transitions from State 2. TFFS enters State 2 when an interrupted erasure creates forged witness records on the partially-erased unity. We claim that if TFFS boots in State 2, its next action is always to erase j again. If only unit i appears to contain a log, TFFS will find that it ends with an erase marker, which will cause TFFS to erase unity again. If unity also appears to be a log, the witness marker of the forged log on/ cannot point to a valid witness record. The only witness record outside i and/ is the witness record of i, which points to i. Because both i and/ appear to be logs, TFFS does not associate a logical unit number with them, so TFFS never reads a witness record from them. Therefore, even if j contains forged witness records, TFFS never uses these records, so TFFS always detects that/ is a forged log and that / is the genuine log. Therefore, the only possible transition from State 2 is to State 1. This concludes the correctness proof for the log-finding protocol.

This technique is useful not only for TFFS and other journaled file systems for flash devices, but for many other flash data structures as well. Most data structures designed for flash devices use some kind of erase-unit header to find, at boot time, the root of the data structure. By analogy to the case of trees, we refer herein to the logical starting element of any data structure as the "root" of the structure. For example, the "root" of a linked list is the first element of the list, and the "root" of a contiguous array is the first element of the array. The root can be a log, a log header, the superblock of a file system, or the root of some mapping structure such as the Flash Transition Layer [5] or the NAND Flash Translation Layer [6]. In all of these data structures, an interrupted erasure can leave the data structure with two apparent roots. Appropriate adaptation of our witness technique can eliminate this problem.

A simpler but incomplete solution to the problem would be to identify the log using a fixed and long signature that is unlikely to be created by an interrupted erasure. Bit patterns with only ones or only zeros are likely to be created by interrupted erasures, so they should not be used for identifying the log. Other bit patterns are fairly likely as well, such as patterns with mostly ones or mostly zeros and patterns in which all the zeros follow all the ones or vice versa. But a randomly selected bit pattern of k bits is likely to appear in a fixed location of a partially- erased unit with probability around 2^" . For a sufficiently large Jc, such a signature can be a fairly reliable way of finding the genuine log. It appears that most prior art flash systems rely on such signatures to detect their root unit.

The present invention uses a witness instead for two reasons. First, there is no general and proven methodology to choose such signatures. As we mentioned, some bit patterns make poor signatures, but it is not completely clear which patterns should be avoided. Second, we wanted to completely avoid cases in which TFFS boots and finds two apparent logs, even if the probability of such an event is low. In particular, in an information system that employs many TFFS devices, possibly millions of such devices, the probability of a forged signature turning up in one of the devices can be high enough to be of concern. J. Error Detection

TFFS implements error detection for all the structures and data stored on a NOR flash memory. We use two different error-detection mechanisms. We use two mechanisms rather than one mechanism because different parts of TFFS are accessed in different ways, and because of our assumptions on the two stochastic sources that cause bit errors. Two factors determine whether a particular error detection mechanism is appropriate for a given situation. One factor is the distribution of bit errors. A high error probability requires a stronger detection mechanism than a low error probability. Stronger mechanisms need more error-detection bits and more computation. The other factor is the length of the data that needs to be protected. Protecting a large chunk of data with one code is more efficient than partitioning the data and protecting each part separately with a code.

From this perspective, there are three different error-detection situations in TFFS. One situation represents all the errors that the high-probability source that models wear and manufacturing defects. The second situation represents low-probability errors in user data, which sometimes consists of long chunks of data. The third is low-probability errors in the data structures of TFFS itself, which consists of small records.

Before we explain how TFFS handles each situation, we mention an important principle that we follow in the computation of error-detection codes. When TFFS writes data protected by error-detection-code to flash, the code is computed from the copy of the data in RAM. In other words, TFFS never computes the code using as input the data just stored on flash, because that data is still unprotected; what we read is not necessarily what we wrote. Applying this principle is tricky when TFFS copies data from one location on flash to another. We explain below how TFFS addresses this issue. J-I: Protecting against wear and defects.

We detect errors that the wear-and-defects source causes by protecting erase-unit headers with a 16-bit error detection code (CRC 16). The code is computed over the entire header, except for the code itself. Once the header and its error-detection code are written to the flash memory, we read the header and its code from the flash memory and verify that they pass the error-detection test. If the wear-and-defects source is active for this erase unit, the verification should detect it, because CRC 16 is strong enough to detect multiple bit errors that this source might cause in the header. If the wear-and-defects source is presently inactive for this unit, we assume that it will remain inactive until the next erasure, so we can use the unit safely until then without worrying about this source. We also verify this source at other times (whenever we access the header) for extra protection. But because some headers are not accessed often, especially the header of the log unit, TFFS cannot always detect errors from this source if this source becomes active between erasures, contrary to our assumption. J-2: Protecting user data. Each file record or data extent is protected by a CRC 16 or CRC32 code. File records and data extents of up to 4 KB are protected by a CRC 16 code and longer records/extents by CRC32. When a record/extent is appended or updated by a user application, the data are provided in RAM, so TFFS computes the error-detection code on this in-RAM copy of the data. When a user application requests the data in a particular file record or data extent, TFFS reads the data from the flash memory into the user's buffer in RAM and verifies the error-detection code at the same time. In two situation, the computation and verification require a more complex procedure.

When a partial data extent is copied from one location on the flash memory to another location on the flash memory, TFFS does not store all the data it reads in a single buffer in RAM. Such copies happen when a user application updates only part of an extent in a binary file. TFFS performs the operation by creating a new extent covering the same byte range in the file. The new extent replaces the old extent. Some of the data for the new extent come from the buffer that the user code provides, but some of the data come from the old extent on flash. The copying from the old extent to the new extent is done byte by byte. TFFS copies the data, verifies the error-detection on the old extent, and computes the error detection for the new extent in a single pass over the extent's byte range. This single loop essentially fuses three loops: one loop to verify the old extent, one loop to compute the code for the new extent, and one loop to copy some of the data from the old extent to the new extent. In each iteration of this fused loop, TFFS reads a byte from the old extent and feeds the byte into the error-detection computation of the old extent. If that byte is destined to the new extent, that byte is written to the flash memory and also is fed into the computation of the new error-detection code. Otherwise, the appropriate byte of data is read from the user's buffer in RAM, is written to the flash memory, and is fed into the computation of the error detection code. These loops must be fused; we could not simplify the algorithm by first verifying the old extent and then copying it, because a memory error after the verification of the old extent and before the copying to the new location would have gone unnoticed. In such a case, the new error-detection code would be computed from the corrupt data, giving it an unwarranted stamp of approval. The other complex situation is similar. When a user application requests only part of an extent, the user's buffer is not large enough to hold the entire extent, so TFFS cannot verify a RAM image of the extent. Instead, we use the technique described above: the loop that verifies the entire extent is fused with the loop that reads part of the extent into a RAM buffer.

Our design, which uses a single error-detection code for each extent, is a compromise between two extreme designs. One extreme possibility is to use a single code for an entire file. This would require less storage for error-detection codes, but it would also force TFFS to read the entire file whenever only part of the file is read (to verify it). The other extreme possibility is to use a code for each byte or word, say a parity code. This would waste more space but would not require reading and verifying an entire extent when only part of it is read or copied. J-3: Protecting file-system structures.

The data structures of TFFS consist of small records, usually several bytes long. These structures are protected by parity bits that are verified whenever a field or several fields from a record are read. Protecting these records is not completely straightforward, however, because most of these records are updated once or more after they are first written and before they are discarded. The different states that a record goes through are characterized by programming additional fields that were originally left in the erased state. This is the natural way to implement state changes in NOR flash devices that allow reprogramming, the devices for which TFFS is designed. If a record goes through several states, a single parity bit cannot detect errors in all of these states, because the parity bit cannot change more than once (once cleared, a parity bit cannot be set until the erase unit in which it appears is erased).

To address this problem, we assign separate parity bits to each group of fields that are programmed together. The ratio of parity bits to data bits in a groups is at least 1 to 8. Usually, most of the record is protected by one set of parity bits, and one or more Boolean flags (valid, obsolete, committed, aborted, etc.) are each protected by a dedicated parity bit. We mention that that there is a more sophisticated way to address the same issue, using an updating technique invented for write-once memories. In this technique, k bits of abstract state that can change arbitrarily up to / times are encoded by more than k bits but fewer than tk bits. The encoding is such that any sequence of t different

states can be reprogrammed into a write-once medium. In principle, we could encode the error-detection code in such an encoding. This would have allowed the abstract error-detection code to arbitrarily change t times, to reflect up to M changes in the data that it protects. We did not attempt to use such an encoding. The parity function that we use is flash-specific. The parity bit is T if and only if the number of 'O's in the protected data is even. This ensures that an area in the erased state (all Ts) always passes the error-detection test with a parity bit that is also in the erased state.

Implementation and Performance

This section describes the implementation of TFFS and its performance. The performance evaluation is based on detailed simulations that we performed using several simulated workloads. The simulations measure the performance of TFFS, its storage overheads, its endurance, and the cost of leveling the flash memory device's wear.

We performed the experiments using simulations of two real-world flash devices. The first device is an 8 MB stand-alone flash-memory chip, the M29DW640D from STMicroelectronics. This device consists of 126 erase units of 64 KB each (and several smaller ones, which our file system does not use), with read access times of about 90 nanoseconds, program times of about 10 microseconds, and block-erase times of about 0.8 seconds.

The second device is a 16-bit microcontroller with on-chip flash memory, the

ST10F280, also from STMicroelectronics. This chip comes with two banks of RAM, one containing 2 KB and the other 16 KB, and 512 KB of flash memory. The flash memory contains

7 erase units of 64 KB each (again, with several smaller units that we do not use). The flash access times are 50 nanoseconds for reads, 16 microseconds for writes, and 1.5 seconds for erases. The small number of erase units in this chip hurts TFFS' performance; to measure the effect, we also ran simulations using this device but with smaller erase units ranging from 2 to 32 KB.

Both devices are guaranteed for 100,000 erase cycles per erase unit. We configured the non-hardware-related parameters of the simulated embodiment of TFFS as follows. The file system is configured to support up to 32 concurrent transactions, B- tree nodes have either 2—4 children or 7—14 children, 10 simulated recursive-call levels, and a RAM cache of 3 file roots. This configuration requires 466 bytes of RAM for the 8 MB flash and 109 bytes for the 0.5 MB flash.

We used 3 workloads typical of flash-containing embedded systems to evaluate the file system. The first workload simulates a fax machine. This workload is typical not only of fax machines, but of other devices that store fairly large files, such as answering machines, dictating devices, music players, and so on. The workload also exercises the transactions capability of the file system. This workload contains:

A parameter file with 30 variable length records, ranging from 4 to 32 bytes (representing the fax's configuration). This file is created, filed, and is never touched again. A phonebook file with 50 fixed-size records, 32 bytes each. This file is also created and filed but never accessed again.

Two history files consisting of 200 cyclic fixed-size records each. These files record the last 200 faxes sent and 200 last faxes received. These files are changed whenever a fax page is sent or received.

Each arriving fax consists of 4 pages, 51,300 bytes each. Each page is stored in a separate file and the pages of each fax are kept in a separate directory that is created when the fax arrives. The arrival of a fax triggers a transaction that creates a new record in the history file and creates a new directory for the file. The arrival of every new fax page adds changes the fax's record in the history file and creates a new file. Data are written to fax- page files in blocks of 1024 bytes. The simulation does not include sending faxes. The second workload simulates a cellular phone. This simulation represents workloads that mostly store small files or small records, such as beepers, text-messages, and so on. This workload consists of the following files and activities:

Three 20-record cyclic files with 15-byte records: one file for the last dialed numbers, one file for received calls, and one file for sent calls. Two SMS files: one file for incoming messages and one file for outgoing messages. Each variable-length record in these files stores one message. An appointments file, consisting of variable-length records. An address book file, consisting of variable-length records. The simulation starts by adding to the phone 150 appointments and 50 address book entries.

During the simulation, the phone receives and sends 3 SMS messages per day (3 in each direction), receives 10 calls and dials 10 calls, misses 5 calls, adds 5 new appointments and deletes the oldest 5 appointments. The third workload simulates an event recorder, such as a security or automotive "black box", a disconnected remote sensor, and so on. The simulation represents workloads with a few event-log files, some of which record frequent events and some of which record rare events (or perhaps just the extreme events from a high-frequency event stream). This simulation consists of three files:

One file records every event. This is a cyclic file with 32-byte records.

Another file records one event per 10 full cycles through the first file. This file, too, is cyclic with 32-byte records.

A configuration file with 30 variable-size records ranging from 4 to 32 bytes. These files are filled when the simulation starts and never accessed again.

Figure 6 presents the results of experiments intended to measure the storage overhead of

TFFS. In these simulations, we initialize the file system and then add data until the file system runs out of storage. In the fax workload, we add 4-page faxes to the file system until it fills. In the phone workload, we do not erase SMS messages. In the event-recording simulation, we replace the cyclic files by non-cyclic files.

For each workload, seven bars are shown: three bars for an 8MB flash with 64 KB erase units (denoted by "8192/64"), and four bars for a 448 KB flash with either 64KB or 2KB erase units ("448/64" and "448/2"). Two bars are for file systems whose B-trees have 7 — 14 children. The rest are for B-trees with 2 — 4 children. The scenarios denoted NSP describe a file system that does not use spare pointers.

The graph shows the amount of user data written to the file system before it ran out of flash storage, as a percentage of the total capacity of the device. For example, if 129,432 bytes of data were written to a flash file system that uses a 266,144 bytes flash, the capacity is 49%.

The groups of bars in the graph represent different device and file-system configurations: an 8 MB device with 64 KB erase units, a 448/64 KB device, and a 448/2 KB device; file systems with 2—4 children per tree node and file systems with 7—14 children; file systems with spare pointers and file systems with no spare pointers.

Clearly, storing large data extents, as in the fax workload, reduces storage overheads compared to storing small file records or data extents. Wider tree nodes reduce overheads when the leaves are small. The performance- and endurance-oriented experiments that are presented below, however, indicate that wider nodes degrade performance and endurance. A small number of erase units leads to high overheads. Small erase units reduce overheads except in the fax workload, in which the 1 KB data extents fragment the 2 KB erase units. The next set of experiments measures both endurance and performance. All of these experiments run until one of the erase units reaches an erasure count of 100,000; at that point, we consider the device worn out. We measure endurance by the amount of user data written to the file system as a percentage of the theoretical endurance limit of the device. For example, a value of 68% means that the file system was able to write 68% of the data that can be written on the device if wear is completely even and if only user data are written to the device.

We performed two groups of endurance experiment. The first group assesses the impact of file-system fullness and data life spans on TFFS' behavior. In particular, we wanted to understand how TFFS copes with a file system that is almost full and with a file system that contains a significant amount of static data. This group consists of three scenarios: a first scenario in which the file system remains mostly empty; a second scenario in which the file system is mostly full, with half the data never deleted or updated, and the other half updated cyclically; and a third scenario in which the file system is mostly full, most of the data are never updated, and a small portion of the data is updated cyclically. The results of these endurance experiments are shown in Figure 7.

The second group of endurance experiments assesses the impact of device characteristics and file-system configuration on TFFS' performance. This group includes the same device/file- system configurations as in the capacity experiments, but the devices were kept roughly two- thirds full, with half of the data static and the other half changing cyclically. The results of this group of endurance experiments are shown in Figure 8.

Each graph in Figures 7 and 8 shows the endurance of a file system that is always almost empty, of a file system that is almost always full and half of whose data are static, and of a full file system with almost only static data.

Figures 7 and 8 show that on the fax workload, endurance is good, almost always above 75% and sometimes above 90%. On the two other workloads endurance is not as good, never reaching 50%. This is caused not by early wear of a particular block, but by a large amount of file-system structures written to the device (because writes are performed in small chunks). The endurance of the fax workload on the device with 2 KB erase units is relatively poor because fragmentation forces TFFS to erase units that are almost half empty. The other significant fact that emerges from Figures 7 and 8 is that the use of spare pointers significantly improves endurance. As noted below, the use of spare pointers also significantly improves performance.

The next set of experiments was designed to measure the performance of TFFS. We measured several performance metrics under the different content scenarios (empty, full-half- static, and full-mostly-static file systems) and the different device/file- system configuration scenarios.

The first metric we measured was the average number of erasures per unit of user-data written. That is, on a device with 64 KB erase units, the number of erasures per 64 KB of user data written. The results were almost exactly the inverse of the endurance ratios (to within

0.5%). This implies that the TFFS wears out the devices almost completely evenly. When the file system performs few erases per unit of user data written, both performance and endurance are good. When the file system erases many units per unit of user data written, both metrics degrade. Furthermore, we have observed no cases where uneven wear leads to low endurance; low endurance is always correlated with many erasures per unit of user data written.

The second metric we measured was the efficiency of reclamations. We define this metric as the ratio of user data to the total amount of data written in block- write operations. The total amount includes writing of data to sectors, both when a sector is first created and when the sector is copied during reclamation, and copying of valid log entries during reclamation of the log. The denominator does not include writing of sector descriptors, erase-unit headers, and modifications of fields within sectors (fields such as spare pointers). A ratio close to 100% implies that little data is copied during reclamations, whereas a low ratio indicates that much valid data is copied during reclamation. The two graphs presenting this metric, Figures 9 and 10, show that the factors that affect reclamation efficiency are primarily fullness of the file system, the amount of static data, and the size of user data items. The results again show that spare pointers contribute significantly to high performance.

We also measured two other metrics: the number of programming operations per system call and the number of flash-read instructions per system call. These metrics do not count programming and read operations performed in the context of copying blocks; these are counted by the reclamation-efficiency metric. These metrics did not reveal any interesting behavior rather than to show (again) that spare pointers improve performance. Spare pointers improve these metrics by more than a factor of 2.

Hardware Example

Figure 11 is a simplified partial block diagram of a microcontroller 20 of the present invention. Microcontroller 20 includes a CPU 22, a RAM 24 and a NOR flash memory 26, that communicate with each other via a bus 28. Microcontroller 20 includes other components, such as ports for communicating with the device in which microcontroller 20 is embedded, that for simplicity are not shown in Figure 11. NOR flash memory 26 includes a boot block for booting microcontroller 20. NOR flash memory 26 also has stored therein executable code for TFFS. In one implementation of microcontroller 20, CPU 22 executes the TFFS code in-place to manage the files that are stored in NOR flash memory 26. In another implementation of microcontroller 20, CPU 22 copies the TFFS code from NOR flash memory 26 to RAM 24 and executes the TFFS code from RAM 24 to manage the files that are stored in NOR flash memory 26. NOR flash memory 26 is an example of a computer-readable code storage medium in which is embedded computer readable code for TFFS.

[1] P. L. Barrett, S. D. Quinn, and R. A. Lipe, "System for updating data stored on a flash- erasable, programmable, read-only memory (FEPROM) based upon predetermined bit value of indicating pointers," U.S. Patent 5 392 427, Feb. 21, 1995.

[2] W. J. Krueger and S. Rajagopalan, "Method and system for file system management using a flash-erasable, programmable, read-only memory," U.S. Patent 5 634 050, May

27, 1999.

[3] W. J. Krueger and S. Rajagopalan, "Method and system for file system management using a flash-erasable, programmable, read-only memory," U.S. Patent 5 898 868, Apr.

27, 1999. [4] W. J. Krueger and S. Rajagopalan, "Method and system for file system management using a flash-erasable, programmable, read-only memory," U.S. Patent 6 256 642, July

3, 2001. [5] A. Ban, "Flash file system," U.S. Patent 5 404 485, Apr. 4, 1995.

[6] A. Ban, "Flash file system optimized for page-mode flash technologies," U.S. Patent

5 937 425, Aug. 10, 1999. [7] N. Daberko, "Operating system including improved file management for use in devices utilizing flash memory as main memory," U.S. Patent 5 787 445, July 28, 1998. [8] A. Kawaguchi, S. Nishioka, and H. Motoda, "A flash-memory based file system," in

Proceedings of the USENIX 1995 Technical Conference, New Orleans, Louisiana, Jan.

1995, pp. 155-164. [9] H. Dai, M. Neufeld, and R. Han, "ELF: An efficient log- structured flash file system for wireless micro sensor nodes," in Proceedings of the 2nd ACM Conference on Embedded Networked Sensor Systems (SenSys), Nov. 2004, pp. 176-187. While the invention has been described with respect to a limited number of embodiments, it will be appreciated that many variations, modifications and other applications of the invention may be made.

Claims

WHAT IS CLAIMED IS:

1. A data structure for storing a plurality of records in a memory of a computer system and for retrieving the records, the data structure comprising:

(a) a current version of a tree, each of whose leaves includes a data object having a retrieval key for one of the records; and

(b) a single previous version of said tree.

2. The data structure of claim 1, wherein said current version is a read-write version and said previous version is a read-only version.

3. The data structure of claim 1, wherein each internal node of each said tree includes at least one spare pointer.

4. The data structure of claim 3, wherein each said internal node includes only one said spare pointer.

5. The data structure of claim 3, wherein each said internal node includes a commit flag.

6. The data structure of claim 1 , wherein said tree is a B-tree.

7. A file system comprising at least one data structure of claim 1 for storing a plurality of files in a memory of a computer system and for retrieving the files.

8. The file system of claim 7, for storing and retrieving said files via a plurality of directories, and comprising one said data structure for each said file and one said data structure for each said directory.

9. The file system of claim 8, further comprising one said data structure for mapping names of said directories to metadata records of said directories.

10. The file system of claim 8, further comprising one said data structure for mapping globally unique identifiers of said files and of said directories to said data structures of said files and of said directories.

11. The file system of claim 8, further comprising one said data structure for storing and retrieving transaction identifiers of said data structures of said files and for storing and retrieving transaction identifiers of said data structures of said directories.

12. A computer-readable storage medium having computer-readable code embodied on the computer-readable storage medium, the computer readable code comprising program code for the file system of claim 7.

13. A method of storing a plurality of records in a memory of a computer system and of retrieving the records, comprising the steps of:

(a) providing at most two versions of a tree, each of whose leaves includes a data object having a retrieval key for one of the records, a first version of said tree being a current version of said tree and a second version of said tree being a previous version of said tree; and

(b) updating said tree by steps including:

(i) changing only said current version of said tree.

14. The method of claim 13, wherein said current version is a read- write version and said previous version is a read-only version.

15 The method of claim 13 , wherein said updating further includes:

(ii) committing said changes of said current version;

(iii) substituting said committed current version for said previous version; and

(iv) substituting a copy of said committed current version for said current version.

16. The method of claim 15, further comprising the step of:

(c) providing each internal node of said tree with a spare pointer that includes a commit flag; and wherein said committing of said changes to said current version includes clearing said commit flags of said current version.

17. The method of claim 13, wherein said tree is a B-tree.

18. A computer-readable storage medium having computer-readable code embodied on the computer-readable storage medium, the computer-readable code for storing a plurality of records in a memory of a computer system and for retrieving the records, the computer-readable code comprising:

(a) program code for providing at most two versions of a tree, each of whose leaves includes a data object having a retrieval key for one of the records, a first version of said tree being a current version of said tree and a second version of said tree being a previous version of said tree; and

(b) program code for updating said tree by steps including changing only said current version of said tree.

19. A persistent data structure for storing a plurality of records in a flash memory of a computer system and for retrieving the records.

20. The persistent data structure of claim 19, comprising a persistent tree.

21. The persistent data structure of claim 20, wherein said persistent tree is a B- tree.

22. The persistent data structure of claim 20, wherein said persistent tree includes a plurality of leaves, each said leaf including a data object having a retrieval key for one of the records.

23. The persistent data structure of claim 20, wherein said persistent tree includes a plurality of internal nodes, each said internal node including at least one spare pointer.

24. The persistent data structure of claim 23, wherein each said internal node includes only one said spare pointer.

25. The persistent data structure of claim 23 wherein each said spare pointer includes a commit flag.

26. The persistent data structure of claim 23, wherein each said spare pointer includes an abort flag.

27. The persistent data structure of claim 19, comprising:

(b) a single previous version of said tree.

28. A file system comprising at least one persistent data structure of claim 19 for storing a plurality of files in a flash memory of a computer system and for retrieving the files.

29. The file system of claim 28, comprising:

(a) for each file, a corresponding data structure of claim 19 that includes a mechanism for recording tentative changes to said data structure by transactions of the file system; and

(b) a log of indications of said tentative changes.

30. A computer-readable storage medium having computer-readable code embodied on the computer-readable storage medium, the computer-readable code comprising program code for the file system of claim 28.

31. A method of storing a plurality of files in a nonvolatile memory of a computer system and of retrieving the files, comprising the steps of:

(a) providing, for each file, a corresponding data structure of claim 19 that includes a mechanism for recording tentative changes to said data structure;

(b) recording said tentative changes in a log in the nonvolatile memory;

(c) storing, in the nonvolatile memory, a witness record corresponding to said log;

(d) storing, in said log, a witness marker that points to said non-volatile memory; and (e) upon booting the computer system, using said witness marker and said witness record to verify an identity of said log.

32. A method of storing a plurality of records in a flash memory of a computer system and of retrieving the records, comprising the steps of:

(a) providing a persistent tree that includes a plurality of leaves and a plurality of internal nodes, each said leaf including a data object having a retrieval key for one of the records, each said internal node including at least one spare pointer; and

(b) changing at least one said internal node by modifying one of said at least one spare pointer of each said at least one internal node.

33. The method of claim 32, wherein said modifying of said spare pointer of one of said at least one internal nodes includes modifying said spare pointer to point to a new child node.

34. The method of claim 32, wherein each said internal node includes only one said spare pointer.

35. The method of claim 32, further comprising the steps of:

(c) providing each said spare pointer with a commit flag; and

(d) committing said at least one change by steps including clearing said commit flag of said modified spare pointer of each said at least one changed internal node.

36. The method of claim 32, further comprising the steps of:

(c) providing each said spare pointer with an abort flag; and

(d) canceling said at least one change by steps including clearing said abort flag of said modified spare pointer of each said at least one changed internal node.

37. The method of claim 32, wherein said tree is a B-tree.

38. A computer-readable storage medium having computer-readable code embodied on the computer-readable storage medium, the computer-readable code for storing a plurality of records in a flash memory of a computer system and for retrieving the records, the computer-readable code comprising:

(a) program code for providing a persistent tree that includes a plurality of leaves and a plurality of internal nodes, each said leaf including a data object having a retrieval key for one of the records, each said internal node including at least one spare pointer; and

(b) program code for changing at least one said internal node by modifying one of said at least one spare pointer of each said at least one internal node.

39. A file system, for storing a plurality of files in a memory of a computer system and for retrieving the files, comprising:

(a) for each file, a corresponding data structure including a mechanism for recording tentative changes to said data structure by transactions of the file system; and

(b) a log of indications of said tentative changes.

40. The file system of claim 39, wherein said data structures are trees, and wherein said mechanism of each said tree includes, for each internal node of said each tree, a spare pointer that includes a commit flag that is cleared in accordance with said log when said tentative changes are committed and an abort flag that is cleared in accordance with said log when said tentative changes are aborted.

41. The file system of claim 39, wherein said trees are pruned version trees.

42. A computer-readable storage medium having computer-readable code embodied on the computer-readable storage medium, the computer-readable code comprising program code for the file system of claim 39.

43. A method of storing a plurality of files in a nonvolatile memory of a computer system and of retrieving the files, comprising the steps of: (a) providing, for each file, a corresponding data structure including a mechanism for recording tentative changes to said data structure; and

(b) logging indications of said tentative changes.

44. The method of claim 43, further comprising the step of:

(c) committing said tentative changes in accordance with said indications.

45. The method of claim 43, further comprising the step of:

(c) aborting said tentative changes in accordance with said indications.

46. The method of claim 43, wherein said logging includes recording said indications in a log in the nonvolatile memory, the method further comprising the steps of:

(c) storing, in the non- volatile memory, a witness record corresponding to said log;

(d) storing, in said log, a witness marker that points to said non-volatile memory; and

(e) upon booting the computer system, using said witness marker and said witness record to verify an identity of said log.

47. A computer-readable storage medium having computer-readable code embodied on the computer-readable storage medium, the computer-readable code for storing a plurality of files in a nonvolatile memory of a computer system and for retrieving the files, the computer code comprising:

(a) program code for providing, for each file, a corresponding data structure including a mechanism for recording tentative changes to said data structure; and

(b) program code for logging indications of said tentative changes.

48. In a computer system that includes a nonvolatile memory wherein is stored a data structure having a root, a method of managing the data structure so that the root of the data structure is found unambiguously at boot time, comprising the steps of:

(a) storing, in the nonvolatile memory, a witness record corresponding to the data structure; (b) storing, in a header of the root, a witness marker corresponding to said witness record; and

(c) upon booting the computer system: upon finding two candidate headers, each said candidate header including a witness marker: accepting, as said header of the root, said candidate header whose witness marker corresponds to said witness record.

49. The method of claim 48, wherein said witness marker that corresponds to said witness record includes a pointer to said witness record.

50. A computer-readable storage medium having computer-readable code embodied on the computer-readable storage medium, the computer-readable code for managing a data structure, having a root, that is stored in a nonvolatile memory of a computer system, so that the root of the data structure is found unambiguously at boot time, the computer-readable code comprising:

(a) program code for storing, in the nonvolatile memory, a witness record corresponding to the data structure;

(b) program code for storing, in a header of the root, a witness marker corresponding to said witness record; and

(c) program code for: upon booting the computer system: upon finding two candidate headers, each said candidate header including a witness marker: accepting, as said header of the root, said candidate header whose witness marker corresponds to said witness record.