Lazy write and logging checkpoints. What are Journaled File Systems

The Present and Future of Journaling

There are many definitions of journaling file systems, but let's put it in a way that everyone can understand: journaling is the system for those who are tired of the fsck boot-time checker. This is also a system for those who are close to the idea of failure-tolerant systems. If you incorrectly turn off the power in normal system, where there is no logging, the OS detects this fact and runs the fsck disk integrity check utility at the next boot. This utility scans the file system and tries to fix problems without harming the data. The verification process can take quite a long time. Sometimes the file system becomes so damaged that the OS boots only in single-user mode and prompts the user to perform further recovery.

Say fsck

What's even worse is that the fsck process may be running operating system automatic when mounting the file system to ensure that the metadata is correct (even if there was no corruption). So eliminating unnecessary filesystem integrity checks is a noticeable area for improvement.

So now you know who needs journaling file systems, why don't such systems need fsck checks? In short - because they conduct special magazine. A log is a file that is a ring buffer in which all actions related to changes to the file system are recorded. Periodically, these changes are applied to the file system. In the event of a failure, the log can be used as a starting point to recover unsaved data and prevent metadata corruption.

To summarize, a journaled file system is a crash-tolerant file system in which modification commands are logged before being executed, thereby avoiding corruption of metadata. (See Figure 1). As usual with Linux, there are many variations of such systems. Let's do short review history of file systems, and then consider those available today file systems and their differences.

What is metadata?

Metadata are called service structures data required to store basic data. Operations such as creating and deleting a file and directory, increasing a file, truncating a file, etc. affect metadata.

Figure 1. A typical journaled file system.

History of Linux Journaled File Systems

IBM® was the first to develop a journaling file system called JFS (Journaled File System). The first version of JFS was introduced in 1990, and modern version supported on Linux as JFS2, developed later. In 1994, Silicon Graphics introduced the high-performance XFS file system for the IRIX OS. In 2001, XFS was ported to Linux. In 1998, a file-based system was developed for Amiga systems. Smart system File System (SFS), which was subsequently released under GNU license Lesser General Public License (LGPL) and received support in Linux 2005. The most widely used file system is ext3fs (from the English. third extended file system ), which is an extension of the ext2 system with the addition of logging. ext3fs support appeared in Linux in 2001. Finally, the widely used journaled file system ReiserFS has opened up many new paths and opportunities for development. However, the development of this system has slowed down due to the legal problems of its author.

Types of Logging

All journaled file systems maintain a journal to buffer file system changes (which is also needed for disaster recovery), but there are different strategies for what to log and when. The three most common strategies are writeback mode, ordering mode, and data mode.

IN writeback mode Only metadata is journaled, and data blocks are written directly to disk. This contributes to the inviolability of the file system structure and protects against damage, but damage to the data itself is still possible (for example, if a system crash occurs after writing metadata to the journal, but before writing a block of data). Decide the specified problem allows ordering mode. In this mode, only the metadata is also logged, but the data itself is written before the metadata is logged. This ensures file system data consistency after recovery. Finally, it is possible to log in data mode, in which both metadata and the data itself are logged. This mode has highest level resistance to damage and data loss, but has the disadvantage of low performance, since all data is written twice (first to the log, then to disk).

The rules for applying changes recorded in the log may also be different in different approaches. For example, when should changes be applied? When is the magazine full? Or when a certain timeout expires?

Journaled File Systems Today

Today, several journaling file systems are actively used, each of which has its own advantages and disadvantages. Below are the four most popular journaling file systems.

JFS2

JFS2 (also known as improved journaling file system) is the first journaled file system and for a long time was used on the IBM AIX® operating system before being ported to Linux. JFS2 is a 64-bit file system that, taking its roots from the original JFS, has been significantly improved in terms of scalability and support for multiprocessor architectures.

JFS2 supports in-order journaling, high performance, and sub-second recovery times. To improve performance, it uses an extent-based file placement method. Extent-based placement means placing the file in several continuous sections, rather than many identical blocks. Due to continuity, these areas provide more fast reading and recording. Additional benefit extents - lower costs for working with metadata. When placing a file in blocks, the metadata of each block is recorded. If extents are used, the metadata for the extents, which typically consists of multiple blocks, changes.

JFS2 also uses B+ trees for both effective search directories and to manage extent descriptors. JFS2 does not have its own policy for pushing changes to disk - instead it is based on the kupdate daemon timeout.

XFS

XFS is another early journaling file system, originally developed by Silicon Graphics in 1995 for the IRIX OS. In 2001, XFS was implemented in Linux, already being a well-thought-out and reliable file system at that time.

XFS uses full 64-bit addressing and provides very high performance through the use of B+ trees to accommodate directories and files. XFS stores data as extents, supporting variable extent sizes (from 512 bytes to 64 kilobytes). Along with extents, XFS uses lazy allocation, which delays the allocation of blocks until it is time to write them to disk. This feature increases the likelihood of filling several disk blocks in a row, since their number will be known at the time of recording.

Other interesting properties of XFS are the guaranteed I/O speed when file system users are allocated a reserve bandwidth for I/O operations, and direct I/O, which copies data directly between disk and the application buffer (instead of going through multiple buffers). Journaling in XFS is done using the writeback method.

Third extended file system (ext3fs)

The third extended file system (ext3fs) is the most popular journaling file system, which arose as an evolution of the well-known ext2 file system. In fact, it is compatible with ext2, since it operates on identical structures, but with the addition of a log. Moreover, it is possible to mount an ext3 partition as ext2, or convert ext2 to ext3 using the tune2fs utility.

ext3fs supports all three logging strategies ( write back, ordering and data mode), but the default ordering mode is used. The policy for transferring log data to disk can be configured, but initially it is such that the transfer occurs either when 1/4 of the log is full or when one of the transfer timers expires.

One of the main drawbacks of ext3fs comes from the fact that it was not originally intended to be a journaling file system. Because it is based on ext2fs, it lacks many of the advanced features found in other file systems (such as extents). It also generally performs poorly compared to ReiserFs, JFS, and XFS, but is less CPU intensive and memory intensive than many other file systems.

ReiserFS

What is tailings compaction?

It often happens that the file has a size smaller size logical block. Instead of allocating a whole block for each such file, leaving part of the block unoccupied (this part is called tail), they try to fit several files into one block. This method gives a gain of 5% free space compared to other file systems, but has a negative impact on performance.

The ReiserFS file system was designed from the very beginning to be journaling. In 2001, it was added to the main branch of the 2.4 kernel and became the first journaling file system to appear in Linux. The main method of journaling is organization. On-the-fly file system size expansion is supported. ReiserFS also supports tailings compaction to dynamically reduce fragmentation, which allows it to outperform ext3fs when working with small files.

ReiserFS (also called ReiserFS v3) uses many modern approaches, for example B+-trees. The file system format is based on a single B+ tree, making search operations particularly fast and scalable. The log-to-disk migration policy depends on the log size and is based on the number of blocks that need to be migrated.

The reputation of ReiserFS has been damaged several times: most recently by the problems of the system’s author with the law ( detailed information see section).

The Future of Journaled File Systems

Having seen journaled file systems of the present and past, let's look at what the future holds (and what doesn't).

Reiser4

After successfully introducing ReiserFS into the kernel and being used in many Linux distributions Namesys (the company behind ReiserFS) has begun work on a new journaled file system, Reiser4, which has been built entirely from the ground up and includes many advanced features.

Improved logging in Reiser4 is achieved by using write wandering and deferring block allocation until the log data is migrated (as was done in XFS). The Reiser4 architecture included flexible plugin support (to add compression or encryption functionality, for example), but this idea was rejected by the Linux community, which believed that these advanced features belonged in the virtual file system (VFS) subsystem.

After the indictment of the owner of Namesys and at the same time the author of ReiserFS, all commercial activities around Reiser4 were suspended.

Fourth extended file system

The fourth extended file system (ext4fs) is further development ext3fs. Ext4fs was intended as a replacement for ext3fs, being forward and backward compatible, but including many improvements (some of which break this compatibility). In practice, you can mount an ext4 partition as ext3 and vice versa.

First, ext4fs is a 64-bit file system with support for huge volumes (up to 1 exabyte). It can also use extents, but in this case it loses compatibility with ext3fs. Similar to XFS and Reiser4, ext4fs delays block placement on disk and occurs as needed (which reduces fragmentation). The log also stores checksums content for greater reliability. Instead of B+- or B*-trees, a special type of B-tree is used, the so-called. H-tree, which allows subdirectories to have much more larger size(in ext3 it is limited to 32Kb).

Although lazy allocation reduces fragmentation, over time the file system big size still fragments. To solve this problem, the e4defrag utility was developed, which can be used to defragment separate files or an entire file system.

Another interesting difference between ext4fs and ext3fs is the accuracy of file timestamps. In ext3, the timestamp dimension is one second. Ext4fs looks to the future: with the continued growth of processor speeds and interfaces, more precise measurement. Therefore, one nanosecond was taken as the time dimension.

Although ext4fs is included in Linux kernel in version 2.6.19, it can already be considered stable. This system, which is still under development, is the starting point for the journaling file system of the future in Linux.

Moving on

Journaled file systems provide reliability and protection against data corruption in the event of a system crash or loss of power. In addition, the recovery time in such systems is much faster than in traditional file systems (for example, those that use fsck). The development of new logging methods is based both on past experience coming from JFS and XFS, and on the search for new algorithms and structures. It is not entirely clear how journaled file systems will evolve in the future, but their usefulness is obvious, and they have already become the new standard for file systems.

NTFS is a fault-tolerant system that can restore itself to a correct state in the event of almost any real failure. Any modern file system is based on the concept of transaction- an action performed entirely and correctly or not performed at all. NTFS simply does not have intermediate (erroneous or incorrect) states - the quantum of data change cannot be divided into before and after the failure, bringing destruction and confusion - it is either committed or canceled.

Example 1: data is written to disk. Suddenly it turns out that it was not possible to write to the place where we had just decided to write the next portion of data - physical damage to the surface. The behavior of NTFS in this case is quite logical: the write transaction is rolled back entirely - the system realizes that the write was not performed. The location is marked as failed, and the data is written to another location - a new transaction begins.

Example 2: A more complex case is when data is being written to disk. Suddenly, the power is turned off and the system reboots. At what phase did the recording stop, where is there data and where is it not? Another system mechanism comes to the rescue - the transaction log. The fact is that the system, realizing its desire to write to disk, marked this state in the $LogFile metafile. When rebooting, this file is examined for the presence of unfinished transactions that were interrupted by an accident and the result of which is unpredictable - all these transactions are canceled: the place where the write was made is marked again as free, indexes and MFT elements are returned to the state in which they were before failure, and the system as a whole remains stable. Well, what if an error occurred while writing to the log? It’s also okay: the transaction either hasn’t started yet (there is only an attempt to record the intentions to carry it out), or it has already ended - that is, there is an attempt to record that the transaction has actually already been completed. In the latter case, at the next boot, the system itself will fully understand that in fact everything was written correctly anyway, and will not pay attention to the “unfinished” transaction.

Still, remember that logging is not an absolute panacea, but only a means to significantly reduce the number of errors and system failures.

Experience shows that NTFS is restored to a completely correct state even in the event of failures at moments very busy with disk activity. You can even optimize the disk and press reset in the middle of this process - the likelihood of data loss even in this case will be very low. It is important to understand, however, that the system NTFS recovery guarantees correctness file system, not your data. If you were writing to a disk and got a crash, your data may not be written, but there is always a copy of it.

Compression

NTFS files have one quite useful attribute - "compressed". The fact is that NTFS has built-in support for disk compression - something for which you previously had to use Stacker or DoubleSpace. Any file or directory can be individually stored on disk in compressed form - this process is completely transparent to applications. File compression has a very high speed and only one big negative property - the huge virtual fragmentation of compressed files, which, however, does not really bother anyone. Compression is carried out in blocks of 16 clusters and uses so-called “virtual clusters” - again, extremely flexible solution, allowing to achieve interesting effects- for example, half the file can be compressed, but half cannot. This is achieved due to the fact that storing information about the compression of certain fragments is very similar to regular file fragmentation: for example, a typical record of the physical layout for a real, uncompressed file:

file clusters from 1 to 43 are stored in disk clusters starting from 400

file clusters from 44 to 52 are stored in disk clusters starting from 8530...

Physical layout of a typical compressed file:

file clusters 1 to 9 are stored in disk clusters starting from 400

file clusters 10 to 16 are not stored anywhere

file clusters from 17 to 18 are stored in disk clusters starting from 409

file clusters from 19 to 36 are not stored anywhere

It's clear that compressed file has "virtual" clusters, real information in which there is no. As soon as the system sees such virtual clusters, it immediately understands that the data from the previous block, a multiple of 16, must be decompressed, and the resulting data will just fill the virtual clusters - that, in fact, is the whole algorithm.

At regular work file system, all changes are usually immediately made to the disk (or rather, to the disk cache in the OS, but this is not important in this context).

Many operations require simultaneous changes to several file system structures ( metadata. A simple example: when creating a hardlink, you need to simultaneously increase the number of links to the inode and change the contents of the directory to which the link is made. You cannot do just one of these operations - the contents of the file system will be incorrect.

During normal file system operation, such a complex operation is always performed in its entirety unless the file system implementation code contains critical errors. However, if there is an abnormal reboot or hardware failure this situation is very real.

Since after the reboot we do not know what operations were performed, what was unfinished, but we only know that the disk was not correctly unmounted (in this case the so-called dirty flag is reset), we need to analyze the file system on the entire disk, and thus Thus, identify all errors in the file system and correct them. Naturally, it is not always possible to do this automatically (unnatural intelligence, alas, no one has yet been able to teach clairvoyance abilities), therefore the same fsck.ext2 after an abnormal reboot may require manual intervention.

Those who have run fsck on a partition of 100-200 G (which is far from uncommon these days) understand perfectly well that there is little pleasure in this. Administrators of multi-terabyte arrays, for an extra minute of idle time they can “accidentally” have their head torn off at the word fsck, grab valerian or ask not to swear with such words in their presence.

To solve this problem, a brilliant idea was invented a long time ago (if anyone knows when and by whom, please tell me) - at first write some description of the planned operation to disk, and only then execute it. Then it will be possible not to test the entire disk for correctness, but just limit yourself to viewing the contents of the log, and, if the operation was not completed, then roll it back. To do this, you do not need to run fsck - this is done by the file system driver itself.

To summarize, the only thing a journaled file system can and should do is save time on fsck. Accordingly, the consistency of file system metadata is guaranteed, no more, no less.

The price of this pleasure: we have a small (usually measured in tens of megabytes) disk area, which accounts for maximum load, that is maximum performance, measured in the number of i/o operations per second, is falling. Well, of course, a little is spent disk space, that in the era of disk prices< 1$/гигабайт никого не волнует.

Data Logging

As you noticed, operations with metadata are usually written to the log. However, you can do the same with data.

As far as I know, only ext3 with the data=journal parameter can perform data logging on Linux.

Of course, data logging in many cases reduces performance somewhat (but not all of them; on the IBM website there are test results according to which using data logging for file systems on which databases are located can even give a performance increase).

This tool also does not guarantee data safety, however, from my experience personal experience using ext3 with data=journal is the most reliable file system.

Performance

An attentive reader will probably have noticed that using a journal creates an uneven load on the disk - one small (compared to overall size file system) area accounts for a disproportionate amount of operations.

There are two very interesting solutions:
Firstly, you can take the magazine to separate disk(most file systems allow this), the result is that we effectively double the performance by adding just one disk. It looks especially nice when the performance of a huge RAID array is increased in such a simple and cheap way.

Secondly, you can use special cards non-volatile memory(for example, UMEM, which, alas, I have not seen for sale in Russia), which are also noticeably faster than conventional disks (but have a small memory size).

There is also a completely extravagant solution that I have not tried yet - to make a journal on a block device located in memory. Of course, after a reboot, such a file system will need to be recreated again, but for temporary data this can give an interesting and noticeable increase in performance. Especially when logging data, not just metadata.

Tricks

As you have already seen, the magazine can also provide a speed increase. There are a few more ingenious tricks that a journaling file system can use to further higher magnification performance:

delayed file creation (at the time of file creation, do not immediately create an entry in the directory, but keep it only in the journal for some time; perhaps the file is temporary and will be immediately deleted);

delayed file allocation (do not physically allocate space even for the first block of the file until you need to write at least one block), it is quite possible that the user will first change the file size, and only then start writing data. As a result, fragmentation is reduced (if programs use this trick);

These are the simplest ones, there are many more little tricks that allow a journaled file system to work faster than a regular one, while remaining more reliable.

Flaws

As I already said, the log is not a panacea, and the data does not save at all. However, many people get a false sense of security from using journaled file systems - of course, you can reboot the machine by resetting it, and it won’t even swear during boot!

Yes, he won’t swear. And it will be absolutely correct from the point of view of some fsck file system. Only the data may still be left with only scraps.

Let's say reiserfs in similar situations It may well leave garbage in the modified files (arbitrary data that was in the block allocated for the file). Which, in essence, means a very likely accidental leak of information.

XFS acts more correctly - it writes such blocks as zeros. Which often shocks users. Especially fans of reiserfs, which will not write zeros.

As a result, reiserfs are more likely will save modifications, and XFS will do its best to avoid garbage in files and data leaks - just slightly different strategies. The result is the same - the data may be lost, and you won’t even know about it. Until you come across a file that no one has touched for a year (it was in the archive), and which suddenly turns out to be filled with garbage or zeros.

ext3 with data logging enabled does not suffer from such features. However, it noticeably loses in performance.

In a good way, all these problems can (and should) be avoided simply by purchasing UPS, and it is better to use logging as an additional level of reliability and a means of increasing productivity.

Bottom line

A journaled file system makes administration a little easier, but it is not magic remedy from data loss due to abnormal reboots. Therefore, if you do not use UPS and do not do Backup, then sooner or later your data will be covered in a copper basin, which I sincerely DO NOT wish for you. And if you want, you can use journaled file systems as a means of increasing productivity.

Who bought UPS And backup makes that data is always intact

(C) Denis Smirnov 5 Nov 2004
Placing this document on other Internet resources, as well as in printed publications not allowed.

Journaled file systems are a class of file systems that characteristic of which - maintaining a journal that stores a list of changes, to one degree or another helping to maintain the integrity of the file system.

Running a system check (such as fsck) on large file systems can take a long time, which is very bad for today's high-speed systems. The reason for the lack of integrity in the file system may be incorrect unmounting, for example, if the disk was being written to at the time of termination. Applications could update the data contained in files, and the system could update file system metadata, which is "data about file system data," in other words, information about which blocks are associated with which files, which files are located in which directories, and the like . Errors (lack of integrity) in data files are bad, but much worse are errors in file system metadata, which can lead to file loss and other serious problems.

To minimize integrity issues and minimize system restart time, a journaled file system maintains a list of changes it will make to the file system before actually writing the changes. These records are stored in a separate part of the file system called a “journal” or “log”. Once file system changes are safely journaled, the journaled file system applies those changes to the files or metadata and then removes those entries from the journal. Log entries are organized into sets of related file system changes, much like the way changes added to a database are organized into transactions.

Having a log increases the likelihood of maintaining the integrity of the file system because log file entries are made before actual changes are made, and these entries are retained until they are fully and safely applied. When the computer is restarted, the mount program can ensure the integrity of the journaled file system by simply checking the log file for changes that were expected but not made and then writing them to the file system. That. With a journal, in most cases the system does not need to check the integrity of the file system, which means that the computer will be available for use almost immediately after a reboot. Accordingly, the chances of data loss due to problems in the file system are significantly reduced.

There are several journaling file systems available on Linux. The most famous of them:

XFS, a journaling file system developed by Silicon Graphics but now released open source(open source);

ReiserFS, a journaling file system designed specifically for Linux;

JFS, a journaling file system originally developed by IBM but now released as open source;

xt3 is a journaled extension of the ext2 filesystem used on most versions of GNU/Linux. Unique Feature ext3 system - the ability to switch to it from ext2 without reformatting the disk. Developed by Dr. Stephan Tweedie.

In the OS family Microsoft Windows Journaled includes file NTFS system. In Mac OS X - HFS+.