A buffer manager manages data transfers between shared memory and persistent storage and can have a significant impact on the performance of the DBMS. The PostgreSQL buffer manager works very efficiently.
In this chapter, the PostgreSQL buffer manager is described. The first section provides an overview and the subsequent sections describe the following topics:
This section introduces key concepts required to facilitate descriptions in the subsequent sections.
The PostgreSQL buffer manager comprises a buffer table, buffer descriptors, and buffer pool, which are described in the next section. The buffer pool layer stores data file pages, such as tables and indexes, as well as freespace maps and visibility maps. The buffer pool is an array, i.e., each slot stores one page of a data file. Indices of a buffer pool array are referred to as buffer_ids.
Sections 8.2 and 8.3 describe the details of the buffer manager internals.
In PostgreSQL, each page of all data files can be assigned a unique tag, i.e. a buffer tag. When the buffer manager receives a request, PostgreSQL uses the buffer_tag of the desired page.
The buffer_tag comprises three values: the RelFileNode and the fork number of the relation to which its page belongs, and the block number of its page. The fork numbers of tables, freespace maps and visibility maps are defined in 0, 1 and 2, respectively.
For example, the buffer_tag '{(16821, 16384, 37721), 0, 7}' identifies the page that is in the seventh block whose relation's OID and fork number are 37721 and 0, respectively; the relation is contained in the database whose OID is 16384 under the tablespace whose OID is 16821. Similarly, the buffer_tag '{(16821, 16384, 37721), 1, 3}' identifies the page that is in the third block of the freespace map whose OID and fork number are 37721 and 1, respectively.
This subsection describes how a backend process reads a page from the buffer manager (Fig. 8.2).
Fig. 8.2. How a backend reads a page from the buffer manager.When a backend process modifies a page in the buffer pool (e.g., by inserting tuples), the modified page, which has not yet been flushed to storage, is referred to as a dirty page.
Section 8.4 describes how buffer manager works.
When all buffer pool slots are occupied but the requested page is not stored, the buffer manager must select one page in the buffer pool that will be replaced by the requested page. Typically, in the field of computer science, page selection algorithms are called page replacement algorithms and the selected page is referred to as a victim page.
Research on page replacement algorithms has been ongoing since the advent of computer science; thus, many replacement algorithms have been proposed previously. Since version 8.1, PostgreSQL has used clock sweep because it is simpler and more efficient than the LRU algorithm used in previous versions.
Section 8.4.4 describes the details of clock-sweep.
Dirty pages should eventually be flushed to storage; however, the buffer manager requires help to perform this task. In PostgreSQL, two background processes, checkpointer and background writer, are responsible for this task.
Section 8.6 describes the checkpointer and background writer.
PostgreSQL does not support direct I/O, though sometimes it has been discussed. If you want to know more details, refer to this discussion on the pgsql-ML and this article.
The PostgreSQL buffer manager comprises three layers, i.e. the buffer table, buffer descriptors, and buffer pool (Fig. 8.3):
These layers are described in detail in the following subsections.
A buffer table can be logically divided into three parts: a hash function, hash bucket slots, and data entries (Fig. 8.4).
The built-in hash function maps buffer_tags to the hash bucket slots. Even though the number of hash bucket slots is greater than the number of the buffer pool slots, collisions may occur. Therefore, the buffer table uses a separate chaining with linked lists method to resolve collisions. When data entries are mapped to the same bucket slot, this method stores the entries in the same linked list, as shown in Fig. 8.4.
A data entry comprises two values: the buffer_tag of a page, and the buffer_id of the descriptor that holds the page's metadata. For example, a data entry ‘Tag_A, id=1’ means that the buffer descriptor with buffer_id 1 stores metadata of the page tagged with Tag_A.
The hash function is a composite function of calc_bucket() and hash(). The following is its representation as a pseudo-function.
uint32 bucket_slot = calc_bucket(unsigned hash(BufferTag buffer_tag), uint32 bucket_size)
Note that basic operations (look up, insertion, and deletion of data entries) are not explained here. These are very common operations and are explained in the following sections.
The structure of buffer descriptor is described in this subsection, and the buffer descriptors layer in the next subsection.
Buffer descriptor holds the metadata of the stored page in the corresponding buffer pool slot. The buffer descriptor structure is defined by the structure BufferDesc. While this structure has many fields, mainly ones are shown in the following:
The structure BufferDesc is defined in src/include/storage/buf_internals.h.
Each descriptor will have one of the above states. The descriptor state changes relative to particular conditions, which are described in the next subsection.
In the following figures, buffer descriptors’ states are represented by coloured boxes.
In addition, a dirty page is denoted as ‘X’. For example, an unpinned dirty descriptor is represented by X .
A collection of buffer descriptors forms an array. In this document, the array is referred to as the buffer descriptors layer.
When the PostgreSQL server starts, the state of all buffer descriptors is empty. In PostgreSQL, those descriptors comprise a linked list called freelist (Fig. 8.5).
Please note that the freelist in PostgreSQL is completely different concept from the freelists in Oracle. PostgreSQL's freelist is only linked list of empty buffer descriptors. In PostgreSQL freespace maps, which are described in Section 5.3.4, act as the same role of the freelists in Oracle.
Figure 8.6 shows that how the first page is loaded.
The second and subsequent pages are loaded in a similar manner. Additional details are provided in Section 8.4.2.
Fig. 8.6. Loading the first page.Descriptors that have been retrieved from the freelist always hold page's metadata. In other words, non-empty descriptors continue to be used do not return to the freelist. However, related descriptors are added to the freelist again and the descriptor state becomes ‘empty’ when one of the following occurs:
The reason why the freelist be made is to get the first descriptor immediately. This is a usual practice for dynamic memory resource allocation. Refer to this description.
The buffer descriptors layer contains an unsigned 32-bit integer variable, i.e. nextVictimBuffer. This variable is used in the page replacement algorithm described in Section 8.4.4.
The buffer pool is a simple array that stores data file pages, such as tables and indexes. Indices of the buffer pool array are referred to as buffer_ids.
The buffer pool slot size is 8 KB, which is equal to the size of a page. Thus, each slot can store an entire page.
The buffer manager uses many locks for many different purposes. This section describes the locks necessary for the explanations in the subsequent sections.
Please note that the locks described in this section are parts of a synchronization mechanism for the buffer manager; they do not relate to any SQL statements and SQL options.
BufMappingLock protects the data integrity of the entire buffer table. It is a light-weight lock that can be used in both shared and exclusive modes. When searching an entry in the buffer table, a backend process holds a shared BufMappingLock. When inserting or deleting entries, a backend process holds an exclusive lock.
The BufMappingLock is split into partitions to reduce the contention in the buffer table (the default is 128 partitions). Each BufMappingLock partition guards the portion of the corresponding hash bucket slots.
Figure 8.7 shows a typical example of the effect of splitting BufMappingLock. Two backend processes can simultaneously hold respective BufMappingLock partitions in exclusive mode in order to insert new data entries. If the BufMappingLock is a single system-wide lock, both processes should wait for the processing of another process, depending on which started processing.
The buffer table requires many other locks. For example, the buffer table internally uses a spin lock to delete an entry. However, descriptions of these other locks are omitted because they are not required in this document.
The BufMappingLock had been split into 16 separate locks by default until version 9.4.
Each buffer descriptor uses two light-weight locks, content_lock and io_in_progress_lock, to control access to the stored page in the corresponding buffer pool slot. When the values of own fields are checked or changed, a spinlock is used.
The content_lock is a typical lock that enforces access limits. It can be used in shared and exclusive modes.
When reading a page, a backend process acquires a shared content_lock of the buffer descriptor that stores the page.
However, an exclusive content_lock is acquired when doing one of the following:
The official README file shows more details.
The io_in_progress lock is used to wait for I/O on a buffer to complete. When a PostgreSQL process loads/writes page data from/to storage, the process holds an exclusive io_in_progress lock of the corresponding descriptor while accessing the storage.
When the flags or other fields (e.g. refcount and usage_count) are checked or changed, a spinlock is used. Two specific examples of spinlock usage are given below:
LockBufHdr(bufferdesc); /* Acquire a spinlock */ bufferdesc->refcont++; bufferdesc->usage_count++; UnlockBufHdr(bufferdesc); /* Release the spinlock */
#define BM_DIRTY (1 << 0) /* data needs writing */ #define BM_VALID (1 << 1) /* data is valid */ #define BM_TAG_VALID (1 << 2) /* tag is assigned */ #define BM_IO_IN_PROGRESS (1 << 3) /* read or write in progress */ #define BM_JUST_DIRTIED (1 << 5) /* dirtied since write started */ LockBufHdr(bufferdesc); bufferdesc->flags |= BM_DIRTY; UnlockBufHdr(bufferdesc);
Changing other bits is performed in the same manner.
In version 9.6, the spinlocks of buffer manager will be replaced to atomic operations. See this result of commitfest. If you want to know the details, refer to this discussion.
This section describes how the buffer manager works. When a backend process wants to access a desired page, it calls the ReadBufferExtended function.
The behavior of the ReadBufferExtended function depends on three logical cases. Each case is described in the following subsections. In addition, the PostgreSQL clock sweep page replacement algorithm is described in the final subsection.
First, the simplest case is described, i.e. the desired page is already stored in the buffer pool. In this case, the buffer manager performs the following steps:
Then, when reading rows from the page in the buffer pool slot, the PostgreSQL process acquires the shared content_lock of the corresponding buffer descriptor. Thus, buffer pool slots can be read by multiple processes simultaneously.
When inserting (and updating or deleting) rows to the page, a Postgres process acquires the exclusive content_lock of the corresponding buffer descriptor (note that the dirty bit of the page must be set to '1').
After accessing the pages, the refcount values of the corresponding buffer descriptors are decreased by 1.
In this second case, assume that the desired page is not in the buffer pool and the freelist has free elements (empty descriptors). In this case, the buffer manager performs the following steps:
In this case, assume that all buffer pool slots are occupied by pages but the desired page is not stored. The buffer manager performs the following steps:
The rest of this section describes the clock-sweep algorithm. This algorithm is a variant of NFU (Not Frequently Used) with low overhead; it selects less frequently used pages efficiently.
Imagine buffer descriptors as a circular list (Fig. 8.12). The nextVictimBuffer, an unsigned 32-bit integer, is always pointing to one of the buffer descriptors and rotates clockwise. The pseudocode and description of the algorithm are follows:
WHILE true
(1) Obtain the candidate buffer descriptor pointed by the nextVictimBuffer
(2) IF the candidate descriptor is unpinned THEN
(3) IF the candidate descriptor's usage_count == 0 THEN
BREAK WHILE LOOP /* the corresponding slot of this descriptor is victim slot. */
ELSE
Decrease the candidate descriptpor's usage_count by 1
END IF
END IF
(4) Advance nextVictimBuffer to the next one
END WHILE
(5) RETURN buffer_id of the victim
A specific example is shown in Fig. 8.12. The buffer descriptors are shown as blue or cyan boxes, and the numbers in the boxes show the usage_count of each descriptor.
Fig. 8.12. Clock Sweep.Whenever the nextVictimBuffer sweeps an unpinned descriptor, its usage_count is decreased by 1. Therefore, if unpinned descripters exist in the buffer pool, this algorithm can always find a victim, whose usage_count is 0, by rotating the nextVictimBuffer.
When reading or writing a huge table, PostgreSQL uses a ring buffer rather than the buffer pool. The ring buffer is a small and temporary buffer area. When any condition listed below is met, a ring buffer is allocated to shared memory:
When a relation whose size exceeds one-quarter of the buffer pool size (shared_buffers/4) is scanned. In this case, the ring buffer size is 256 KB.
When the SQL commands listed below are executed. In this case, the ring buffer size is 16 MB.
The allocated ring buffer is released immediately after use.
The benefit of the ring buffer is obvious. If a backend process reads a huge table without using a ring buffer, all stored pages in the buffer pool are removed (kicked out); therefore, the cache hit ratio decreases. The ring buffer avoids this issue.
Why 256 KB? The answer is explained in the README located under the buffer manager's source directory.
For sequential scans, a 256 KB ring is used. That's small enough to fit in L2 cache, which makes transferring pages from OS cache to shared buffer cache efficient. Even less would often be enough, but the ring must be big enough to accommodate all pages in the scan that are pinned concurrently. (snip)
In addition to replacing victim pages, the checkpointer and background writer processes flush dirty pages to storage. Both processes have the same function (flushing dirty pages); however, they have different roles and behaviours.
The checkpointer process writes a checkpoint record to the WAL segment file and flushes dirty pages whenever checkpointing starts. Section 9.7 describes checkpointing and when it begins.
The role of the background writer is to reduce the influence of the intensive writing of checkpointing. The background writer continues to flush dirty pages little by little with minimal impact on database activity. By default, the background writer wakes every 200 msec (defined by bgwriter_delay) and flushes bgwriter_lru_maxpages (the default is 100 pages) at maximum.
In version 9.1 or earlier, background writer had regularly done the checkpoint processing. In version 9.2, the checkpointer process has been separated from the background writer process. Since the reason is described in the proposal whose title is "Separating bgwriter and checkpointer", the sentences from it are shown in the following.
Currently(in 2011) the bgwriter process performs both background writing, checkpointing and some other duties. This means that we can't perform the final checkpoint fsync without stopping background writing, so there is a negative performance effect from doing both things in one process.
Additionally, our aim in 9.2 is to replace polling loops with latches for power reduction. The complexity of the bgwriter loops is high and it seems unlikely to come up with a clean approach using latches.
(snip)











