Chapter 5. Storage Engines and Data Types - Part 3

by Michael Kruckenberg and Jay Pipes

From Pro MySQL, Berkeley, Apress, July 2005.

The InnoDB Storage Engine

The InnoDB storage engine³ addresses some of the drawbacks to the MyISAM storage engine. Namely, it provides enforcement of foreign key constraints and full ACID-compliant transaction processing abilities (see Chapter 3).

Much of InnoDB’s power is derived from its implementation of row-level locking through multiversion concurrency control (MVCC). Through MVCC, InnoDB has support for a number of transaction isolation levels, giving you control over how your transactions are processed. In the following sections, we’ll examine these transaction-processing capabilities, as well as InnoDB’s doublewrite log system, file and record formats, and buffers.

Enforcement of Foreign Key Relationships

InnoDB enforces the referential integrity of foreign key relationships at the database level. When a CREATE TABLE statement is issued with the FOREIGN KEY … REFERENCES clause, the parent table (REFERENCES table) is checked to verify the existence of a key when a record in the child table is inserted.

A common example of this parent-child relationship, as we discussed in Chapter 1, is the Customer to CustomerOrder to CustomerOrderItem scenario. A customer can place zero or more orders. An order can contain one or more order details. In order to enforce the relationship, we would issue the statements in Listing 5-1. Note that the parent tables must be created first, before any foreign keys reference them, and the parent tables must have a unique index containing the columns referenced in the FOREIGN KEY clause. Additionally, all data types must be identical on both sides of the relationship.

mysql> CREATE TABLE customer (
      > id INT NOT NULL AUTO_INCREMENT,
      > name VARCHAR(30) NOT NULL,
      > address VARCHAR(100) NOT NULL,
      > PRIMARY KEY (id)) ENGINE = INNODB;
mysql> CREATE TABLE customer_order (
      > id INT NOT NULL AUTO_INCREMENT,
      > customer INT NOT NULL,
      > date_ordered INT NOT NULL,
      > PRIMARY KEY (id),
      > FOREIGN KEY (customer) REFERENCES customer (id)) ENGINE = INNODB;
mysql> CREATE TABLE customer_order_item (
      > id INT NOT NULL AUTO_INCREMENT,
      > order INT NOT NULL,
      > product VARCHAR(30) NOT NULL,
      > PRIMARY KEY (id),
      > FOREIGN KEY (order) REFERENCES customer_order (id)) ENGINE = INNODB;

Listing 5-1. Creating an InnoDB Table with a Foreign Key Constraint

Tip: You can use the ON UPDATE CASCADE and ON UPDATE DELETE options in order to force InnoDB to automatically handle updates and deletes on the parent record. Refer to the manual for detailed instructions on these options. See http://dev.mysql.com/doc/mysql/en/create-table.html and also http://dev.mysql.com/doc/mysql/en/ansi-diff-foreign-keys.html.

InnoDB Row-Level Locking

Although InnoDB does implement table-level locking (you can order it to use table-level locks using the LOCK TABLES statement), the default lock granularity is at the row level. While tablelevel lock granularity is more efficient from a memory perspective, a row-level lock is crucial for applications that have a high read and write rate where updates to the data are common.

You might wonder how table-level locking could be more efficient, since it locks a larger block of records. During table-level locking, MyISAM places a lock on the table information structure that is shared by threads reading and writing. The lock is applied to this single shared resource and is held for a relatively short period of time (usually nanoseconds). In row-level locking, an array of locks must be maintained for the rows active in transactions. So, while on the surface, table-level locking may seem inefficient because it holds on to a large logical block, the implementation of row-level locking is more CPU- and memory-intensive because of the number of locks that must be tracked.

InnoDB’s implementation of row-level locking uses an internal table that contains lock information for the keys. This internal format is a memory-efficient, compressed hash lookup of the primary keys in the table. (This is, by the way, the reason you cannot have an InnoDB table without a PRIMARY KEY assigned to it; see the discussion of clustered versus non-clustered data and index organization in Chapter 2 for details.)

That said, there are situations in which the level of lock granularity becomes more of a player than the resources needed to maintain the actual locks. For systems where there are a large number of concurrent users issuing both UPDATE and SELECT statements on the same data — typically in a mixed OLTP/OLAP⁴ environment — situations arise where there are too many requests for write locks, which inhibit, or block, the read requests until the write has completed. For table-level lock granularity, these read requests must wait until the write request has released the table-level lock in order to read any of the data records in the table.

Row-level locking solves this dilemma by allowing update requests to only block read (or other write) requests to the data records that are affected by the update. Other read requests — ones that do not need to be read from the segment being written by the write request — are not held up. InnoDB implements this granularity of locking. This is one of the reasons that the InnoDB storage engine is an excellent candidate for systems having high read and write requests.

Like MyISAM, InnoDB implements a mechanism to allow insertions that occur at the end of the data file — which, in the case of InnoDB, is always the end of the clustered index — to happen concurrently without issuing any exclusive locks on the table.

ACID-Compliant Multistatement Transaction Control

If you have an absolute requirement that certain sets of statements run against your database tables must be completed inside an ACID-compliant transaction, InnoDB is your storage engine of choice. As noted earlier, InnoDB accomplishes transaction control through MVCC.

The default isolation level in which InnoDB runs multistatement transactions is REPEATABLE READ and, for most situations, this isolation level is sufficient.⁵ However, in certain circumstances, you may need a higher level of isolation. In these cases, InnoDB offers a SERIALIZABLE isolation level that can be set using the SET TRANSACTION ISOLATION LEVEL statement before issuing any commands in the connection thread. See Chapter 2 for a detailed discussion of isolation levels and MVCC, to determine situations where you may need to set a specific isolation level.

The InnoDB File and Directory Layout

The InnoDB storage engine file organization is different from the MyISAM arrangement. While the MySQL server maintains an .frm file for each InnoDB table, similar to MyISAM tables, InnoDB also keeps its own store of meta information about InnoDB tables. Because of this, it is not currently possible to simply transfer InnoDB databases from one server to another by copying the table files.

By default, the storage engine manages all InnoDB tables in what’s called a tablespace, which is modeled after the Oracle concept of the same name. The tablespace is composed of multiple files, which can grow to the size limitation of the operating system. These files are named based on what is in your configuration file. By default, these files begin with ibdata and then a number. In your my.cnf file (discussed in Chapter 14), you will see a section similar to the following:

innodb_data_home_dir = /usr/local/var/
innodb_data_file_path = ibdata1:2000M;ibdata2:10M:autoextend

The ibdata files contain both the table and index data for all InnoDB tables. These ibdata files will be in innodb_data_home_dir, while the .frm file will be in the schema’s directory under the main MySQL data_dir directory. All the ibdata files are concatenated by InnoDB to form the InnoDB tablespace. The tablespace can contain any number of files, and the autoextend functionality ensures that the tablespace files can grow with the database. This also means that file system size limitations (for instance, 2GB on most Linux distributions) can be overcome, since the tablespace can contain multiple files, unlike with the MyISAM .MYD storage.

Within the tablespace, two internal files (called segments) maintain each InnoDB table (these segments aren’t visible to you, however). One segment is used for the clustered index data pages, and another segment is used for any secondary indexes built on that clustering key. The reason this is done this way is so that records may be added sequentially in large blocks, to both the data and secondary index pages of the table.

To implement InnoDB’s transaction processing system, a series of log files will also be created. In your my.cnf file, you will find something like the following two lines:

innodb_log_group_home_dir = /usr/local/var/
innodb_log_arch_dir = /usr/local/var/

These are the directories where the main log files and archive log files are stored. The default naming convention for these log files is ib_logfile and then a number representing the log segment. You will have a number of log files equal to the innodb_log_files_in_group configuration variable (with a default of two log files). We’ll take a closer look at the log system a little later in the chapter, in the “InnoDB Doublewrite Buffer and Log Format” section.

Optionally, as of version 4.1.1, you can elect to have InnoDB organize its files in a pertable format, similar to the MyISAM file organization. To enable this file layout, insert the innodb_file_per_table configuration option under the mysqld section of your my.cnf file. Keep in mind, however, that enabling this option does not remove the ibdata files, nor allow you to transfer InnoDB schema to another machine by simply copying the .ibd files, as you can with the MyISAM storage engine’s files.

Note: Currently, the tables cannot be manually assigned to the multiple ibdata files. Therefore, it is not possible to have InnoDB store separate tables on separate disks or devices.

InnoDB Data Page Organization

The InnoDB storage engine stores (both on disk and in-memory) record and index data in 16KB pages. These pages are organized within the ibdata files as extents of 64 consecutive pages. The reason InnoDB does this is to allocate large spaces of memory and disk space at once, to ensure that data is as sequential on the hard disk as possible. This is a proactive stance at maintaining as defragmented a system as possible.

Each extent stores data related to a single index, with one exception. One extent contains a page directory, or catalog, which contains the master list of data pages as a linked tree of pointers to all extents in the tablespace.

Clustered Index Page Format

Since the storage engine uses a clustered index organization, the leaf pages of the index contain the actual record data. Secondary B-tree indexes are built on the clustered index data pages.

A clustered index data page in the InnoDB storage engine is a complex beast. It consists of seven distinct elements:

Fil header: This 34-byte header contains the directory information about the page within the segment. Important directory information includes an identifier for the page, the previous and next page’s identifiers,⁶ and the log serial number (LSN) for the latest log record for the page. We’ll discuss the importance of the log serial number in the upcoming section on the InnoDB log format.
Page header: This 50-byte header contains meta information about the data page itself. Important elements of this section include pointers to the first record on the page, the first free record, and the last inserted record. Also of interest are an identifier for the index to which the data page belongs and the number of records on the page.
Infimum and Supremum records: These are two fixed-size records placed in the header. These records are used to prevent the next-previous link relationship to go beyond the index bounds and as a space to put dummy lock information.
User records: After the Infimum and Supremum records come one or more user records. The format of the user record is detailed in the next section.
Free space: After the user records is free space available for InnoDB to insert new records. This is the “fill factor” space for InnoDB, which attempts to keep data pages at 15/16 filled.
Page directory: Following the free space, the page directory contains a variably sized set of pointers to each record, paired with the record’s clustering key. In this way, queries can use the page directory’s smaller size to do very fast lookups and range queries for records on the page.
Fil trailer: Finally, this section contains a checksum of the page’s data, along with the page log sequence number, for use in error-checking the contents of the page.

InnoDB Record Format

InnoDB records have a very different format from MyISAM records. The record is composed of three parts:

One- or two-byte field start offsets contain the position of the next field in the record, relative to the start address of the record. There will be n field offsets, where n is the number of fields in the table. The field offsets will be 1 byte if the total record size is 127 bytes or less; otherwise, each field offset will be 2 bytes long.
A fixed-size 48-bit (6-byte) “extra bytes” information section contains meta information about the record. This meta information includes the following important elements:

One bit denoting if the record is deleted. In this case, a value of 1 means the record is deleted (the opposite of MyISAM).
Ten bits detailing the number of fields in the record.
Thirteen bits identifying the record within the data page (the heap number).
One bit telling InnoDB whether the field offsets mentioned previously are 1 or 2 bytes long.
Sixteen-bit (2-byte) pointer to the next-key record in the page.

The field contents compose the remainder of the record, with no NULL value separating the field contents, because the field offsets enable the storage engine to navigate to the beginning of each field.

The most important aspect of the InnoDB record structure is the two parts of the “extra bytes” section that contain the 13-bit heap number and the 16-bit next-key pointer.

Remember that InnoDB tables follow a clustered data organization where the data page is clustered, or ordered, based on the primary key value. Would it then surprise you to know that InnoDB does not actually store records in the order of the primary key?

“But wait!” you say. “How is it possible that a clustered data organization can be built on index pages without those records being laid out in primary key order?” The answer lies in the storage engine’s use of next-key pointers in the data records.

The designers of InnoDB knew that maintaining clustered index data pages in sort order of the primary key would be a performance problem. When records were inserted, the storage engine would need to find where the record “fit” into the appropriate data page, then move records around within the file in order to sort correctly. Updating a record would likewise cause problems. Additionally, the designers knew that inserting records on a heap structure (with no regard to the order of the records) would be faster, since multiple insertions could be serialized to go into contiguous blocks on the data page. Therefore, the developers came up with a mechanism whereby records can be inserted into the data page in no particular order (a heap), but be affixed with a pointer to the record that had the next primary key value.

The InnoDB storage engine inserts a record wherever the first available free space is located. It gets this free record space address from the page header section. To determine the next-key pointer, it uses the small, condensed page directory trailing section of the data page to locate the appropriate place to insert the primary key value for the inserted record. In this way, only the small page directory set of key values and pointers must be rearranged. Note also that the next-key pointers are a one-way (forward-only) list.

Note: The InnoDB page and record source code files are in the /innobase/page/and /innobase/rem/ directories of your source distribution. rem stands for record manager.

Footnotes

³ InnoDB was originally developed by Heikki Tuuri and is now developed and maintained by Innobase Oy (http://www.innodb.com/).

⁴ OLTP stands for online transaction processing, and these systems typically have high write requests. OLAP stands for online analytical processing, and these systems typically have high read requests.

⁵ A few folks will insist that this isolation level is indeed more than sufficient for normal default operations. Oracle and SQL Server both default to the READ COMMITTED isolation level. See the InnoDB manual for a discussion on its isolation levels: http://dev.mysql.com/doc/mysql/en/innodb-transaction-model.html and follow the links to the various subsections.

⁶ The next and previous page identifiers provide a mechanism for InnoDB to perform fast range query and scan operations by providing a linking relationship between the index data pages. This linking relationship is a major difference between the implementation of the B-tree index structure in InnoDB versus MyISAM. This type of B-tree algorithm is commonly called a B+ tree (B-plus tree) and is useful for clustered data organizations.

Michael Kruckenberg started his career with web technologies more than 10 years ago. His first major undertaking was bringing a small mail-order company online (using MySQL). After hopping around a bit during the 1990s Internet boom and spending time in the Internet startup world, Mike put his feet down at his current gig, senior programmer at Tufts University. He is now the technical manager for the Apache/Perl/MySQL-driven Tufts University Sciences Knowledgebase (TUSK), a content repository for faculty and students. Mike likes to spend his elusive free time with his wife and kids on New England adventures, has a healthy addiction to music, and likes to dabble in the arts (photography, video, and literature).

For the past 10 years, Jay Pipes has worked with all kinds of companies, large and small, to identify the value of the information they collect and help them build software that best addresses the needs of their businesses. From e-commerce to work-order management systems, Jay has been involved in projects with both Microsoft and open-source technologies. Passionate about programming and all things technical, Jay now runs his own consulting business, based in Columbus, Ohio. When not being bothered by his two cats, two dogs, and a constantly ringing phone, you can find him, headphones pulled across his ears, happily coding away at home.

Contributors : Michael Kruckenberg, Jay Pipes
Last modified 2006-05-12 05:01 PM

DBAzine.com

Sections

Personal tools

Menu