B-Trees and why they are used for indexing

Summarizing the awesome explanation by Abdul Bari

Kumar Swapnil
5 min readJan 4, 2021

Indexes are well known when it comes to databases and plays a crucial role in performance tuning of databases. In this post we are going to cover how indexing improves the performance of databases and why it makes sense to choose B-Trees for indexing.

Let’s start with the logical representation of disk structure. A disk consists of sectors and tracks. The intersection of sector and track forms a block. Any location on the disk can be addressed with Track number and Sector number. So, the block address of the yellow block (refer figure) can be given by (track 1, sector 0). Consider each block has size of 512 bytes (it can be of any size). When we read or write, to and from the disk, we always read and write in terms of blocks.

A 512 Bytes block

Now, if we take a block, then each byte has it’s own address, which is called as offset. So, we can refer any one byte on the disk in terms of block address and offset. Now, let’s move on the next thing, how data is stored on the disk in the form of database.

Consider the example database table above with, Say, 100 records and assume that the block size is 512 bytes. The size taken by the columns are 10 bytes, 50 bytes, 10 bytes, 8 bytes and 50 bytes, respectively. Together each record makes 128 bytes. So, the number of records per block would be 512 / 128 = 4. It means we can store 4 rows of the above table in one block.

As we have 100 records, number of blocks required would be 100 / 4 = 25. Now for any read query, it would require 25 block access (worst case).

Can we reduce this time? Yes.

Now, we will prepare an index for this table. In index we will store two columns, the first column would store eid and second column would store pointer to the corresponding record.

Creating Index for fast read

This index would also be stored on the disk. So, how much space this index will take on the disk. Consider the record pointer uses 6 bytes of data, then each row will consume 16 bytes (eid = 10 bytes). This means for 100 records, we will need additional 1600 bytes (3.125 blocks, or for simplicity, assume 4 blocks) on the disk. Now, whenever we want to search, we would require at most 4 indices and once we got a particular key, we can access the record in single disk access via the record pointer which makes it at most 5 block access.

Now, let’s talk about multi level index. Consider, the 100 records have grown to 1000 rows. This will result in increased index size which would slow down searching in the index itself. So, we can prepare an index for the already created index.

In the previous index, one block contains (512/16) 32 records. Now, for one block we will be maintaining one key (refer figure). This is a sparse index. It won’t have one entry for each entry in the first index. So, roughly 2 blocks would be required to store this high level index. So, search in the high level index, once you get an entry access the low level index, once you get an entry access the record. So, total block access for 1000 records would be 2 + 1 + 1 = 4. So, by adding one more level of index, we are reducing the number of block access. This is the basic idea originating for B trees and B+ trees.

Let’s extend the above concept and see what a multi level index would look like when there are more than 2 levels.

We can go on adding multi level index if the size of table keeps growing. A multi level index will reduce the number of block access and hence we would be able to perform much efficient read operation even for millions of records. So, if we just turn the above diagram in clockwise diagram, you will get an idea of we are headed at.

as you can see that, this multi level index looks like a tree. Now, what is the requirement of multi level index. Shall we add/delete the indices manually ? No. We want it to be self balancing and that’s what a B-tree is. As per Wikipedia,

A B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree generalizes the binary search tree, allowing for nodes with more than two children.

Hope you get an idea why B-tree is used for indexing and why indexing plays a pivotal role in database performance. I’ll continue from here in the next post since it only covers the first half of the video.

--

--

Kumar Swapnil
Kumar Swapnil

Written by Kumar Swapnil

Wallflower | Tsundoku | ❣️ anime and crypto

No responses yet