Saturday, February 13, 2010

Infobright ICE sorter algorithms

In the current version of ICE, 3.3.1, there are 4 types of sorting algorithms, with the internal names:
  • SorterLimit
  • SorterCounting
  • SorterOnePass
  • SorterMultiPass
Which one that are used depends on the Infobright memory settings and type of sort to be done.
Before going in depth about the different algorithms, we need to understand how Infobright determines the size of the keys to order by. The method is actually pretty smart, from the Knowledge Grid Infobright can determine the biggest value of a column, and that value are used to determine the key size. For example a column can be declared as a integer which is 4 bytes, but if the column only contains values less than 255 bytes, the key only has to be 1 byte long, less than 65535, 2 bytes, etc.
The size of a row are calculate the same way as the key length, meaning that we get the smallest possible buffer.

The memory used by the Infobright sorter, are defined by the ServerMainHeapSize in the brighthouse.ini configuration file. The memory allocated to sorting are defined in the following heap size intervals.

ServerMainHeapSize Memory available for sorting
Less than 0.5 GB 64 MB
0.5 - 1.2 GB 128 MB
1.2 - 2.5 GB 256 MB
2.5 - 5 GB 512 MB
5 - 10 GB 1 GB
More than 10 GB 2 GB

SorterLimit
This algorithm uses a single memory buffer.
The criterias for using this algorithm criteria are as follows:
  • The number of rows retrieved must be less than a third of the total number of rows.
  • The number of rows retrieved must be less than a third of the maximum rows allowed in the memory.
To fulfill these requirements you have to define a LIMIT clause.
This algorithm does the sorting on-the-fly, meaning that when the values are loaded into the sort buffer, it is sorted right away, the other algorithms sorts when the values are retrieved. This means that a much smaller buffer are used, but because the sorting occures on-the-fly, it takes longer to sort many rows. Therefore it makes sense that this algorithm are chosen only on small datasets.

SorterCounting
This is a Counting Sort algorithm, which uses two memory buffers.
The criterias for using this algorithm criteria are as follows:
  • The memory must be able to hold twice as many rows as the table contains.
  • The key size must be less than 3 bytes long.
  • If the key is 1 byte long, the total number of rows must be above 1024. And if its 2 bytes long, the number of rows must be above 256000.
As the Counting Sort algorithm are impractical for large ranges, it is only for low-cardinality keys.

SorterOnePass
This algorithm uses either a Bubble SortQuick Sort or a combination of both, and uses a single memory buffer.
The criteria for using this algorithm criteria are as follows:
  • The memory must be able to hold the all the rows in the table.
If we are ordering less than 20 values are Bubble Sort are used. If we are ordering more than 20 values, most of the values, are sorted using Quick Sort, but depending on the distribution of values in the table smaller intervals are sorted using Bubble Sort.

SorterMultiPass
This is uses same sorting algorithm as SorterOnePass, but uses multiple buffers.
The criteria for using this algorithm criteria are as follows:
  • The memory cannot hold all the rows in the memory.
The number of buffers used are defined by the available memory, buffer has the size of the available memory, so if three times the memory are needed, three buffers are created. When a buffer has been filled, it is sorted and save to disk. When a value are retrieved the current buffer are sorted, and then the values, both from disk and memory, are returned.
Because of the multiple buffers and sort passes, this is the most costly sorting algorithm.

How to determine which algorithm are used?
It is pretty easy to determine which algorithm that are going to be used. Just look at the key length, row size, the number of rows in the table, the number of rows to get and the available memory. If ControlMessages have been enabled in the brighthouse.ini configuration file. You can also look at the contents of bh.err log file, and look for the Sorter initialized line, to get the number of rows, the key size and total row size. The line could look like this:
2010-02-12 20:05:33 [1] Sorter initialized for 28 rows, 1+8 bytes each.
Meaning that the table contains 28 rows, the key is 1 byte long and the total row size are 1+8 bytes long. With the information in hand, look at the criterias above to figure which algorithm are chosen.

No comments:

Post a Comment