Wednesday, February 17, 2010

Infobright join algorithms

After I looked through the different sorter algorithms here, I've decided to take a look at the joining algorithms. They're a bit more complicated, but I was a very informative process.
First we need to understand the difference between a simple and a complex join.

Simple Join
A simple join is a join which only uses two tables, and isn't a BETWEEN operation.

Complex Join
Is a join which uses more than two tables, or a BETWEEN operation.

Now that we know the difference, we can take a closer look at the 3 different join algorithms.
  • Hash join
  • Sort join
  • General join
Hash join
This algorithm can only be used if it is a simple join and the condition is a equal operation. It uses a temporary table to store the hash for all key values and the matching tuples for the first table.
If enough memory are available all the hash values are placed in the temporary table, otherwise only the values which fit in memory are placed in the table. Then the values in the second table are iterate to find matching key.
If all key values from the first table weren't placed in the hash table, the next portion of keys are placed in the table, and the second table are iterated again, this is done until all keys have been processed.

Sort join
This algorithm can only be used if it is a simple join and the condition are <=, <, > or >=.
It works by inserting all keys and dimensions, into two sorters, one for each table. Using the Knowledge Grid irrelevant values are removed, and then the keys are sorted. Both sorters are then traversed in parallel and matched. The final step is to check additional constraints before the rows are committed to the output.

General join
This is the general joiner algorithm, and it is used when no other algorithm can be used. It is also the slowest algorithm available, it iterates through all the join dimensions and remove tuples not matching the defined join conditions.

How to determine to algorithm used?
If you enable ControlMessage, you can easily see which algorithm are used. There will be a line like this:
Tuples after inner/outer join noOfdimensions sort/mix/hash/loop noOfTuplesAfterJoin
Sort is the sort join, mix is mixed algorithm (if one fails, and general are used instead), hash is the hash join and loop is the general join.

No comments:

Post a Comment