Thursday, February 25, 2010

Infobright ICE ini settings

The brighthouse.ini file contains all Infobright specific settings. But what settings are actually available, and what do they do? When I looked at the source code, I saw settings I had no idea existed.
The list below applies to ICE 3.3.1.

BufferingLevel
Default value: 2
Used when the loader heap can't hold the all the data, and defines the size of the buffer. The size can be calculated like this LoaderMainHeapSize / BufferingLevel.

CacheFolder
Default value: "cache"
As the name implies this defines the location of the Infobright cache folder.

CachingLevel
Default value: 1
Used to define the type of caching level used on the uncompressed heap. If the value are 1 or 2, Infobright removes data packs which aren't used from the heap, if it runs out of space. This is not done if the value are anything else than 1 or 2. If this isn't enough or another value are defined, it tries to release memory and it tries to compress the data and move it to the compressed heap.

LicenseFile
Default value: ""
This points to the Infobright license file, and are only used in the IEE evaluation edition.

ClusterSize
Default value: 2000
The maximum size of a data file, in MB.

ControlMessages
Default value: 0
Enables control messages, which usually needed when investigating performance. The value 2 enables it, values above 2, enables it and also produces timestamps.

HugeFileDir
Default value: ""
Only used on non-Windows machines. Used to specify the location of a memory mapped file containing the server main heap. If the setting aren't defined, no memory mapped file are used, instead it is placed directly in memory.

InternalMessages
Default value: false
Used to print FET - Function Execution Times - to a log file called development.log. Only used if compiled with --with-FET.

KNFolder
Default value: BH_RSI_Repository
The folder containing the Knowledge Grid data.

KNLevel
Default value: 99
If the value are 0, Knowledge Grid aren't used. values above 0 enables the Knowledge Grid, but doesn't seem to do anything else in ICE.

LoaderMainHeapSize
Default value: 320
Size of the memory heap in the loader process, in MB.

LoaderSaveThreadNumber
Default value: 16
Number of parallel load threads. Only used in IEE.

PushDown
Default value: true
Enables or disables Engine Condition Pushdown. Should not be disabled.

ServerCompressedHeapSize
Default value: 256
Size of the compressed memory heap in the server process, in MB.

ServerMainHeapSize
Default value: 600
Size of the uncompressed memory heap in the server process, in MB.

UseCharSet
Default value: false
Used to enable or disable UTF8 support. As this isn't fully implemented in the current version, 3.3.1, it shouldn't be enabled.

AllowMySQLQueryPath
Default value: false
Allows queries to fallback to standard MySQL execution if necessary.

Wednesday, February 17, 2010

Infobright join algorithms

After I looked through the different sorter algorithms here, I've decided to take a look at the joining algorithms. They're a bit more complicated, but I was a very informative process.
First we need to understand the difference between a simple and a complex join.


Simple Join
A simple join is a join which only uses two tables, and isn't a BETWEEN operation.

Complex Join
Is a join which uses more than two tables, or a BETWEEN operation.

Now that we know the difference, we can take a closer look at the 3 different join algorithms.
  • Hash join
  • Sort join
  • General join
Hash join
This algorithm can only be used if it is a simple join and the condition is a equal operation. It uses a temporary table to store the hash for all key values and the matching tuples for the first table.
If enough memory are available all the hash values are placed in the temporary table, otherwise only the values which fit in memory are placed in the table. Then the values in the second table are iterate to find matching key.
If all key values from the first table weren't placed in the hash table, the next portion of keys are placed in the table, and the second table are iterated again, this is done until all keys have been processed.

Sort join
This algorithm can only be used if it is a simple join and the condition are <=, <, > or >=.
It works by inserting all keys and dimensions, into two sorters, one for each table. Using the Knowledge Grid irrelevant values are removed, and then the keys are sorted. Both sorters are then traversed in parallel and matched. The final step is to check additional constraints before the rows are committed to the output.

General join
This is the general joiner algorithm, and it is used when no other algorithm can be used. It is also the slowest algorithm available, it iterates through all the join dimensions and remove tuples not matching the defined join conditions.

How to determine to algorithm used?
If you enable ControlMessage, you can easily see which algorithm are used. There will be a line like this:
Tuples after inner/outer join noOfdimensions sort/mix/hash/loop noOfTuplesAfterJoin
Sort is the sort join, mix is mixed algorithm (if one fails, and general are used instead), hash is the hash join and loop is the general join.

Infobright ICE importer IGNORE LINES patch

The ICE importer in any of the current versions of ICE, newest 3.3.1, doesn't support IGNORE LINES in the LOAD DATA statement. For example, this this data file would fail, as we can't tell it to skip the first line:

Test data
1;2;3
4;5;6

It is possible to issue the LOAD DATA statement with the IGNORE LINES option, but it is ignored by the loader.
It is very easy to implement in the loader, and the attached patched does exactly this.

The patch is tested in ICE 3.3.1, but should also work in earlier versions. It must be applied in the ICE source code root.

I've created a bug report here.

Patch file:
ignoreLines331.patch (4 KB)

Tuesday, February 16, 2010

Infobright ICE MySQL Query Cache patch

Infobright ICE doesn't currently use the MySQL Query Cache for queries executed by the Infobright engine.
The attached patch enables the MySQL Query Cache for Infobright queries also.

There are currently a bug report about this here.

Patch file for ICE 3.3.1, should also work for earlier version:
queryCache331.patch (1 KB)

Infobright ICE importer default escape character patch

In the current, 3.3.1, and earlier version of ICE there is a problem with the importer. It doesn't handle escape characters in strings correctly. This is because the default escape character in the importer is nothing (ASCII code: 0). This patch sets the default escape character to '\', which is the default in MySQL.
You'll only run into this problem if you haven't define an escape character in the LOAD DATA statement, and try to load a string containing a ".

Patch file for ICE 3.3.1, but should also work in earlier version, has to be applied in the source code root:
escapedCharacter331.patch

Infobright ICE RENAME TABLE support

Currently no version of ICE supports the RENAME TABLE statement, but I'm about to change this.
At least from version 3.3.0, the RENAME TABLE statement has been almost fully implemented, to only thing missing is connecting the rename function, to the statement. I have know idea why this hasn't been done a long time ago.
I've created a forum post, trying to answer this question, it can be found here.
The patch enables you to give ICE the following statement:
RENAME TABLE oldName TO newName
The attached patch are tested in version 3.3.1, but should also work in 3.3.0, and should be applied in the root ICE source directory.

Patch file:
renameTable331.patch (1 KB)

UPDATE:
If you issue the RENAME TABLE command in the current implementation, the table file structure are corrupted. Therefore do not try to rename a Infobright table without applying this patch. I've create a bug report here.

UPDATE 2:
I got a response to the forum post I created. The answer to why it wasn't implemented was that renaming tables in ICE aren't officially supported, and therefore this wasn't a bug. See the answer here.

ASP.NET LoadControl overloads

Yesterday, one my friends asked me a ASP.NET question about LoadControl. According to MSDN, http://msdn.microsoft.com/en-us/library/system.web.ui.templatecontrol.loadcontrol.aspx, the LoadControl method are overloaded.
He had created a user control, with a constructor he would like to initialize, he also had a empty constructor. I've created my own code which describes the situation. The user control looked something like this:
UserControl.asx
<%@ Control Language="C#" AutoEventWireup="true" CodeBehind="UserControl.ascx.cs" Inherits="LoadControl.UserControl1" %>
<p>TextBox:</p>
<asp:TextBox runat="server" ID="TextBox1" />
UserControl.asx.cs
public partial class UserControl1 : System.Web.UI.UserControl
   private string text = "TextBox";
   public UserControl1()
   public UserControl1(string text) {
      this.text = text;
   }
}
The page look something like this:
Default.aspx
<%@ Page Language="C#" AutoEventWireup="true" CodeBehind="Default.aspx.cs" Inherits="LoadControl._Default" %>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
   <head runat="server">
      <title></title>
   </head>
   <body>
      <form id="form1" runat="server">
         <div>
            <asp:Panel ID="Panel1" runat="server" />
         </div>
      </form>
  </body>
</html>
Default.aspx.cs
public partial class _Default : System.Web.UI.Page {
   protected void Page_Load(object sender, EventArgs e)  {
      Panel1.Controls.Add(LoadControl("~/UserControl.ascx"));
      Panel1.Controls.Add(LoadControl(typeof(UserControl1), new object[] { "Test TextBox"}));
   }
}
When the page was accessed it surprisingly look like this:
The second instance of the user control, was never added to the page. The reason for this is the way ASP.NET handles user controls. When you use LoadControl(string), .NET processes the acsx file. When you use LoadControl(Type, Object[]), .NET does not process the ascx file, instead it uses the controls Render method. If this method doesn't exists nothing are rendered. To get code above to work like it was intended a Render method have to be added, the method could look something like this:
protected override void Render(HtmlTextWriter writer) {
   writer.Write("<p>" + text + "</p>");
   writer.Write("<input type='text' />");
}
The important lesson here is that user controls only uses the acsx markup file, if the markup file are specified to load. If only the type are used, the user control knows nothing about the ascx file.

Infobright ICE importer FIELDS ESCAPED BY patch

None of the Infobright ICE versions support NULLs escaped anything other than the default escape character '\'. This is probably because the default way to handle NULLs in the ICE exporter and importer are an empty field. In this example fields are terminated by ';', and NULLs are indicated the ICE default way.
1;;2
2;2;2
The problem is that ICE importer also supports NULLs defined the default MySQL way:
1;\N;2
2;2;2
But not with any other escape character, like this:
1;~N;2
2;2;2
In ICE you can define a custom NULL indication by setting the BH_NULL session variable, but for some reason this is not handled in the importer.

There are currently a bug report concerning this here.

I've created a patch to solve this "problem", which means that you can define another escape character like this:
LOAD DATA INFILE '/tmp/outfile.csv' INTO TABLE ibtest FIELDS TERMINATED BY '~';

The patch has to be applied in the root of the ICE source code directory.
The patch is tested in ICE 3.3.1, but should also work in any earlier versions.

Patch file:
escapedBy331.patch (2 KB)

Saturday, February 13, 2010

Infobright ICE sorter algorithms

In the current version of ICE, 3.3.1, there are 4 types of sorting algorithms, with the internal names:
  • SorterLimit
  • SorterCounting
  • SorterOnePass
  • SorterMultiPass
Which one that are used depends on the Infobright memory settings and type of sort to be done.
Before going in depth about the different algorithms, we need to understand how Infobright determines the size of the keys to order by. The method is actually pretty smart, from the Knowledge Grid Infobright can determine the biggest value of a column, and that value are used to determine the key size. For example a column can be declared as a integer which is 4 bytes, but if the column only contains values less than 255 bytes, the key only has to be 1 byte long, less than 65535, 2 bytes, etc.
The size of a row are calculate the same way as the key length, meaning that we get the smallest possible buffer.

The memory used by the Infobright sorter, are defined by the ServerMainHeapSize in the brighthouse.ini configuration file. The memory allocated to sorting are defined in the following heap size intervals.

ServerMainHeapSize Memory available for sorting
Less than 0.5 GB 64 MB
0.5 - 1.2 GB 128 MB
1.2 - 2.5 GB 256 MB
2.5 - 5 GB 512 MB
5 - 10 GB 1 GB
More than 10 GB 2 GB

SorterLimit
This algorithm uses a single memory buffer.
The criterias for using this algorithm criteria are as follows:
  • The number of rows retrieved must be less than a third of the total number of rows.
  • The number of rows retrieved must be less than a third of the maximum rows allowed in the memory.
To fulfill these requirements you have to define a LIMIT clause.
This algorithm does the sorting on-the-fly, meaning that when the values are loaded into the sort buffer, it is sorted right away, the other algorithms sorts when the values are retrieved. This means that a much smaller buffer are used, but because the sorting occures on-the-fly, it takes longer to sort many rows. Therefore it makes sense that this algorithm are chosen only on small datasets.

SorterCounting
This is a Counting Sort algorithm, which uses two memory buffers.
The criterias for using this algorithm criteria are as follows:
  • The memory must be able to hold twice as many rows as the table contains.
  • The key size must be less than 3 bytes long.
  • If the key is 1 byte long, the total number of rows must be above 1024. And if its 2 bytes long, the number of rows must be above 256000.
As the Counting Sort algorithm are impractical for large ranges, it is only for low-cardinality keys.

SorterOnePass
This algorithm uses either a Bubble SortQuick Sort or a combination of both, and uses a single memory buffer.
The criteria for using this algorithm criteria are as follows:
  • The memory must be able to hold the all the rows in the table.
If we are ordering less than 20 values are Bubble Sort are used. If we are ordering more than 20 values, most of the values, are sorted using Quick Sort, but depending on the distribution of values in the table smaller intervals are sorted using Bubble Sort.

SorterMultiPass
This is uses same sorting algorithm as SorterOnePass, but uses multiple buffers.
The criteria for using this algorithm criteria are as follows:
  • The memory cannot hold all the rows in the memory.
The number of buffers used are defined by the available memory, buffer has the size of the available memory, so if three times the memory are needed, three buffers are created. When a buffer has been filled, it is sorted and save to disk. When a value are retrieved the current buffer are sorted, and then the values, both from disk and memory, are returned.
Because of the multiple buffers and sort passes, this is the most costly sorting algorithm.

How to determine which algorithm are used?
It is pretty easy to determine which algorithm that are going to be used. Just look at the key length, row size, the number of rows in the table, the number of rows to get and the available memory. If ControlMessages have been enabled in the brighthouse.ini configuration file. You can also look at the contents of bh.err log file, and look for the Sorter initialized line, to get the number of rows, the key size and total row size. The line could look like this:
2010-02-12 20:05:33 [1] Sorter initialized for 28 rows, 1+8 bytes each.
Meaning that the table contains 28 rows, the key is 1 byte long and the total row size are 1+8 bytes long. With the information in hand, look at the criterias above to figure which algorithm are chosen.

Friday, February 12, 2010

Infobright ICE 3.3.1 released follow up - The biggest little-known secret

Before updating my Infobright patches, I decided to to a closer a look at what actually changed from 3.3.0 to 3.3.1, and to my surprisem, a new very big feature has been implemented, namely UTF-8 support.

Why would they implement something like this, even partially, without telling the public?
Even a tool called CharSetMigrationTool have been included in the source distribution, don't know about the binary distributions. The tool seems to update the Knowledge Grid to support UTF-8.

Just by looking at the source code, it seems like UTF-8 support has been fully implemented with a few exceptions, like CMAPs for UTF8. CMAPs are explained in more detail in this forum post.

On the ICE forum I posted the question, has it been implemented or not, the post can be found here.

Other than that, no untold changes seems to be hiding in 3.3.1 release.

UPDATE: One of the Infobright developers replied to my forum post, and he confirms that UTF-8 support has been partially implemented in ICE 3.3.1. The next release of ICE should contain UTF-8 support for the ICE loader process. See the reply here.

Thursday, February 4, 2010

Infobright ICE 3.3.1 released

Today Infobright 3.3.1 has been released.
The most important change are that MySQL have been upgraded from 5.1.14 to 5.1.40 GA, which fixes a bunch of different bugs, both performance, query syntax, etc., in the MySQL engine.
Because of the MySQL upgrade it is necessary to run the MySQL Updater program, as the data file structure has been changed in MySQL 5.1.40 GA. Upgrade instructions are available here:
Windows Upgrade Instructions
Linux Upgrade Instructions
Tar Upgrade Instructions
There are also a known issue with the 3.3.1 version.
Please note: There is a known issue with 3.3.1. A server crash may occur for queries of the form:

SELECT distinct(a1), max(a2) FROM t1 GROUP BY a3;

Where a1, a2, a3 may or may not be the same. It is caused by the use of max and min and the distinct clause, and may be easily worked around by removing the distinct and adding an additional clause to the GROUP BY.

This issue has been resolved and will be included in the next release. A full list of known and fixed issues is included at the end of this document.
All the information about the new release can be found here and can be downloaded here.
The patches previously posted on this blog, will be updated to work in 3.3.1. The updated patches will be added to the old post.

Debugging .NET application startup

When you start a .NET application where it can't find all assembly dependencies, you'll get an error. But sometimes it isn't a very describing one, for example the last time, I got error placing kernel32.dll as the fault module, no more than this. This error doesn't say anything about the actual error, what to do?
I fired up WinDbg, with the SOS extension. Then I went to File -> Open Executable, and chose the application to start. As WinDbg breaks before the program are actually run, I pressed F5, to continue the application. Then WinDbg breaks later on because of an exception, with an output something like this:
ModLoad: 77650000 77716000   C:\Windows\system32\ADVAPI32.dll
ModLoad: 777a0000 77863000   C:\Windows\system32\RPCRT4.dll
ModLoad: 72d70000 72dd6000   C:\Windows\Microsoft.NET\Framework\v4.0.30109\mscoreei.dll
ModLoad: 765c0000 76619000   C:\Windows\system32\SHLWAPI.dll
ModLoad: 77cc0000 77d0b000   C:\Windows\system32\GDI32.dll
ModLoad: 76620000 766bd000   C:\Windows\system32\USER32.dll
ModLoad: 77870000 7791a000   C:\Windows\system32\msvcrt.dll
ModLoad: 766c0000 766de000   C:\Windows\system32\IMM32.DLL
ModLoad: 76410000 764d8000   C:\Windows\system32\MSCTF.dll
ModLoad: 76350000 76359000   C:\Windows\system32\LPK.DLL
ModLoad: 76390000 7640d000   C:\Windows\system32\USP10.dll
ModLoad: 750e0000 7527e000   C:\Windows\WinSxS\x86_microsoft.windows.common-controls_6595b64144ccf1df_6.0.6002.18005_none_5cb72f96088b0de0\comctl32.dll
ModLoad: 72250000 727e0000   C:\Windows\Microsoft.NET\Framework\v2.0.50727\mscorwks.dll
ModLoad: 73e90000 73f2b000   C:\Windows\WinSxS\x86_microsoft.vc80.crt_1fc8b3b9a1e18e3b_8.0.50727.4053_none_d08d7da0442a985d\MSVCR80.dll
ModLoad: 768b0000 773c0000   C:\Windows\system32\shell32.dll
ModLoad: 773c0000 77505000   C:\Windows\system32\ole32.dll
ModLoad: 70c50000 71748000   C:\Windows\assembly\NativeImages_v2.0.50727_32\mscorlib\894183c0c47bd4772fbfad4c1a7e3b71\mscorlib.ni.dll
ModLoad: 60340000 60348000   C:\Windows\Microsoft.NET\Framework\v2.0.50727\culture.dll
ModLoad: 72ea0000 72eec000   image72ea0000
ModLoad: 00a70000 00abc000   image00a70000
ModLoad: 72ea0000 72eec000   C:\Windows\assembly\GAC_MSIL\mscorlib.resources\2.0.0.0_da_b77a5c561934e089\mscorlib.resources.dll
(1090.1160): C++ EH exception - code e06d7363 (first chance)
(1090.1160): C++ EH exception - code e06d7363 (first chance)
(1090.1160): C++ EH exception - code e06d7363 (first chance)
(1090.1160): C++ EH exception - code e06d7363 (first chance)
(1090.1160): C++ EH exception - code e06d7363 (first chance)
(1090.1160): CLR exception - code e0434f4d (first chance)
(1090.1160): CLR exception - code e0434f4d (!!! second chance !!!)
eax=002cfb50 ebx=e0434f4d ecx=00000001 edx=00000000 esi=002cfbd8 edi=000f79a8
eip=7651fbae esp=002cfb50 ebp=002cfba0 iopl=0         nv up ei pl nz ac pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000216
KERNEL32!RaiseException+0x58:
7651fbae c9              leave
I start by looking at the stack of the thread throwing the exception, using kb:
0:000> kb
ChildEBP RetAddr  Args to Child              
002cfba0 722ae774 e0434f4d 00000001 00000001 KERNEL32!RaiseException+0x58
002cfc00 7230dc9c 01f91444 00000000 00000000 mscorwks!RaiseTheExceptionInternalOnly+0x2a8
002cfc38 723e5a54 ffffffff 00126d28 cdf37c13 mscorwks!UnwindAndContinueRethrowHelperAfterCatch+0x70
002cfc84 72d76210 722c3d8f 002cfca0 72e57f16 mscorwks!_CorExeMain+0x1ac
002cfc90 72e57f16 00000000 72d70000 002cfcb4 mscoreei!_CorExeMain+0x38
002cfca0 72e54de3 00000000 7652d0e9 7ffda000 mscoree!ShellShim__CorExeMain+0x99
002cfca8 7652d0e9 7ffda000 002cfcf4 77af19bb mscoree!_CorExeMain_Exported+0x8
002cfcb4 77af19bb 7ffda000 72824509 00000000 KERNEL32!BaseThreadInitThunk+0xe
002cfcf4 77af198e 72e54ddb 7ffda000 00000000 ntdll!__RtlUserThreadStart+0x23
002cfd0c 00000000 72e54ddb 7ffda000 00000000 ntdll!_RtlUserThreadStart+0x1b
Here I can see that it is a .NET exception cause the application crash. This means that I can use the SOS command !pe, to look close at the exception. First I load SOS, using .load sos, and then the !pe command.
0:000> .load sos
0:000> !pe
Exception object: 01f91444
Exception type: System.IO.FileNotFoundException
Message: The file or assembly 'xxx.dll', Version=1.0.3672.38321, Culture=neutral, PublicKeyToken=cfb1f46daa2d0af2' or one of its dependencies could not be found. The file could not be found.
InnerException: 
StackTrace (generated):

StackTraceString: 
HResult: 80070002
Now the real error message appears, and the problem can be solved.
I'm pretty sure that the error message are supposed to be shown in the Windows Event Log, and that it has before, but for some reason this was not the case in the situation above.

Wednesday, February 3, 2010

Infobright ICE importer LINES TERMINATED BY patch

As a continuation of my previous post, Infobright also doesn't allow importing with a user defined line terminator.There are currently a bug report here.
This patch fixes this, please note that the patch from this post, http://kimsnetblog.blogspot.com/2010/02/infobright-exporter-lines-terminated-by.html, have to be applied first.

This patch allows you to use the Infobright importer while defining a line terminator, like this:
LOAD DATA INFILE '/tmp/outfile.csv' INTO TABLE ibtest LINES TERMINATED BY '\t'
By default the Infobright importer supports \r\n and \n as line terminators, and figures it out automatically, but if another line terminator are used, importing the file will fail.

The patch has to be applied in the Infobright source root.

Patch file:
importLinesTerminated.patch (4 KB)

UPDATE:
Patch file for ICE 3.3.1:
importLinesTerminatedBy331.patch (4 KB)

Infobright ICE exporter LINES TERMINATED BY patch

In all current versions, newest 3.3.0, of Infobright you cannot define LINES TERMINATED BY when exporting data to a file. Well, you can define it but it isn't used when the Infobright exporter are used.
The default line terminator are \r\n, which works great, but sometimes you want to use another line terminator, which are not supported in any of the current versions of Infobright.

There are currently two bug reports concerning this, http://bugs.infobright.org/ticket/1235 and http://bugs.infobright.org/ticket/1072.

This patch fixes this, it works in Infobright 3.3.0 and allows you to define LINES TERMINATED BY like this:
SELECT * FROM ibtest INTO OUTFILE '/tmp/outfile.csv' LINES TERMINATED BY '\n'
The patch has to be applied at the root of the Infobright source.

Patch file:
exporterLinesTerminated.patch (8 KB)

UPDATE:
Patch file for ICE 3.3.1:
exporterLinesTerminated331.patch (8 KB)