Thursday, March 31, 2011

SSIS/VB.NET Equivalent of SQL IN (anonymous array.Contains())

I've got some SQL which performs complex logic on combinations of GL account numbers and cost centers like this:

WHEN (@IntGLAcct In (
    882001, 882025, 83000154, 83000155, 83000120, 83000130, 
    83000140, 83000157, 83000010, 83000159, 83000160, 83000161,
83000162, 83000011, 83000166, 83000168, 83000169, 82504000, 
82504003, 82504005, 82504008, 82504029, 82530003, 82530004,
83000000, 83000100, 83000101, 83000102, 83000103, 83000104, 
83000105, 83000106, 83000107, 83000108, 83000109, 83000110, 
83000111, 83000112, 83000113, 83100005, 83100010, 83100015, 
82518001, 82552004, 884424,   82550072, 82552000, 82552001, 
82552002, 82552003, 82552005, 82552012, 82552015, 884433,   
884450,   884501,   82504025, 82508010, 82508011, 82508012,
83016003, 82552014, 81000021, 80002222, 82506001, 82506005, 
82532001, 82550000, 82500009, 82532000))

Overall, the whole thing is poorly performing in a UDF, especially when it's all nested and the order of the steps is important etc. I can't make it table-driven just yet, because the business logic is so terribly convoluted.

So I'm doing a little exploratory work in moving it into SSIS to see about doing it in a little bit of a different way. Inside my script task, however, I've got to use VB.NET, so I'm looking for an alternative to this:

Select Case IntGLAcct = 882001 OR IntGLAcct = 882025 OR ...

Which is obviously a lot more verbose, and would make it terribly hard to port the process.

Even something like ({90605, 90607, 90610} AS List(Of Integer)).Contains(IntGLAcct) would be easier to port, but I can't get the initializer to give me an anonymous array like that. And there are so many of these little collections, I'm not sure I can create them all in advance.

It really all NEEDS to be in one place. The business changes this logic regularly. My strategy was to use the udf to mirror their old "include" file, but performance has been poor. Now each of the functions takes just 2 or three parameters. It turns out that in a dark corner of the existing system they actually build a multi-million row table of all these results - even though the pre-calced table is not used much.

So my new experiment is to (since I'm still building the massive cross join table to reconcile that part of the process) go ahead and use the table instead of the code, but go ahead and populate this table during an SSIS phase instead of calling the udf 12 million times - because my udf version just basically stopped working within a reasonable time frame and the DBAs are not of much help right now. Yet, I know that SSIS can process these rows pretty efficiently - because each month I bring in the known good results dozens of multi-million row tables from the legacy system in minutes AND run queries to reconcile that there are no differences with the new versions.

The SSIS code would theoretically become the keeper of the business logic, and the efficient table would be built from that (based on all known parameter combinations). Of course, if I can simplify the logic down to a real logic table, that would be the ultimate design - but that's not really foreseeable at this point.

From stackoverflow
  • What if you used a conditional split transform on your incoming data set and then used expressions or something similar (I'm not sure if your GL Accounts are fixed or if you're going to dynamically pass them in) to apply to the results? You can then take the resulting data from that and process as necessary.

    Cade Roux : I know where you're coming from, but one of these UDFs is 600 lines with 30 CASE statements and another is 1000 lines. We already have to resync them each month to the original Focus include file to ensure our results are accurate until the old system is replaced.
    Dillie-O : ugh, yeah. That would be quite nasty. I see in the other comments that the dynamic array generation is working, so Woo Hoo for that!
  • Try this:

    Array.IndexOf(New Integer() {90605, 90607, 90610}, IntGLAcct) >-1
    
    Joel Coehoorn : Can you share the error?
    Cade Roux : The little squiggle went away, I'm guessing maybe this will work.
    Cade Roux : It was saying something like Expression Required on the opening parenthesis.
    Cade Roux : Looks like this is working.
    Michael : This might be rather inefficient - it creates this array for every row you process. Instead of anonymous array, I would create a member variable and initialize it once.
    Cade Roux : Good point. Perhaps I will look into these arrays and see if they can be meaningfully named - another failing of the system I am porting, it is extremely poorly written. We shouldn't even talk about the C++ side of the system....
    Cade Roux : This worked great, and I've actually put them in Dictionary objects for a massive boost in speed.

0 comments:

Post a Comment