After experimenting with some "improved" stream/branch disable functionality, my big data transformation now suffers from congestion and comes to a halt! The transformation reads a single file and writes statistics to fact tables. There's more than 22'000'000 rows totally for this file. Which means everything has to run smooth and fast, or rows start to pile up.
Some sub-streams should be disabled depending on arguments given at start. The straight forward way to do this in Spoon is to:
1) Add "Get Variables" step. Add the variable which decides wheter the branch should be disabled or not.
2) Add "Filter" step, filter on the stream field set above. True: continue stream. False: disable stream.
However, this process means the same constant field is added 22'000'000 times, step 1) above. And the logic comparison is then done 22'000'000 times, step 2) above. That's 19'999'999 times more than necessary!
So I tried do make my own java code to test only once:
This works excellent with small data. The sub-branch to be disabled is immediately green/finished after receiving the first row. But the transformation freezes with big data. Why? What am I missing? Seems like the steps up-stream is still trying to send rows somehow. Why would they try to send rows when this step has already executed "setOutputDone()" and returned false?
I wish the filter step would accept variables!
Some sub-streams should be disabled depending on arguments given at start. The straight forward way to do this in Spoon is to:
1) Add "Get Variables" step. Add the variable which decides wheter the branch should be disabled or not.
2) Add "Filter" step, filter on the stream field set above. True: continue stream. False: disable stream.
However, this process means the same constant field is added 22'000'000 times, step 1) above. And the logic comparison is then done 22'000'000 times, step 2) above. That's 19'999'999 times more than necessary!
So I tried do make my own java code to test only once:
Code:
public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException
{
Object[] r = getRow();
if (r == null)
{
setOutputDone();
return false;
}
if (first)
{
first = false;
//Disable the stream / branch if ENABLE_BRANCH variable is not 'Y' (yes)
String enable = getVariable("ENABLE_BRANCH", "NULL");
if(!enable.equals("Y"))
{
setOutputDone();
return false;
}
}
r = createOutputRow(r, data.outputRowMeta.size());
putRow(data.outputRowMeta, r);
return true;
}
I wish the filter step would accept variables!