First of all, I've been to these sites already after searching through this forum:
http://blog.spec-india.com/a-guide-t...aho-kettle-pdi
http://www.timbert.net/doku.php?id=t...ansonceperfile
http://diethardsteiner.blogspot.com/...designing.html
I have a directory of 2000+ large xml files (daily). I want to process those xml files in as many concurrent XML Input Stream (StAX) steps as possible. Therefore, I do not want to use the "execute for every input row" job option that is often mentioned with the Get File Names step. It makes no sense to have to process 2000 files one at a time on 1 server. Most examples use CSV File Input which can read from the list of files at once. The XML Input Stream (StAX) step cannot.
1) In the first link above, there is a screenshot that the XML step takes a variable for the filename. I've tried that but it doesn't seem to work. Get Variables -> XML Input Stream (StAX) never gets the content of that variable for me.
2) What is the best way to accomplish the parallelization? Since XML Input Stream (StAX) processes 1 file at a time, what would a job and the transformations have to look like to get XML Input Stream (StAX) to have Cx4 for example on a cluster? I assume the master would have to send a list of files to each slave but what previous step would that be to send each filename to the XML Input step.
http://blog.spec-india.com/a-guide-t...aho-kettle-pdi
http://www.timbert.net/doku.php?id=t...ansonceperfile
http://diethardsteiner.blogspot.com/...designing.html
I have a directory of 2000+ large xml files (daily). I want to process those xml files in as many concurrent XML Input Stream (StAX) steps as possible. Therefore, I do not want to use the "execute for every input row" job option that is often mentioned with the Get File Names step. It makes no sense to have to process 2000 files one at a time on 1 server. Most examples use CSV File Input which can read from the list of files at once. The XML Input Stream (StAX) step cannot.
1) In the first link above, there is a screenshot that the XML step takes a variable for the filename. I've tried that but it doesn't seem to work. Get Variables -> XML Input Stream (StAX) never gets the content of that variable for me.
2) What is the best way to accomplish the parallelization? Since XML Input Stream (StAX) processes 1 file at a time, what would a job and the transformations have to look like to get XML Input Stream (StAX) to have Cx4 for example on a cluster? I assume the master would have to send a list of files to each slave but what previous step would that be to send each filename to the XML Input step.