I am looking to read the contents of multiple tar.gz files. My first approach was to try to unzip the files via the "Unzip file" step but that did not work. After reading through forum posts (http://forums.pentaho.com/showthread...-Gzipped-files) I have a better understanding why.
Based on that I read through the examples provided in the Advanced Users FAQ (http://wiki.pentaho.com/display/EAI/...ressedfiles%3F) and was able to successfully read one of the many files in a tar.gz files that I have via the "Text file input" step. Now I am looking to see how I might be able to access multiple tar.gz files that have the same files within them. Is this possible?
For example I might have the following source tar.gz files that I need to read:
bla_2014-12-15.tar.gz
bla_2014-12-14.tar.gz
foo_2014-12-15.tar.gz
foo_2014-12-14.tar.gz
Within each of these tar.gz files there is a bar.txt file. I tried reading the files from the "Text file input" step with wildcards like so...
tar:gz:/path/to/files/.*.tar.gz!/.*.tar!/bar.txt
...but I was not able to get this to work.
As mentioned above I am able to read the bar.txt file if I specify the full path like so (tar:gz:/path/to/files/bla_2014-12-15.tar.gz!/bla_2014-12-15.tar!/bar.txt), but I need to be able to handle the probability of multiple tar.gz files. So I tried another approach of creating the full vfs string using "Get File Names" and a few other steps. I am able to build the full vfs string correctly and then pass it into the "Text file input" step with "Accept file names from previous step" turned on. When I run this though it fails saying it can't open the following file:
tar:gz:file:////path/to/files/bla_2014-12-15.tar.gz!/bla_2014-12-15.tar!/bar.txt
It looks like the "file://" is causing the issue. Is there a way to tell it to not include "file://"? Does anyone have any ideas of how to get around all of this? Or a better way to handle this situation?
EDIT: I also want to mention this is being done on a Windows computer with a Windows server running the Pentaho process in mind. I know I could install gzip packages to both but I'm trying to avoid that path if possible.
Based on that I read through the examples provided in the Advanced Users FAQ (http://wiki.pentaho.com/display/EAI/...ressedfiles%3F) and was able to successfully read one of the many files in a tar.gz files that I have via the "Text file input" step. Now I am looking to see how I might be able to access multiple tar.gz files that have the same files within them. Is this possible?
For example I might have the following source tar.gz files that I need to read:
bla_2014-12-15.tar.gz
bla_2014-12-14.tar.gz
foo_2014-12-15.tar.gz
foo_2014-12-14.tar.gz
Within each of these tar.gz files there is a bar.txt file. I tried reading the files from the "Text file input" step with wildcards like so...
tar:gz:/path/to/files/.*.tar.gz!/.*.tar!/bar.txt
...but I was not able to get this to work.
As mentioned above I am able to read the bar.txt file if I specify the full path like so (tar:gz:/path/to/files/bla_2014-12-15.tar.gz!/bla_2014-12-15.tar!/bar.txt), but I need to be able to handle the probability of multiple tar.gz files. So I tried another approach of creating the full vfs string using "Get File Names" and a few other steps. I am able to build the full vfs string correctly and then pass it into the "Text file input" step with "Accept file names from previous step" turned on. When I run this though it fails saying it can't open the following file:
tar:gz:file:////path/to/files/bla_2014-12-15.tar.gz!/bla_2014-12-15.tar!/bar.txt
It looks like the "file://" is causing the issue. Is there a way to tell it to not include "file://"? Does anyone have any ideas of how to get around all of this? Or a better way to handle this situation?
EDIT: I also want to mention this is being done on a Windows computer with a Windows server running the Pentaho process in mind. I know I could install gzip packages to both but I'm trying to avoid that path if possible.