Hi there,
I have a bunch of flat logfiles that I am want to process for reporting purposes which currently sit on a HDFS cluster.
The intention is to ingest the files, extract its fields (via regex or whatever other method available) and perform some basic reporting like number of hits to a particular page, top visitors, etc.
Prior to looking into PDI I played a bit with Splunk's Hunk. The software works by splitting its jobs across cluster nodes using Yarn, meaning I can have a reasonably lightweight Hunk server that relies on harvesting computing power from the nodes to complete its jobs.
However, after reading the PDI documentation I ended with the impression that while reading the files from HDFS should be trivial, the data and reporting would be read from the HDFS cluster but processed by the Pentaho platform rather than the cluster itself (unless I design this job as a MapReduce job).
Is this understanding correct?
I have a bunch of flat logfiles that I am want to process for reporting purposes which currently sit on a HDFS cluster.
The intention is to ingest the files, extract its fields (via regex or whatever other method available) and perform some basic reporting like number of hits to a particular page, top visitors, etc.
Prior to looking into PDI I played a bit with Splunk's Hunk. The software works by splitting its jobs across cluster nodes using Yarn, meaning I can have a reasonably lightweight Hunk server that relies on harvesting computing power from the nodes to complete its jobs.
However, after reading the PDI documentation I ended with the impression that while reading the files from HDFS should be trivial, the data and reporting would be read from the HDFS cluster but processed by the Pentaho platform rather than the cluster itself (unless I design this job as a MapReduce job).
Is this understanding correct?