Hello, this is my first post on the forum - I've only been using Spoon for a few weeks. I'm an experienced Java developer working in Germany.
My question: I have a job that performs ETL on Excel sheets, extracting data and saving it to a PostgreSQL DB. So far so good. I have some Excel sheets that need extensive transformations first, such as deleting columns and empty rows, renaming and inserting header fields, filling in some missing values. This was formerly done via an Excel macro; I've written a Java class that uses Apache POI to do the job (much faster than the macro).
What's the best way of integrating this with my job? The Java code needs to read in the whole file to process it, so I'm not sure if a Step is the right way to go. Or should I write a custom plugin for the job?
My envisioned workflow would be giving the job/step a directory, where it then reads all the .xlsx/.xls files and performs the massive transformations, before moving on to the "easier" ETL steps. I've read as much as I could find about UDJC, but am really unsure if that's what I need to do, and if so how I implement processRow when I need to read in all the rows first.
Thanks in advance,
John
My question: I have a job that performs ETL on Excel sheets, extracting data and saving it to a PostgreSQL DB. So far so good. I have some Excel sheets that need extensive transformations first, such as deleting columns and empty rows, renaming and inserting header fields, filling in some missing values. This was formerly done via an Excel macro; I've written a Java class that uses Apache POI to do the job (much faster than the macro).
What's the best way of integrating this with my job? The Java code needs to read in the whole file to process it, so I'm not sure if a Step is the right way to go. Or should I write a custom plugin for the job?
My envisioned workflow would be giving the job/step a directory, where it then reads all the .xlsx/.xls files and performs the massive transformations, before moving on to the "easier" ETL steps. I've read as much as I could find about UDJC, but am really unsure if that's what I need to do, and if so how I implement processRow when I need to read in all the rows first.
Thanks in advance,
John