Pentaho Runs Out of Memory (Heap space, or Stack Overflow) when looping job entries

May 20, 2015, 11:21 am

≫ Next: BayesNetGenerator returns 'All' for numeric valuces

Hi there--

Yes I mentioned this before. I thought it was something in my job entries/ transformations that had recursion, but now I'm thinking it's native to Pentaho itself.

I've seen several bug reports in Pentaho --- one even showed a 'dummy' loop --- literally ... it was a job that had a start step, then a dummy 1, then a dummy 2, back to the start step. It literally did nothing but bounce back and forth.

This very quickly (we're talking minutes) crashes.

The same case happens with me --- after 80 loops of whatever steps, the job crashes (heap space overflow). I even had a random dummy step in my transformation as a visual aid for the loop --- I took that out, it gets to about 120 loops and crashes now. That much memory overhead, that never gets cleared out, due to a dummy step?? What gives here?

Oddly enough I have a similar transformation that has never crashed (it's only ever needed 250 loops though) --- so it's possible the architecture of the job itself may slow down this "memory bloat" that gathers. I'm going to use a couple memory tools soon to see what I can figure out.

Anyone else ever experience this? The massive amount of loops is only needed for the initial data load. It basically checks if a page is the "last page" -- if not, is loads the next page.

After the initial load, it would only really loop 4-6 times each time it was run.

Only work-around I can see is either A). Manually put a maximum of 50 loops within the job. The initial data load (and any other big loads) would have to be run 6-7 times manually.

B). Put the looping task onto the operating system ... aka instead of Pentaho looping, maybe it can output a command line to run itself again if a certain condition is met at the end? Not sure if this is a good idea.

C). Somehow optimize the job to consume less memory/ give Pentaho more memory to work with. This is fine, but if an initial load, or reload, or any kind of large loops of data is ever needed, it's a problem waiting to happen.

↧

BayesNetGenerator returns 'All' for numeric valuces

May 20, 2015, 2:58 pm

≫ Next: Table Output - JSON to Postgresql JSONB field

≪ Previous: Pentaho Runs Out of Memory (Heap space, or Stack Overflow) when looping job entries

I've used the tutorial in https://weka.wikispaces.com/Programmatic+Use to build a bayesian network. I write the network to an xml file. This is my code:

Code:

        // Declare two numeric attributes

        Attribute Attribute1 = new Attribute("firstNumeric");

        Attribute Attribute2 = new Attribute("secondNumeric");



        // Declare a nominal attribute along with its values

        FastVector fvNominalVal = new FastVector(3);

        fvNominalVal.addElement("blue");

        fvNominalVal.addElement("gray");

        fvNominalVal.addElement("black");

        Attribute Attribute3 = new Attribute("aNominal", fvNominalVal);



        // Declare the class attribute along with its values

        FastVector fvClassVal = new FastVector(2);

        fvClassVal.addElement("positive");

        fvClassVal.addElement("negative");

        Attribute ClassAttribute = new Attribute("theClass", fvClassVal);



        // Declare the feature vector

        FastVector fvWekaAttributes = new FastVector(4);

        fvWekaAttributes.addElement(Attribute1);

        fvWekaAttributes.addElement(Attribute2);

        fvWekaAttributes.addElement(Attribute3);

        fvWekaAttributes.addElement(ClassAttribute);

        // Create an empty training set

        Instances isTrainingSet = new Instances("Rel", fvWekaAttributes, 10);

        // Set class index

        isTrainingSet.setClassIndex(3);

         // Create the instance

        Instance iExample = new Instance(4);

        iExample.setValue((Attribute)fvWekaAttributes.elementAt(0), 1.0);

        iExample.setValue((Attribute)fvWekaAttributes.elementAt(1), 0.5);

        iExample.setValue((Attribute)fvWekaAttributes.elementAt(2), "gray");

        iExample.setValue((Attribute)fvWekaAttributes.elementAt(3), "positive");



        // add the instance

        isTrainingSet.add(iExample);

        //Create bayesnet

        BayesNet nb = new BayesNet();

        nb.buildClassifier(isTrainingSet);

        String str = ((BayesNet) nb).toXMLBIF03();

        PrintWriter writer = new PrintWriter("blub.xml", "UTF-8");

        writer.println(str);

        writer.close();

Then I use the ability of BayesNetGenerator to produce artificial data from the bayesian network in the xml file:

Code:

weka.classifiers.bayes.net.BayesNetGenerator -F blub.xml -M 10

This works for nominal values. But for numerical values, it gives me:

Code:

@relation Rel-weka.filters.supervised.attribute.Discretize-Rfirst-last



@attribute firstNumeric {'\'All\''}

@attribute secondNumeric {'\'All\''}

@attribute aNominal {blue,gray,black}

@attribute theClass {positive,negative}



@data

'\'All\'','\'All\'',blue,negative

'\'All\'','\'All\'',gray,positive

'\'All\'','\'All\'',gray,negative

'\'All\'','\'All\'',gray,positive

'\'All\'','\'All\'',blue,positive

'\'All\'','\'All\'',blue,positive

'\'All\'','\'All\'',blue,positive

'\'All\'','\'All\'',black,positive

'\'All\'','\'All\'',black,negative

'\'All\'','\'All\'',black,positive

'\'All\'','\'All\'',gray,positive

'All' is also the outcome in the xml file for the numerical attributes. How do I generate numerical values? Should I discretize the numerical values?

Thanks.

↧

Table Output - JSON to Postgresql JSONB field

May 20, 2015, 5:13 pm

≫ Next: Edit schedule in Pentaho Report Designer

≪ Previous: BayesNetGenerator returns 'All' for numeric valuces

Hi,

Can Kettle send data to a PostgreSQL JSONB filed? Anytime I try to output JSON data to a Postgresql JSONB field type I always get an error. Kettle seem to only send JSON data to PostgreSQL text fields. Anyway to make this work? Or is it that Kettle is yet to catchup with the new Postgresql JSON/JSONB data types

So as a workaround , I have to send the JSON data to a text field. When the transformation is compeleted, I go into Postgresql and move the data to a jsonb field or change the field type to jsonb.

Thanks for your input!

↧

Edit schedule in Pentaho Report Designer

May 20, 2015, 10:18 pm

≫ Next: Clustering markers

≪ Previous: Table Output - JSON to Postgresql JSONB field

I get the below error when I edit a scheduled report in Pentaho User Console Workspace:-
"not-null property references a null or transient value: org.pentaho.platform.repository.subscription.Subscription.title"
How to resolve this?

↧

Clustering markers

May 20, 2015, 10:31 pm

≫ Next: Comparing filenames in PDI

≪ Previous: Edit schedule in Pentaho Report Designer

Hi,

How to make clustering markers with Openlayers in CDE?

↧

Comparing filenames in PDI

May 21, 2015, 12:45 am

≫ Next: Identify duplicate column value from single row

≪ Previous: Clustering markers

I am trying to import a certain .CSV file into my database using PDI 3.2.5 (so no easy way out using a "User Defined Java Class" step).

Normally this would be rather easy, as you could just link up a "CSV file input" step with a "Table output" step and be good to go. However, the problem is that I don't know which file I want to import in advance, as in before executing the job/transformation in PDI.

That is because I have many files in my import folder, which all have the same format regarding their filename: "KeyDate_Filename_YYYYMMDD.CSV"
The idea is to have file with the newest YYYYMMDD imported for a given key date.

My theoretical approach to implement this would be:

Make the given key date available in PDI as a parameter (already done)
Read in the names of all files stored in the import folder
Filter said filenames for the given key date
Compare the YYYYMMDD of the remaining files and select the newest
Use selected filename as parameter in a "CSV file input" step (already done)
Import data via "Table output" step (already done)

Unfortunately I am fairly new to PDI and don't really have a compelling idea on how to implement the bold parts or if that approach as a whole is even viable.
Can anybody think of a way to get this done? Appreciate any feedback

↧

Identify duplicate column value from single row

May 21, 2015, 1:54 am

≫ Next: Error with design studio : Action validation failed

≪ Previous: Comparing filenames in PDI

Hi All,

How to identify duplicate column values in a single row? For instance, my input looks like this

id type_1 name type_2 company type_3 address
1 CON xxxxx CON xxxy NOC xxxxxxxx

I need to identify how many times "CON" repeated?

Cheers,
Harris

↧

Error with design studio : Action validation failed

May 21, 2015, 2:42 am

≫ Next: Parallel Coordinates in Analyser report

≪ Previous: Identify duplicate column value from single row

Hello everyone,

I'm new with pentaho and I'm trying to create an action that sends an email with some data using design studio.
The tutorial that does that is here:
https://www.youtube.com/watch?v=1gk4GGPSO_U

I'm doing the exact same thing but it gives me this error:

Code:

Possible Causes:

    RuntimeContext.ERROR_0035 - Action validation failed.



Action Sequence:Practice01.xaction

Execution Stack:

    EXECUTING ACTION: Send mail (EmailComponent)

        in LOOP ON: LoopResults

        in LOOP ON: LoopList 

Loop Index (1-based):N/A

...etc

Do any one have any idea about how to solve this

Thanks to all.

↧

Parallel Coordinates in Analyser report

May 21, 2015, 2:45 am

≫ Next: Pass User Role (env::roles) from Report to OLAP4J XMLA

≪ Previous: Error with design studio : Action validation failed

Hi ,

I am trying to integrate parallel coordinate charts with my anayser report.
When following this below link
http://wiki.pentaho.com/display/COM/...el+Coordinates

This says to download the plugins and place it under system folders.

But by doing so , my analyser report itself is not loading !!

Has anyone worked on any kind of visualization carts ???

Please let me know the steps to do so.

Thanks!!

↧

Pass User Role (env::roles) from Report to OLAP4J XMLA

May 21, 2015, 3:02 am

≫ Next: Run PDI transformation from HTML app

≪ Previous: Parallel Coordinates in Analyser report

Hi Guys,

I want to pass the role of the user (env::roles) to the OLAP4J XMLA connection string to support role restrictions of the OLAP Cube on MDX queries.

Does anyone know a way to pass the user role to the OLAP4J connection within the PRD?

Thank you in advanced.

Best regards,
Matthias

↧

Run PDI transformation from HTML app

May 21, 2015, 3:05 am

≫ Next: How to return no matched row ?

≪ Previous: Pass User Role (env::roles) from Report to OLAP4J XMLA

Hello,

I'd like to execute a PDI transformation directly from an HTML/js simple app instead of the command window.

I don't know if it's possible or if any of you have done something familiar .

But any hint or help would be highly apreciated .

↧

How to return no matched row ?

May 21, 2015, 3:25 am

≫ Next: Commit Size and Lost Data on PDI

≪ Previous: Run PDI transformation from HTML app

Hi everybody ,
I look for a solution to perform SSIS lookup in Pentaho Data Integration.
I'll try to explain with an exemple :
I have two tables A and B.
Here , data in table A :
1
2
3
4
5
Here , data in table B:
3
4
5
6
7
After my process :
All rows in A and not in B ==> will be insert to B
All rows in B and not in A ==> will be deleted to A
So , here my final Table B :
3
4
5
1
2
someone can help me please ?

↧

Commit Size and Lost Data on PDI

May 21, 2015, 4:37 am

≫ Next: Errorm This XML file does not appear to have any style information associated with it

≪ Previous: How to return no matched row ?

Hai All,

i have one transformation with multiple process in database for insert and update my row, when i used commit size=1 the with data processed about 30000, it work well and no lost data, but this way is affected time perfomance. if i change my commit size about 50 or 100, it use less time than before, but the problem is some of my data to be process are lost (ex:30000 data, the data in database only about 29900). how to used the right configuration for commit size in my table output & update, anyone happen this problem too?

thanks

↧

Errorm This XML file does not appear to have any style information associated with it

May 21, 2015, 5:10 am

≫ Next: Commit Size dan Lost Data

≪ Previous: Commit Size and Lost Data on PDI

Hi

I have Pentaho Release 5.2.0.0.209 installed. When I tried to run an report (for example JFreeQuadForRegion) this message appears:
"This XML file does noet appear tot have any style information associated with it. the document tree is shown below"

I'am running Firefox on an client.
Can anyone tell me how to solve this problem?

Regards,

Attached Images

JFreeQuadForRegion.jpg (18.5 KB)

↧

Commit Size dan Lost Data

May 21, 2015, 5:18 am

≫ Next: From .ktr transformation to .Bat file

≪ Previous: Errorm This XML file does not appear to have any style information associated with it

Hai Semuanya,

Saya mempunyai sebuah transformation dengan banyak menggunakan table output dan update. ketika saya setting commit size nya 1 dengan data sekitar 25000 data dapat masuk ke database semua tapi perfoma Pentahonya menjadi lama. ketika saya ganti dengan commit size 50 dengan data 25000 ada beberapa data yang tidak masuk ke database tapi perfoma pentahonya menjadi cepat. bagaimana cara mengatasi problem ini ya ?

Note :
1. PDI 5.0.0
2. OS windowns 7

↧

From .ktr transformation to .Bat file

May 21, 2015, 6:16 am

≫ Next: Pentaho Report Designer -Fill remaning space of page

≪ Previous: Commit Size dan Lost Data

Hello,

I'd like to know how to create a batch file (.bat) that contains a ktr transformation .
This batch file would be automaticly executing the PDI transformation though the command window.

Any hint or help would be highly apreciated.

Thank you in advance.

↧

Pentaho Report Designer -Fill remaning space of page

May 21, 2015, 8:54 am

≫ Next: Need help with 2 series value comparison in bar chart for conditional coloring of bar

≪ Previous: From .ktr transformation to .Bat file

Hi All,

I have a report that has sale line items for each client. I have added a group so it produces a new group for each client and then added a page break so that each client is on its own page. Now I would like that the group footer is at the bottom of each page and that the details footer scales to the remaining height of the page. Can this be done?

Thank you in advance.

David

↧

Need help with 2 series value comparison in bar chart for conditional coloring of bar

May 21, 2015, 9:23 am

≫ Next: Pentaho CDE Nested SELECT sql query

≪ Previous: Pentaho Report Designer -Fill remaning space of page

Hello,

I am having sql query having below columns,

column 1: product name
column 2: actual sales
column 3: target sales

I am using bar chart. In that if actual > target, I have to show one color and if not then different color. How I can do this in pentaho CDE?

I have one function as:

function(s) { if(s.vars.value.value>=100){ return "rgb(0, 174, 239)"; } else {return "rgb(127, 214, 247)" ;}}

In this I can compare first series with constant value. s.vars.value.value gives data of actual sales i.e. first series. But how to get data of target sales using s.vars.value

Please help.

↧

Pentaho CDE Nested SELECT sql query

May 21, 2015, 9:25 am

≫ Next: Following job running on failure of the prior job.

≪ Previous: Need help with 2 series value comparison in bar chart for conditional coloring of bar

Hi all i hope you are doing good

I use these query to create a chart in pentaho cde
""""""
select jira.dataissue.value,count(value),substring(jira.issue.entry,1,3)
from jira.DataIssue,jira.issue where jira.dataissue.field = 'version(s)_corrigée(s)'
and jira.dataissue.value is not null and jira.dataissue.issue = jira.issue.id and
jira.dataissue.issue in ( select jira.dataissue.issue from jira.dataissue,jira.issue
where jira.dataissue.issue = jira.issue.id and jira.dataissue.value = 'récit' )
and jira.dataissue.issue in ( select jira.issue.id from dataissue,issue
where jira.dataissue.issue = jira.dataissue.issue and jira.dataissue.value = 'Fermée' )
and jira.dataissue.issue in ( select jira.dataissue.issue from jira.dataissue,jira.issue
where jira.dataissue.issue = jira.issue.id and jira.dataissue.field = 'point_d_effort')
group by jira.dataissue.value

""""""

And there is no data found , i'm sur that the query actual works as i tested it in mysql .

Does Pentaho cde accept nested sql select query

↧

Following job running on failure of the prior job.

May 21, 2015, 11:24 am

≫ Next: Report per user role embedded on the home page of PUC

≪ Previous: Pentaho CDE Nested SELECT sql query

Hi all,

I have a case where I have one big Kettle job consisting of Job A ->Job B -> Job C-> Job D - > Success.

Each of the jobs consists of a set of transformations that are executed sequentially one after another.

The Hops between jobs are configured to run when the prior job has successfully run.

There is a transformation that fails in Job B but instead of the entire big job failing, Job C continues to run even on failure of Job B.

What would be best architecture or design scenarios to avoid this?

Thanks,

Ron

↧