Quantcast
Channel: Pentaho Community Forums
Viewing all 16689 articles
Browse latest View live

Calculate unique values in parallel for Neo4j Node creation

$
0
0
Hi Kettle and Neo4j fans!
So maybe that title is a little bit over the top but it sums up what this transformation does:

Here is what we do in this transformation:

  • Read a text file
  • Calculate unique values over 2 columns
  • Create Neo4j nodes for each unique value

To do this we first normalize the columns effectively doubling the amount of rows in the set. Then we do some cleanup (remove double quotes). The secret sauce is then to do a partitioned unique value calculation (5 partitions means 5 parallel thread). By partitioning the data on the single column we guarantee that the same data ends up on the same partition (step copy). For this we use a Hash set (Unique Hash Set) step which uses memory to avoid a costly “Sort/Unique” operation. While we have the data in parallel step copies, we also load the data in parallel into Neo4j. Make sure you drop indexes on the Node/Label you’re loading to avoid transaction issues.
This allowed me to condense 28M rows at the starting point into 5M unique values and load those in just over 1 minute on my laptop. I’ll post a more comprehensive walk-through example later but I wanted to show you this strategy because it can help you out there in need of decent data loading capacity.
Cheers,
Matt


More...

Process the Column data in Excel File

$
0
0
We are receiving the data in excel file and need to insert into the database. Please find the input file format and the expected output.

Could you please provide inputs and really helpful

Input:
TimeStamp FieldA FieldB
2017-01-01T00:00 0.0 0.0
2017-01-01T00:30 0.0 0.0
2017-01-01T01:00 0.0 0.0
2017-01-01T01:30 0.0 0.0
2017-01-01T02:00 0.0 0.0
2017-01-01T02:30 0.0 0.0
2017-01-01T03:00 0.0 0.0
2017-01-01T03:30 0.0 0.0
2017-01-01T04:00 0.0 0.0
2017-01-01T04:30 0.0 0.0
2017-01-01T05:00 0.0 0.0
2017-01-01T05:30 0.0 0.0
2017-01-01T06:00 0.0 0.0
2017-01-01T06:30 0.0 0.0


Expected Output:


FieldName TimeStamp Value
FieldA 2017-01-01T00:00 0.0
FieldA 2017-01-01T00:30 0.0
FieldA 2017-01-01T01:00 0.0
FieldA 2017-01-01T01:30 0.0
FieldA 2017-01-01T02:00 0.0
FieldA 2017-01-01T02:30 0.0
FieldA 2017-01-01T03:00 0.0
FieldA 2017-01-01T03:30 0.0
FieldA 2017-01-01T04:00 0.0
FieldA 2017-01-01T04:30 0.0
FieldA 2017-01-01T05:00 0.0
FieldA 2017-01-01T05:30 0.0
FieldA 2017-01-01T06:00 0.0
FieldA 2017-01-01T06:30 0.0
FieldB 2017-01-01T00:00 0.0
FieldB 2017-01-01T00:30 0.0
FieldB 2017-01-01T01:00 0.0
FieldB 2017-01-01T01:30 0.0
FieldB 2017-01-01T02:00 0.0
FieldB 2017-01-01T02:30 0.0
FieldB 2017-01-01T03:00 0.0
FieldB 2017-01-01T03:30 0.0
FieldB 2017-01-01T04:00 0.0
FieldB 2017-01-01T04:30 0.0
FieldB 2017-01-01T05:00 0.0
FieldB 2017-01-01T05:30 0.0
FieldB 2017-01-01T06:00 0.0
FieldB 2017-01-01T06:30 0.0

kettle pdi-ce-8.1.0.0-365 error on rename Excel Input Step in transformation

$
0
0
Hello all,

hope you can help me today.

Environment: Windows 10(64 bit)
Java 1.9.1

What I did stepwise:

  • Load pdi-ce-8.1.0.0_356.zip
  • unpack zip
  • go to folder /data-integration
  • create a simple *.xls Excel file test.xls
  • double klick SpoonConsole.bat
  • Select file --> new --> transformation
  • Select design --> input --> Microsoft Excel Input Step
  • Tab = "Files" --> Select in Step --> Browse --> test.xls --> Add the file
  • Tab = "Sheet" --> Select sheet --> select Table1 sheet
  • Tab = "Fields" --> Select "Get fields from header row..." --> ok --> in this tab select "preview rows" --> ok
  • Save the transformation as file *.ktr --> ok
  • Run the transformation --> ok
  • Select the "Microsoft Excel Input" Step again for editing the Step name --> press OK the attached Error Message occurs: "Unable to open dialog for this step"

It would very nice if you can help me. Do I have to install some thing else?
With pdi-ce-7.1.0.0_12 all works fine.

Thank you in advance

T.


Please find attached the error log message
------------------------------------------------------------
Creating configuration from pentaho.marketplace.di.cfg
Nov 28, 2018 6:39:22 PM org.apache.cxf.endpoint.ServerImpl initDestination
INFORMATION: Setting the server's publish address to be /marketplace
Nov 28, 2018 6:40:11 PM org.apache.cxf.endpoint.ServerImpl initDestination
INFORMATION: Setting the server's publish address to be /repositories
Nov 28, 2018 6:40:27 PM org.apache.cxf.endpoint.ServerImpl initDestination
INFORMATION: Setting the server's publish address to be /browser
2018/11/28 18:40:41 - Carte - Installing timer to purge stale objects after 1440 minutes.
2018/11/28 18:48:35 - /Transformation 1 - Dispatching started for transformation [/Transformation 1]
2018/11/28 18:48:36 - Microsoft Excel Input.0 - Finished processing (I=1, O=0, R=0, W=1, U=0, E=0)
2018/11/28 18:48:36 - dummy.0 - Finished processing (I=0, O=0, R=1, W=1, U=0, E=0)
2018/11/28 18:48:43 - Spoon - Save as...
2018/11/28 18:48:43 - Spoon - Save file as...
2018/11/28 18:48:58 - Spoon - Using legacy execution engine
2018/11/28 18:48:58 - Spoon - Transformation opened.
2018/11/28 18:48:58 - Spoon - Launching transformation [test_excel_input]...
2018/11/28 18:48:58 - Spoon - Started the transformation execution.
2018/11/28 18:48:59 - test_excel_input - Dispatching started for transformation [test_excel_input]
2018/11/28 18:49:00 - Microsoft Excel Input.0 - Finished processing (I=1, O=0, R=0, W=1, U=0, E=0)
2018/11/28 18:49:00 - Spoon - The transformation has finished!!
java.lang.NullPointerException
at org.pentaho.di.trans.TransMeta.getStepMetaCacheKey(TransMeta.java:6405)
at org.pentaho.di.trans.TransMeta.findPreviousSteps(TransMeta.java:1384)
at org.pentaho.di.trans.TransMeta.findPreviousSteps(TransMeta.java:1371)
at org.pentaho.di.ui.trans.steps.excelinput.ExcelInputDialog.getInfo(ExcelInputDialog.java:1493)
at org.pentaho.di.ui.trans.steps.excelinput.ExcelInputDialog.ok(ExcelInputDialog.java:1468)
at org.pentaho.di.ui.trans.steps.excelinput.ExcelInputDialog.access$600(ExcelInputDialog.java:102)
at org.pentaho.di.ui.trans.steps.excelinput.ExcelInputDialog$7.handleEvent(ExcelInputDialog.java:1028)
at org.eclipse.swt.widgets.EventTable.sendEvent(Unknown Source)
at org.eclipse.swt.widgets.Display.sendEvent(Unknown Source)
at org.eclipse.swt.widgets.Widget.sendEvent(Unknown Source)
at org.eclipse.swt.widgets.Display.runDeferredEvents(Unknown Source)
at org.eclipse.swt.widgets.Display.readAndDispatch(Unknown Source)
at org.pentaho.di.ui.trans.steps.excelinput.ExcelInputDialog.open(ExcelInputDialog.java:1214)
at org.pentaho.di.ui.spoon.delegates.SpoonStepsDelegate.editStep(SpoonStepsDelegate.java:120)
at org.pentaho.di.ui.spoon.Spoon.editStep(Spoon.java:8949)
at org.pentaho.di.ui.spoon.trans.TransGraph.editStep(TransGraph.java:3291)
at org.pentaho.di.ui.spoon.trans.TransGraph.mouseDoubleClick(TransGraph.java:785)
at org.eclipse.swt.widgets.TypedListener.handleEvent(Unknown Source)
at org.eclipse.swt.widgets.EventTable.sendEvent(Unknown Source)
at org.eclipse.swt.widgets.Display.sendEvent(Unknown Source)
at org.eclipse.swt.widgets.Widget.sendEvent(Unknown Source)
at org.eclipse.swt.widgets.Display.runDeferredEvents(Unknown Source)
at org.eclipse.swt.widgets.Display.readAndDispatch(Unknown Source)
at org.pentaho.di.ui.spoon.Spoon.readAndDispatch(Spoon.java:1375)
at org.pentaho.di.ui.spoon.Spoon.waitForDispose(Spoon.java:8104)
at org.pentaho.di.ui.spoon.Spoon.start(Spoon.java:9466)
at org.pentaho.di.ui.spoon.Spoon.main(Spoon.java:701)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.pentaho.commons.launcher.Launcher.main(Launcher.java:92)

Can I run Community Edition in a production environment?

$
0
0
Trying to get some input from the Pentaho group on some questions that arose in a discussion.
  1. Can you run the Community Edition in a production environment? If you do, how big is your Pentaho Environment?
  2. Why did you go the CE route instead of using the Pentaho paid subscription?
  3. What do you do for support if you run into a Production issue?
  4. Are there outside companies that provide support and if so what is their response time.


Our environment - We use spoon/kettle only with an Oracle 12c backend repository (which is non-supported) and do all our scheduling through Windows task scheduler. Been this way for 13yrs now.

Join tables with a common where condition

$
0
0
I have two tables in two different environments. I need to simulate using Kettle a query like the following one:

Code:

select *
from t1
left join t2
on t1.a=t2.a
where t1.b="VALUE1" OR t2.c="VALUE2"

I've tried with two table input with the following queries:

Code:

select *
from t1
where t1.b="VALUE1"

Code:

select *
from t2
where t2.c="VALUE2"

And then merging the data flows, but of course I miss data.

Have you any idea about how to solve this problem?

Thank you!

pdi-cd-8.1.0.0-365 / 8.2.x.x Excel Writer Step DATE TIMESTAMP Column null in Sheet

$
0
0
Hi All,

Test Environment

Windows 8.1/10 64 bit
Oracle 12.2.0.1
java version "1.8.0_131"
pdi-cd-(7.1,8.1,8.2)

What I did

Create simple transformation
Table Input step (with select * from user_objects)
Excel Writer Step (xlsx) writes file to local folder

What I found

The Excel file has been written but no content (null) in the DATA/TIMESTAMP Cells. This means the oracle table contains two DATE fields CREATED and LAST_DDL_TIME. This fields where empty in the excel file.

Further I tried to set in the Excel Writer Step a format for the date fields --> nothing changed --> column are null
Further I set the values in the select statement of table input step to a formated string for the DATE fields, means: to_char(DATE,'dd.mm.yyyy') --> now the fields are filled in the excel file. If I use the "format cells" use case in the file it sais "User defined format", but it is not a DATE.

Question/s

in previous versions (e.g. pdi-4.4 )we did not need formatting. Do I always have to specify formatting in the table input step?
A statement like "select * from my_table" with TIMESTAMP or DATE columns would be null for the columns?
Do I use the Excel Writer Step in a wrong manner?

Thank you for helping me

T.

pentaho-server community version 8.1.0.0-365 with pentaho data integration

$
0
0
Hi All,

is it possible to install pentaho data integration (pdi-ce-8.1.0.0-365) in the pentaho-server community edition?

Any advice how to invoke the data integration in the pentaho console?

In the browse file section I'm able to get open an result of an transformation (Steel Wheels Example of data integration) but not the transformation it selve.

Or, is it only a feature by using Enerprise Versionn?

Thanks for a tip here!!

T.

Pentaho 8.2 is available!

$
0
0
I've come to accept my inefficiency on keeping up with the technical blog posts. This is the point where one accepts his complete uselessness (and I don't even know if this is a real word!)

Anyway - up to the good things:

Pentaho 8.2 is available!

Get it here!


A really really solid release! Not a lot of eye catching new buzzwords, but a huge list of things that will make a serious impact on the development effort and production releases out there.


Release overview

Here's the release at a glimpse:

  • Enhance Eco System Integration
    • HCP Connector I
    • MapR DB Support
    • Google Encryption Support

  • Improve Edge to Cloud Processing
    • Enhanced AEL
    • Streaming AMQP

  • Better Data Operation
    • Expanded Lineage
    • Status Monitoring UX
    • OpenJDK support

  • Enable Data Science & Visualization
    • Python Executor
    • PDI Data Science Notebook (Jupyter) Integration
    • Push Streaming

  • Improve Platform Stability and Usability
    • JSON Enhancements
    • BA Chinese Language Localization for PUC
    • Expanded MDI

  • Additional Improvements




And now a little bit of detail into each of them:


Eco System Integration

HCP Connectivity

HCP is a distributed storage system designed to support large, growing repositories of fixed-content data from simple text files to images, video to multi-gigabyte database images. HCP stores objects that include both data and metadata that describes that data and presents these objects as files in a standard directory structure.


An HCP repository is partitioned into namespaces owned and managed by tenants, providing access to objects through a variety of industry-standard protocols, as well as through various HCP-specific interfaces.



There are many use cases for using HCP in the Enterprise context:



  • Globally Compliant Retention Platform (GCRP)
    • Meet Compliance & Legal Retention requirements (WORM, SEC 17A-4, CFTC and MSRB)

  • Secure Analytics Archive
    • Big data source/target (land) for secure analytic workflows
    • Better Data portability
    • Multi-tenant

  • Protect data with much higher durability (up to fifteen 9s) and availability (up to ten 9s) with HCP




The PDI+HCP combo will allow much more resources into serving these use cases: By leveraging PDI's connectivity capabilities to a wide variety of data, we can use HCP as a "Staging Data Lake" for semi-structured and unstructured data and/or using it as an execution environment for the execution of data science algorithms against this type of content will also, like enriching HCP metadata or doing deep learning for image recognition


In this release we implemented a VFS driver for HCP; Next versions will include a deeper, metadata level integration with HCP's functionality.






MapR DB support





Simple but important improvement: MapR DB is now supported! It's an enterprise-grade, high performance, global NoSQL database management system. It is a multi-model database that converges operations and analytics in real-time, including the HBase API to run HBase applications, even though not all features are compatible.


It's now validated to read/write data from MapR-DB as Hbase. In terms of what use case this enables, I'd call out: Operational Data Hub/Real-Time BI, Customer 360 and several IoT related ones.




Google Cloud Encryption





Google CMEK allows data owners to have a multilayered security model that secures data and controls access to the data encryption keys. With this new capability, Pentaho users can use these custom encryption keys to access data in Google Cloud Storage and Google Big Query enhancing the security of the data. And we're very happy to say that we were able to test that it just works with no product change required! Damn, feels good when it happens (which rarely does!)

Pentaho 8.0 CE - Schedules - Select Folder option not working

$
0
0
Hi,

I am trying to modify/create Schedules Jobs and I cannot change the output folder (Generated Content Location). When I pressed Select button it show the Select Folder pop-up but it never show the folders available.

I checked Pentaho Browse Files options and it is able to show it.

Anyone facing the same issue?

Regards,

Manuel

mozi characters elimination using pdo

$
0
0
Hi,

Please tell me how can we ignore the mozi characers using pdi

example:������added

I want to load only string "added" to my database, Could you please tell me the PDI step or some logic to load it.


Pentaho 8 + Saiku 3.90

$
0
0
Hi,

I am testing new Pentaho server. I installed Saiku plugin 3.90. I am trying to export the result of a default cube to Excel and I am gettin this error:

driver=mondrian.olap4j.MondrianOlap4jDriver
11:44:31,851 ERROR [Query2Resource] Cannot get excel for query (B0F58368-A2C7-9FB1-A128-6DE8A3A63425)
java.lang.IllegalArgumentException: Merged region A1 must contain 2 or more cells
at org.apache.poi.xssf.usermodel.XSSFSheet.addMergedRegion(XSSFSheet.java:358)
at org.apache.poi.xssf.usermodel.XSSFSheet.addMergedRegion(XSSFSheet.java:323)
at org.saiku.service.util.export.excel.ExcelWorksheetBuilder.buildExcelTableHeader(ExcelWorksheetBuilder.java:818)

Cube:
Quadrant Analysis
Column:
Positions
Region
Row:
Department
Filter:
None

Regards,

Manuel

Table input step with multiple statement

$
0
0
Hello,

In some ETL I read a SQL script which I then execute in a table input step. The SQL is supposed to output 1 row with 1 column with a specified name. Now in some of these scripts, I want to add some statement not generating result rows.

To fix the idea, this is a query doing a count of what just changed, and this script running against hive an example would be:
Code:

set hive.query.name=something;
select count(*) from somewhere where something;

Pentaho fails with the error:
Quote:

java.sql.SQLException: The query did not generate a result set!
A requirement is that the Pentaho job is a generic job executing whatever SQL it finds in some directory, so I cannot do much in Pentaho itself. The best option I have found so far is to run an external command for which I would capture the output but it feels very brittle.

How could I do what I want?
Thanks,

My Pentato Server don't start!

$
0
0
Hi,
Have a very problem with my pentaho server intance online since 3 days. It don't start.
I have verify the log but don't understand.
Yesterday I done a new installation but still have the same problem.


See log image here https://i.ibb.co/8D6R6Dk/Selection-046.png


How can I resolv it please :confused: and idea:confused:

Regards,

PDI Data Science Notebook (Jupyter) Integration

$
0
0
Hello community!

I am really impressed on the Pentaho 8.2 release overview specially on the new features for Data Science!
Enable Data Science & Visualization

  • Python Executor
  • PDI Data Science Notebook (Jupyter) Integration


There features seems amazing! Integrating the brilliant Big Data ETL and pushing them to a powerful Data Science platform including Jupyter Notebooks seems amazing!

I just downloaded the PDI 8,2 CE Edition but I couldn't find anything related to this.

Are these features just for EE version? Where can I found proper documentation on them?

Are there plans to release on CE version? At least as a plugin on MarketPlace?

Br,
Orair.

How to clear environmental variable?

$
0
0
Hello --- I use environmental variables for Database connection properties in the .kettle properties file.

I need to change a database instance from "blah blah" to blank. Just, blank. Nothing. Nada.

Setting a variable to (blank) doesn't seem to work in jobs. I'm using version 7.1

Setting a variable to anything changes it. However setting it to blank is like "do nothing, keep old value."

How do I ACTUALLY set an active variable to blank or clear it out? A javascript sub-job or something?

S3 file output not working in PDI 5.0.1 stable

$
0
0
Hello,

I am trying to export the mysql table data to Amazon s3 bucket using 's3 File output' step.But I am getting error everytime i run the transformation.

"Error opening new file : org.apache.commons.vfs.FileSystemException: Could not determine the type of file...."

Any help will be much appreciated.


Thanks,

Kylin Cube with Mondrian -> adomd connection

$
0
0
Hi,

We are planning to use Kylin OLAP cube with mondrian and by using Microsoft.ADODM.Client dll. Is it possible to access Kylin OLAP cube via ADOMD client dll?.

Regards,
Tamil

how to create drill down report in pentaho report designer

$
0
0
Hi,
I am trying to create drill down report in pentaho report designer. My end goal is to to display the report in the same page as the main report (just below the parent report)or a pop will also work.

What I have tried:
First I am trying to achieve basic functionality of passing parameter from parent report to child report but not able to do it.

Created two reports first.prpt and drilledreport.prpt and published on the bi server.
In First report I have defined in url attribute


HTML Code:

=DRILLDOWN("remote-sugar"; "http://localhost:8080/pentaho"; {":entaho-path"; "/public/Steel Wheels/Reports/drilledreport.prpt"};{"productline"};[PRODUCTLINE])
in drilledreport have created query like this:
select PRODUCTNAME FROM PRODUCTS where productline =${productline}

on clicking the productline it is calling drilled report but it is not showing any data.

It would be great help if anyone can point me where I am doing wrong. and if anyone has achieved what my end goal is, any idea or sample would be really helpful for me.

How to embed PDI in OSGI Bundle (Maven + Karaf Dependencies)

$
0
0
I would like to execute a Kettle job from my OSGI bundle running in Karaf.
Currently I cannot compile my bundle because I don't have the correct dependencies in my pom.xml.

From this wiki (https://github.com/pentaho/maven-parent-poms/wiki), I can see that my pom needs to inherit from Pentaho's parent pom.
So in my pom.xml I have added this section which uses the correct version:

Code:

<parent>
    <groupId>org.pentaho</groupId>
    <artifactId>pentaho-ce-bundle-parent-pom</artifactId>
    <version>8.1.0.0-365</version>
</parent>

What dependencies do I need to add to my pom.xml?
Are there any documentation that explains this process? I have tried this (https://help.pentaho.com/Documentati...nter/PDI/Embed) but it didn't help.

415 Error while trying to retrieve report content

$
0
0
This is a post related to PDI but the functional issue seems to be in BISERVER.

I've been trying to use BI-SERVER (8.2) to generate some reports and save them as PDF through PDI.

Using a UDJ step I can save binary data that is returned.

Trying to use the REST client step I'm ending up with HTTP Status 415 – Unsupported Media Type when calling with POST and output-target parameter as 'pageable/pdf'

Using the REST client step with GET .../generatedContent?output-target=pageable/pdf returns the pdf binary in the response.

All fine and good, however using the GET method I have to include the user/pw in the URL, which I don't like. I would rather
include in the POST parameters.

I've tried to manually include content-type as application/pdf...for the POST, same result.

Is this a Tomcat configuration error or a pentaho configuration error or just a limitation of the server?

FYI: POST without specifying an output-target does return an html report.

Any ideas?

tia,

BobC
Viewing all 16689 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>