JSON Input step 11 r/s speed ... why so slow?

April 28, 2015, 10:10 am

≫ Next: SQL Insert/ Update "Upsert" Scalability and Performance

Hello there---

I'm still working on a data integration project to get Zendesk's native JSON data into a MySQL data warehouse.

My two bottle necks for speed are JSON parsing, and then SQL "Upserting."

Actually, I've separated these steps to do JSON >> Serialize to File, then DeSerialize >> MySQL, because the DB has a write "time-out" problem for now.

So the JSON >> Serialize to File, I'm getting 11 r/s write time. That's abysmal, is it not?

Each JSON document has 1000 records. Here's a dummy sample of "one record" (there are 1000 of these per document).

Code:

  "id":               35436,

  "url":              "https://company.zendesk.com/api/v2/tickets/35436.json",

  "external_id":      "ahg35h3jh",

  "created_at":       "2009-07-20T22:55:29Z",

  "updated_at":       "2011-05-05T10:38:52Z",

  "type":             "incident",

  "subject":          "Help, my printer is on fire!",

  "raw_subject":      "{{dc.printer_on_fire}}",

  "description":      "The fire is very colorful. This is the email body and it can go on forever and ever and ever and blah blah blah. I expect a full refund for the printer fire and bonus inc. Respectfully yours, Mr. JD Fake Name, 1130 W Anderson Ville, Cog Enterprises, 555",

  "priority":         "high",

  "status":           "open",

  "recipient":        "support@company.com",

  "requester_id":     20978392,

  "submitter_id":     76872,

  "assignee_id":      235323,

  "organization_id":  509974,

  "group_id":         98738,

  "collaborator_ids": [35334, 234],

  "forum_topic_id":   72648221,

  "problem_id":       9873764,

  "has_incidents":    false,

  "due_at":           null,

  "tags":             ["enterprise", "other_tag"],

  "via": {

    "channel": "web"

  },

  "custom_fields": [

    {

      "id":    27642,

      "value": "745"

    },

    {

      "id":    27648,

      "value": "yes"

    }

  ],

  "satisfaction_rating": {

    "id": 1234,

    "score": "good",

    "comment": "Great support!"

  },

  "sharing_agreement_ids": [84432]

}

I'm currently parsing out about 17 of these fields using the JSON input step. Note: The sample "description" field is misleading. In the example, it's 200ish characters. It actual refers to full email bodies and can be 5,000 characters. This field, and other 'variable length' massive fields, I am not parsing out. However, the JSON input step still has to READ over these characters, does it not?

I did a test character count on one JSON page (1000 records like above) --- and it comes out to a total of 4.6 million characters.

And again, this is simply reading a local JSON file, parsing it, and serializing to file. The steps are done in parallel, but it takes 90 seconds to get through 1000 records. Perhaps this is reasonable. Is there any way to optimize this process?

↧

SQL Insert/ Update "Upsert" Scalability and Performance

April 28, 2015, 10:31 am

≫ Next: R/S speed, performance log, bottleneck analysis at JOB vs TRANSFORMATION level?

≪ Previous: JSON Input step 11 r/s speed ... why so slow?

Hi --

I'm still having issues scaling the Insert/ Update Table Output step.

I have a .cube serial field (the fastest Pentaho can read if I recall) being deserialized and then an Insert/ Update based on a unique field, "Ticket_ID."

Total there are 18 fields being either inserted/ updated. However they are all < 50 chars, nothing huge.

The problem is that currently that are 250k records in the Database. (250k unique Ticket IDs)

I also am updating a database not on the local machine but on a remote web server (probably slows things significantly; that's one optimization to be made).

Here's the thing though. I'm inserting 1000 records at a time in this "Upsert" step.

It's my understanding that the Ticket_ID of each of these 1000 records has to be compared to every. last. Ticket_ID. of the 250k records. (to determine whether to insert, or update).

The current process time of all this currently, is it takes 20 full minutes for the "upsert" of the 1000 records to complete. Truly an abysmal rate IMHO. And that time only increases as the 250k records grows to 500k, or 1 million, or 2 million.

My question: Surely there is a better way to do this?

My thoughts:

1. Would a table index improve "upsert" time? It's my understanding that an "upsert" contains both a read query, and then a write query. The 'read' may be the slow aspect ultimately, and an index would fix this. Would a simple B-tree index (or any index) of Ticket_ID, the 'key' here ... be a significant time saver? I will probably attempt it.

2. Possibly the solution to this whole mess. Tickets become STATUS: CLOSED after 30 days. At this point -- they will NO LONGER be updated in any way. That table of 250k records? 90% of it is CLOSED. They don't have to be looked up.

Why am I doing an "upsert" that is looking up every ticket since the beginning of time, when in reality, I only need to look at tickets within the last 30 days, or status <> closed? The latter is FAR more scalable, and it doesn't increase linearly with time!

But I have no idea how to technically implement this. Telling SQL to "check if a ticket status is closed, if so, don't look it up for update" -- is still forcing it to do look up work.

My thoughts are possibly a daily or even monthly migration of "closed" tickets to a separate table. That way, the "upsert" is only querying 1-2 months of ticket data. Then perhaps a view can "union" the tables.

Does this make sense? I'm certainly no database administrator. Thanks for any suggestions.

↧

R/S speed, performance log, bottleneck analysis at JOB vs TRANSFORMATION level?

April 28, 2015, 12:45 pm

≫ Next: Copy and Pasting to Hidden Directory

≪ Previous: SQL Insert/ Update "Upsert" Scalability and Performance

Hi there --- even with detailed logging turned on, I notice at the JOB level, there is no detailed analysis of R/S speed of each step, progress, etc ... like there is when you run a transformation by itself at the bottom.

I'm curious if there's an easy way to see the duration of each transformation step (and sub step) --- or a visual to see what started, what was the bottleneck, what was the speed of X, Y, Z step ... just for a bird-'s eye view for optimization purposes ....

I know I can just delete the 'start' and 'end' times in the Logging, but that's such a manual effort. Any ideas? Thanks.

↧

Copy and Pasting to Hidden Directory

April 28, 2015, 1:46 pm

≫ Next: Hello Tutorial - JS error Can't find bundle Messages

≪ Previous: R/S speed, performance log, bottleneck analysis at JOB vs TRANSFORMATION level?

This seems to be generating an Unavailable for me. Copy works fine, pasting into a hidden directory does not.

Can someone verify this? Should be fairly simple to do.

I've tested on the 5.1GA and 5.3GA: both are giving me the error which is sort of annoying since I have to go and
unhide the directory then paste.

↧

Hello Tutorial - JS error Can't find bundle Messages

April 28, 2015, 1:48 pm

≫ Next: Feature Extraction for Building Text-Classifier in Weka

≪ Previous: Copy and Pasting to Hidden Directory

Tried the community version of DI with basic Hello tutorial

Getting this missing bundle package error:

Caused by: java.util.MissingResourceException: Can't find bundle for base name org.mozilla.javascript.resources.Messages, locale en_US

↧

Feature Extraction for Building Text-Classifier in Weka

April 28, 2015, 3:25 pm

≫ Next: Carte Question/Jobs

≪ Previous: Hello Tutorial - JS error Can't find bundle Messages

Hello Dear Weka Community,

I want to extract text-features from annotated text, such that the extracted-features could be used to build the text-classifier within Weka.

Specifically, I am looking for an open-source application/library/tool that could transforms the input text into indices/numeric features. One such (close-doors) tool is Coh-metrix (http://cohmetrix.com) that could transforms a given text into about 106 numeric features. I have appended its list of features in an attachment.

Please advise me on any of open-source application/library/tool, that I could use to extract different text-features (e.g., n-grams, grammatical features, writing mechanics, vocabulary features, POS features, etc.).

Any smallest hint would be helpful.

Thank you for your time and support.

Best regards,

Syed

Attached Files

Coh-Metrix_Indices.pdf (752.8 KB)

↧

Carte Question/Jobs

April 28, 2015, 4:39 pm

≫ Next: Help with the specifics of VARIABLES please

≪ Previous: Feature Extraction for Building Text-Classifier in Weka

I have recently fallen in love with Carte and would like to start to use it vs chron to manage jobs and how often they run. But I have yet to figure out one item that would make my life much easier to do.

In Jobs on the Start button you can set a sched to run Time/date/ect. hat I have not figured out is how to save "snapshot" of Carte jobs in the event that carte has to be restarted or crashes. The only way I know how to do this is to log back into spoon and relaunch the jobs and set the sched again and pass to the remote server

Is there some means of saving the status so that you do not have to do that again?

Thanks

↧

Help with the specifics of VARIABLES please

April 28, 2015, 5:00 pm

≫ Next: ways to combine multiple tables into unified header field names

≪ Previous: Carte Question/Jobs

Hi,
I am really getting confused about how to use variables.
I have consulted many posts and am still no wiser as how variables are declared, set and typed.
Many posts show 'Set Variables' steps as 'set Environment Variables' and these have columns of Field nams and Variable Name.
I am using Kettle-Spoon 5.3.0.0 and my 'Set Variables' steps have 'Variable name' and 'Value'
then there is the wiki (http://wiki.pentaho.com/display/EAI/Set+Variables) which shows even more attributes like 'Apply formatting' !

Rules I have gleaned so far are
* you cant set a variable in the same transformation as it used
* its not so much a declarative as a setting "It accepts one (and only one) row of data to set the value of a variable."

What I am trying to do.....
there is a database table keyed on date with a 'needed flag' and a 'done flag'.
....for each Date where Needed=Y and Done=N
...........do the work - in this case take a snapshot fact as-at this date
............mark the date as 'Done'=Y
.....end for

Until I get the date passing working the 'do the work' step is not in the flow
It starts with a table input step and what I expect is to able to use a variable
in its where clause. something like where LS.AS_OF_DATE = '${FACT_LVE_as_at_date_txt}'

So far my Pentaho structure is a job
* Start
* Run transform 'select date' (with no advanced option set)
* Set Variables
* Run transform 'mark off date'

transform 'select date'...
* has a table input step that selects the dates
* has a 'copy rows to result' step

Set Variables
* has no content - odd in the extreme but many posts say this.

Transform 'mark off date'
* Get variables - attempting to get the variable
* a 'get system Info' and formula step (to add constants)
* an Update step

Running this gets various errors which leads me to the questions
1. Where is a variable given a name ?
2. Where is a variable given a data type / format ?
3. How do you get a variable into the list ie 'press CTRL Space to select a variable to Insert' ?
4. The whole looping appears to be controlled by 'Copy Rows to Result' in lower level transform
from a documentation / maintenance viewpoint this does not seem sound

Ok so now feel free to tell me where I have gone off the rails - and or a much simpler way to achieve the whole thing.

Thanks
JC

↧

ways to combine multiple tables into unified header field names

April 28, 2015, 7:24 pm

≫ Next: Publishing Reports

≪ Previous: Help with the specifics of VARIABLES please

Hello, I have an Access data base with a few different, unrelated tables. What's the best transformation to bring all of those into a unified CSV output with mapped field names?

↧

Publishing Reports

April 28, 2015, 10:25 pm

≫ Next: How to Run Multiple individual queries in PRD

≪ Previous: ways to combine multiple tables into unified header field names

Hi All,

I am doing a poc on Pentaho reporting, so i have downloaded pentaho report designer. I have created my first report by connecting to my local database and able to preview it, but when i try to publish the same i am coming across an error. Do i need to get pentaho BI in order to publish reports.

↧

How to Run Multiple individual queries in PRD

April 28, 2015, 11:28 pm

≫ Next: Pentaho Ctools Changelogs

≪ Previous: Publishing Reports

Hi All,

I'm into a situation where I need to run N number of queries through pentaho report designer and have to merge the result set at the end as output.

For instance, I have to run 5 individual queries separately and combine the result set at the end.

Any suggestions will be appreciable.

Regards,
Naveen

↧

Pentaho Ctools Changelogs

April 29, 2015, 1:40 am

≫ Next: Extracting data from nested JSON array

≪ Previous: How to Run Multiple individual queries in PRD

On my latest blog post I announced the release of a new Ctools versions. I always publish that kind of entry as soon as I get that notice from our engineering team.

However - there's a much better way to follow what's been happening on the ctools. Our UX team always puts up this information on a much, much better way than I can aim for.

You can access the Ctools changelog directly from the website. And let me tell you - I challenge you to find any project that does them better than we do. Don't believe me? Take a look at the site or just scroll down to see the different themes for the previous 4 ctools releases (it was when we started doing them)

Ctools Version 15.04.16 - House Ctools

Ctools Version 15.02.26 - And the winner is...

Ctools Version 14.12.10 - Twinkle twinkle

Ctools Version 14.10.15 - Don't fall behind

More...

↧

Extracting data from nested JSON array

April 29, 2015, 3:20 am

≫ Next: PCM15 - Pentaho Community Meeting 2015 - vote for the location preferences

≪ Previous: Pentaho Ctools Changelogs

Hi

Example of the JSON file I'm dealing with is here -- http://pastebin.com/EhraSZX0
There is one record for one customer. The customer, however, can have multiple contact information such as phones and email adrresses.

Using the JSON Input step (split into a parent and children), mapping its path only to $.Contact[] I was only able to get 1 number, which is "07415897816" which coincides with Customer_PrimaryFlag = "YES". But this is not the intended outcome.

The transformation is such as follows:

And, the Javascript I have in "JS:Contact Information" is ...

Code:

//Script here





if (json_contact) {





 var segmentContact = json_contact.split('},{');

 var Customer_TelPtr = '';





 for (var i=0; i<segmentContact.length; i++){





   var stringBeginsTelno = segmentContact[i].split('"Customer_TelNo":"');

   var stringEndsTelno = stringBeginsTelno[1].split('","');

   Customer_TelPtr = stringEndsTelno[0];





   var stringBeginsType = segmentContact[i].split('"Customer_TypeId":"');

   var stringEndsType = stringBeginsType[1].split('","');

   var contact_type = stringEndsType[0];





 }

}

Note: In the JS script above, json_contact = the json mapping path to $..Contact[].

Previewing the output gets me this:

The row we're interested in is the first (where customer_id is blank). The subsequent rows after the first are other customers. Apparently the JavaScript step has only extracted 1 row. There is a total of 7 contacts (6 of phone types and 1 email) for Ms. Megan Denise Fox. The objective here is for PDI to go through the nested Contact[] path and get all contact information there is. The ideal outcome should be (ignore column C):

I need help / guidance here of some sort to get n number of rows per customer for n number of contact information nested within Contact[].
I'm starting to believe that PDI cannot iterate through the JSON nested array and that it is its limitation. If it is truly a limitation, then please just say so. Awaiting your responses.
Thank you!

↧

PCM15 - Pentaho Community Meeting 2015 - vote for the location preferences

April 29, 2015, 3:54 am

≫ Next: Google Sheet input data

≪ Previous: Extracting data from nested JSON array

FYI: Nelson created a survey to collect location preferences for the next PCM.
The exact date is still undecided but it'll probably be November 7, 14 or 21, so as to not clash with Pentaho World 2015.

Please take a look and if you plan on attending, help us by voting on your preferred option:

https://www.surveymonkey.com/s/2SGPY2F

↧

Google Sheet input data

April 29, 2015, 5:05 am

≫ Next: jsonoutput: key from flied

≪ Previous: PCM15 - Pentaho Community Meeting 2015 - vote for the location preferences

Dear all,

I'm wondering if it's possible to build a solution that share between several users the data entry Budgeting Cycle .
Probably someone of you has jet faced this problem.
Using XLS sheets is always complicated becouse many people has to send the same sheet to different user adding her/his own planned data.
For instance sales agents input they forecast data for new period / month / year.

BI is not devoted to manage also data entry and so I'm asking if there are some tricks to achieve these features.
Someone said me that SPARKL could be helpful but I'm not a programmer and it looks a bit techy.

Google sheet instead is easily shareble and it could be a good compromise.

Has PDI natively this possibility ?

Every other suggestion is wellcome.

Many Thanks

↧

jsonoutput: key from flied

April 29, 2015, 7:10 am

≫ Next: Cube published from Schema workbench not seen in Saiku

≪ Previous: Google Sheet input data

Hi, Could you please advise on transformation?

Assuming I have data

User	Age
John	26
Kate	58

How could I receive a json where the username acts as the key and the age as value?

Code:

{

    "John" : 26,

    "Kate" : 58

}

Thanks

↧

Cube published from Schema workbench not seen in Saiku

April 29, 2015, 7:13 am

≫ Next: Best Way to Create Surrogate Key for Main Fact Tables?

≪ Previous: jsonoutput: key from flied

Hi,

I'm using Pentaho CE 5.3 version, Schema workbench as desktop tool and Saiku plugin downloaded from Marketplace (also CE version). DB Repository is Oracle, and BI server runs on Linux.
I have created MyTestSchema from Schema Workbench and published it successfully. I have and Oracle DB Repository and I see an XML file in BLOB from DS_REPOS_DATASTORE. I started PUC, enter Manage Data Sources and I see a DataSourceType MyTestSchema type Analysis. It say that mondrian file MyTestSchema.mondrian.xml.
But, when I start new Saiku analysis, I don't see MyTestCube.
Catalina.out displays errors
15:58:37,339 ERROR [GenericServlet] GenericServlet.ERROR_0004 - Resource /saiku-ui/saikuplugin.properties_supported_languages.properties not found in plugin s
aiku
15:58:37,386 ERROR [GenericServlet] GenericServlet.ERROR_0004 - Resource /saiku-ui/saikuplugin.properties.properties not found in plugin saiku
15:58:37,409 ERROR [GenericServlet] GenericServlet.ERROR_0004 - Resource /saiku-ui/saikuplugin.properties_sr.properties not found in plugin saiku

But it doesn't seem to be important when I look at the cube creating from database tables source. That cube shows in drop down box of Saiku.

Do you have any suggestions how to solve this problem?

↧

Best Way to Create Surrogate Key for Main Fact Tables?

April 29, 2015, 9:31 am

≫ Next: Plugin for Google Drive Error

≪ Previous: Cube published from Schema workbench not seen in Saiku

Hi there--

I'm reading some literature on ETL/ Data Warehousing best practices in general --- you can definitely fill a library.

Anyway, it's highly recommended to create a surrogate key/ dummy primary key for main fact tables (aka list of measures, numbers, etc ... as opposed to dimension tables which are mappings/ descriptions).

It's also recommended that this surrogate key is not a natural key + timestamp, or whatever. That's what I'm currently doing, so it looks like this author knows what he's talking about.

Anyway I'm trying to visualize how to actually a achieve this in Pentaho. Would it be best to look up the max surrogate key ID in the current table (say its 193832), and just add one, and increment from there during each load/ write of each row?

Or should there just be a universal Pentaho variable for it that is incremented by one every time the job is ran?

I'm just trying to visualize what will be most useful in the event of an ETL failure/ error/ rollback, etc. What's the easiest/ most straightforward way to generate surrogate keys in Pentaho?

↧

Plugin for Google Drive Error

April 29, 2015, 11:36 am

≫ Next: Problem with the v-align

≪ Previous: Best Way to Create Surrogate Key for Main Fact Tables?

Hi all,
I founded this nice plugin that can create input step to Google Sheet

http://blog.intuitivus.com.br/pt/uti...mo-datasource/

After installed I can see it on input step list on the left panel but when I try to edit it on trasformation I got this error:

org.eclipse.swt.layout.FormData cannot be cast to org.eclipse.swt.layout.GridData

Can someone help me to fix it ?

Thanks
Giovanni

Details:
java.lang.ClassCastException: org.eclipse.swt.layout.FormData cannot be cast to org.eclipse.swt.layout.GridData
at org.eclipse.swt.layout.GridLayout.layout(Unknown Source)
at org.eclipse.swt.layout.GridLayout.computeSize(Unknown Source)
at org.eclipse.swt.widgets.Composite.computeSize(Unknown Source)
at org.eclipse.swt.widgets.Control.pack(Unknown Source)
at org.eclipse.swt.widgets.Control.pack(Unknown Source)
at org.pentaho.di.ui.trans.step.BaseStepDialog.setSize(BaseStepDialog.java:900)
at org.pentaho.di.ui.trans.step.BaseStepDialog.setSize(BaseStepDialog.java:877)
at org.pentaho.di.ui.trans.step.BaseStepDialog.setSize(BaseStepDialog.java:265)
at com.intuitivus.pdi.steps.spreadsheet.IntuitivusSpreadsheetStepDialog.open(IntuitivusSpreadsheetStepDialog.java:450)
at org.pentaho.di.ui.spoon.delegates.SpoonStepsDelegate.editStep(SpoonStepsDelegate.java:124)
at org.pentaho.di.ui.spoon.Spoon.editStep(Spoon.java:8797)
at org.pentaho.di.ui.spoon.trans.TransGraph.editStep(TransGraph.java:3027)
at org.pentaho.di.ui.spoon.trans.TransGraph.mouseDoubleClick(TransGraph.java:744)
at org.eclipse.swt.widgets.TypedListener.handleEvent(Unknown Source)
at org.eclipse.swt.widgets.EventTable.sendEvent(Unknown Source)
at org.eclipse.swt.widgets.Widget.sendEvent(Unknown Source)
at org.eclipse.swt.widgets.Display.runDeferredEvents(Unknown Source)
at org.eclipse.swt.widgets.Display.readAndDispatch(Unknown Source)
at org.pentaho.di.ui.spoon.Spoon.readAndDispatch(Spoon.java:1316)
at org.pentaho.di.ui.spoon.Spoon.waitForDispose(Spoon.java:7979)
at org.pentaho.di.ui.spoon.Spoon.start(Spoon.java:9310)
at org.pentaho.di.ui.spoon.Spoon.main(Spoon.java:654)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.pentaho.commons.launcher.Launcher.main(Launcher.java:92)

Attached Images

gs.jpg (23.0 KB)

↧

Problem with the v-align

April 29, 2015, 11:55 am

≫ Next: Dialog fails for Cassandra Input/Output

≪ Previous: Plugin for Google Drive Error

Hi.

I have a situation since I started using version 5.3.
When I run the report (in PDF, HTML or any other format) the v-align of the fields (labels, message-fields, text-fields, etc.) are incorrect: If I define the v-align as "Bottom", at generate the report gets as "Middle"; If I define the v-align as "Middle", at generate the report gets as "Top"; And if I set the v-align as "Top" at generate the report gets higher than "Top".

Here are a few pictures, I hope this helps:

-------------------//-----------------

Thanks, for now. :D

↧