Quantcast
Channel: Pentaho Community Forums
Viewing all articles
Browse latest Browse all 16689

store aggregated field in (mysql) database

$
0
0
I'm importing 500k 'activities' from a round 2000 xml files in my warehouse.
Each activity has an identifier which is a varchar(255).
It should be unique, but it is not.
I want to add a field which contains how often an identifier is used.
The table in simple form has the fields: stage_activity_key, iati_identifier, occurence_iati_identifier

I tried it with mysql:
Code:

TRUNCATE stage_activity_temp;

insert into stage_activity_temp (`iati-identifier`, occurence_iati_identifier)
SELECT `iati-identifier` , count(*)
FROM `stage_activity`
GROUP BY `iati-identifier`;

update stage_activity a
left join stage_activity_temp b on
a.`iati-identifier` like b.`iati-identifier`
set a.occurence_iati_identifier=b.occurence_iati_identifier ;

But after 50000 seconds the update query was still running...

I tried it with pdi with a simple 3 step transformation:
get rows from stage_activity, lookup occurence, update stage_activity
But (i assume) because of database/table locking only 20000 rows are read and buffered and no rows are updated.

I considered to use a group by step when importing the activities, but sorting 500k rows (and still growing), doesn't seem a wise decision.

Does anybody know a clever solution?
Should it help to use a lookup table for the iati-identifier, so I can update using an integer key?
table like (activity_key, identifier_key, identifier, occurence)

The data is only renewed daily (at most).

Hope somebody can help.

Jaap-Andre

Viewing all articles
Browse latest Browse all 16689

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>