I'm importing 500k 'activities' from a round 2000 xml files in my warehouse.
Each activity has an identifier which is a varchar(255).
It should be unique, but it is not.
I want to add a field which contains how often an identifier is used.
The table in simple form has the fields: stage_activity_key, iati_identifier, occurence_iati_identifier
I tried it with mysql:
But after 50000 seconds the update query was still running...
I tried it with pdi with a simple 3 step transformation:
get rows from stage_activity, lookup occurence, update stage_activity
But (i assume) because of database/table locking only 20000 rows are read and buffered and no rows are updated.
I considered to use a group by step when importing the activities, but sorting 500k rows (and still growing), doesn't seem a wise decision.
Does anybody know a clever solution?
Should it help to use a lookup table for the iati-identifier, so I can update using an integer key?
table like (activity_key, identifier_key, identifier, occurence)
The data is only renewed daily (at most).
Hope somebody can help.
Jaap-Andre
Each activity has an identifier which is a varchar(255).
It should be unique, but it is not.
I want to add a field which contains how often an identifier is used.
The table in simple form has the fields: stage_activity_key, iati_identifier, occurence_iati_identifier
I tried it with mysql:
Code:
TRUNCATE stage_activity_temp;
insert into stage_activity_temp (`iati-identifier`, occurence_iati_identifier)
SELECT `iati-identifier` , count(*)
FROM `stage_activity`
GROUP BY `iati-identifier`;
update stage_activity a
left join stage_activity_temp b on
a.`iati-identifier` like b.`iati-identifier`
set a.occurence_iati_identifier=b.occurence_iati_identifier ;
I tried it with pdi with a simple 3 step transformation:
get rows from stage_activity, lookup occurence, update stage_activity
But (i assume) because of database/table locking only 20000 rows are read and buffered and no rows are updated.
I considered to use a group by step when importing the activities, but sorting 500k rows (and still growing), doesn't seem a wise decision.
Does anybody know a clever solution?
Should it help to use a lookup table for the iati-identifier, so I can update using an integer key?
table like (activity_key, identifier_key, identifier, occurence)
The data is only renewed daily (at most).
Hope somebody can help.
Jaap-Andre