Hello there---
I'm still working on a data integration project to get Zendesk's native JSON data into a MySQL data warehouse.
My two bottle necks for speed are JSON parsing, and then SQL "Upserting."
Actually, I've separated these steps to do JSON >> Serialize to File, then DeSerialize >> MySQL, because the DB has a write "time-out" problem for now.
So the JSON >> Serialize to File, I'm getting 11 r/s write time. That's abysmal, is it not?
Each JSON document has 1000 records. Here's a dummy sample of "one record" (there are 1000 of these per document).
I'm currently parsing out about 17 of these fields using the JSON input step. Note: The sample "description" field is misleading. In the example, it's 200ish characters. It actual refers to full email bodies and can be 5,000 characters. This field, and other 'variable length' massive fields, I am not parsing out. However, the JSON input step still has to READ over these characters, does it not?
I did a test character count on one JSON page (1000 records like above) --- and it comes out to a total of 4.6 million characters.
And again, this is simply reading a local JSON file, parsing it, and serializing to file. The steps are done in parallel, but it takes 90 seconds to get through 1000 records. Perhaps this is reasonable. Is there any way to optimize this process?
I'm still working on a data integration project to get Zendesk's native JSON data into a MySQL data warehouse.
My two bottle necks for speed are JSON parsing, and then SQL "Upserting."
Actually, I've separated these steps to do JSON >> Serialize to File, then DeSerialize >> MySQL, because the DB has a write "time-out" problem for now.
So the JSON >> Serialize to File, I'm getting 11 r/s write time. That's abysmal, is it not?
Each JSON document has 1000 records. Here's a dummy sample of "one record" (there are 1000 of these per document).
Code:
"id": 35436,
"url": "https://company.zendesk.com/api/v2/tickets/35436.json",
"external_id": "ahg35h3jh",
"created_at": "2009-07-20T22:55:29Z",
"updated_at": "2011-05-05T10:38:52Z",
"type": "incident",
"subject": "Help, my printer is on fire!",
"raw_subject": "{{dc.printer_on_fire}}",
"description": "The fire is very colorful. This is the email body and it can go on forever and ever and ever and blah blah blah. I expect a full refund for the printer fire and bonus inc. Respectfully yours, Mr. JD Fake Name, 1130 W Anderson Ville, Cog Enterprises, 555",
"priority": "high",
"status": "open",
"recipient": "support@company.com",
"requester_id": 20978392,
"submitter_id": 76872,
"assignee_id": 235323,
"organization_id": 509974,
"group_id": 98738,
"collaborator_ids": [35334, 234],
"forum_topic_id": 72648221,
"problem_id": 9873764,
"has_incidents": false,
"due_at": null,
"tags": ["enterprise", "other_tag"],
"via": {
"channel": "web"
},
"custom_fields": [
{
"id": 27642,
"value": "745"
},
{
"id": 27648,
"value": "yes"
}
],
"satisfaction_rating": {
"id": 1234,
"score": "good",
"comment": "Great support!"
},
"sharing_agreement_ids": [84432]
}
I'm currently parsing out about 17 of these fields using the JSON input step. Note: The sample "description" field is misleading. In the example, it's 200ish characters. It actual refers to full email bodies and can be 5,000 characters. This field, and other 'variable length' massive fields, I am not parsing out. However, the JSON input step still has to READ over these characters, does it not?
I did a test character count on one JSON page (1000 records like above) --- and it comes out to a total of 4.6 million characters.
And again, this is simply reading a local JSON file, parsing it, and serializing to file. The steps are done in parallel, but it takes 90 seconds to get through 1000 records. Perhaps this is reasonable. Is there any way to optimize this process?