I'm trying to optimize the time it takes for pure inserts (I guess updates is a different story).
I'm wondering what the biggest bottleneck is, or best way to reduce time, for basic inserts into a SQL database.
I mean, the first thing, probably, is the size of the data, right? Both the number of rows, the number of columns, and the data size in each column. Some of these may not be able to be minimized, the KB/ footprint of each row is one thing that can be potentially optimized, right?
What else can be optimized or is the largest factor? Is it the transmission medium? I mean, how much magnitude of difference is there between writing to a database that's on the same computer, vs writing across a web connection (that is robust, fast, and has a ping of 1 ms?).
Finally --- why is it that multiple parallel connections to the database seem to speed up the process to a point? I mean, when I have 20 connections making inserts round-robin style, it's about 6-7x faster than one connection writing ALL the data. I'm curious why this is.
Right now I have 2.2 million rows totaling 2.7 GB. That's 1.23 kb per row.
Right now inserting 1000 rows at at time (1.23 MB) using 14 connections takes 6.7 seconds. That's a snail-paced 10.66 rows per second. Even assuming 1 connection would be just as fast (it isn't) that's 150 rows/ second at best, which is not exactly "fast" either. I'm writing over a super-fast, robust web connection b/c we can't have the ETL process on the same space as the data warehouse.
So .. how can I optimize for speed here?
The reason for 1000 rows at a time is because the data comes from pages of 1000 - but optimizing the parsing is a separate issue for now.
I do have one primary index I believe, but nothing too write-expensive. Right now I've simply been doing Monte Carlo like testing (try it and see what works) but I need something more focused.
I posted similar on StackOverflow and so far have got the response that I need to leverage "bulk inserting" and "minimal logging" (as far as SQL Server goes).
How exactly would I do a bulk insert using Pentaho? I'm already doing a transaction commit size of 1000 rows (each data set) but perhaps I must make that larger. Do I need to use "execute SQL statement" instead of insert table, so I can utilize the "bulk insert into" syntax?
I'm wondering what the biggest bottleneck is, or best way to reduce time, for basic inserts into a SQL database.
I mean, the first thing, probably, is the size of the data, right? Both the number of rows, the number of columns, and the data size in each column. Some of these may not be able to be minimized, the KB/ footprint of each row is one thing that can be potentially optimized, right?
What else can be optimized or is the largest factor? Is it the transmission medium? I mean, how much magnitude of difference is there between writing to a database that's on the same computer, vs writing across a web connection (that is robust, fast, and has a ping of 1 ms?).
Finally --- why is it that multiple parallel connections to the database seem to speed up the process to a point? I mean, when I have 20 connections making inserts round-robin style, it's about 6-7x faster than one connection writing ALL the data. I'm curious why this is.
Right now I have 2.2 million rows totaling 2.7 GB. That's 1.23 kb per row.
Right now inserting 1000 rows at at time (1.23 MB) using 14 connections takes 6.7 seconds. That's a snail-paced 10.66 rows per second. Even assuming 1 connection would be just as fast (it isn't) that's 150 rows/ second at best, which is not exactly "fast" either. I'm writing over a super-fast, robust web connection b/c we can't have the ETL process on the same space as the data warehouse.
So .. how can I optimize for speed here?
The reason for 1000 rows at a time is because the data comes from pages of 1000 - but optimizing the parsing is a separate issue for now.
I do have one primary index I believe, but nothing too write-expensive. Right now I've simply been doing Monte Carlo like testing (try it and see what works) but I need something more focused.
I posted similar on StackOverflow and so far have got the response that I need to leverage "bulk inserting" and "minimal logging" (as far as SQL Server goes).
How exactly would I do a bulk insert using Pentaho? I'm already doing a transaction commit size of 1000 rows (each data set) but perhaps I must make that larger. Do I need to use "execute SQL statement" instead of insert table, so I can utilize the "bulk insert into" syntax?