I my previous post i wrote about porting a existing financial data warehouse to PDW and some of our challenges there.
The main reason for getting the PDW in the first place was for a total re-write of an existing operational data warehouse (called Insight@ISS).
We had a data warehouse setup that delivered to our internal and external customers what they wanted but it was both extremely expensive to add new customers to the solution and there were no way that the solution would scale to the number of customers that we wanted to get on the platform.
So we bought into the Microsoft marketing message about the PDW was a mean mother f..ker SQL box. Our understanding was that we were buying a super charged Porsche 911 (of SQL servers). A server that would crunch anything that you would throw at it. It will not… What we got was much more akin to a Formula 1 racer. Put it on a track with a good pit crew and a great driver it will make magic, but trying to take it through Copenhagen rush hour traffic will just not work. And we wanted to take it through rush hour traffic at high speed 😉
|We thought that the PDW was like a super charged Porsche 911 – really fast, great at most thing and not really bad at anything. It is not. It is more like a Formula 1 racer. Put it on a race track with a good pit crew and an experienced driver and it will make miracles. But trying to take it though rush hour traffic will just not work.|
It was still fast at something and with a change in our loading paradigm from the old data warehouse project to the new we were seeing performance.
Now the data warehouse has just grown to 5 times the size of the old data warehouse and we are starting to see longer processing times (it still scales close to linier, but the starting point just wasn’t fast enough.
And as we expect to see an increase of data and customer on the platform over the next year of a factor 5 then we have to do something to improve 🙂
Based on our experience with the financial data warehouse project we knew that we needed to move from a SSIS ETL approach to a more ELT approach of doing all the transformation on the PDW and just orchistrating the work from SSIS
After a lot of discussions and a little prototype we got to a solution that looks like this:
The purpose of this design is that we keep the data on the PDW and make use of the scalability of the PDW in order to run statements in parallel and distributed on all of the compute nodes of the PDW.
Our promise to the business is that we can improve performance of our ETL load with a factor 4 to 10 compared to the current setup.
On our initial tests that doesn’t look impossible.
Furthermore we expect that this approach will also make it easier or faster for the developers to add new functionality.
A more detailed look at how we have optimized the queries will be for a future article.