Our journey with the PDW started year and a half ago when we got a Microsoft PDW for a total remake of our data warehouse platform.
In this post I will got through some of our high level learnings. In future posts I will go more in details.
So we had two different projects that we needed to get on to the PDW.
- Porting an existing financial data warehouse solution.
- Totally remaking of a operational data warehouse used for operational data analysis and customer reporting.
My primary experience beforehand was designing and developing data warehouses on the Microsoft SQL Server platform using SSIS and SSAS. And it has worked great for a large number of projects.
In a future article I will describe our experience with building a data warehouse from scratch on the PDW platform and what we have learned over the course this work.
Experiences from porting the financial data warehouse.
Together with our BI supplier we set out to do a one-to-one porting of the data warehouse reusing the existing SSIS packages with lots of logic and lookups implemented in data flow tasks i.e. the usual way to do it 😉 Only change we was looking at was to switch our cubes from MOLAP mode to ROLAP in order avoid the processing time of the cubes.
Needless to say that we failed miserably with our first go. None of the developers had worked on a PDW before, we was implementing on PDW version AU0.5 which was quite unstable, doing it across the summer holiday period with a number of changes in the development team due to vacations and with a tight deadline.
So the end result was the project being way over due and not delivering on our expectations.
What we learned from porting the financial data warehouse:
- Don’t treat the PDW as a regular sql server on steroids (which was the MS marketing message at the time). It is NOT. We have a quarter rack PDW so in fact it is 16 small server that need to work together in order to deliver the result. This is a really important fact to understand. They need to work together, so you need to design for it.
- A lot of the things that happens automatically on a regular sql server will have to be programmed/handled by the developers. Especially maintaining statistics and distribution of data.
- Make sure that you maintain statistics on the PDW. This is perhaps the easiest fix for bad performance. If you don’t create statistics then the PDW will optimize the queries for a load of 1000 rows. And that will most probably be wrong 🙂
- Keep the data on the PDW. Using SSIS the setup is to load the data out of the PDW into your SSIS server and then write it back again – and writing with SSIS is not as fast as internal operations on the PDW. We saw a factor 5 in performance improvement on our fact load when changing from SSIS to SQL. The SSIS dataflow had more than 50 lookups that needed to be changed to left outer joins. On simpler dataflows the improvements would have been greater.
- Use dwloader if possible as it is faster than the SSIS PDW destination adapter. Moving data out of the databases into a file and then loading that file into the PDW fells totally wrong to me. But if you design for maximum performance then this is what you need to look at. Try to get to terms with it – i’m not there 100% yet.
- Keep an eye on the distributions. You can have tables either replicated or distributed. Generally dimensions should be replicated and fact be distributed. You also need to set a distribution key that will be used for distributing the data. Here you need to both avoid skewedness of the data and align the table for what it is going to be used together with later on. One important thing to note here is, that all tables that is distributed on a column with the same name and data type will be placed on the same disks no matter what database it belongs to.
- You can’t have computed columns on the PDW when loading with the SSIS PDW destination adapter. These columns need to come in through the SSIS data flow.
- And you need to change all destination adapters to the PDW destination adaptor when you migrate a SSIS (data flow task-driven) to PDW.
- Migrating takes longer than you think as there are a lot of learnings as you progress. For us it was 3-4 times the estimated time.
- Divide and conquer should be your driving design pattern. Do small things on large amounts of data instead of doing a lot on a little amount of data. It is often faster to rewrite a 10 mio row table then it is to update 1000 rows
I know that some of these experiences sounds counter intuitive to most data warehouse developers and architects – they were counter intuitive to me.