Saturday, May 18, 2013

Four Data Profiling Best Practices for Data Warehousing Beginners

Data profiling best practices is the first thing you need if you are planning to create a data warehouse for your company or organization. Data profiling, sometimes called “DP” in ETL professional circles, means a way of analyzing and processing data from  raw data sets to collect statistics and decipher trends and patterns in them. These patterns or statistics can be used to:

 

  1. Assess whether the information can be used to meet the needs of the organization.
  2. Assess the  data quality
  3. Check the potential for using the data in a new or different application.
  4. Assess the risks that any kind of data processing, storage, or management might have on the data.
  5. Have an overall view of managing and governing data for the warehouse.

 

If you stay in the ETL business, you will find that data profiling often becomes the backbone of your work in planning, designing, implementing, and managing a data warehouse. In order to guide you in your projects, you can follow these top four data profiling best practices, all very helpful tips in this line of work. They can be applied to your circumstances, too.

 

Data Profiling Best Practice 1: Do it before data warehousing

 

Do data profiling before data warehousing. This is probably the first and foremost data profiling best practice that you should know. When you find that your company or organization needs to set up a data warehouse, then your first order of business is to profile the information that you now have, before embarking on any of the extract-transform-load processes. This first step allows you to look long and hard objectively into the information that you have and to see what its quality is. This first data profiling step provides metadata that you will usually need when designing the rest of your data warehouse. This data profiling best practice will ensure that your ETL best practices are as efficient as they can be. In most cases, a data profiling done before the actual data warehousing project, will even tell you if the project is worth pursuing—saving you thousands of dollars in manpower hours.

 

Data Profiling Best Practice 2: Know the full array of tools

 

It is necessary for you to know the full array of tools that you can use to create your data profiling. Best practice is to get an idea of what programs or platforms are in the market then compare their functionality, price points, and usage so that you will have an idea of which tool is best suited for your data profiling purposes. Some tools available in the market are DataFlux, Trillium, and Informatica and even the basic data profiling tools that come with SQL Server. Currently, some providers have also released automated profiling tools. Most ETL experts agree that the last option is already suitable for basic, light data profiling work. For the more advanced kinds of projects, however, a tool dedicated or mainly designed for data profiling work is best. Aside from the kind of work, settling on a data profiling tool should, of course, be based on the level of standardization that you want. What you should do as a data profiling best practice is to get on to the Internet and faithfully do your research about each of these tools and read user/programmer feedback. Remember that choosing a data profiling tool really is a commitment to a kind of standard and style for data profiling.

 

Data Profiling Best Practice 3: Combine data profiling with cross-checking

 

If you really want to get the most out of your data profiles, you need to remember that the information you’ll be working with does not only come from inside, but must also be validated by outside sources. That means that you have to cross-check the facts that you have from data profiling with the feedback and other information that you are getting from other sources, which may not be so easy to tabulate and input in a data warehouse. These outside ways of checking data profiles may come from customer feedback, simple questionnaires for your stakeholders, trends relating or corresponding to the items in your data profile, or maybe a general survey of what your industry benchmarks are. This data profiling best practice can help you check the results of your inside work, as well as possibly point out errors or open up new ways of looking at your available data—all for the better understanding of information about your company.

 

Data Profiling Best Practice 4: Review segmentation

 

As a last data profiling best practice, you must continue running this process even after you have designed your data warehouse. Profiling data regularly can help you look for new ways of interpreting the data that you have, which usually brings a fresh insight into how the company or organization is doing, revealing changing trends and sounding off on any need to innovate or adapt so that the company will perform better. More than just regular, periodic data profiling, however, you also have to regularly review the way you segment the information for your data profiling work. This means that you have to cut through your data in new ways so that you get fresh insights into the same information. This will help you broaden or focus your data segmentation so that the metadata will give you either a widened or a specified look into your market and your business transactions. Sometimes, adjusting your data profile segmentation can mean just doing the data profile on a relevant part of information within your data warehouse. This is done so that you can be more efficient at studying the different related information or information chains. Data profiling best practices will make the data identification process faster and will yield more information about the more important aspects of your matrix of data.