ETL Power - Transforming the raw ingredients of business intelligence : April 2013

Friday, April 26, 2013

Basics of Fact Table Granularity

Is it imperative that you learn about the basics of fact table granularity? To answer that, you must ask yourself: Are you embarking on a data warehousing project for your company or organization? If you answered yes, then it is imperative that you read about the basics of fact table granularity so that you can make important decisions for the design of your data warehouse. Understanding the basics of fact table granularity will affect the effectiveness of your data warehouse in the long run.

Fact Table Granularity Explained

One of the first basics of fact table granularity that you must understand is its meaning. Fact table granularity refers to the level of intricacy of detail that a single fact table contains. A higher granularity means that there are a higher number of details in that one table, while a lower granularity means less detail. For someone who is just learning the basics of fact table granularity, the easiest way to see how high or low the granularity of a fact table that your company’s database has is by looking at the number of rows in it. A higher number of rows in a table usually means a higher granularity. Granularity can also mean the lowest level of depth in a set of data for a fact table. For example, the grain for time can be year, quarter, month, day, or hour, with the year being the highest level, in which there is low fact table granularity.

Granularity in Fact Table Design

Understanding granularity is a must for you if you are involved in designing an efficient, functional fact table for a business or organization. Whether you are the designer or the project supervisor, your decisions based on the basics of fact table granularity will have an impact on how your fact table will work. This is because granularity is one of the most essential steps in designing a fact table, next to identifying analysis points for the business, measures, dimensions, and locations. Knowing how to properly choose granularity for a table will help you identify enough data for your needs without bogging down your system with unnecessary data.

Choosing the Level of Granularity

Knowing how deep you will go in terms of grain level for your fact table is among the basics of fact table granularity. Choosing to stay in the high levels of granularity will mean that you will have very limited data available to you. For example, choosing to stay in the year, one of the shallowest levels of granularity for time, in your fact table means that there is just one data being added for every year. This brings almost no knowledge to you, and is inefficient in helping you analyze or drill down into the data. Meanwhile, going too deep, in the level of hours, for example, for the fact table might mean that you will have 24 inputs 365 days a year, multiplied by the various other items for which the fact table is programmed. That can mean a lot of useless reports that end up bogging down your system as well as your analysis. There will be too much data to sift through and you will end up not optimizing their use and value. In many instances, a medium granularity is best to be used for a company or organization’s uses. A daily input is manageable in terms of size and analysis capabilities but it also gives enough detail from which you can extrapolate patterns, trends, and conclusions. Deciding on the level of granularity is a skill that must be learned over time and by gaining an understanding of your business processes.

Sunday, April 7, 2013

A Battle of ETL Tools: PowerCenter Informatica Versus Microsoft SSIS

Technology has made it easier for businesses to organize and manipulate data to get a clearer picture of what’s going on with their business. Notably, ETL tools have made managing huge amounts of data significantly easier and faster, boosting many organizations’ business intelligence operations.

There are many third-party vendors offering ETL solutions, but two of the most popular are PowerCenter Informatica and Microsoft SSIS (SQL Server Integration Services). Each technology has its advantages but there are also similarities on how they carry out the extract-transform-load processes and only differ in terminologies.

If you’re in the process of choosing ETL tools and PowerCenter Informatica and Microsoft SSIS made it to your shortlist, here is a short comparative discussion detailing the differences between the two, as well as their benefits.

Package Configuration

Most enterprise data integration projects would require the capacity to develop a solution in one platform and test and deploy it in a separate environment without having to manually change the established workflow. In order to achieve this seamless movement between two environments, your ETL technology should allow the dynamic update of the project’s properties using the content or a parameter file or configuration.

Both Informatica and SSIS support this functionality using different methodologies. In Informatica, every session can have more than one source and one or more destination connections. There are different kinds of connections the primary being relational connections. Every session can be set up dynamically by changing parameters contained in a parameter file.

The same thing can be achieved in SSIS via Configurations. Using the SSIS Configuration Wizard, configuration data is saved in XML files. Unlike Informatica wherein there can be multiple connections, SSIS only allows a single defined connection that can be applied across all tasks in a package.

In short, Informatica parameters are defined at the session level while SSIS configurations are set at the package level.

Data Staging

When you use SSIS, you will use the Connection Manager to generate a connection defining the physical location of the file. Multiple files that need to be loaded from multiple connections would require multiple connections. All information set in the connection manager can be incorporated in the Configuration File and can be dynamically updated during run-time.

On the other hand, if you’re using Informatica, you will use the Workflow Manager tool to assign a location to each file. Every task that needs to access that file can be configured with the location and name of that specific file.

Value Extraction

One of the main functions of ETL tools is being able to extract meaning from the information that is currently being ran or to supplement that information with extra information obtained from the current information in the data processing pipeline.

Both SSIS and Informatica have this functionality through the use of derived columns or the capacity to draw new information from existing data. Informatica does this via its expression transformation component while Microsoft SSIS does this via the Derived Column transformation.

The logic used to complete both operations is the same and the syntax involved is also identical. The difference between the two technologies lies in the expression language utilized to obtain the new data and the notation style involved. Informatica uses Character, Conversion, Data, Numerical, Scientific, Special and Test. On the other hand, SSIS uses Mathematical, String, Date/Time, NULL, Type Casts and Operators.

Sorting

Simply defined, sorting is having the ability to sort information into a chronological data set. While the order of the information may appear to be immaterial for loading into a relational data warehouse or database, it may matter for the other tasks later on in the transformation process.

The difference on how SSIS and Informatica carry out this functionality cannot be subtler. Informatica’s Sorter and SSIS’ Sort can both chronologically organize data and eliminate duplicate data. In SSIS, de-duplication can be done by setting the eliminate duplicates option to TRUE. For Informatica, this can be done by selecting the distinct option.

Detection of Data Quality Issues

Similar to all data integration solutions, ETL technologies can be susceptible to data quality problems. Fortunately for users of Microsoft SSIS, it allows for the creation of checkpoints within the data transformation process that can reveal and repair data quality problems. SSIS has a feature called Fuzzy Lookup transform that pairs incoming “dirty” information – unexpected abbreviations, null fields, inserted or missing tokens, truncations, misspellings and other data abnormalities – with clean records contained in a reference table. There is also the Fuzzy Grouping tool that finds similarities among input rows and unites duplicate data.

Unfortunately, Informatica does not have the same functionality out of the box. If you want to recreate this capability in Informatica, it would require human intervention to make a mapping table that contains every referrer value that came across the input stream and their equivalent mapping values.

However, take note that even if you’re using SSIS, you may still need to manually intervene to detect and repair data quality issues. Even the most advanced algorithm may miss something so you still need to manually check for the accuracy and integrity of your data.

Modularity

Modularity is concerned with the manner in which the work units that make up an end-to-end ETL solution are created and reused.

There’s a slight difference between PowerCenter Informatica and Microsoft SSIS on how they build modular data integration and ETL solutions.

Informatica involves a bottom-up framework to ETL implementation by permitting a library of components – mappings, mapplets, transformations, targets, and sources – that can be employed across numerous worklets in the solution. A worklet is composed of a sequence of mapping instances.

On the contrary, SSIS uses a top-down approach in wherein a general sequence of tasks is defined before setting the specifics on how these tasks are going to be carried out. Reusability of ETL components is achieved by creating libraries of packages which can then be implemented together with a master package. A package is the counterpart of Informatica’s worklet.

Tracking Changes in Slowly Changing Dimensions

Slowly changing dimensions address the issue of capturing and documenting a history of modifications or changes to entities within a database that are not reflected in a System of Record for that particular data. A common example of slowly changing dimensions is an item moving to another product category in a department store. This will modify the said product’s category attribute but its SKU will remain unchanged.

Informatica and SSIS both have the functionality to track these changes with very similar features. Informatica is equipped with a “slowly changing dimension wizard” that will allow you to create the sources, transformations, ports and destinations that are pertinent to accommodate these slowly changing requirements. SSIS also comes with a slowly changing dimension wizard that works similarly. Aside from the ability to keep track of slowly changing attributes, it can also recognize changes to attributes that are not supposed to change. These are known as fixed attributes. You have the option to enable the wizard to raise an error warning accordingly when this happens.

Dimension Loading

In terms of dimension loading, a surrogate key is necessary. As a substitute to a natural key, a surrogate key is where every join between fact tables and dimension tables are based.

Informatica and SSIS have varying ways on how they generate surrogate keys.

Out of the box, Informatica PowerCenter comes with a component dubbed as Sequence transformation that has the capacity to create a surrogate key. It produces an incremental value for every row in the pipeline which can then be incorporate into a destination table via a surrogate key column.

SSIS does not have a Sequence transformation component. Instead, it uses its Script transform component to generate surrogate keys.

Fact Table Loading

Populating fact tables usually involved two processes: (1) aggregating the data to the needed granularity and (2) retrieval of dimensional surrogate keys.

Informatica PowerCenter carries out these operations via a transformation dubbed as “Aggregator.” This component cuts across groupings of values from chosen input columns. SSIS has the same capability through the component called “Aggregate.” The slight difference between the two is that SSIS only has the most used functions such as Minimum and Maximum, Average, Count Distinct, Count and Sum. Informatica has all these and other extra capabilities such as Variance, Standard Deviation, Percentile, Median, First and Last.

Which ETL Technology is Right for Your Business?

There are other aspects that illustrate the difference between Informatica PowerCenter and Microsoft SSIS such as Design Time Debugging, Collaborative Development, Security, Integration with Packaged Applications, and Web Services & Service Oriented Architecture. However, the things discussed above cover the basic concepts in ETL technology.

As businesses encounter bigger challenges to synergize data from a constantly increasing number of different systems, your choice of an ETL solution to fit your needs is more crucial than ever.

As for the choice between Informatica PowerCenter and Microsoft SSIS, many analysts consider Informatica as the leader in ETL technology while reputable research firm Forrester firm once called SSIS’ price-to-performance ration “downright seductive.”

However, proclaiming a winner in this battle between two ETL technology giants would greatly depend on your business requirements. Of course, there are pricing differences between the two technologies and notable difference in their features, capabilities and differences as well as their level of usability. It’s for you to analyze which technology is the perfect fit.

Thursday, April 4, 2013

ETL and Its Role in the Success of Your Business Intelligence Projects

Good business managers don’t make internal and external business decisions out of thin air. More often than not, these business strategies and tactics are based on empirical and numerical data. This is where business intelligence or BI comes in.

Simply defined, business intelligence is a group of technologies, architectures, processes, methodologies and theories that are aimed to cull, interpret and convert business data into relevant and useful information. This information is used for an array of business purposes such as identifying new opportunities, crafting of new business policies and creating strategies for business sustainability.

The Basics of ETL

One of the quintessential and vital tools that are vital to the success of your business intelligence projects is ETL, short for extract-transform-load. Basically, ETL is a software that allows businesses to consolidate disparate data and transfer it from one source to another application or from different sources to a single data warehouse for interpretation and analysis.

As its name suggests, an ETL software involves a three-pronged process:

1. Extract. The software pinpoints the relevant data sets based on a query and extracts the information from different internal or external, and structured or unstructured sources.

2. Transform. This step involves preparing and modifying the extracted data for loading to the designated data warehouse or data warehouses. The level of transformation the extracted data needs vary. There are data that needs little to no transformation while there are those that have to be cleaned, restructured and organized. The most common types of data transformation include:

· Standardization

· Recording

· Pivoting

· Aggregation

· Cleaning

· Filtering

· Sorting

· Re-mapping

3. Loading. A number of experts agree that this is the least complex step in the ETL process given that the extraction and transformation of data were completed without a glitch. However, depending on the size of your datasets a partitioning strategy may need to be implemented. Based on the various loading options available, data loading may overwrite current data, append the existing data with the new data or merge information.

Key Benefits of ETL in Business Intelligence Projects

The success of your business intelligence project is highly dependent on the quality and integrity of data you have and the efficiency of how you collect, manipulate and consolidate your business information. This is where the benefits of ETL tools primarily lie.

ETL tools are especially advantageous in situations wherein you need to integrate data from different source systems that are usually in different formats. Since you can program an automated data processing schedule, ETL is beneficial when you need to process data repeatedly such as on a real time, hourly or daily basis.

Below are the more specific benefits of using ETL tools in your business intelligence projects:

1. Collect and integrate data from different locations. If you’re a company with operations spanning different geographical locations, ETL can be used to consolidate data into a central data warehouse for easy aggregation. One of the best examples to demonstrate this benefit is Motorola. The company used ETL to mine data from 30 geographically disparate systems and load the information into a single supply chain management data warehouse. This enabled the technology company to see its total procurement expenses without difficulty. Imagine if this data in a multitude of different formats had to be aggregated manually. It could have easily taken hundreds of man hours to complete.

2. Sharing of business information across different business functions. Another basic function that ETL tools play in your business intelligence projects is sharing of information across your different operational units. For example, your CRM system will define your customers using specific parameters. What if your accounting department needs this data to improve how it handles billing? Chances are, different units in your business will require a different data format. With ETL, extracting this information, converting it to a new format and loading it to the new data warehouse can be done seamlessly and with minimal room for error.

3. Ease of migrating, cleansing and warehousing bulk data for more comprehensive business intelligence projects. While it is common practice in business intelligence to choose data samples within bigger data populations, logic still dictates that the more information you have, the more informed and educated your business decisions are. ETL’s capability to extract, cleanse and write data with ease will allow you to use as much data as you can with ease.

Information is prerequisite to intelligence. Transitively speaking, the amount, quality and integrity of business information you get is directly proportional to the success of your business intelligence projects. There are only a few tools that can rival the efficacy of ETL software in terms of data processing for business intelligence. Review your business intelligence needs and identify gaps that ETL tools can bridge.