Technology
has made it easier for businesses to organize and manipulate data to get a
clearer picture of what’s going on with their business. Notably, ETL tools have
made managing huge amounts of data significantly easier and faster, boosting many
organizations’ business intelligence operations.
There
are many third-party vendors offering ETL solutions, but two of the most
popular are PowerCenter Informatica and Microsoft SSIS (SQL Server Integration
Services). Each technology has its advantages but there are also similarities
on how they carry out the extract-transform-load processes and only differ in
terminologies.
If
you’re in the process of choosing ETL tools and PowerCenter Informatica and
Microsoft SSIS made it to your shortlist, here is a short comparative discussion
detailing the differences between the two, as well as their benefits.
Package Configuration
Most
enterprise data integration projects would require the capacity to develop a
solution in one platform and test and deploy it in a separate environment
without having to manually change the established workflow. In order to achieve
this seamless movement between two environments, your ETL technology should allow
the dynamic update of the project’s properties using the content or a parameter
file or configuration.
Both
Informatica and SSIS support this functionality using different methodologies.
In Informatica, every session can have more than one source and one or more
destination connections. There are different kinds of connections the primary
being relational connections. Every session can be set up dynamically by
changing parameters contained in a parameter file.
The
same thing can be achieved in SSIS via Configurations. Using the SSIS
Configuration Wizard, configuration data is saved in XML files. Unlike
Informatica wherein there can be multiple connections, SSIS only allows a
single defined connection that can be applied across all tasks in a package.
In
short, Informatica parameters are defined at the session level while SSIS
configurations are set at the package level.
Data Staging
When
you use SSIS, you will use the Connection Manager to generate a connection
defining the physical location of the file. Multiple files that need to be
loaded from multiple connections would require multiple connections. All
information set in the connection manager can be incorporated in the
Configuration File and can be dynamically updated during run-time.
On
the other hand, if you’re using Informatica, you will use the Workflow Manager
tool to assign a location to each file. Every task that needs to access that
file can be configured with the location and name of that specific file.
Value Extraction
One
of the main functions of ETL tools is being able to extract meaning from the
information that is currently being ran or to supplement that information with
extra information obtained from the current information in the data processing
pipeline.
Both
SSIS and Informatica have this functionality through the use of derived columns
or the capacity to draw new information from existing data. Informatica does
this via its expression transformation component while Microsoft SSIS does this
via the Derived Column transformation.
The
logic used to complete both operations is the same and the syntax involved is
also identical. The difference between the two technologies lies in the
expression language utilized to obtain the new data and the notation style
involved. Informatica uses Character, Conversion, Data, Numerical, Scientific,
Special and Test. On the other hand, SSIS uses Mathematical, String, Date/Time,
NULL, Type Casts and Operators.
Sorting
Simply
defined, sorting is having the ability to sort information into a chronological
data set. While the order of the information may appear to be immaterial for
loading into a relational data warehouse or database, it may matter for the
other tasks later on in the transformation process.
The
difference on how SSIS and Informatica carry out this functionality cannot be subtler.
Informatica’s Sorter and SSIS’ Sort can both chronologically organize data and
eliminate duplicate data. In SSIS, de-duplication can be done by setting the eliminate
duplicates option to TRUE. For Informatica, this can be done by selecting the distinct
option.
Detection of Data
Quality Issues
Similar
to all data integration solutions, ETL technologies can be susceptible to data
quality problems. Fortunately for users of Microsoft SSIS, it allows for the
creation of checkpoints within the data transformation process that can reveal
and repair data quality problems. SSIS has a feature called Fuzzy Lookup
transform that pairs incoming “dirty” information – unexpected abbreviations,
null fields, inserted or missing tokens, truncations, misspellings and other
data abnormalities – with clean records contained in a reference table. There
is also the Fuzzy Grouping tool that finds similarities among input rows and unites
duplicate data.
Unfortunately,
Informatica does not have the same functionality out of the box. If you want to
recreate this capability in Informatica, it would require human intervention to
make a mapping table that contains every referrer value that came across the
input stream and their equivalent mapping values.
However,
take note that even if you’re using SSIS, you may still need to manually
intervene to detect and repair data quality issues. Even the most advanced
algorithm may miss something so you still need to manually check for the
accuracy and integrity of your data.
Modularity
Modularity
is concerned with the manner in which the work units that make up an end-to-end
ETL solution are created and reused.
There’s
a slight difference between PowerCenter Informatica and Microsoft SSIS on how
they build modular data integration and ETL solutions.
Informatica
involves a bottom-up framework to ETL implementation by permitting a library of
components – mappings, mapplets, transformations, targets, and sources – that
can be employed across numerous worklets in the solution. A worklet is composed
of a sequence of mapping instances.
On
the contrary, SSIS uses a top-down approach in wherein a general sequence of
tasks is defined before setting the specifics on how these tasks are going to
be carried out. Reusability of ETL components is achieved by creating libraries
of packages which can then be implemented together with a master package. A
package is the counterpart of Informatica’s worklet.
Tracking Changes in
Slowly Changing Dimensions
Slowly
changing dimensions address the issue of capturing and documenting a history of
modifications or changes to entities within a database that are not reflected
in a System of Record for that particular data. A common example of slowly
changing dimensions is an item moving to another product category in a
department store. This will modify the said product’s category attribute but
its SKU will remain unchanged.
Informatica
and SSIS both have the functionality to track these changes with very similar
features. Informatica is equipped with a “slowly changing dimension wizard”
that will allow you to create the sources, transformations, ports and
destinations that are pertinent to accommodate these slowly changing
requirements. SSIS also comes with a slowly changing dimension wizard that
works similarly. Aside from the ability to keep track of slowly changing
attributes, it can also recognize changes to attributes that are not supposed
to change. These are known as fixed attributes. You have the option to enable
the wizard to raise an error warning accordingly when this happens.
Dimension Loading
In
terms of dimension loading, a surrogate key is necessary. As a substitute to a
natural key, a surrogate key is where every join between fact tables and
dimension tables are based.
Informatica
and SSIS have varying ways on how they generate surrogate keys.
Out
of the box, Informatica PowerCenter comes with a component dubbed as Sequence
transformation that has the capacity to create a surrogate key. It produces an incremental
value for every row in the pipeline which can then be incorporate into a
destination table via a surrogate key column.
SSIS
does not have a Sequence transformation component. Instead, it uses its Script
transform component to generate surrogate keys.
Fact Table Loading
Populating
fact tables usually involved two processes: (1) aggregating the data to the
needed granularity and (2) retrieval of dimensional surrogate keys.
Informatica
PowerCenter carries out these operations via a transformation dubbed as
“Aggregator.” This component cuts across groupings of values from chosen input
columns. SSIS has the same capability through the component called “Aggregate.”
The slight difference between the two is that SSIS only has the most used
functions such as Minimum and Maximum, Average, Count Distinct, Count and Sum.
Informatica has all these and other extra capabilities such as Variance,
Standard Deviation, Percentile, Median, First and Last.
Which ETL Technology
is Right for Your Business?
There
are other aspects that illustrate the difference between Informatica
PowerCenter and Microsoft SSIS such as Design Time Debugging, Collaborative
Development, Security, Integration with Packaged Applications, and Web Services
& Service Oriented Architecture. However, the things discussed above cover
the basic concepts in ETL technology.
As
businesses encounter bigger challenges to synergize data from a constantly
increasing number of different systems, your choice of an ETL solution to fit
your needs is more crucial than ever.
As
for the choice between Informatica PowerCenter and Microsoft SSIS, many
analysts consider Informatica as the leader in ETL technology while reputable
research firm Forrester firm once called SSIS’ price-to-performance ration
“downright seductive.”
However,
proclaiming a winner in this battle between two ETL technology giants would
greatly depend on your business requirements. Of course, there are pricing
differences between the two technologies and notable difference in their
features, capabilities and differences as well as their level of usability. It’s
for you to analyze which technology is the perfect fit.