With the features of Azure Data Factory V2 becoming generally available in the past few months, especially the Integration Services Runtime, the question persists in our practice about which data integration tool is the best fit for a given team and project. Both Azure Data Factory and SQL Server Integration Services are built to move data between disparate sources, and they do have some overlapping capabilities. If you are new to either of these tools, here are some questions to help you begin understanding the differences to better make a selection.
Does your organization use Azure today, or is Azure an option for your project?
SQL Server Integration Services (SSIS) is a tool that has been around for many years. It started as an on-premises service, but that doesn’t mean it’s impossible to run in Azure in either a fully-cloud environment or a hybrid-cloud environment where your organization’s network is extended to Azure.
On the other hand, Azure Data Factory is a cloud service only, and while it absolutely can integrate into on-premises services and data, it does require a cloud service to be running in Azure. So, the first question to address is whether Azure is in use today or could be deployed to help create your solution.
|Hybrid On-Prem & Azure Solution||Pure Azure Solution||On-Prem Only Solution|
|Azure Data Factory (ADF V2)||Yes||Yes||No|
|Integration Services (SSIS)||Yes||Yes||Yes|
Can either or both services connect to the data sources and destinations you need?
Both ADF and SSIS have the capability to connect to a huge variety of data sources and write to different destinations, but in order to move forward it does make sense to verify the specific needs you have at the endpoints of your data integration. If your environment includes anything atypical besides common database engines, files and services, it pays to validate that you can consume and then write what you need to.
Azure Data Factory uses the concept of a source and a sink to read and write data. Microsoft’s Data Factory Documentation covers all ADF’s possible sources and destinations; check out Copy Activity in Azure Data Factory for an overview.
ADF supports a huge variety of both cloud and on-prem services and databases. If you are integrating it into data sources or destinations which reside within your corporate network, you will need a self-hosted Integration Runtime (Gateway) that handles connecting to your internal resources and transmitting the data up to Azure.
SSIS uses the connection manager concept to pull the data from a source and put it into a common/consistent format to work on internally in a data flow. There are standard connections built into SSIS, but also third-party connection managers that you can purchase and plug in for a variety of services.
Will you write ETL or ELT?
The next consideration is a bit more involved if you are new to data integration. Both of these tools excel at transporting data from place to place, but they have important differences in terms of what you can do to modify the data in transit. As a matter of emphasis, ADF has more features geared toward moving the data than performing any complex transformation along the way. SSIS, on the other hand, was built with a large library of transformations that you can chain together to make elaborate data flows including lookups, matching, splitting data, and more.
The tools also overlap quite a lot. In projects this seems to lead to the question of whether you’ll transform the data “in flight” using Extract Transform Load (ETL), or instead move the data to a destination where it’ll be transformed using Extract Load Transform (ELT).
Imagine loading a data warehouse from two or three business application databases. The two patterns here, at a high level, would be to pull the data out of the source and stage it into the data warehouse, essentially as a copy, and then call some stored procedure code to transform it into the warehouse proper, or to pull the data from the sources and use the data flow transformation within the integration tool itself to transform the data. There are pros and cons to each pattern, but deciding criteria includes:
- Does the destination for the data have the capability, like a database, to transform the data after it lands there? If not – say you are loading into a cloud service and not a database – then the decision is sort of forced to transform the data before it lands.
- Does the integration tool have the capability to transform the data? If not, then it may not be practical to try to use it to transform the data in flight.
- Finally, under those constraints, where would you prefer to do the transformation?
Put another way, the place you need to transform the data is key to this. Consider whether it’s possible to transform the data within your destination, as in a warehouse database, and/or whether it’s possible to transform it in transit. The table below represents current generally available capabilities as of September 2018:
(transform in flight)
(load first, call functions in the destination to transform the data)
What scalability challenges are you facing?
The next factor to consider is how much data you are dealing with. SSIS can be made to scale and can offer very good performance, but being mostly a purchased/on-prem tool it also may require that you tangle with servers and configuration to make it fast. ADF has the typical advantages of cloud services: it’s something you subscribe to and you don’t have to carry so much of the burden of infrastructure. To get some idea of possible throughput for ADF see this matrix. If you have a hybrid on-premises/cloud design, remember that the link to the Azure data center and that Integration Runtime (Gateway) that transmits the data up and down can also be a bottleneck.
There are gray areas here, but many organizations find it’s very helpful not to have to work with underlying hardware, virtual machines and servers. As we talked about above, SSIS, recently, can also run within ADF V2, which we’ll turn to in a minute, but if you are starting a new solution that may muddy the waters a bit.
Should you combine these two technologies?
ADF V2 now contains an Integration Services Runtime that enables the cloud service to execute SSIS packages. The idea here is that one can create a hybrid solution with both technologies, which is an exciting possibility but begs some questions. Here are some things to think through if considering combining these two technologies:
- Do you have a large SSIS deployment already, that would benefit from moving it to the cloud without rewriting?
- What’s the volume of data you will be moving? Is it thousands of rows, hundreds of thousands, millions, billions? ADF, again as a matter of emphasis, is mostly a big-data tool, and so it can move large quantities of rows quickly, but with some performance dedicated to overhead operations, start up, and so on. SSIS seems to be somewhat lighter weight, so it may be more appropriate if you have a smaller workload – but conversely SSIS could be slower for huge data volumes. One analogy I like is to think of ADF as a freight train compared to SSIS as a truck. Both capable, but in the case of the train it might take a bit of time to get set up and moving. Once moving, though, the train can move a lot of freight very quickly.
Since the two can now be combined, you can do things like call SSIS packages from ADF, if SSIS is a better fit for your workload.
- What’s your larger orchestration strategy for scheduling work? ADF V2 has a good scheduler with a lot of options, including the ability to chain tasks together with dependencies; will it fit into the larger strategy in your solution for calling automated ETL tasks? SSIS solutions have traditionally used SQL Agent as the scheduling engine and something like a “Master Package” to orchestrate running the ETL process with the appropriate dependencies, and that certainly still works. In this new world it’s possible to mix the two; for example using ADF to schedule the work, but call SSIS packages in the cloud, using the Integration Services Runtime.
- If you are planning to use SSIS in the cloud, how will you host it? On a virtual machine without using ADF, or within the new ADF environment?
- Is your team ready to tackle the two technologies together? It’s not terrible, but there is some learning curve if you are taking on both at the same time, or moving from one to the other.
I hope this will help you to understand some of the important differences between ADF and SSIS for your data integration work, and to make an informed decision so you’re happy with whichever platform you end on. If you are looking to address your data and AI challenges, contact us today! Our experts are ready to assist and determine how we can help you streamline your business.
Check out part 2 of this blog post here!