Two of the more popular methods of uploading data to an Azure SQL Database are Azure Data Factory and SQL Server Integration Services (SSIS). SSIS is the old guard, the comfortable tool that DBAs and ETL experts know and love. In contrast, Azure Data Factory is the new contender, young, fast, agile. Or is it? There isn’t a wealth of knowledge out there comparing the two in terms of pure performance. Here, we’re going to attempt to put some real numbers out there.
Your mileage may vary, but it should give you some idea of how well the two platforms transfer data. That said, here’s a summary of the testing procedures.
The local system was a SQL Server 2016 install with Azure Data Gateway running during all tests. SSIS ran on the same system as the database service. The computer showed no more than 50% utilization during the tests, so it was likely not a bottleneck. The network advertised upload speed was 10 megabits per second. Speedtest.net reported 12 megabits to my internet service provider.
The dataset is a 3.7 million row NY Taxi Sample database (Download Link). The CSV is a little over 400MB. It was imported to a local SQL table. No transformations were performed in either version. The SSIS package used ADO.NET source and destination connections. The Azure Data Factory pipeline, likewise, directly copied from source to destination.
Between tests, the Azure SQL Database table was truncated. Each configuration was tested once.
I recorded results at a variety of pricing tiers for the Azure SQL Database to test relative performance between SSIS and Azure Data Factory.
My first execution tried transferring the data with the Azure SQL Database using the basic tier (5 Database Transfer Units or DTUs) in Azure Data Factory. It took almost an hour to transfer: 59 minutes, 34 seconds. In keeping tabs on the network graph, it was obvious that the Azure database was the bottleneck, as the network showed spikes about once every 5 seconds. I did not bother running SSIS at this level.
I tried a head-to-head comparison at the S2 tier of Azure SQL Database. That gives you a 50 DTU capacity, theoretically about 10 times the speed of the basic tier. Instead, I got about 4 times the throughput. Azure Data Factory took 15 minutes, 43 seconds and SSIS took 14 minutes, 53 seconds. SSIS here took roughly 1 minute less. Surprisingly, SSIS was a little faster.
I tried increasing the Azure Data Factory to the S3 tier with 100 DTUs. Both SSIS and ADF showed minor improvements, each lopping off about 3 minutes from their transfer time. ADF clocked in with 12 minutes, 42 seconds, while SSIS came in with 11 minutes, 55 seconds. Again, here SSIS pulled ahead slightly with about a minute’s advantage.
Then, I upped the Azure SQL Database to the P4 tier with 500 DTUs. SSIS dropped by about half to 6 minutes, 20 seconds. ADF closed some of the gap this time with a copy time of 6 minutes, 39 seconds. We were seeing peak throughput around 11 megabits here, and less frequent dips. Here, we’re not hitting peak utilization on either source or destination, so it seems that it’s finally limited by network bandwidth here.
Finally, I tried going to the P6 tier with 1000 DTUs. While we do finally see ADF beating out SSIS in this scenario, we’re clearly hitting the limits of our network connection. SSIS dropped by four seconds, to 6 minutes, 16 seconds and ADF dropped to 6 minutes, 3 seconds. At this price point, the P6 tier was hitting less than 30% utilization.
Here’s a table to summarize the results:
|Tier||DTUs||ADF Time (m:ss)||SSIS Time (m:ss)|
There are two things being compared here. One is between the two transport services, ADF and SSIS. The other is throughput as Azure SQL DB is scaled.
It’s clear that both Azure Data Factory and SSIS benefit when the Azure SQL Database is scaled up to an appropriate level. The obvious recommendation, then, is that during loading, one ought to scale up the Azure SQL Database to something like 25-50 DTUs per megabit of network bandwidth. Conveniently, you can scale it back down when you’re finished loading.
What is less obvious is which technology to use. It’s clear that network throughput will be the limiting factor when the target is appropriately sized. What is not clear is which technology uses that bandwidth most efficiently. Azure Data Factory seemed to scale slightly better with greater network bandwidth, but the testing was insufficient to determine a clear winner.
At this point, then, the recommendation must default to whatever suits your needs most. If you have a rich SSIS depth of knowledge and utilization, then keep using that. However, if you need to capture data from IoT devices or other outside data sources, then Azure Data Factory might suit your needs better. If you’re on the fence, it would be best to test and compare results for your specific situation.