How to solve ETL problems using SFTP
TL;DR - SFTP can make it incredibly easy to share data securely between several parties where APIs to the data aren’t maintained. Read more below for the full story!
ETL, or Extract, Transfer and Load, is key to any organization dealing with data. The process of ETL involves pulling data from a variety of sources, transforming it to models or data structures, and storing these structures into a database. This database (usually referred to as a Data warehouse) supports better business decisions by making the process of querying and reporting that much quicker and easier. The idea of ETL has been quite popular ever since the 1970s, so the fact that there are tools and frameworks out there that support the orchestration of ETL processes should come as no surprise. Many organizations utilize commercial ETL products such as the incumbent Informatica or new-age cloud solutions such as Fivetran or Xplenty, while other organizations build their own tools using a variety of programming languages and frameworks.
Back in the not so distant past, most companies made use of ETL for the purpose of analyzing their own data: they would collect data from disparate data sources that they owned into a single data warehouse, and by doing so gained a clearer and fuller picture of their own business undertakings. In most cases, this involved pulling data from databases using SQL.
Nowadays, organizations have to combine the data that resides in their internal databases, with data collected by 3rd party services they use. For instance, an organization that uses a CRM such as Salesforce or Pipedrive, would have to pull data about its customers from these services. An organization that uses Facebook Ads and Google Ads for marketing campaigns, would have to pull data about the success of its campaigns from these services. Doing so is far more complex due to the different technologies that come into play and the fact that both the APIs (the interface to the data), and the data itself, are not owned by the company. These 3rd parties may change the interface and the data structure of the data which may break the ETL processes. But what about pulling data from vendors or customers who don’t offer an API to the data you’d like to analyze?
This is where SFTP comes in. SFTP, the Secure File Transfer Protocol, has been around for over 20 years and is widely considered to be a standard and secure protocol for transferring files. Due to its widespread use, SFTP has become the de-facto standard for securely sharing files within and between organizations. While SFTP is as simple as its predecessor, FTP (which is as old as the concept of ETL), it also boasts substantial advances in terms of its security, this being a contributing factor as to why SFTP is widely supported by GUI and command line (CLI) tools on all operating systems that allow both simple users to connect to SFTP servers and IT folks to write quick and dirty automation scripts to move files around. There are also libraries and SDKs for all programming languages that allow smooth connectivity to SFTP servers.
Another example of SFTP’s almost universal popularity is that when working with large corporations as vendors or customers, using SFTP as a safe drop site for files is almost always the default method. It provides a high level of confidence to corporates thanks to its encryption of authentication & data transfer. Another reason for its prevalence amongst corporates is its longevity and reliability — it's just been around for so long. Furthermore, it doesn’t require any additional development so that its setup time is potentially much shorter than other methods. Time, naturally, being a valuable resource.
So if you are working on an ETL process and you need to either receive data or send it to vendors, customers or other 3rd parties, SFTP is a valuable solution. Its reliability, efficiency, and its frequent use amongst corporates make it a viable method of transferring data . Once you have chosen SFTP as the protocol, method and intermediate storage between the source and destination, you will also need to agree on the data structure. A common practice of our current day is to use JSON or JSONL format (JSON Lines or new-line delimited JSON) to store data in files, since the file also describes the schema of the data, while also allowing for dynamic schemas that evolve over time with backward compatibility.
SFTP To Go’s SFTP cloud solution offers a few more features that make it superior to plain SFTP when used in ETL and data processing platforms:
- Multi-protocol support - users and applications can pick and choose whichever protocol they wish to read and write data files. While SFTP is the default protocol, you can also use FTPS, which may be more prominent with some organizations. Amazon S3 is also a viable option, which allows the user to query data directly through Amazon Athena or even perform scalable data processing using Amazon EMR, Spark, or Map/Reduce.
- Filesystem changes webhook notifications - in many cases, ETL processes are triggered by a time based scheduler. An example of such scheduled commands could be: “Load customer data every midnight”. With webhook notifications, you can use a programmatic trigger to start an ETL process each time a file is dropped into your SFTP storage. This allows you to process data as it comes and also help you scale the required processing power (i.e. no resources when there are no new files vs. more resources when many files are uploaded).
- Unlimited storage - the primary concern with every storage is disk space. More often than not, ETLs may break due to disk space running out in either the intermediate storage or in the target data warehouse. By contrast, SFTP To Go’s storage is virtually unlimited. We will let you know if you go over your threshold, so that you won’t have to worry about running out of disk space anymore.