A reasonably common scenario for a data-focussed consultancy is that a client may want to ship sensitive data from their on-premise or cloud environment to your AWS environment. There are a number of reasons that they may want to copy the data into your environment: it may be difficult for you to work with it in-situ, the tools you need may not be inside their environment, their may be no ingress to their data stores from outside, or they may want to provide an extract of data rather than the raw sources. These are all valid scenarios under which the simplest scenario is to be able to dump the sensitive data into an S3 bucket under your control.
As a secureable data store, S3 is quite hard to beat. The cost of storage at rest, and of data transfer in and out, is trending toward zero and already very low. There are a number of convenient ways of providing transparent encryption at rest on the server side, and reasonably convenient ways of doing client side encryption. Access control and auditing can be very fine grained, and unless you do something silly access can be locked down very hard. The bad old days of inadvertently leaving buckets open to the internet are long past, and it takes fairly active effort to open an S3 bucket for public access.
The drawback of S3 as a “drop box” for clients though is that it does require the client to install and manage either the AWS CLI tooling, or some third party S3 client software, or to login to your account to access the bucket via a web browser. It also requires you to provide them with access credentials which are somewhat prone to leaking.
Recently, AWS has rolled out AWS Transfer for SFTP which allows you to provide an SFTP service backed by a standard S3 bucket for storage. Wikipedia defines SFTP as
SSH File Transfer Protocol (also Secure File Transfer Protocol, or SFTP) is a network protocol that
provides file access, file transfer, and file management over any reliable data stream
This is quite different to FTPS and FTP over SSH. FTPS is a standard FTP session with encryption enabled, and FTP over SSH tunnels the connection through an SSH connection. Neither are considered very secure, both require multiple channels (ports) to be opened, and available tooling makes configuration and use tricky and prone to error. SFTP on the other hand sits direct.ly on top of TLS, ensuring fast and bullet proof encryption, and solid validation of the identity of the client, server or both. More relevantly it uses a single easily identifiable channel/port, making network and firewall configuration and monitoring much simpler.
The user guide is fairly good, but does not make it entirely clear that there are three steps. Note that these could be theoretically automated using Terraform or similar, however it’s likely that setting up a service like this will be infrequent, and there may not be much benefit in automating it. Note as well that provisioning the service can take quite a long time (tens of minutes), which does not sit comfortably with the Terraform way of doing Infrastructure-as-code.
First step is to create a bucket, with all public access turned off, and without access policies defined. Ideally you want to enable server side encryption-at-rest, and whatever logging, versioning and life-cycle management you want. Remember this is a standard bucket with nothing special about it, so you can easily adapt your standard bucket policies. As an aside, I would actually recommend that the bucket used as a target for SFTP is not where the data is retained – it’s quite straightforward to set up a data pipeline that will move files from the “dropbox” location to a “working” location when they land, adding additional assurances around the possibility of the data being publicly exposed.
Next, an IAM role and policy is needed which allows the SFTP service to read and write the bucket. The user guide gives a clear walk through of the requirements and process here. I would suggest having a distinct role and policy for each instance of the SFTP service / bucket pair – roles and policies are essentially free, and it will make auditing much simpler.
The final step is to create an SFTP instance. Again, the user guide is pretty good here, and the creation wizard is extremely clear. The endpoint can be public (which is what you want to expose it to a client), or tied to a VPC (which could be useful for the case where you have a VPC that has a private connection into your on-premise network). You want to use “service managed” identities, so that you can make use of the client’s private/public ssh key pair, and of course you want to tag your resource.
The final, final step is to add a user to the service. It’s actually at this point that an association is formed between the SFTP end point and the bucket you created – the security model is “this user is allowed to instruct the SFTP service to access this bucket”, rather than the service itself having a security configuration that knows about the bucket. A side effect of this of course is that it means that you could use different buckets for different users. In addition the user configuration allows specification of a home directory in the bucket, which allows different virtual roots of the filesystem for different users.
The big drawback to the service is price. While storage and transfer pricing is very low, the actual base provisioning cost of the service adds up very quickly – at $0.30/hour this adds up to $216/month. This caught me out when i accidentally left an instance up for a week after doing testing for a client, landing me with a most unpleasant bill.
Despite the cost though, as long as your clients are able to manage SSH keypairs securely, and setup outgoing SSH traffic to the SFTP server, this is easily the simplest and most secure way of transferring large volumes of data to your AWS environment.
