Advantages of On-Premises Deep Learning and the Hidden Costs of Cloud Computing

TL;DR

  1. Cloud offers ‘free’ compute hours to lure customers to deploy all deep learning on cloud. Training costs quickly get out of hand and eclipse would be on-premises hardware investments.
  2. On-premises systems, such as Exxact Deep Learning workstations and servers, offer maximum flexibility and control over infrastructure, and allow for deployment of experimental frameworks for cutting-edge experiments.
  3. In many cases, due to security and privacy concerns (often mandated by government regulations), sensitive data must remain on-premises or have ‘air-gapped’ properties.

When to Choose On-Premises Over Cloud For Your Deep Learning Applications

This is not meant to be a hit piece on cloud for deep learning, as we realize cloud GPUs offer great utility in the deep learning ecosystem. However, we do intend to show where cloud computing falls woefully short for deep learning applications.

Cloud providers have created the paradigm where it’s automatically assumed that cloud computing is cheaper & better, with little to no consideration of running your own hardware. While for most applications this may be true, deep learning is simply a different entity, and having your own dedicated hardware, especially for deep neural network (DNN) training can provide significant benefits — with cost effectiveness only being one.

As providers of high performance computing (HPC) and specialized systems for deep learning, we have a particular expertise in identifying scenarios where on-premises compute is favored over cloud in terms of cost, flexibility, privacy, and/or security.

How do we know this?

Simple — This is what our customers tell us and what they ask for.

Some of the applications are obvious (you’d hardly want to rely on cloud compute for a self-driving car cruising at highway speeds) while others are not.

For example:

The sheer amount of data and compute necessary for training DNN’s.

A naïve company assuming that “cloud is cheaper, we’ll only pay for what we need!” will be quickly railroaded with runaway costs once their deep learning needs begin to scale.

This is why it’s important to consider every aspect of your computational needs when deciding between cloud or local compute for your next project. You really don’t need your own warehouse-sized data-center to match the performance of cloud compute virtual machines. An Exxact Valence VWS-1542881 Deep Learning Workstation or a NVIDIA DGX-1 are not much larger than a conventional personal computer, and can probably handle DNN training for 90% of companies “doing deep learning”.

Note, it’s also fair to assume that the ideal scenario may be a sort of hybrid approach utilizing cloud services and API’s, with on-premises hardware for the data and compute heavy tasks.

Cost: Beware of the “Free” GPU Instances

Cost comparison estimates for cloud vs. on-premises systems vary from about 2x as expensive for data centers in general, and up to 3-4x more expensive in deep learning specific setups.

Large cloud providers tend to offer “free” cloud compute hours to lure in companies and get them “hooked” on their cloud platform, sort of akin to a drug dealer handing out smack on the street corner. Now we’re not saying it doesn’t make sense to take advantage of cloud services in certain situations; just know what you’re getting yourself into before you end up with your full deep learning stack on a cloud platform and are training compute heavy, data intensive deep neural network models.

It’s a well-known secret that cloud computing can be expensive when compared to dedicated systems, particularly for tasks with reliable compute needs known well in advance.

Installing an on-premises deep learning system allows your organization to claim depreciation against tax liabilities. 

For application specifications that don’t rule out either on-site or cloud solutions, cost is king. In that case it’s time to set the total cost of ownership against a comparable subscription to a major cloud compute provider.

Keep in mind that the numbers below are estimates, and that cloud-based projects often accrue additional costs from things like data storage and transfer that aren’t immediately obvious. The costs for running Amazon Web Services P2 and P3 instances, marketed especially for machine learning, are shown below with and without a 3 year subscription (the 3 year commitment entails partial payment in advance).

(Assuming total depreciation over 3 years. Estimated maintenance and operational costs at 50% of the original purchase cost per year, cost of electricity (estimated as ~$0.20 per kW*hr). For the latest cloud pricing, check the AWS EC pricing pages for P2 and P3 instances. It’s worth noting that even with an estimate of 50% maintenance costs per year, on-site systems at 100% utilization would still be significantly cheaper than slower cloud counterparts. While the “lower end” P100 on-site configuration costs about 50% less per hour than a reserved p2.xlarge AWS instance, P100 GPUs perform about 4x faster than the older K80 GPUs on Tensorflow benchmarks.)

Flexibility: Listen To What Your Engineers Want, Not Your Accountant

One of the selling points of cloud computing is “elasticity,” or the ability to quickly spin up additional virtual machines as needed. As counter-intuitive as it may sound, this elasticity does not necessarily translate into increased flexibility when it comes to pre-installed frameworks or choice of hardware.

For instance, invest in reserved P2/P3 instances from Amazon Web Services, and you’ll find yourself limited to a choice between older-generation K80 and more capable but pricier Tesla V100 GPUs.

Choosing a custom-built system for your deep learning application allows flexibility in choice of GPUs. Not only that, but on-premises providers support specialized software configurations far beyond the popular TensorFlow, Torch, PyTorch, Theano, etc. but also support more esoteric packages like DL4J, Chainer, and Deepchem for drug discovery.

Specialized frameworks offer an ease of flexibility that is not always available from one-size-fits-all solutions offered by major cloud providers, configured with all dependencies to run smoothly out-of-the-box.

More often than not, developer/researcher time is demonstrably your most valuable resource. Cloud computing obviates the need to worry about upgrades and maintenance, so that you and your team can concentrate on solving real problems. What’s not as obvious is that sourcing a deep learning system from a dedicated provider provides many of the same benefits, with services and warranties you’ll be hard-pressed to do without on a DIY system.

Security and Privacy

Applications serving government, law enforcement, defense, and medical industries all have strict regulations on maintaining data security, often preventing the usage of 3rd party storage solutions.

The obvious considerations of capability and cost may be the the first thing to come to mind when debating the cloud vs on-premises decision, but in fact there are many applications where the choice will be made for you by data security or privacy requirements.

As members of the public, we may be growing overly accustomed to news of security breaches in cloud services, (such as the personal information describing U.S. voter registered for the 2016 election left exposed on AWS by data services company Deep Roots Analytics) but in setting up a research or business project with potentially sensitive data the consequences are all too real.

The convenience of cloud resources comes at the cost of increased exposed attack surfaces which may be vulnerable to malicious or accidental breaches.

 

On-premises systems mitigate some of this risk and can be configured to optimize for security, e.g. by building an air-gapped system to avoid side-channel attacks. In other instances the control and protection of private data may be something of a gray area, but internal best practices may encourage on-site data storage. Banking, fintech, or insurance applications all deal with sensitive data; and even for areas without explicit regulatory requirements, data security is a priority consideration when a breach may have long-term reputation consequences.

The Bottom Line: It’s gut-check time for your A.I. initiatives

It’s time to be honest with yourself and your company and determine how serious you take deep learning and A.I.

If you have no idea what you’re doing, looking for bragging rights, and just want to play around a little with some neural networks, probably going full cloud is for you. However, if you REALLY want to deep dive into the latest game-changing technology, be on the cutting edge of A.I. research, you’ll find out sooner or later that it’s time to put some skin in the game, buckle down and get some GPUs.

Cloud computing may seem to make sense for small, unknown, or variable compute requirements, but for deep learning at scale there are numerous advantages to considering a dedicated on-premises system.

For continuous, large scale and anticipated deep learning compute requirements, the cost savings of using dedicated on-site systems are significant. Computational needs for a smaller or more experimental workload can be met by a cost-effective yet capable 4x GPU deep learning workstation starting at less than $8k.

Hey, and if all else fails, you can always mine cryptocurrency on your shiny new GPUs. Try doing that with Azure or AWS.