Are on-premises HPC clusters really millions of dollars? Recent posts from cloud-only vendors have made it seem HPC clusters are only for firms with deep pockets. TotalCAE offers both on-premises HPC clusters, that we manage in our client’s data center, and managed cloud, in our client’s AWS/Azure subscriptions. Since TotalCAE actually manages both cloud and on-premises clusters, this post is to help clients understand that HPC clusters are not just for the rich and famous. We will discuss what a HPC cluster is made of and where the real costs of on-premises and cloud clusters really are so firms can make an informed decision.
A minimal HPC configuration consists of these components for both on-premises and cloud:
- Head Node – This is a management server that runs the HPC scheduler that schedules workloads on the compute nodes, contains the direct attached storage, performs management functions, and runs the TotalCAE platform. This exists in both on-premises and cloud environments. The head node does not run any computation.
- Compute Nodes – Compute nodes run computational work, and are connected with a low latency networking solution such as InfiniBand on-premises and in Azure, and AWS EFA on the cloud to enable jobs to scale as the number of CPU cores grows. A compute node has anywhere from 56-96 CPUs and 100-384GB of memory per node to solve your job.
- Networking – For more than two nodes, a specialized RDMA switch for computational traffic called InfiniBand is used on-premises. On the cloud RDMA networking is baked into the costs and there is no separate charge.
A typical starter HPC configuration with 1 Management node, 48TB storage, and Two Intel Ice Lake Compute Nodes ( 112 total cores) with InfiniBand costs less than 1K per month on-premises . This equates to approximately 1.2 cents per core hour for the entire on-premises solution; affordable for small businesses and far less than the aforementioned million dollars.
Common HPC Configurations for CAE
Here are some list prices for on-demand cloud, reserved cloud, and On-Premises lease options to do your own comparison.
Hardware Option | Cost | Location | Time Duration and Instance Type |
56-core 3rd Gen Intel Xeon Scalable Processor | $420 per month | On-Prem | 3-Year with InfiniBand |
64-core 3rd Gen Intel Xeon Scalable Processor | $1,843 per month | AWS | 3-Year RHEL Reserved c6i.32xlarge Instance with EFA |
96-core AMD EPYC Milan 7R13 | $1,230 per month | AWS | 3-Year RHEL Reserved Hpc6a.48xlarge Instance with EFA |
64-core 3rd Gen Intel Xeon Scalable Processor | $5.57 per Node Hour | AWS | Hourly RHEL On Demand c6i.32xlarge Instance with EFA |
96-core AMD EPYC Milan 7R13 | $2.88 per Node Hour | AWS | Hourly RHEL On Demand Hpc6a.48xlarge with EFA |
120 core AMD EPYC Milan 7003 | $1,445 per month | Azure | 3-year RHEL Reserved HB120rs V3 with InfiniBand |
120 core AMD EPYC Milan 7003 | $3.96 per Node Hour | Azure | Hourly On Demand RHEL HB120rs V3 |
The direct cloud pricing above is available at https://azure.microsoft.com/en-us/pricing/details/virtual-machines/red-hat/ and https://aws.amazon.com/ec2/pricing/on-demand/.
The Cost Driver is not the HPC Hardware, it’s People and Software.
Many of these million dollar total cost of ownership analysis’ include the cost of HPC software and associated labor. IDC attributes over 67% of the cost of on-premises costs to labor (60%) and software (7%) . They assume an army of engineers and software is required to run your HPC environment and this is true. However, when TotalCAE manages your on-premises HPC you are getting a complete HPC team and all software at a fraction of the cost. No overhead, no installation, no maintenance and no day-to-day upkeep. Let your team focus on engineering, and not IT.
Cloud Doesn’t Eliminate People and Software.
For those who go the do-it-yourself route on AWS or Azure cloud, your staffing costs do not change. In fact, they will be higher as cloud HPC requires additional skills for your IT staff. TotalCAE SaaS platform and services hosted in your cloud environment will have you up and running in days, not months, while saving you over 60% on your operational costs through IT management of the cloud environment running TotalCAE Platform. Same great benefits of TotalCAE on-prem, with the flexibility and agility of using your own Azure and AWS infrastructure.
TotalCAE HPC Everywhere for Maximum Agility and Lowest Cost On-Premises and Cloud
Firms want to make the best decisions for their company and being fully informed on both on-premises and cloud options is just the way to do that. TotalCAE HPC Everywhere enables clients to run either on-premises, cloud, or both for maximum flexibility and the lowest costs for hundreds of CAE applications. TotalCAE’s turnkey managed HPC cloud and managed HPC clusters will enable you to get your HPC environment for your engineers up and running in days, not months, while reducing your IT labor and HPC software to a fraction of the cost of doing it yourself.