In the public cloud for HPC we have seen an interesting shift from on-premise HPC that management sees public cloud simulation time as highly visible cost that needs to be cost controlled.
Management has an interest in being able to prevent engineers from being able to do too much simulation on the public cloud to avoid getting large cloud bills, and thus there is increased scrutiny on the types of on-demand jobs, and why these jobs need to be ran.
When each public cloud job costs real money, there is a shift in how engineers do their work. For example, making a mistake in the model now has a direct cost associated with it. If each model issue or typo costs you 500 dollars, you feel it a bit more than when the system was on premise.
This is in contrast to on-premise HPC resources where engineers are used to having access to unlimited simulation for their fixed resources for a fixed dollar amount. Engineers are encouraged to do as much HPC simulation a possible on-premise. We often get management asking questions like “Why isn’t the cluster busy on Sunday night?” when reviewing on premise utilization reports from TotalCAE Analyzer.
Engineering knows simulation ultimately saves money through less warranty claims, less physical testing, better products, and other benefits, but we still see a movement to try and put caps and various barriers around the HPC public cloud spending to keep the directly visible costs in check.
A few things that TotalCAE recommends when using HPC public cloud to minimize simulation as an expense:
- It is a good idea to debug models locally on your on-premise HPC or workstation, then send the large job to the cloud after the kinks are worked out. Focusing the public cloud for those large jobs that have been debugged gets you the best bang for the buck on the public cloud.
- It is more important to monitor submitted jobs that you submit to public cloud. Often a job that might not converge on premise isn’t a big deal if you don’t notice it in a timely fashion, but if you submit that same job that in the cloud and leave for vacation for two weeks, you could be paying for compute time and using up your cloud budget for a run that may not be generating useful results. ( real world story)
- The TotalCAE portal enables you to put a comment in the simulation job, which is useful when management is reviewing a report at the end of the month for the highest cost runs, and wants to know why a particular simulation was ran ( which can be hard to remember otherwise when doing so many simulations!)
- When doing a DoE, you may be able to control how you search the possible design space to take less job runs, as each job on public cloud has a real dollar cost and isn’t “filling holes” in the on-premise utilization.
- Consider the cost of the job when submitting. The TotalCAE cloud portal will highlight the approximate cost per hour of the job prior to the user hitting “Submit”. This can help deciding if this model is one you want to submit to the public cloud, based on the estimated cost.
- Consider a “max running” time, if the cost of the simulation exceeds what you wanted to spend ( didn’t converge for example, or is hung up) to avoid getting a large bill for a job that wasn’t going to solve.
The TotalCAE portal has features to give you job cost, cap a jobs cost, and pick projects to pull usage from to simplify cost management and reporting on public cloud.
The public cloud is a powerful tool to run very large simulations that could not be run on-premise, some simple tips like these can help you utilize it cost efficiently.