Overview
Hellbender is the latest High Performance Computing (HPC) resource available to researchers and students (with sponsorship by a PI) within the UM-System.
Hellbender consists of 208 mixed x86-64 CPU nodes (112 AMD, 96 Intel) providing 18,688 cores as well as 28 GPU nodes consisting of a mix of Nvidia GPU's (see hardware section for more details). Hellbender is attached to our Research Data Ecosystem ('RDE') that consists of 8PB of high performance and general purpose research storage. RDE can be accessible from other devices outside of Hellbender to create a single research data location across different computational environments.
Detailed Information
Investment Model
Overview
The newest High Performance Computing (HPC) resource, Hellbender, has been provided through partnership with the Division of Research Innovation and Impact (DRII) and is intended to work in conjunction with DRII policies and priorities. This outline will provide definitions about how fairshare, general access, priority access, and researcher contributions will be handled for Hellbender. HPC has been identified as a continually growing need for researchers, as such DRII has invested in Hellbender to be an institutional resource. This investment is intended to increase ease of access to these resources, provide cutting edge technology, and grow the pool of resources available.
Fairshare
To understand how general access and priority access differs, fairshare must first be defined. Fairshare is an algorithm that is used by the scheduler to assign priority to jobs from users in a way that gives every user a fair chance at the resources available. This algorithm has several metrics to perform this calculation over for any given job waiting in the queue, such as job size, wait time, current and recent usage, and individual user priority levels. This allows administrators to tune the fairshare algorithm, to adjust how it determines which jobs are next to run once resources are available.
Resources Available to Everyone: General Access
General access will be open to any research or teaching faculty, staff, and students for any UM system campus. General access is defined as open access to all resources available to users of the cluster at an equal fairshare value. This means that all users will have the same level of access to the general resource. Research users of the general access portion of the cluster will be given the RDE Standard Allocation to operate from. Larger storage allocations will be provided through RDE Advanced Allocations, and independent of HPC priority status.
Hellbender Advanced: Priority Access
When researcher needs are not being met at the general access level, researchers may request an advanced allocation on Hellbender to gain priority access. Priority access will give research groups a limited set of resources that will be available to them without competition from general access users. Priority Access will be provided to a specific set of hardware through a priority partition which contains these resources. This partition will be created, and limited to use by the user and their associated group. These resources will also be in an overlapping pool of resources available to general access users. This pool will be administered such that if a priority access user submits jobs to their priority access partition, any jobs running on those resources from the overlapping partition will be requeued and begin execution again on another resource in that partition if available, or return to wait in the queue for resources. Priority access users will retain general access status, fairshare will still play a part in moderating their access to the general resource. Fairshare inside a priority partition determine which user’s jobs are selected for execution next inside this partition. The jobs running inside this priority partition will also affect a user’s fairshare calculations even for resources in the general access partition. Meaning that running a large amount of jobs inside a priority partition will lower a user’s priority for the general resources as well.
Traditional Investment
Hellbender Advanced Allocation requests that are not approved for DRII Priority Designation may be treated as traditional investments with the researcher paying for the resources used to create the Advanced Allocation at the defined rate. These rates are subject to change based on the determination of DRII, and hardware costs.
Resource Management
Information Technology Research Support Solutions (ITRSS) will procure, set up, and maintain the resource. ITRSS will work in conjunction with MU Division of Information Technology and Facility Services to provide adequate infrastructure for the resource.
Resource Growth
Priority access resources will generally be made available from existing hardware in the general access pool and the funds will be retained for a future time to allow a larger pool of funds to accumulate for expansion of the resource. This will allow the greatest return on investment over time. If the general availability resources are less than 50% of the overall resource, an expansion cycle will be initiated to ensure all users will still have access to a significant amount of resources. If a researcher or research group is contributing a large amount of funding, it may trigger an expansion cycle if that is determined to be advantageous at the time of the contribution.
Benefits of Investing
The primary benefit of investing is receiving “shares” and a priority access partition for you or your research group. Shares are used to calculate the percentage of the cluster owned by an investor. As long as an investor has used less than they own, investors will be able to use their shares to get higher priorities in the general queue. than they own. FairShare is by far the largest factor in queue placement and wait times.
Investors will be granted Slurm accounts to use in order to charge their investment (FairShare). These accounts can contain the same members of a POSIX group (storage group) or any other set of users at the request of the investor.
Hellbender Policies
Under no circumstances should your code be running on the login node.
Each user must use their own account to access RSS resources. Account sharing is prohibited.
Software and Procurement
Open Source Software installed cluster-wide must have an open source (https://opensource.org/licenses) license or be obtained utilizing the procurement process even if there is not a cost associated with it.
Licensed software (any software that requires a license or agreement to be accepted) must follow the procurement process to protect users, their research, and the University. Software must be cleared via the ITSRQ. For more information about this process please reach out to us!
For widely used software RSS can facilitate the sharing of license fees and/or may support the cost depending on the cost and situation. Otherwise, user are responsible for funding for fee licensed software and RSS can handle the procurement process. We require that if the license does not preclude it, and there are not node or other resource limits, that the software is make made available to all users on the cluster. All licensed software installed on the cluster is to be used following the license agreement. We will do our best to install and support a wide rage of scientific software as resources and circumstances dictate but in general we only support scientific software that will run on RHEL in a HPC cluster environment. RSS may not support software that is implicitly/explicitly deprecated by the community.
Containers, Singularity/Apptainer/Docker
A majority of scientific software and software libraries can be installed in users’ accounts or in group space. We also provide limited support for Singularity for advanced users who require more control over their computing environment. We cannot knowingly assist users to install software that may put them, the University, or their intellectual property at risk.
Storage
None of the cluster attached storage available to users is backed up in any way by us, this means that if you delete something and don't have a copy somewhere else, it is gone. Please note the data stored on cluster attached storage is limited to Data Class 1 and 2 as defined by UM System Data Classifications. If you have need to store things in DCL3 or DCL4 please contact us so we may find a solution for you.
Storage Type |
Location |
Quota |
Description |
Home |
/home/$USER |
50 GB |
Available to all users |
Pixstor |
/home/$USER/data |
500 GB |
Available to all users |
Local Scratch |
/local/scratch |
1.6-3.2 TB |
Available to all users |
Pixstor |
/cluster/pixstor, /mnt/pixstor |
Varies |
For investment, cluster attached |
Vast |
/cluster/VAST |
Varies |
For investment, cluster/instrument attached |
Research Network
Research Network DNS: The domain name for the Research Network (RNet) is rnet.missouri.edu and is for research purposes only. All hosts on RNet will have a .rnet.missouri.edu domain. Subdomains and CNAMEs are not permitted. Reverse records will always point to a host in the .rnet.missouri.edu domain.