Mandate:
Reporting to the Manager and Architect of Advanced Research Computing Infrastructure, the Senior Advanced Research Computing Systems Administrator works as part of a team to design, build, and ensure the operational effectiveness of the university’s research servers and storage. Members of this team maintain systems critical to many research groups on-campus and beyond, including web servers and database servers, and large, high-performance research computing systems (HPC),cloud infrastructure and container orchestration used by researchers both at UVic, from institutions across the country, and with international collaborations. These systems are required to be in operation 24 hours per day, 365 days of the year and decisions regarding these systems can impact UVic’s obligations to other parties beyond the institution.
Objectives:
The Senior Advanced Research Computing System Administrator’s work includes the design, installation, configuration, and maintenance of hardware and software, problem determination/resolution, resource allocation, performance and security monitoring, and usage reporting.
Each position has specialized areas of expertise in multiple domains storage technologies such as Ceph, dCache, GPFs, Lustre and IBM Spectrum Protect (TSM); deployment technologies like, xCAT, Cobbler, Ansible, Puppet, and Terraform; and compute/virtualization technologies such as Kubernetes, OpenStack; HPC Schedulers such as SLURM, HTCondor, Moab; and Systems Monitoring. The specific technologies that are leveraged in this role will change over time and this position has the responsibility to help guide the decision on how future technologies are selected and deployed.
This position requires the incumbent to have significant problem solving skills to analyze and correct software and hardware problems and to automate administration tasks. This includes unanticipated and unique problem solutions where the incumbent may be the sole expert in the area. The incumbent also must possess effective communications skills in order to provide technical assistance and advice to peers and the user community, as well as inform user areas on the impact and implications of system failures, maintenance, and cyber security incidents. This role leads project teams and provides recommendations on the university’s server and storage infrastructure.
System maintenance is usually required to be performed off-hours and major issues are responded to on a 24/7 basis.
This role may need to work outside of normal work hours on an emergency or pre-scheduled basis. The role may need to travel out of town/country.