Posted 30 June, 2026
HPC Systems Administrator
London, United Kingdom
Full Time
HPC Systems Administrator, Specialist
Salary: Competitive salary and package (Depending on level of experience)
Locations: UK, London (must be willing to travel to client sites throughout the UK on an ad hoc basis)
Salary: Competitive salary and package (Depending on level of experience)
Accenture are partnering with scaled UK AI compute pioneers to lead the charge on next-generation infrastructure for sovereign AI. To support this endeavor, we're building a high-performance compute operations team in London.
Our work will be sensitive, secure and on the most up-to-date high density compute stacks available.
Any offer of employment is subject to satisfactory BPSS and SC security clearance which requires 5 years continuous UK address history (typically including no periods of 30 consecutive days or more spent outside of the UK) at the point of application.
Key Responsibilities:
• Design, deploy, and manage HPC infrastructures including GPU clusters and parallel computing environments.
• Support AI model training platforms by maintaining compute resources, optimizing scheduling, and ensuring compatibility with AI frameworks and libraries.
• Monitor, analyse, and fine tune performance metrics addressing bottlenecks or inefficiencies.
• Develop and maintain automation scripts and tools (e.g., PowerShell, Python, Bash) to streamline operational tasks, monitoring, and reporting.
• Document architecture, configurations, processes, and resolutions for compliance, knowledge transfer, and continuous improvement. Participate in root cause analysis (RCA) and post-incident reviews for compute or HPC-related incidents, implementing preventive measures as needed.
Required Skills:
• Expertise in an HPC environment, including GPU cluster administration (e.g., NVIDIA, AMD) and workload schedulers such as SLURM or PBS.
• Proficiency with AI model training workflows and experience supporting popular AI/ML frameworks (e.g., TensorFlow, PyTorch, CUDA). Solid understanding of networking, storage, and server platforms in both Windows and Linux environments.
• Advanced analytical, troubleshooting, and performance tuning skills, with the ability to diagnose and resolve complex compute and HPC issues.
• Experience with automation, monitoring platforms, and scripting languages (e.g., Python, PowerShell, Bash) to enhance operational efficiency.
• Strong communication and collaboration skills, with a track record of working effectively across technical and non-technical teams. Familiarity with compliance, data security, and best practices for compute and HPC environments.
#LI-EU
Salary: Competitive salary and package (Depending on level of experience)
Locations: UK, London (must be willing to travel to client sites throughout the UK on an ad hoc basis)
Salary: Competitive salary and package (Depending on level of experience)
Accenture are partnering with scaled UK AI compute pioneers to lead the charge on next-generation infrastructure for sovereign AI. To support this endeavor, we're building a high-performance compute operations team in London.
Our work will be sensitive, secure and on the most up-to-date high density compute stacks available.
Any offer of employment is subject to satisfactory BPSS and SC security clearance which requires 5 years continuous UK address history (typically including no periods of 30 consecutive days or more spent outside of the UK) at the point of application.
Key Responsibilities:
• Design, deploy, and manage HPC infrastructures including GPU clusters and parallel computing environments.
• Support AI model training platforms by maintaining compute resources, optimizing scheduling, and ensuring compatibility with AI frameworks and libraries.
• Monitor, analyse, and fine tune performance metrics addressing bottlenecks or inefficiencies.
• Develop and maintain automation scripts and tools (e.g., PowerShell, Python, Bash) to streamline operational tasks, monitoring, and reporting.
• Document architecture, configurations, processes, and resolutions for compliance, knowledge transfer, and continuous improvement. Participate in root cause analysis (RCA) and post-incident reviews for compute or HPC-related incidents, implementing preventive measures as needed.
Required Skills:
• Expertise in an HPC environment, including GPU cluster administration (e.g., NVIDIA, AMD) and workload schedulers such as SLURM or PBS.
• Proficiency with AI model training workflows and experience supporting popular AI/ML frameworks (e.g., TensorFlow, PyTorch, CUDA). Solid understanding of networking, storage, and server platforms in both Windows and Linux environments.
• Advanced analytical, troubleshooting, and performance tuning skills, with the ability to diagnose and resolve complex compute and HPC issues.
• Experience with automation, monitoring platforms, and scripting languages (e.g., Python, PowerShell, Bash) to enhance operational efficiency.
• Strong communication and collaboration skills, with a track record of working effectively across technical and non-technical teams. Familiarity with compliance, data security, and best practices for compute and HPC environments.
#LI-EU

