Posted 30 June, 2026
AI Infrastructure Lead Architect
London, United Kingdom
Full Time
YOU ARE
As a Lead and Principal Infrastructure Architect, you own end-to-end responsibility for designing optimized compute infrastructure for large-scale AI and machine learning systems, including large-scale distributed training environments.
You are the authority who translates business goals, SLAs, and client standards into infrastructure architectures that perform at scale while being deliberately engineered for cost-efficiency. Drawing on deep experience, you weigh multiple viable solutions for any given problem - across compute, networking, storage, orchestration, and model serving - and make rational, well-justified architectural decisions tailored to each client's situation, constraints, and standards. You architect and optimize the full computational stack for performance, power, cost, and scalability; design and tune large-scale GPU clusters and distributed training systems; and ensure infrastructure meets security, compliance, and regulatory requirements.
As the recognized AI infrastructure expert in at least one hyperscaler cloud (such as AWS, Azure, or Google Cloud), you bring authoritative knowledge of that platform's AI/ML services, accelerators, networking, and cost levers, and apply it to deliver best-in-class solutions. Beyond design, you set technical direction and standards, lead and mentor engineers and architects, partner with clients and stakeholders to shape the infrastructure roadmap, and are ultimately accountable for delivering AI/ML infrastructure that meets business SLAs, controls cost, and scales to enterprise and frontier workloads.
THE WORK
As a Lead and Principal Infrastructure Architect, you own end-to-end responsibility for designing optimized compute infrastructure for large-scale AI and machine learning systems, including large-scale distributed training environments.
You are the authority who translates business goals, SLAs, and client standards into infrastructure architectures that perform at scale while being deliberately engineered for cost-efficiency. Drawing on deep experience, you weigh multiple viable solutions for any given problem - across compute, networking, storage, orchestration, and model serving - and make rational, well-justified architectural decisions tailored to each client's situation, constraints, and standards. You architect and optimize the full computational stack for performance, power, cost, and scalability; design and tune large-scale GPU clusters and distributed training systems; and ensure infrastructure meets security, compliance, and regulatory requirements.
As the recognized AI infrastructure expert in at least one hyperscaler cloud (such as AWS, Azure, or Google Cloud), you bring authoritative knowledge of that platform's AI/ML services, accelerators, networking, and cost levers, and apply it to deliver best-in-class solutions. Beyond design, you set technical direction and standards, lead and mentor engineers and architects, partner with clients and stakeholders to shape the infrastructure roadmap, and are ultimately accountable for delivering AI/ML infrastructure that meets business SLAs, controls cost, and scales to enterprise and frontier workloads.
THE WORK
- Own the end-to-end architecture and design of optimized compute infrastructure for large-scale AI/ML systems, including large-scale distributed training environments, from concept through delivery.
- Develop and evaluate architecture alternatives, weighing trade-offs across compute, networking, storage, orchestration, and model serving to make rational, well-justified decisions tailored to each client's situation and standards.
- Lead architecture assessments and reviews of existing and proposed environments, identifying gaps, risks, bottlenecks, and optimization opportunities, and recommending remediation.
- Drive architectural decision-making, documenting rationale, trade-offs, and assumptions so decisions are transparent, defensible, and aligned with business SLAs and standards.
- Define and maintain the AI infrastructure roadmap, planning capacity, scaling, and technology evolution in step with business and product goals.
- Architect and optimize the full computational stack for performance, power, cost, and scalability, ensuring infrastructure meets business SLAs while being deliberately engineered for cost-efficiency.
- Design and tune large-scale GPU clusters and distributed training systems, including accelerator selection, interconnect/networking, and storage for high-throughput training workloads.
- Serve as the authoritative AI infrastructure expert in at least one hyperscaler cloud (AWS, Azure, or GCP), applying deep knowledge of its AI/ML services, accelerators, networking, and cost levers.
- Design deployment, automation, and CI/CD strategies for reliable, repeatable, and scalable releases of AI systems, models, and data pipelines into production.
- Establish AI monitoring and observability strategy across InfraOps and MLOps, defining SLAs, SLOs, alerting, and performance/cost tracking, and driving continuous optimization.
- Integrate AI/ML systems into enterprise environments, ensuring interoperability, security, compliance, and adherence to regulatory and client standards.
- Lead capacity planning and cost modeling, forecasting compute needs and engineering cost-efficiency into the architecture without compromising performance.
- Collaborate with clients, stakeholders, and engineering teams to align infrastructure decisions with business outcomes, translating requirements into actionable architecture and standards.
- Set technical direction, standards, and best practices, mentoring engineers and architects and leading design and code reviews across the team.

