Job Description

Join a stealth-mode startup building out their AI and cloud platform, powered by thousands of H100s, H200s, and B200s, ready for experimentation, full-scale model training, or inference. As a Platform Engineer/Senior Site Reliability Engineer, you’ll own the reliability, performance, and automation of this GPU-powered infrastructure, ensuring seamless orchestration across environments managed by Slurm, Kubernetes, or direct SSH access. As well as supporting their extremely exciting new products coming to the market!

This is a rare opportunity to work at the intersection of AI infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment.

If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it. Get in touch and apply today!

Responsibilities:

Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilisation, and data flow.
Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes.
Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput.

Skills / Must Have:

7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting large-scale compute environments.
Strong hands-on experience with Kubernetes and Slurm for cluster orchestration and workload management.
Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred).
Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning.
Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks.
Familiarity with high-performance computing (HPC) or AI/ML training infrastructure at scale.
Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.

Salary & Benefits:

$300,000 gross per year
Equity

Job Tags

Permanent employment

Similar Jobs

EX Venture

Marketing Internship in Bali Job at EX Venture

...Duration: 4-6 months Compensation: Unpaid Internship (with potential for full-time... ...innovation? Join us in Bali to drive marketing initiatives for our portfolio companies... ...marketing and analytics, gaining comprehensive experience in how high-growth companies build...

nLeague

Helpdesk System Analyst Job at nLeague

...Job ID: 800093 Position: DOAS Helpdesk System Analyst Client: DOAS Location: 200 Piedmont Ave SE, Suite 1804 West Tower,... ...responsibilities will include both in-person support at our office and remote support for users working from home. Job Responsibilities...

CAE

Pilot Instructor Job at CAE

...maybe assigned, etc. Essential Duties and Responsibilities include, but are not limited to the following: Provides KC-135 Pilot procedures instruction in aircraft, classroom, and simulator or other platform environments. Provides training for Visual Threat...

Taco Bell - B&G Food Enterprises

Shift Leader Job at Taco Bell - B&G Food Enterprises

...Franchisees are independent business owners who set their own wage and benefit programs that can vary among franchisees." The Taco Bell Shift Manager supports the Restaurant General Manager by running great work shifts and meeting Taco Bell standards. You take ownership and...

MalaceHR

Industrial Cleaner Job at MalaceHR

... off Tue/Wed ~10pm-6am-off Sun/Mon nights ~10pm-6am-off Tues/Weds nights Pay:... .... Performs manual scraping/chipping to clean walls, hoses, windows, grates, lights, robots... ..., kneel, and bend for long periods of time and reach within 6 to theground. Ability...

Senior Site Reliability Engineer (SRE) - AI Inftastructure Job at Confidential, San Francisco, CA

R0xkMk1tbW5sNjRtbnZRRC85Y0FhQThpQlE9PQ==