Principal Site Reliability Engineer

Number of employees

780

San Francisco, CA - US, USA

Posted on: 2025-12-12

Category: energy

Apply now

Please let Crusoe know you found this job on Work in Green. This will help us grow!

Employment type:

Full time

Experience required:

Intermediate

Salary

Salary not provided

About the company:

Crusoe is the industry’s first vertically integrated, purpose-built AI cloud platform. The company is redefining AI cloud infrastructure and its platform is recognized as the "gold standard" among builders for its reliability and performance in developing, training, and deploying AI models. Powered by clean, renewable energy, Crusoe aligns the future of computing with the future of the climate. Leading Fortune 500 companies trust Crusoe’s advanced, AI-optimized cloud to support their most demanding AI applications.

Crusoe's mission is to accelerate the abundance of energy and intelligence. We’re crafting the engine that powers a world where people can create ambitiously with AI — without sacrificing scale, speed, or sustainability.

Be a part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and join a team that’s setting the pace for responsible, transformative cloud infrastructure.

About This Role

As a Principal Site Reliability Engineer, you will play a critical role in designing and operating a next-generation NeoCloud built for AI, GPU, and high-performance workloads. This role sits at the intersection of infrastructure architecture, reliability engineering, and technical leadership. You’ll set reliability strategy, influence platform design, and ensure the cloud scales safely, efficiently, and predictably as customer demand accelerates.

You are a hands-on technical leader who thrives in complex distributed systems, drives clarity in ambiguous environments, and raises the bar for operational excellence across the organization.

What You’ll Be Working On

  • Define and own the reliability architecture for a NeoCloud platform supporting GPU-dense, latency-sensitive, and large-scale distributed workloads

  • Design and evolve SLOs, SLIs, and error budgets that meaningfully balance reliability, velocity, and customer experience

  • Lead incident response strategy for high-severity events, including root cause analysis and long-term remediation

  • Architect and improve observability systems (metrics, logs, tracing) to support rapid detection and diagnosis at scale

  • Partner with Infrastructure, Networking, Hardware, and Platform teams to influence system design before production issues occur

  • Drive automation across provisioning, deployment, capacity management, and failure recovery

  • Establish best practices for on-call health, operational readiness, and production change management

  • Serve as a technical authority and mentor for senior and staff-level engineers across the SRE and infrastructure org

What You’ll Bring to the Team

  • 10+ years of experience operating and scaling large-scale distributed systems in production environments

  • Deep expertise in SRE principles: reliability modeling, incident management, toil reduction, and systems thinking

  • Strong background in cloud or infrastructure platforms (public cloud, private cloud, or NeoCloud environments)

  • Hands-on experience with Kubernetes and containerized workloads at scale

  • Proficiency in one or more programming languages (Go, Python, Rust, or similar) with production-grade code ownership

  • Strong understanding of Linux systems, networking fundamentals, and performance bottlenecks

  • Proven ability to lead through influence — setting direction across teams without direct authority

  • Exceptional communication skills, especially during high-stakes incidents and cross-functional decision-making

Bonus Points

  • Experience supporting GPU-based, AI/ML, or HPC workloads

  • Familiarity with bare-metal provisioning, hardware lifecycle management, or data center operations

  • Experience building or scaling a NeoCloud or cloud-adjacent platform from early growth to maturity

  • Background in capacity planning for GPU, storage, or high-throughput networking environments

  • Passion for sustainable infrastructure or next-generation cloud architectures

Benefits:

  • Industry competitive pay

  • Restricted Stock Units in a fast growing, well-funded technology company

  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents

  • Employer contributions to HSA accounts

  • Paid Parental Leave

  • Paid life insurance, short-term and long-term disability

  • Teladoc

  • 401(k) with a 100% match up to 4% of salary

  • Generous paid time off and holiday schedule

  • Cell phone reimbursement

  • Tuition reimbursement

  • Subscription to the Calm app

  • MetLife Legal

  • Company paid commuter benefit; $300 per month

Compensation:

Compensation will be paid in the range of $261,000 - $326,000 + Bonus. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant’s education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.

Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.

Similar climate jobs

These are some of our top picks for great climate jobs on Work in Green.

View all jobs

950 Energy jobs at Crusoe

Crusoe is hiring Principal Site Reliability Engineer,Staff Product Manager, Foundations (SF/Sunnyvale/Seattle),Senior Commercial Operations Analyst, and more.

View all jobs at Crusoe