Principal Site Reliability Engineer

Crusoe

780

San Francisco, CA - US, United States

Posted on: 2025-12-12

Category: energy

Ready to make this your next chapter?

Let Crusoe know you found them on WorkInGreen. It helps more companies post climate jobs here.

Apply to job

Employment type:

Full time

Experience required:

Intermediate

Salary

Salary not provided

About the company:

Crusoe is the industry’s first vertically integrated, purpose-built AI cloud platform. The company is redefining AI cloud infrastructure and its platform is recognized as the "gold standard" among builders for its reliability and performance in developing, training, and deploying AI models. Powered by clean, renewable energy, Crusoe aligns the future of computing with the future of the climate. Leading Fortune 500 companies trust Crusoe’s advanced, AI-optimized cloud to support their most demanding AI applications.

Crusoe's mission is to accelerate the abundance of energy and intelligence. We’re crafting the engine that powers a world where people can create ambitiously with AI — without sacrificing scale, speed, or sustainability.

Be a part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and join a team that’s setting the pace for responsible, transformative cloud infrastructure.

About This Role

As a Principal Site Reliability Engineer, you will play a critical role in designing and operating a next-generation NeoCloud built for AI, GPU, and high-performance workloads. This role sits at the intersection of infrastructure architecture, reliability engineering, and technical leadership. You’ll set reliability strategy, influence platform design, and ensure the cloud scales safely, efficiently, and predictably as customer demand accelerates.

You are a hands-on technical leader who thrives in complex distributed systems, drives clarity in ambiguous environments, and raises the bar for operational excellence across the organization.

What You’ll Be Working On

Define and own the reliability architecture for a NeoCloud platform supporting GPU-dense, latency-sensitive, and large-scale distributed workloads
Design and evolve SLOs, SLIs, and error budgets that meaningfully balance reliability, velocity, and customer experience
Lead incident response strategy for high-severity events, including root cause analysis and long-term remediation
Architect and improve observability systems (metrics, logs, tracing) to support rapid detection and diagnosis at scale
Partner with Infrastructure, Networking, Hardware, and Platform teams to influence system design before production issues occur
Drive automation across provisioning, deployment, capacity management, and failure recovery
Establish best practices for on-call health, operational readiness, and production change management
Serve as a technical authority and mentor for senior and staff-level engineers across the SRE and infrastructure org

What You’ll Bring to the Team

10+ years of experience operating and scaling large-scale distributed systems in production environments
Deep expertise in SRE principles: reliability modeling, incident management, toil reduction, and systems thinking
Strong background in cloud or infrastructure platforms (public cloud, private cloud, or NeoCloud environments)
Hands-on experience with Kubernetes and containerized workloads at scale
Proficiency in one or more programming languages (Go, Python, Rust, or similar) with production-grade code ownership
Strong understanding of Linux systems, networking fundamentals, and performance bottlenecks
Proven ability to lead through influence — setting direction across teams without direct authority
Exceptional communication skills, especially during high-stakes incidents and cross-functional decision-making

Bonus Points

Experience supporting GPU-based, AI/ML, or HPC workloads
Familiarity with bare-metal provisioning, hardware lifecycle management, or data center operations
Experience building or scaling a NeoCloud or cloud-adjacent platform from early growth to maturity
Background in capacity planning for GPU, storage, or high-throughput networking environments
Passion for sustainable infrastructure or next-generation cloud architectures

Benefits:

Industry competitive pay
Restricted Stock Units in a fast growing, well-funded technology company
Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
Employer contributions to HSA accounts
Paid Parental Leave
Paid life insurance, short-term and long-term disability
Teladoc
401(k) with a 100% match up to 4% of salary
Generous paid time off and holiday schedule
Cell phone reimbursement
Tuition reimbursement
Subscription to the Calm app
MetLife Legal
Company paid commuter benefit; $300 per month

Compensation:

Compensation will be paid in the range of $261,000 - $326,000 + Bonus. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant’s education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.

Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.

Energy jobs More climate jobs in United States

3339 Energy jobs

Not quite the right fit? Keep looking.

Crusoe

780

Full time

Energy