Please let Crusoe know you found this job on Work in Green. This will help us grow!
Employment type:
Full time
Experience required:
Intermediate
Salary
Salary not provided
About the company:
Crusoe's mission is to accelerate the abundance of energy and intelligence. We’re crafting the engine that powers a world where people can create ambitiously with AI — without sacrificing scale, speed, or sustainability.
Be a part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and join a team that’s setting the pace for responsible, transformative cloud infrastructure.
As an Incident Manager at Crusoe, you will be the frontline defender of our service reliability and customer trust. This role is pivotal to our mission, directly impacting the company’s success by minimizing downtime and orchestrating rapid resolutions to critical technical challenges. You will spearhead the management of high-visibility incidents and customer escalations, ensuring that our innovative climate-aligned computing platform remains robust and dependable.
In this full-time position, you will lead the charge on transformative projects, including designing self-serve support processes and partnering with our engineering teams to drive product improvements based on real-world incident data. We are looking for a technically fearless professional who thrives in high-pressure situations and possesses the leadership skills to guide both customers and internal teams through complex technical landscapes.
Incident Response Leadership: Lead the end-to-end management of high-visibility technical incidents and customer escalations, ensuring rapid restoration of services and effective communication throughout the lifecycle.
Complex Troubleshooting: Diagnose and resolve sophisticated technical issues involving Infiniband, containerization, and distributed training to maintain peak operational efficiency for our customers.
Infrastructure Optimization: Guide and assist customers in implementing and fine-tuning their HPC infrastructure, directly contributing to their performance goals and technical success.
Strategic Collaboration: Act as a critical bridge between customers and internal engineering/product teams, translating frontline feedback into actionable product enhancements and quality improvements.
Knowledge Empowerment: Develop and deliver high-impact training materials, internal documentation, and knowledge base articles to empower both teammates and customers to navigate our solutions effectively.
Process Innovation: Design and implement robust incident response strategies and self-serve support processes to scale our ability to handle complex technical challenges.
Risk Mitigation: Participate in and manage on-call rotations, providing a reliable safety net for our infrastructure and ensuring 24/7 readiness for critical service interruptions.
Technical Linux & Virtualization Expertise: Demonstrate deep technical experience with Linux, Virtualization, and Kubernetes to effectively manage and resolve infrastructure incidents.
Network Fundamentals: Apply a solid understanding of the TCP/IP stack to troubleshoot connectivity and performance issues across distributed systems.
Infrastructure-as-Code (IaC) Knowledge: Utilize your understanding of IaC practices to navigate and support modern automated environments.
Proven Customer Leadership: Bring 4-5 years of customer-facing experience, including 3-5+ years in a leadership role acting as a primary liaison between technical teams and stakeholders.
Exceptional Communication: Leverage elite written and verbal communication skills to translate complex technical concepts into clear, actionable updates for diverse audiences.
Analytic Problem-Solving: Apply a rigorous problem-solving mindset to diagnose, isolate, and resolve multifaceted technical issues under pressure.
Programming Proficiency: Experience writing or debugging code in one or more programming languages.
HPC Familiarity: Prior experience working with High-Performance Computing environments or large-scale distributed systems.
Advanced Certifications: Industry-recognized certifications in Linux administration, Kubernetes (CKA), or Incident Management frameworks.
Scalability Mindset: Experience scaling support or incident functions within a high-growth technology startup.
Crusoe also offers a competitive benefits package designed to support financial security, health, and overall well-being, including pension contributions, private health and dental insurance, income protection, life assurance and more.
Compensation will be paid as salary or hourly. Compensation to be determined by the applicant’s education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.
Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.
These are some of our top picks for great climate jobs on Work in Green.
Crusoe is hiring Control Room Operator ,Operations & Maintenance Technician II,Operations & Maintenance Technician I, and more.