【AIエージェントxRPA - 業界No.1 SaaS vendor from US】Principal Site Reliability Engineer
どんな仕事か
【クライアントについて】 "Agentic(エージェンティック)"の最先端で一緒に働いてみませんか? 当社は、エンドツーエンドの業務自動化を通じて、これまで日本企業の効率化と変革を支えてきました。 今、我々が注力しているのは「エージェンティックオートメーション」。AIエージェント、RPAのロボット、人を連携させて、企業全体の業務を安全かつ安定的に自動化することです。 日本支社は本社直下のリージョンに昇格し、日本を最重要拠点と位置づける戦略のもと、日本から世界へソリューションを発信することを目指しています。 当社は、好奇心旺盛で、自ら進んで動けるフットワークの軽い人材を求めています。 ビジネスのスピードや変化を喜びとし、互いを思いやり、ともに成長し続けられる仲間が必要です。 是非、一緒にエージェンティックオートメーションを実現し、共に社会を変革しましょう。 Role Overview: This is a high-impact, principal level role designed for an engineer who excels in the "heat of the moment". Operating with a high degree of autonomy, you will take operational leadership to restore the stability of our large-scale distributed services, blending deep technical SRE expertise with the authoritative presence of an Incident Commander. You will partner closely with platform, infrastructure, and application teams globally to improve service availability, reduce operational toil, and ensure our systems scale reliably under real-world load and failure conditions. You will act as the Japan regional owner for SRE standards and maintain a close partnership and functional alignment with our Global SRE organization. You will also own service reliability, observability, automation, and continuous improvement initiatives for the region. You will report primarily to Senior Director of Japan and functionally to Vice President - SRE, based in U.S. You will also act in the managerial capacity with another team member reporting to you. What You’ll Be Working On: 1. Incident Command & Tactical Response • Lead Incident Command: Act as the primary Incident Commander for high-stakes technical events. Establish command and control, orchestrate cross-functional response efforts (Compute, Network, Storage, Database), and maintain a common operating picture for all stakeholders. • Live Site Troubleshooting: Serve as a key escalation point for complex issues. Use your deep understanding of service topology and dependencies to diagnose "grey failure" and resolve disruptions promptly. • Executive Communication: Own the communication life cycle. Deliver real-time, executive-level briefings during active incidents, translating technical jargon into clear business impact and recovery timelines for leadership. 2. Prevention & Reliability Engineering • Post-Incident Evolution: Lead thorough retrospectives and RCAs. Beyond just documenting what happened, you will drive and influence the discovery and implementation of automated self-healing solutions to ensure the same issue never occurs twice. • Observability: Define, track, and improve service health through promoting well-designed SLIs and SLOs. Influence and implement proactive monitoring, dashboards, and early-warning alerts to identify performance bottlenecks before they trigger an incident. • Toil Automation: Design and implement automation to reduce manual intervention during incidents and routine operations. Apply engineering rigor to operational workflows to eliminate repetitive and error-prone tasks. • Service Resilience: Understand the know-how to test service behavior under load, including degradation modes, scaling characteristics, and dependency failures. Ensure backup, restore, and disaster recovery capabilities are implemented, tested, and maintained. 3. Service Design & Cross-functional Leadership • Architectural Partnership: Partner with development teams to champion high availability and readiness of the services and promote best practices on reliability, resilience, and operability. • Team Mentorship: Advocate for SRE best practices. Mentor and support other engineers, helping raise the overall incident response and reliability maturity of the organization.
必要なスキル・経験
What You’ll Bring to the Team: • Experience: 7+ years in SRE , Cloud Operations, or a related technical field, with at least 3 years in a lead responder or command-oriented role. • Command Presence: Demonstrated ability to remain calm, focused, and decisive under extreme pressure. You can lead a room of diverse stakeholders and drive technical conversations to successful outcomes. • Forensics & Investigation: Skills in analyzing system artifacts, network, and performance dashboard data to lead the multi-disciplinary audience to appropriate root cause areas of service failures. • Technical Breadth: Strong proficiency in Python or Go and a holistic understanding of distributed systems, Kubernetes, and cloud infrastructure (Preferably Azure). • Observability Expertise: Deep experience with leveraging Prometheus/Grafana, Open Telemetry or any other equivalent 3rd party Observability stack. • Availability: Willingness to participate in the on-call rotation as an Incident Commander for high-severity issues.
歓迎条件
Nice to have: • Command Frameworks: Familiarity with structured command systems (such as the Incident Command System - ICS) used in crisis management. • LLM Ops: Experience using LLMs or AI-driven detection systems to solve reliability and capacity challenges in GPU-heavy, high-performance computing environments. • AI Tooling: Champion the use of AI tools and LLM-powered agents to improve SRE pillars including, but not limited to, reducing operational toil. • Event-Driven Remediation: Proven history of building "self-healing" infrastructure via Terraform, A zure Service Operator, or any other equivalent solutions.
| 想定年収 | 1,000 ~ 2,000 万円 |
|---|---|
| ポジション | Principal Site Reliability Engineer |
| 勤務地 | 東京都 |