Remote Senior Manager of Site Reliability Engineering

Remotestar

Cambridgeshire, United Kingdom Full-time in I.T. & Communications
    Share:
    • Job ID 2773146

    Job Description

    Job Description

    RemoteStar is on the lookout for a talented Senior Site Reliability Engineering Manager to join our client’s team based in the UK, offering a fully remote work opportunity.

    About the Client:

    Our client is revolutionizing the B2B diamond industry with an innovative marketplace that seamlessly connects jewelry retailers with gemstone suppliers. With established operations in major cities including London, Hong Kong, Amsterdam, Mumbai, and New York since 2001, they stand as a leader in the global diamond and gemstones sector.

    About the Role:

    In the position of SRE Manager, you will be at the forefront of optimizing our client’s infrastructure and service performance, ensuring utmost reliability and scalability. Your hands-on technical expertise combined with strategic leadership will be vital in managing and building a first-class SRE team.

    You will take full responsibility for the production environment, encompassing both technical oversight and process management. Your role includes ensuring the continuous operation of live systems while effectively addressing on-call support challenges.

    As part of your responsibilities, you will design and implement a new incident tracking system to facilitate timely root cause analysis and resolution by the development team. Additionally, you will spearhead initiatives aimed at enhancing automation and monitoring capabilities to drive operational efficiency and system reliability.

    Fostering an inspiring culture of collaboration, innovation, and continuous improvement, you will build and guide a high-performing SRE team through mentorship and by setting a strong example.

    Key Responsibilities:

    • Demonstrated experience in a senior or lead SRE position, showcasing a proven history of developing and sustaining robust infrastructure and services.
    • In-depth expertise in incident management processes, encompassing incident response, resolution, and thorough post-mortem reviews.
    • Proficient in observability and monitoring tools like Prometheus, Grafana, ELK stack, or Datadog.
    • Experience with cloud services such as AWS, Azure, or GCP, including proficiency in infrastructure as code using Terraform or CloudFormation.
    • Strong scripting and automation capabilities in languages such as Python, Bash, or Go.
    • Excellent communication and collaboration skills, with a knack for working effectively in cross-functional remote teams.
    • Proven leadership qualities and a genuine passion for mentoring and developing team members.

    What We Offer:

    • A vibrant and dynamic work environment within a rapidly expanding company.
    • An opportunity to work in an international setting.
    • A collaborative workplace with minimal hierarchy.
    • A chance to engage in intellectually stimulating projects that significantly contribute to the client’s success and scalability.
    • Flexible working hours to promote work-life balance.

    Other jobs you may like

    10x your chance to get hired

    Land a job without sending dozens of applications!

     

    Let employers find you

     

    Happy Remote Worker