Site Reliability Engineer
This hedge fund is built on a culture of innovation and they build and maintain cutting edge hardware and software solutions. They are looking for an ambitious, enthusiastic and driven Site Reliability Engineer to join their global Production Engineering team in London and to help take our reliability to the next level. The team is focused on an unrelenting push to improve automation, testing and monitoring of systems and processes to make our business more resilient and to enable a high velocity of change.
- Standardisation of monitoring methodologies, systems, tools, libraries
- Automation of operational processes to improve reliability and efficiency and to reduce alert fatigue
- Owning and evolving our systems through pushing for changes that improve resilience and reliability
- Developing and enabling development of high quality, resilient, scalable and secure systems
- Wearing a strategic resilience and reliability hat in architecture and design discussions
- Maintain the highest levels of systems availability – mostly proprietary applications, across the enterprise
- A passion for automation and continual improvement, with a track record of identifying high-value automation opportunities
- Intense focus on improving system availability and resilience through testing, standardisation and automation
- Ability to build positive and collaborative relationships with colleagues across teams and geographies.
- Broad technical knowledge and strong communication skills, credible across the full technology stack
- Systematic and methodical approach to problem-solving and debugging
- Knowledge of cybersecurity risks
- Expert level scripting / coding in Python / Ruby / Powershell / C# / Java / GO or equivalent
- Experience implementing / using containerisation technologies Docker / Podman / Kubernetes / Openshift.
- Experience using configuration management tools such as Puppet / Chef / Ansible /DSC / Terraform.
- Experience in implementing distributed systems such as Hadoop / Spark / Kafka / Flink.
- Experience implementing centralised logging and monitoring / alerting systems such as Nagios / Sensu / Zabbix / Grafana / Kibana / Prometheus.