System Administrator / SRE
System Administrator / SRE
Anuncio original
The challenge
We're at a pivotal stage in the evolution of our cloud platform. To continue scaling efficiently and strengthening reliability, we are expanding our Operations & SRE capabilities. Our infrastructure supports mission-critical services for our customers, and ensuring performance, stability, and continuous improvement is at the core of our vision.
As a Site Reliability Engineer / Systems Administrator, your mission will be to monitor and optimize our cloud systems, automate processes, ensure effective incident management, and help us maintain a robust, scalable and secure infrastructure. You will play a key role in minimizing downtime, improving operational efficiency, and supporting sustainable growth.
You'll be part of a highly collaborative engineering environment, working closely with DevOps, Product and Development teams to build reliable services from the ground up, enforce good operational practices and contribute to ongoing enhancements that impact thousands of users.
Collaboration will be essential. You will support critical infrastructure decisions, lead incident response, proactively detect risks and ensure that both technology and teams can continue to scale confidently.
Requirements that are important for us
- Experience in administration of large-scale cloud or MSP infrastructures.
- Expert in Linux systems (a must).
- Experience working with critical environments requiring fast and effective incident resolution.
- Solid networking expertise: TCP/IP, DNS, load balancing, firewalling, network virtualization.
- Experience with network storage solutions (Ceph, NFS or similar).
- Knowledge of virtualization and cloud orchestration platforms.
- Database administration basics: MySQL, MariaDB or PostgreSQL.
- Experience with monitoring and tuning tools (Zabbix, Nagios, Prometheus, Grafana, Datadog...).
- Understanding of ITIL processes for managing incidents, problems, and changes.
Key skills and expected impact
- Strong documentation practices and contribution to operational knowledge.
- Monitoring and optimization of performance, identifying bottlenecks and preventing service interruptions.
- Ability to lead root cause analysis and prevent recurring incidents.
- Implementation of centralized log management and analysis.
- Excellent communication in Spanish and intermediate English.
Nice to have
- Experience with Ansible for automation and configuration management.
- Experience with web servers and virtualized platforms.
- Advanced security knowledge and system hardening.
- Analytical mindset focused on operational excellence.
- Experience with ticketing systems (workflow creation, prioritization, follow-up).
Tools
- Monitoring & performance: Prometheus, Grafana, Nagios, Zabbix, Datadog, or similar
- Logging: Centralized log management systems
- Databases: MySQL, MariaDB, PostgreSQL
- Orchestration: CloudStack, OpenStack, or similar
- Collaboration & knowledge base: Jira, Confluence, Microsoft 365, Slack
- Ticketing & ITSM: ITIL-based tools
Candidatura gestionada por Jotelulu