System Administrator / SRE

Jotelulu

Oficina Plaza EspañaPresencialCompetitivoPublicado hace 5 mesesMidIndefinido

🇬🇧Inglés requeridoOperations

Jotelulu

·Oficina Plaza España

Aplicar

System Administrator / SRE

Jotelulu

Anuncio original

The challenge

We're at a pivotal stage in the evolution of our cloud platform. To continue scaling efficiently and strengthening reliability, we are expanding our Operations & SRE capabilities. Our infrastructure supports mission-critical services for our customers, and ensuring performance, stability, and continuous improvement is at the core of our vision.

As a Site Reliability Engineer / Systems Administrator, your mission will be to monitor and optimize our cloud systems, automate processes, ensure effective incident management, and help us maintain a robust, scalable and secure infrastructure. You will play a key role in minimizing downtime, improving operational efficiency, and supporting sustainable growth.
You'll be part of a highly collaborative engineering environment, working closely with DevOps, Product and Development teams to build reliable services from the ground up, enforce good operational practices and contribute to ongoing enhancements that impact thousands of users.
Collaboration will be essential. You will support critical infrastructure decisions, lead incident response, proactively detect risks and ensure that both technology and teams can continue to scale confidently.

Requirements that are important for us

Experience in administration of large-scale cloud or MSP infrastructures.
Expert in Linux systems (a must).
Experience working with critical environments requiring fast and effective incident resolution.
Solid networking expertise: TCP/IP, DNS, load balancing, firewalling, network virtualization.
Experience with network storage solutions (Ceph, NFS or similar).
Knowledge of virtualization and cloud orchestration platforms.
Database administration basics: MySQL, MariaDB or PostgreSQL.
Experience with monitoring and tuning tools (Zabbix, Nagios, Prometheus, Grafana, Datadog...).
Understanding of ITIL processes for managing incidents, problems, and changes.

Key skills and expected impact

Strong documentation practices and contribution to operational knowledge.
Monitoring and optimization of performance, identifying bottlenecks and preventing service interruptions.
Ability to lead root cause analysis and prevent recurring incidents.
Implementation of centralized log management and analysis.
Excellent communication in Spanish and intermediate English.

Nice to have

Experience with Ansible for automation and configuration management.
Experience with web servers and virtualized platforms.
Advanced security knowledge and system hardening.
Analytical mindset focused on operational excellence.
Experience with ticketing systems (workflow creation, prioritization, follow-up).

Tools

Monitoring & performance: Prometheus, Grafana, Nagios, Zabbix, Datadog, or similar
Logging: Centralized log management systems
Databases: MySQL, MariaDB, PostgreSQL
Orchestration: CloudStack, OpenStack, or similar
Collaboration & knowledge base: Jira, Confluence, Microsoft 365, Slack
Ticketing & ITSM: ITIL-based tools