As an L3 support member of the team, my responsibilities include, but are not limited to,
- Monitor, maintain, and provision components of the Data Systems Platform.
- Perform software upgrades on the components of the Data Systems Platform.
- Work with the Data Engineering team to help design and implement the next iteration of scaling.
- Work closely with the systems performance, systems operations, and network engineering teams, as needed, to ensure high performance and availability.
- Work with the Data Engineering team to evaluate open-source and commercial software and hardware solutions.
- Develop and implement tools to automate aspects of managing the Data Systems Platform, including upgrades where appropriate.
- Participate in prototyping, proof-of-concept system development, and benchmarking.
- Manage storage restructuring as required.
- Participate in the on-call rotation, responding to alerts and system issues.
- Manage user access and resource allocations to Data Systems Platform.
- Administer Kafka and Hadoop Clusters to support Data Systems Platforms.
- Review all Pull Requests for the team to ensure high quality and sustainable automation utilities for use in both cluster administration and for internal engineering clients.
- Develop automation utilities for infrastructure administration, as well as self-healing auto-response actions for common failures.
1. As an L2 and L1 support, I helped the team with:
a. Updating automated runbooks that were created for basic admin needs.
b. Create runbooks for repetitive asks from customers (created runbooks for Kafka topic deletion, head nodes decommission/recommission, and restart of services for the components in Hadoop).
c. Create alerts and dashboards for the applications, servers, and memory usage of the nodes.
d. Allocation and retraction of resources for the teams, and monitoring the usage.
e. Perform lease returns for all the servers on a time-period basis.