Monitoring and Logging:
- Deployed Prometheus, Alertmanager, and Grafana in Kubernetes through Helm charts for real-time monitoring, sharply increasing system visibility, and enabling quicker issue resolution.
- Enhanced data analysis and troubleshooting efficiency by integrating Grafana with Prometheus, Loki, and AWS EKS for comprehensive system metrics and log visualization.
- Configured Alertmanager with Slack and email notifications, significantly reducing incident response time and minimizing system downtime.
- Set up BlackBox Exporter to monitor endpoint availability and performance across multiple protocols, ensuring reliable service delivery and quick issue identification.
- Developed custom Prometheus exporters for monitoring key system metrics, and integrated Grafana dashboards with Slack/email alerts, enhancing monitoring capabilities and proactive alerting.
- Migrated and centralized multiple legacy monitoring dashboards into Grafana, improving data accessibility.
- Built a custom CMS toolbox for quick access to monitoring tools like Prometheus, Alertmanager, Grafana, and Runbooks, streamlining workflow, and reducing response time to system alerts.
Dashboarding and Reporting:
- Designed Grafana dashboards to monitor query execution status, error logs, and performance trends, improving system reliability, and facilitating proactive issue resolution.
- Created and presented weekly/monthly reports on system performance, cost optimization, and data insights.
Cloud Infrastructure & Migration:
- Upgraded AMI from CentOS 7 to Oracle Linux 8 for EC2 services, minimizing downtime during migration.
Security & Access Management:
- Enforced IAM security policies, including automated access key rotation and role-based access control
Terraform Infrastructure Management & Automation:
- Developed and managed Terraform IaC, automating AWS resource provisioning for consistency and repeatability.
- Integrated Terraform with CI/CD pipelines to streamline deployments and accelerate delivery cycles.
- Utilized Terraform state management to track changes and prevent configuration drift.
- Applied Terraform workspaces and variable files to manage multiple environments (Dev, Staging, Prod).
- Configured Terraform state disaster recovery using S3 replication across regions, ensuring data availability, and resilience.
- Automated IAM policy updates and S3 bucket policies using Terraform to enforce security best practices.
Terraform Security, Performance & Optimization:
- Provisioned and managed AWS RDS databases using Terraform, automating backups and scaling to enhance data reliability and performance.
- Implemented Terraform security best practices, configuring AWS Security Groups, IAM roles, and policies.
- Implemented Terraform security best practices by configuring AWS Security Groups, IAM roles, and policies, strengthening system security, and compliance.
- Integrated AWS ELB with Terraform for efficient traffic distribution and fault tolerance, improving application availability, and user experience.
- Configured Terraform remote state management using AWS S3 and DynamoDB for consistency and collaboration.
Incident Management & Root Cause Analysis:
- Managed critical P1 data issues, ensuring timely resolution with root cause analysis and long-term fixes.
ETL & Data Processing:
- Developed and deployed ETL jobs in Oozie and Airflow, optimizing data processing workflows.
- Performed SLA audits for Jira tickets and Airflow ETL jobs, implementing failure alerts to improve job reliability.
- Developed SLA dashboards to track ETL job performance and adherence to SLAs, providing clear visibility into job reliability and compliance.
- Created Airflow DAGs to extract query statistics and store them in AWS RDS-MySQL (short term) and S3 (long term).
AWS EMR and Cluster Optimization:
- Provisioned, managed, and optimized AWS EMR clusters, automating cloning and termination to reduce cloud costs.
- Optimized AWS infrastructure costs for EC2, S3, ELB, and EBS using AWS Trusted Advisor recommendations
Performance Optimization & Auto-Scaling:
- Enhanced monitoring system performance using auto-scaling policies, and post-incident analysis.
- Stakeholder Communication and Recognition:
- Delivered presentations to stakeholders, highlighting key metrics, improvements, and recommendations.
- Received client appreciation for prompt responses to customer requests, and efficiently resolving critical issues.