Summary

Overview

Work History

Experience Summary

Education

Technical Skills

Core Competencies

Certifications and Trainings

Community Contributions

Personal Projects

Service Delivery Manager-SRE Data Science platform

Cloud - Site Reliability Engineer Experience Details

Cloud SRE - Devops Team Management

Cloud SRE - Azure Cloud Team Management

Cloud SRE - Database Team Management

Cloud SRE - Production Monitoring Team Management

Cloud SRE - Automation Team Management

Cloud SRE - Identity Access Management

Cloud SRE - Audit Management

Cloud SRE - Cloud Cost Management

Site Reliability Engineer Experience Details

Associate Site Reliability Engineer Experience Details

Production Management Lead, JP Morgan Chase -India

Production Management , JP Morgan Chase - United Kingdom

Onshore Team Lead, Accenture - United Kingdom

Offshore Team Lead, Accenture - India

Java Developer, Cognizant - Pune

Junior Java Developer, Tata Consultancy Services - India

Network Engineer, Patni Computers - India

Shailesh Chaskar

Site Reliability Engineer

Pune,Maharashtra

Summary

A Site Reliability Engineer aspirant Cloud Solutions Architect, further exploring the horizon into Machine learning Operations.

An IT Specialist with 18 years of experience in development and extensive production management/Site reliability in Finance and B2B Ecommerce/ Supply Chain Management.

As a Service Delivery Manager - SRE , Managed a team of 9 L3/SRE Members and guided 6 L2/Production management Members - who supported Data Science Platform which was used by approx 600 Data Scientist.

As a general manager of cloud SRE played a vital role in ensuring the success of the organization's cloud initiatives. By developing and implementing effective SRE policies and procedures, managing various teams effectively and ensuring that the cloud platform is Reliable, Scalable, Secure and Available.

As a cloud Site reliability engineer helped the organization to achieve its business goals by focusing on below KPI's.

Abide to cloud platform Service availability of 99.7%. Eliminating Toil and Improving Automation Rate of at least 25% Person Per Hour Savings Per Quarter. And Improve Time taken to provision Infrastructure by using IAC tools like Ansible and Terraform.
Ensuring production platform is monitored with Mean time to Detect SLA of 99% and Mean time to Response SLA 99%. Further securing Cloud IAM with Identity Secure Score of 40% with Tenant User base of 1k to 10k and managed cost pillar using Azure Cloud Advisor Score.
Clearly defining Service Level Indicators and Service Level Objectives for service used on cloud platforms. Have a detailed Disaster Recovery - Business Continuity plan with Recovery Time Objective (RTO) and Recovery Point Objective (RPO) communicated to business.
Stable production platforms have a robust Change Management process to Govern Changes to production. With enhanced CI/CD ensure improved deployment frequency rate and reduction in deployment failure rate .
Trend analysis of App/Cloud Service Incidents and Security incidents which drive projects ensuring a continuous improvement is inherent part of SDLC as suggested by ITIL framework.

Overview

years of professional experience

Organizations Worked for

Cloud Experience in Years

SRE Experience in Years

Cloud Trainings

Cloud Professional Certification

SRE Trainings

Work History

Service Delivery Manager-SRE Data Science Platform

EPAM

Pune

12.2023 - Current

General Manager (SRE) Cloud and Production

ElasticRun

03.2022 - 12.2023

Vice President (SRE)

JP Morgan Chase

01.2020 - 03.2022

Associate Vice President (SRE)

JP Morgan Chase

01.2018 - 12.2019

Associate Vice President (Production Support Lead)

JP Morgan Chase

10.2016 - 12.2018

Associate Vice President (Production Support)

JP Morgan Chase

07.2014 - 09.2015

Associate Manager (Onshore Team Lead)

Accenture

04.2011 - 06.2014

Senior Software Engineer (Offshore Lead)

Accenture

01.2010 - 03.2011

Software Engineer (Module Lead)

Accenture

10.2008 - 12.2009

Associate Developer (Java)

Cognizant

06.2007 - 09.2008

Assistant System Engineer (Java Developer)

Tata Consultancy Services

01.2005 - 05.2007

Site Maintenance Engineer

Patni Computers

01.2004 - 12.2004

Experience Summary

Managerial Experience:

Managed DevOps, Automation, Cloud, DBA and Production Monitoring L1 and L2 teams Team ensuring cloud platform is available and reliable. Building and Mentoring Reliability engineers team and setting up stringent process to govern Production changes - ElasticRun, India.
Managed SRE Team implementing Blu GRN Deployment, taking on-call responsibilities, collaborate in real-time and to avoid issues which may lead to Incidents, assisting Production management to monitor Batch processing - JP Morgan Chase, India.
Managed Production Management Team providing knowledge, motivating and guiding them while handling critical incidents. Plan and execute Site Resiliency and Disaster recovery events efficiently - JP Morgan Chase, Uk.
Leading development team implementing Gemfire Distributed system – Accenture, UK.
Leading development team creating UI using GWT and Query Module consisting of Spring Web Service and Spring Batch - Accenture, India

Technical Experience:

Enhanced SRE experience with Azure Cloud across Security, Identity Access Management,COST Optimization, Networking and managing IAAS and PAAS service. Mentoring DevOps team to improve CI/CD. Securing the cloud platform with end point security tools.
Result oriented - End to end ownership of key projects like Cyber security Transformation Program migrating 100 accounts across environments.
Batch Monitoring and stabilization achieved resulting in 11 hours of saving in compute over weekend. Extensive interaction experience with all 3 Tiers of engagement experience in Singapore, United Kingdom, United states and Argentina. Experienced in Capacity management and upgrading infrastructure and carrying out basic platform level + App level checks ensuring go-live is smooth
Very keen observer and have deep Understanding of processes like ticket management, incident management, change management and problem management. Highly experienced in understanding build and deployment process and tools and ability to create build pipelines using Jenkins.
Leading Database Migration project as part of Accenture, India to migrate data from legacy to new system. Driven 80% incident reduction via Grid computing monitoring update. Standardization of Site Resiliency /Disaster Recovery execution Via single click end to end run

Business Development Experience:

Vendor Management - Managing relationships with Cloud Service Providers and aligning the goals of making cloud platform reliable and available. Managing various projects with service partners to modernize the platform. Driven various initiatives with App Development teams to enhance CI CD and evolve
Worked closely with Middle office and Level 1 business team to capture all the details required to set-up Jira Service Help desk for Traders/Middle office/Internal App Teams/production Management/Development team and SRE to improve collaboration.
Worked closely with Business as part of Accenture, UK to interact with the potential clients to understand their pains areas, and come up with a proposal for Gemfire Distributed system implementation.

Education

Bachelor of Engineering - Telecommunications

Mumbai University

India

06.1999 - 06.2003

Technical Skills

Azure Cloud
Google Cloud Platform
Python
Java
Jenkins
Control-M Scheduler
Geneos
Kubernetes
Docker
Terraform
Ansible
Zabbix
ELK
Grafana
Prometheus
Posit Products

Core Competencies

Dealing with People

Establishing Focus
Providing Motivational Support
Empowering Others
Managing Change:
Persuasive Communication
Customer Orientation
Building Collaborative Relationships

Dealing with Business

Analytical Thinking
Forward Thinking
Strategic Thinking
Fostering Innovation
Technical Expertise
Results Orientation
Entrepreneurial Orientation

Self-Management

Self Confidence
Stress Management
Personal Credibility
Flexibility

Certifications and Trainings

May 2024 : Participating in Various Azure AI Challenges - exploring the Azure AI Services on Azure platform.
Jul 2023 : Introduction to Generative AI (United Latino Students Association)
Oct 2022 : ElasticRun Assessor (CHRMP)
Sept 2022 : Azure Cloud Associate (Webmagic Informatica)
Feb 2022 : Google Cloud Certified Professional Cloud Architect (Google)
Jan 2022 : Architecting with Google Kubernetes Engine: Foundations (Plural Sight)
Oct 2021 : Leveraging Load Balancing Options on the GCP (Plural Sight)
Oct 2021 : Site Reliability Engineering: The Big Picture (Plural Sight)
Sept 2021 : SRE – Using Error Budgets to Prioritize Work
Oct 2020 : Solving Real World Problems using Machine Learning (Tensor Flow)
Aug 2020 : Machine Learning Boot Camp (Singhad Institutes)
May 2020 : Python for Data Science (I Neuron)
Jul 2019 : Python: The Big Picture (Pluralsight)
May 2019 : Agile Scrum Master (Simpli Learn)
Feb 2018 : Python Getting Started (Pluralsight)
Apr 2017 : Scrum Development with Jira & JIRA Agile (Pluralsight)
Mar 2016 : Learn Devops: Continuously Deliver Better Software (Udemy)
Jan 2015 : Information Technology Infrastructure Library Foundation Certificate
Jan 2007 : Sun Certified Java Developer (Sun Systems)

Community Contributions

Co-Organizer of Google Developer Group Cloud Pune - organizes Cloud Community days and GDG Cloud community events.
Offline and Online Study Groups created - mentoring Young Talents and Professionals looking for guidance.
Publishing multiple Azure AI Services, Azure Cloud and Google Cloud Stories via Java Revisited Publications on medium.com
Conducted Knowledge sharing sessions on Pune Tech Community via Meetup Platform in 2023.
Google Dev Group DevFest 2021 - Cloud Track Quiz winner
Won Active participant award in Google garage series 3 – Infra structure as a service
Knowledge sharing sessions on DevOps/SRE conducted for Bharti Consultancy in Year 2019. 2020 and 2021.

Personal Projects

Recent project - Custom Web App deployed to Azure App Service via Docker Image pushed to Container Registry
Azure Kubernetes Service Monitoring with Grafana and Prometheus
Setup Azure Kubernetes Service and Container Registry via Private End Point
Cloud Infrastructure Provisioning using IAC Tools such as Terraform
Implement CI/CD using Jenkins and Google Kubernetes Engine (Nov/Dec 21). • Aim to expose a simple “Lets Sail together” python containerized application using L7 load balancers. • Automate the deployment using IAC tools like Terraform for Infra and Ansible for Application deployment.
Upstream/Downstream Dependency one stop viewer (Sept 2021) • Aim to have one view of applications dependencies based on schedulers • Created using HTML/Flask for 3 applications by parsing the data from XML and storing in SQL Lite DB.
Home expense application (Apr/May 2021) • Aim to create a Home expense web app for Accounted/Unaccounted and Monthly expenses • Thus, sharing the importance of expense management and also showing how to categorize expenses for better view of your expenses. • Created web portal using Python/Flask/HTML and MySQL/Mongo DB. • Also planned to use Atlas Mongo DB integration to showcase Cloud importance and features.

Next projects in pipeline –

Exploring Vision and Document Azure AI Services and storing data in Database
IOT Core data processing using event driven pub/sub GCP product.
A BCP DR site which enables you to do War Game

Service Delivery Manager-SRE Data Science platform

Goal : Managing Non Catalog Request from Data Scientist to achieve stability of the Data Science platform

How was the Goal Achieved :

1. Focused on streamlining the workload/bandwidth Visibility by using Jira.

2. Improving Team collaboration adopting Agile via daily scrum.

3. Improved Prioritization of request via introducing Sprint and backlog prioritization process.

4. Upgraded RStudio Package Manager and RStudio Connect catering to the needs of Data Scientist.

5. Automated Start of the Day and Start of the Week Checks for the platform with over 20 Applications service status validated.

6. Improved Identity Security quotient - Onboarded Critical R Shiny App - Oracle DB Accounts to Privilege Access Management Portal

7. Improved Web Security quotient - Assisting Web Application Framework team by adding bastion host + WAF layer for the Data Science platform.

8. Presented the key Business Stake holders on progress of the Deliverables on Non Catalog request generated by Business and IT.

9. Foster a culture of collaboration and communication between development and operations teams.

Outcome :

Team Workload Visibility - Missing FTE allocation reduced from 20 to 10 hours per week.

Workload Prioritization - Improved by 25%, Highest and High Priority Request were prioritized from 3 to 6 per month.

Resource Utilization and Cost Savings - By automating SOD and SOW, resource utilization reduced by 75% from 1 hour to 15 mins.

Lead Time for Changes - Due to improved collaboration , changes to the platform were delivered faster since the risk were notified and addressed pro-actively.

Cloud - Site Reliability Engineer Experience Details

Managed DevOps, Automation, Cloud, DBA and Production Monitoring L1 and L2 teams thus ensuring Cloud Platform is Available and Reliable.

As a SRE, adhering to the SRE principles -

Tried to assess and manage the risk generated by production services with a well defining service availability with error budget. Thus clearly defining the window available for unplanned and planned downtime. Example - 99.7 Availability for 720 hours in a month with error budget of Approx 2 hours per month
Defining Service Level Indicators and Service Level Objectives were essential for SRE operations. Service Level Agreement were driven by business leads with SLO to be adhered by SRE team.
Maximum focus and effort was spend on Eliminating Toil and Monitoring the system which was an absolutely essential component of doing the right thing in production. Thus helping us to determine symptoms from causes.
Automation by using infrastructure as code was one of the key focus area for automation team using ansible and python libraries.
Release engineering with DevOps assisted use to keep the process simple and robust thus giving us the agility of releases via efficient CI\CD.

Outcome : key performance indicators (KPIs) defined for SRE are listed below -

Service Availability - Was clearly defined and signed of by Business Owners, reviewed on quarterly basis.
Error Rates - Was clearly defined and signed of by Business Owners, reviewed on quarterly basis
Incident Response and Resolution with Mean time to Detect - clearly defined and achieved with an sla of 99%
Incident Response and Resolution with Mean time to Response - clearly defined and achieved with an sla of 99%
Service-Level Objectives (SLOs) and Service-Level Indicators (SLIs) - clearly defined and reviewed on quarterly basis.
Change Management - Assess the stability and reliability of services before and after changes - clearly defined and reviewed on monthly basis. Error Budget was main reference point to evaluate change management success.
Automation Rate - Measure the extent to which manual, repetitive tasks have been automated - Each quarter atleast 40 hours per person hour savings were achieved.
Assess the quality of incident post-mortems - 5 Why analysis was carried out consistently, with sla of 99%. For P1 Severity PIR was sent.
Change rate failure - Monitor the percentage of changes that result in incidents or rollbacks - clearly defined and reviewed on monthly basis.
Incident Reduction - Trends were reviewed on monthly basis.
Disaster Recovery Testing - Recently started with WAR games in Non-prod to access the preparedness.
Time-to-Resolution Improvement - Trends were reviewed on monthly basis and actions items were defined and tracked to closure.
Culture of Reliability - Assess the extent to which a culture of reliability and shared responsibility is fostered within the organization - cultural change not ease to evaluate. But directly proportional to stability of the system.

Cloud SRE - Devops Team Management

Goal : Bridging the gap between development and operations and achieving the goal of delivering software efficiently, reliably, and with agility.

How was the Goal Achieved :
1. Focused on improving Continuous Integration/Continuous Deployment by adopting tag-based GitLab strategy thus achieving consistent across the platform.
It also helped to maintain a clear and organized history of the codebase and enables it to manage deployments and releases more effectively.
2. Integrated Jenkins to current CI/CD DevOps and Implemented Docker image based deployments.
3. Manage version control systems like gitlab and code repositories members access.
4. Create and manage infrastructure as code (IAC) using tools like Ansible.
5. Implement security best practices for infrastructure and applications enhancing by using key vaults to store secrets and using managed identities for connecting to azure services.
6. Enforce compliance with organizational by adding Change request validation module before releasing a deployment.
7. Implement governance practices to maintain control over resources by moderating the changes made to critical parameters of a containerized application like cpu and memory request and limit.
8. Foster a culture of collaboration and communication between development and operations teams.

Outcome :

Deployment Frequency - Increased by 33% from 60 to 80 deployments per month.
Deployment Failure Rate - Dropped by 60% from 10 to 4 deployments per month.
Resource Utilization and Cost Savings - By adopting new Container registry strategy, team reduces the VM cost by 40%.
Lead Time for Changes - Post Change request is approved, for a signed-off docker image it takes no more than 10 minutes including Post Release checks.

Cloud SRE - Azure Cloud Team Management

Goal : Managing and optimizing cloud infrastructure and services

How was the Goal Achieved :
1. Developing and maintaining a cloud strategy aligned with the organization's goals like converting AKS cluster from Public to Private for enhanced security.
2. Infrastructure Provisioning - Creating and configuring virtual machines, containers, and other cloud resources as needed.
3. Managing VPC , Subnet network configurations and Network security groups.
4. Security and Compliance by implementing and maintaining security best practices like using identities to interact between the services.
5. Managing access controls and identity and access management (IAM) policies using RBAC and AAD groups.
6. Cost Optimization - Monitoring cloud usage on fortnightly basis Implementing cost control measures by using policies.
7. Backup and Disaster Recovery - Implementing backup and disaster recovery strategies and Conducting regular backups and recovery drills.
8. Establishing cloud governance policies using PIM and Identity Governance.
9. Capacity Planning - Estimating future resource needs and planning for capacity scaling. Hence driven projects like re-networking and Migrating AKS Cluster from kubenet to CNI networking modes which gives advanced networking features like dynamic subnet addition to the cluster.
10. Adoption of New Technologies - Staying up-to-date with emerging cloud technologies and assessing their potential benefits for the organization. Example - using preview features in Non Prod like Azure AD Workload identity with AKS.
11. Resource Management - Configure resource limits and requests for your containers to prevent resource contention and ensure predictable performance.
12. Configure Horizontal and Vertical Scaling - to automatically adjust the number of containers based on resource utilization or custom metrics.
13. Load Balancing - Azure App Gateway layer 7 load balancer was provisioned to balance incoming traffic across multiple container instances for high availability and improved performance.
14. Taints,toleration's and Labels were used to control how pods are scheduled on nodes in your cluster, thus using spot nodes efficiently.
15. Frequent Cluster upgrades to ensure the latest Kubernetes version and feature updates, as well as to patch any security vulnerabilities.
16. Kubernetes cluster backup and recovery using Valero.

Outcome :

Cost optimization : 25 % Cost Reduction Per Day during First Major Iteration in Dec 2022 and 12% Cost Reduction Per Day during Second Major Iteration in September 2023.
Services uptime : SLA of 99.7 % Availability for 720 hours in a month with error budget of Approx 2 hours was achieved.
Security : Identity Secure Score improved by 22% compared to Q1 2023.
Advisor Score : Was consistently improved by 5 % in each Quarter.

Cloud SRE - Database Team Management

Goal : Ensure that data is secure, available, and efficiently organized to support the needs of the business. Manage SQL and NoSQL Databases on IAAS and PAAS platform.

How was the Goal Achieved :
1. Installing and configuring database management systems (DBMS) such as Maraia DB, MY SQL, PostgreSQL, MongoDB, etc on IAAS using a custom build image.
2. Setting up database parameters and configuration settings for optimal performance.
3. Access Control : Managing user access and permissions.
4. Backup and Recovery: Creating and maintaining database backup and recovery strategies. Performing regular backups and testing recovery procedures.
5. Performance Tuning and Optimization: Configuring master and replica with required configuration.
6. Estimating future database growth and resource requirements. Scaling database systems to handle increased data loads by federating databases.
7. Frequent Database Patching and Updates : Applying patches and updates to the database software. Planning and executing database version upgrades.
8. High Availability and Disaster Recovery: Implementing high availability solutions such as replication, or failover mechanisms.
9. Evaluating options to move to PAAS Service - flexi DB
10. Monitoring and Alerting: Setting up database monitoring tools and alerts. Responding to alerts and proactively addressing issues.
11. Investigating and resolving database-related issues and incidents.
12. Developing strategies for data archiving and purging to maintain database efficiency.
13. Cost Optimization - Monitoring database-related costs, storage, and resource utilization. Identifying cost-saving opportunities.

Outcome :

Database availability : SLA of 99.7 Availability for 720 hours in a month with error budget of Approx 2 hours was achieved.
Database performance : Response Time of database queries and throughput of queries processed per unit of time was monitored closely and early detection of symptoms reduced the impacts or did not cause an incident.
Database security : Compliance Adherence achieved with no Major or Minor NC during Internal and external Audits.
Backup and Recovery : Recovery Time Objective (RTO) and Recovery Point Objective (RPO) well defined and achieved for critical app categories I and II
Change Success Rate: Assess the success rate of database changes, updates, and patches without causing disruptions - 90% SLA achieved.

Cloud SRE - Production Monitoring Team Management

Goal : Ensure reliability, availability, and performance of services deployed on Cloud Platform. Core responsibilities typically involve monitoring, managing, and responding to incidents in real-time to maintain the stability of production and Non production environments.

How was the Goal Achieved :

1. 24/7 Monitoring: Continuously monitor production systems, networks, and applications to detect and respond to anomalies, issues, or performance degradation.

2. Utilize monitoring tools like Zabbix and monitor alerts via Zabbix and Grafana dashboards to track app and cloud services health and performance. Prometheus an open-source monitoring and alerting system used for monitoring cloud platform reliability and scalability. Grafana an open-source visualization platform integrated with Prometheus to create interactive and customizable dashboards for various data sources. Mainly 3 diff types of dashboard were created, one each for Cloud Cost, App and Cloud Services Monitoring and last one for Critical Configuration Monitoring.

3. Incident Detection and Response: Follow established incident management processes to resolve issues or escalate them to appropriate teams.

4. Alert Management: Configure and fine-tune alerting thresholds to reduce false positives and ensure timely notifications for critical incidents. Acknowledge and prioritize alerts based on severity and impact. Using Exotel configured via Zabbix and Grafana for Alerting. Data was scraped from Prometheus or Azure monitor using PromQL or Kusto Query. Prometheus Rules were configured on the metrics to generate customized alerts on the Grafana Dashboard. Alerting system was enhanced by integrating Zabbix and Grafana with Exotel. In the background Grafana On Call features were explored too.

5. Performance Analysis: Analyze system and application performance metrics to identify bottlenecks, latency issues, and opportunities for optimization. Collaborate with other teams to implement performance improvements.

6. Diagnostics and Troubleshooting: Conduct initial troubleshooting to isolate the root causes of incidents. Provide detailed incident reports and logs to support problem resolution.

7. Documentation and Reporting: Maintain detailed records of incidents, responses, and resolutions. Generate regular reports on system performance, incident trends, and response times.

8. Communication: Provide clear and timely communication to stakeholders, including App teams, management, and end-users, regarding incidents and their status. Collaborate with other teams during incidents to ensure effective resolution.

9. Documentation and Runbooks: Create and maintain runbooks and documentation for common incidents and procedures to ensure consistency and efficiency in incident response.

10. Continuous Improvement: Participate in post-incident reviews and root cause analysis to identify opportunities for process improvement and system resilience enhancement.

11. Ensuring monitoring/alerting Coverage of end to end all service consumed on the cloud platform. Various Prometheus exporters like Blackbox, Elastic Search, Kafka, Mysqld, nginx, node and Redis were configured to collect metrics from different sources. Metric beat was also used to get system and service stats. For Metrics storage Thanos was configured and For Alerts Storage AlertSnitch was configured. Diff Azure components like Application Gateway, Flexi DB and Azure VM's metrics were pushed to Prometheus too. Even Debezium connectors were scrapped for data and pushed to prometheus.

Outcome :

Mean Time to Detect (MTTD) - For Disaster type alert SLA of 98% was met and improved each quarter.
Mean Time to Resolve (MTTR) - Based on Critical Application Framework, issues were resolved with minimum business impact. SLA of 95% was met and improved each quarter.
Mean Time Between Failures (MTBF) - Problem management and root cause analysis assisted to reduce down the MTBF.
Incident Severity Classification : Incident matrix maintained based on Priority and Severity with Impacted User and Financial loss as two decisive attributes.
Incident Trend Analysis : Weekly basis Incident data was published. Monthly Incident scorecard was shared which depicts severity vs which stage caused the incident.
Alert Reduction : Consistently in each sprint cycle, efforts were put to reduce the alerts by 10% approx.

Cloud SRE - Automation Team Management

Goal : Improving efficiency of SRE Team members, reducing manual toil, and ensuring consistency in processes.

How was the Goal Achieved :
1. Define SRE teams automation strategy and roadmap and publish the progress on Monthly basis.
2. Identify areas where automation can provide the most significant benefits and Prioritize automation initiatives based on impact and ROI.
3. Develop custom automation scripts, workflows, and code as needed to automate specific tasks or processes. Use programming languages and scripting tools like Python, Azure CLI or PowerShell or kusto query.
4. Create, manage, and version infrastructure as code (IAC) using tools like Terraform and Ansible,
5. Automate data extraction, transformation, and loading (ETL) processes, example Pre-prod refresh and Data backup/restore was automated.
6. Implement automated monitoring solutions to detect and respond to system and application issues - pod level status was automated with rolling restart strategy,
7. Collaborate with other teams, including DevOps, development, and SRE , to integrate automation solutions seamlessly into existing workflows.
9. Ensure that automation scripts and workflows are thoroughly tested and validated for accuracy and reliability.
10. Monitor and optimize costs associated with cloud resources and automation tools - Azure python SDK libraries were used to find resource cost and redirect to cost dashboard. Disk and snapshots were auto deleted using python sdk.
11. Automate infrastructure provisioning and configuration to support scalability and consistency - Ansible playbooks were written to automate Password reset,Linux Unix Creation and database user creation triggered from jenkins
12. Enhance Change management governance by implementing Change request validation logic during deployments and VM login.

Outcome :

Time Saved - Based on use case time saved varied but approx 50-75% Person Hours were saved.
Cost Savings - Based on use case Cost saved varied but approx 25% cost saved due to auto disk cleanup and 15% due to auto snapshot cleanup.
Error Reduction - Major reduction from 5 incidents to almost 1 Incidents since last year.
Cycle Time Reduction - Shorter cycle times with improved speed and agility by almost 80% from 30 to less than 5 minutes for execution of a simple task like Post release checks.
Productivity Improvement - definitely increased output per employee/team since mundane task was reduced team could focus on other crucial projects
Reliability and Availability - since the task were completed faster, system downtime and incidents were reduced,
Reduction in Manual Intervention - For activities like VM patching almost 95% man hours were reduced thus no manual intervention needed except for monitoring the jerkins job to completion with no issues.
Security and Compliance - improved since new process with change request was introduced thus avoiding unauthorized changes.

Cloud SRE - Identity Access Management

Goal : Ensure that only authorized individuals and systems can access cloud resources while maintaining the highest level of security.
Responsible for developing, implementing, and managing an organization's identity and access management (IAM) program.

How was the Goal Achieved :
1. User Provisioning and Deprovisioning : Create and manage user accounts, ensuring that users have the appropriate access based on their roles and responsibilities.
2. Access Control : Define and enforce access control policies and permissions for various resources, such as databases, applications, and network resources.
Implement role-based access control (RBAC) to manage permissions based on job roles.
3. Authentication and Authorization : Manage authentication mechanisms, including passwords, multi-factor authentication (MFA), and single sign-on (SSO). Conduct MFA Campaign each quarter.
4. Directory Services : Administer and maintain directory services, such as Azure Active Directory to manage user and group information.
5. Identity Lifecycle Management : Monitor and manage the entire lifecycle of user identities, from onboarding to offboarding, ensuring that access rights are appropriate at all stages.
6. Access Reviews and Auditing : Regular Monthly review and audit user RBAC access, identify and remediate any excessive or unauthorized access.
7. Risk Assessment and Compliance : Ensure compliance with ISO 27001 and abide by internal security standards.
8. Identity Federation : Implement and manage identity federation solutions to enable passwordless/token based access to multiple systems and applications.
9. Technology Evaluation and Integration: Evaluate and select IAM technologies and tools that align with the organization's security and business needs.
10. Collaboration : Collaborate with other App Dev Teams and Support teams to ensure a holistic approach to Identity access management.

Outcome :

Reduction in Security Incidents : Reduction in security incidents related to unauthorized access. 100% Reduction in Major Security Incident Tickets and 75% Reduction in Minor Security Incident Tickets.
Compliance Adherence: 100% Reduction in Major and Minor NC post ISO 27001 Audit. 80% Reduction in NC post internal Audit conducted related to identity and access management.
Access Review Completion Rates : Track the percentage of access reviews that are completed on time. High completion rates demonstrate effective monitoring and auditing.
100% completion of Access reviews each month.
Time to Provision and Deprovision Users : Time Taken to onboard new users and off-board new user reduced from a week to 1 day.
Azure Identity Score - Increased identity Secure Score by 22% compared to FY 2022. Ensure that only authorized individuals and systems can access cloud resources while maintaining the highest level of security.

Cloud SRE - Audit Management

External Audit

Goal : ISO 27001 - Information Security Management Systems. Ensure controls are well documented, communicated, and regularly reviewed and updated to ensure they remain relevant and effective in addressing emerging threats and business requirements.

How was the Goal Achieved :
1. Reviewed the controls on a monthly basis and collected relevant Evidences
2. Ensure internal monthly audit report has no major NC as a preparedness for external Audit.
3. Conduct External and Internal Audits and submit all the required evidences with no Minor or Major NC's
4. Mainly focused on improving the process to manage Risk - CR process. Enhance security Controls to protect Cloud Platform.
and monitoring the effectiveness of security controls.
5. Permit to Prod process ot govern the release change management.
6. Ensure that automation solutions adhere to compliance standards and governance policies.

Outcome :

No Major or Minor Non Conformities. Few observations to be closed. 100% improvement in NC and 50% in Observations.

Internal Audit

Goal : Secure Cloud operations by assessing risk, detecting and responding to security incidents, and ensuring compliance.

How was the Goal Achieved :
1. Improve security posture by providing least privilege access using Azure RBAC for Cluster and IAM for Cloud resources.
2. Reduced risk by decommissioning local unix users and use AAD for VM Security, which allows you to use Azure AD credentials to log in to your VMs instead of traditional username/password combinations. This provides enhanced security and simplifies user management.
3. Improved efficiency by automating the critical task like creating users on VM or Database.
4. Reduced costs by reducing the impact to cloud platform due to security incidents
5. Improving Cloud Security by using Privileged Identity Management for Just in time access
4. Enhance Identity Governance by monthly review of the User base and access provided to the users.

Outcome :

No Major or Minor Non Conformities. Few observations to be closed. 100% improvement in NC and 50% in Observations.

Cloud SRE - Cloud Cost Management

Goal : Managing costs in Microsoft Azure Cloud Platform which was crucial to stay within budget and optimize cloud spending

How was the Goal Achieved :
1. Cost Tracking and Analysis on fortnightly basis with clear defined action items.
2. Resource tagging to categorize resources and allocate costs to specific Line of Business.
3. Azure Policy to enforce cost management defined rules.
4. Regularly review and resize resources referring Azure Advisor.
5. Purchasing Azure Reserved Instances (RIs) for predictable workload which offer significant discounts.
6. Isolating Prod and Non prod subscription to control the usage in lower environment.
7. Automate Cleanup and Decommissioning of unused resources like disks and snapshots.

Outcome :

25 % Cost Reduction Per Day during First Major Iteration in Dec 2022 and 12% Cost Reduction Per Day during Second Major Iteration in September 2023.

Site Reliability Engineer Experience Details

Supported Equities Risk Platform Applications
1. A real-time application, used by traders to manage risk by tracking exposure and P&L (profit and loss).
2. Quantative Library Service which encapsulates complexity of invoking QLIB functionality with efficient job distribution to a large pool of compute backbone and maintaining a real time data grid of QLIB instrument, market and results data.

Projects
1. Blue/Green Deployment for real-time application
Plan : Due to issues faced post major release in Quarter 1, considering real time nature of the application Blue/Green deployment pattern was adopted to achieve reliability .

How was the aim Achieved
The initial critical step was to have two instances of schedules created one each for Prod and Prod-Pilot instance.Next was review capacity of the application across data centers and add new Virtual host. Later con fig changes were made to accommodate 2 instances of the stack and use Automation as service to co-ordinate flip between BLUE and GREEN instances.

Outcome : Apart from the standard advantage being simple, fast, well-understood, and easy to implement, main benefit was Rollback which is straightforward, because a simple flip traffic back needs to be executed to the old environment in case of any issues.

2. Cyber security Transformation Program : Technology Privileged Access

Plan : This project came right from Top management hence had highest priority and attention during the course of implementation. Aim was to use control solutions involve "vaulting" privileged access credentials and using discovery processes to ensure control coverage. Adopt tools to eliminate human knowledge of passwords for privileged accounts to enable secure and auditable privileged access.

How was the aim Achieved
Challenge was to understand the new tools on-boarded in the tech store front of the firm. Attend various parties which were a week short program to understand all technical specs and get acquainted with how to on board /configure your respective application based on the current application architecture. Kerberos was used as the underlying network protocol to authenticate service request between hosts using symmetric encryption and a trusted third party — known as the Key Distribution Center (KDC) — to authenticate users to a suite of network services. Once a user authenticates to the KDC, it sends a ticket specific to that session back to the user's machine, and any Kerberized services look for the ticket on the user's machine rather than asking the user to authenticate using a password.
Once the basics were captured, later a solution fitting all the 3 environment had to be designed (Prod, UAT and DEV).It was further challenging since some of the applications host were shared between applications. hence more specific use case had to be designed discussing with other apps and database teams.
Frequent regular communication and iterative releases was key for a period of 2-3 months to implement the changes with proper post release checks, especially over the weekend.

Outcome : In first quarter, lower environments services including 25 environments and around 100 application functional accounts were migrated.
In second quarter, production environment services including 3 regions and almost 20 odd accounts were migrated.
Most of the efforts were implemented with automated through post release checks, resulting into 0 production incidents.

Highly Appreciated by senior management

3. Reduce Intra Day Toil -

Plan : Observed 2 key scenarios to reduce the toil, presented the solution to APP owner and got the same implemented. One was around intra day bounces for multiple components and other was Gemfire cache archiving of intra day data.

How was the aim Achieved
When a applications has dependency on upstream application service being available, it becomes challenging if upstream application has performance issues resulting in bouncing both the applications. Using automation as a service, ad pipeline was defined which would bounce both upstream and downstream application via a single click.
In another situation, if too many objects/data are loaded by a user in Gemfire cache, an alert would fire an email to the user and later a script was invoked via automation as a service to archive or purge the unwanted data.

Outcome : Production management over head of bounce across multiple application was reduced, thus saving at least 15 mins of manual effort.

4. Compute backbone monitoring set-up for dynamic compute allocation

Plan : Compute backbone is a high-performance grid computing used for risk analysis. Overnight batches use grid computing for processing risk calculations. Dynamic profiling was enabled to use the compute available on the grid based on the context of the day, example a different one for Overnight and different one for intra day depending on the consumer. To accommodate the same prod monitoring had to be enhanced.

How was the aim Achieved
Plan was to invoke Compute backbone api to get the resource plan an xml document for a specific context defined with a 24 hours schedule. Automate the extraction of resource plan and then extract the consumer specific compute info expected for a time slot. Add alerting driven by schedule which notifies via a pop-up alert + email to Prod Management and SRE team.

Outcome : Before setting up this monitoring, approx 10 Incidents were observed over a period of one month - impacting Start of the day risk processing.
80% incident reduction,Post the setup 2 incidents raised per month for mismatch in compute expected to run intra day load, due to bug in dynamic profile loading script which was fixed in later iterations.

Highly Appreciated by senior management

5. Standardization of Site Resiliency /Disaster Recovery execution Via single click end to end run

Plan : Site Resiliency and Disaster Recovery where key events and carried out in a defined period over the year. Presented a plan to application owner and production management to automate the application start up and shut down using a "push-button" approach.

How was the aim Achieved
Explored various built in tool in the firm that allows us to automate the process of Application fail over across data center.
Automation as a service was explored and was used to automate the Domain Name System Disaster recovery pairs.
Later Distributed scheduling Control-m Tool was used to stop and start the application using header as Push button. Even Post event checks were automated using Control-m tool to define the correct work flow too.
Info share was conducted round the global across different production management teams to on-board them to this new automated process.

Outcome : Standardization of QLS SR/DR execution reduced the end to end run time from 3 to 1.5 hours. Production Management was able to carry out Major events without AD presence due to automation of Start and Stop process inclining Post events checks.
Responsibilities:
Release management
Scrum management (Manage project dashboard to show case progress)
Co-ordinate between PM team, QA team and AD to ensure that there is a clear understanding of changes introduced to the ecosystem.
Ensure appropriate monitoring in place.
Ensure that the appropriate change documentation is complete.
Ensure deployment of (automated) the software to production and handle events as they arise.
Every time there is an event, alert, or incident, as an SRE member will conduct root cause analysis and develop code/scripts to eliminate re-occurrence or establish a "no-touch" mitigation for potential re-occurrence.
Gemfire Distributed caching Enhancement
Lower environment management / Envs Management and Release Automation
Lower environment and Production new infrastructure set-up / GEMFIRE host migration
Incident management
Pricing Failure Analysis for business
Conduct Series of Info share sessions for Kerberos implementation

Highly Appreciated by Production Management Lead and Application Owner

Associate Site Reliability Engineer Experience Details

Supported Equities Risk Platform Applications
1. Quantative Library Service which encapsulates complexity of invoking QLIB functionality with efficient job distribution to a large pool of compute backbone and maintaining a real time data grid of QLIB instrument, market and results data.

Projects:
1. Stabilize Weekend and weekdays batches

Plan: To achieve aim, created below plan and set achievable expectation with Key Stake holders including down stream applications which consumes risk processed by the batch.
In Q1, stabilize highly impacted and critical AMER region Weekend/Weekday Batch
In Q2, stabilize critical and huge White friars EMEA batches
In Q3, Split Critical Huge AMER batches to stagger work load on Quantative Library Service

How was the aim achieved:
Analyze compute contention on weekends by monitoring the performance real-time.
Review Database capacity and weekend Database maintenance schedule with Database Engineer.
Implement effective batch staggering even by amending the schedule strategy to avoid service components crash over the weekend.
Consistent weekly update on progress was provided to stake holders, to keep them in loop on the effort made.

Outcome:
Post above analysis and review gained additional batch run time of 11 hours on Saturday run which was highly significant
It also Significantly reduced prod management manual effort of multiple hours to zero manual action while supporting the weekend batches. Huge resource gain.
zero Batch related incidents for AMER/EMEA due to SLA breach in Q1/Q2 respectively

Highly appreciated by Business leads since risk was delivered before SLA.
Even appreciated by production Management since we saved lot of manual human effort.

2. Improve Quantative Library Service Gemfire Cache capacity monitoring

Plan : To achieve aim, created below plan and set achievable expectation with Production Management and Application Owner.
Week 1 - Review why we have so many L3 calls for cache capacity related queries
Week 2 - Update existing monitoring set-up to cater to increased business data in-flow and categorize the usage.
Week 3 - Conduct management info share to define new rules in maintaining data and provide details on increased usage by business.
Week 4 - Automate data archiving process with adequate notification on daily basis to respective business application teams.

How was the aim achieved:
Initial analysis of current monitoring set-up with info gathered from query raised by Production management gave us the scope of work.
Once scope of work was defined monitoring update + Automation of data archiving process solved the issues.

Outcome:
In the First month post implementation itself, 50% drop in prod support tickets was observed for issues raised to App L3 for cache capacity, thus big save in developer and PM
man hours.

3. Application Production Incidents Dashboard

Plan : To reflect current state in production, decided to create a dashboard and reflect monthly incidents caused by the application. Presented this idea to Application owner and once acknowledged, sent first draft view of the Dashboard to senior management.

How was the aim Achieved
Explored tools available in the firm to ascertain if we need to create a new one or can re-use the existing one. Found one in built Business Intelligence Analytical and visualization web application that allows users to present data from different data sources and one of them was from Incident management system. Next step was to extract the required data and amending the data to make it presentable in two categories 1. Component level and 2. Context driven eg. week days vs weekends Incidents

Outcome : As this dashboard consistently provided holistic view of Application Production Incidents to Leads, it was one of the key source to prioritize further development and production management work group pipelines. 25% reduction in weekend incidents was achieved by fixing the basic problems around scheduling and simple bugs.

Highly Appreciated by Senior management.

4. Application Production Monitoring Dashboard

Plan : Taking a leaf out of Production Incidents Dashboard, decided to create a dashboard and reflect monthly alerts generated by the application. Presented this idea to Application owner and once acknowledged, sent first draft view of the Dashboard to production senior management. Further decided to review the stats and amend the altering to reduce noise iteratively over a monthly iteration.

How was the aim Achieved
Explored built Business Intelligence Analytical and visualization web application tool that allows users to present data from different data sources and one of them was from Alerting management system. Post training and info share form the web application team, created a dashboard and enabled subscription mechanism. Subscription feature sent a weekly email of the Dashboard reflecting the alerts generated last week. This was used to review alerts and fix the noise and update the alerting rule iteratively over a period of 2 weeks.

Outcome : Due to dashboard every one had single source of stats, thus easy to align resources to fix the alerts in-line with latest business needs. After 2 months, 4 sprint of 15 days each, monthly alerts was reduced down by 35% in 1st month and 15% in second month.

Highly Appreciated by Production Management Lead and Application Owner

Responsibilities:

Managed Critical Incidents and did own them end to end
Asserts ideas and contributes significantly to the diverse and collaborative working culture by performing role of Scrum Master consistently
As a release manager ensured releases are effectively and efficiently managed.
Followed ITSM work flow with no unauthorized ITSM and also ensured various PTO Calls are organized with Level 1 and Level 2 teams.
Regularly send notification to clients for critical deliverables.
Driving Production Stability delivery to Production working with global teams.
Scrum master role to improve collaboration and co-ordination within distributed teams.
He efficiently support critical incidents end to end trained PM empowered PM to take better decision, especially on weekend - thus reduced AD developer support hours by 50%.
Demonstrating understanding of Control-M/Geneos tools to enable effective and efficient support Model for other applications.
Managed communication well with stakeholders
Release tracking and communication with clients
Remove silos and play role of Scrum master role - Improve collaboration and co-ordination within distributed teams

Highly Appreciated by Production Management Lead and Application Owner

Production Management Lead, JP Morgan Chase -India

Role Production Support Analyst(Oct 2016 – Dec 2017)

Project:
CIB Equities risk and trading application support.

Responsibilities:

Communicates clearly and articulates messages critical to the team – like weekend events ensuring proper support coverage if needed.
Keeping team up to date with weekend events and ensuring AD coverage is in place for weekends.
Provide two tier support to all applications and assist all end users to identify any issues in production.
Monitor all performance metrics for various production systems and identify root cause for all technical issues and recommend solutions.
Maintain effective relationships with various system administrators and development teams.
Participate in periodic meetings and maintain all applications for productions and plan appropriate various strategies
Support the key departmental specific line-of-business applications including regular updates which will involve some out-of-hours work
Take ownership of user problems and be proactive when dealing with user issues.
Performed disaster recovery tests to ensure system availability and functionality
Assisted in streamlining processes to reduce and make daily workload more efficient

Technologies : Geneos Monitoring Tool, Control-M Scheduling tool and Unix Scripting

Production Management , JP Morgan Chase - United Kingdom

Role Production Support Analyst ( July 2014 – Sept 2016)

Project: CIB Equities risk and trading application support.

Responsibilities:

Provide two tier support to all applications and assist all end users to identify any issues in production.

Monitor all performance metrics for various production systems and identify root cause for all technical issues and recommend solutions.

Maintain effective relationships with various system administrators and development teams.

Participate in periodic meetings and maintain all applications for productions and plan appropriate various strategies Support the key departmental specific line-of-business applications including regular updates which will involve some out-of-hours work Take ownership of user problems and be proactive when dealing with user issues.

Performed disaster recovery tests to ensure system availability and functionality Assisted in streamlining processes to reduce and make daily workload more efficient

Technologies : Geneos Monitoring Tool, Control-M Scheduling tool and Unix Scripting.

Onshore Team Lead, Accenture - United Kingdom

Role: Onshore Team Lead in UK (April 2011 - Jun 2014)
Client : JP Morgan Chase, UK
Project:
Pyramid Position Services a middleware application for trading services to enhance the performance and build on the Gemfire Caching System. PPS uses Gemfire Enterprise application (High Performance Distributed Caching application) for storing its trade positions as against traditional DB. Hence retrieval of the positions is done at a much faster rate (High Throughput) and with Low Latency.
Responsibilities:
Propose improvement to the existing build and deploy framework. Carry out SIT and UAT regression testing effectively.
Automate Release Build Process. Configure Scheduling Tool. Enhancements and defects fixing.
Manage and co-ordinate production releases. Plan the release iterations and define scope with in the team.
Streambase the position data and produce the reconciled data on live view server.
Parsing log files using log stash and plotting the same using ggplot.
Expand and establish relationships with UAT Env and L2 Operate Team to facilitate deliverables on time.
Provide Out of Office Support as per SLA and efficiently manage Disaster Recovery infrastructure checks for Production.
Technologies : Core Java, Gemfire Caching, Jenkins, XML, XSLT, Maven, Streambase live view, log stash, ggplot, Control-M Scheduling tool, Restful Web Service and Jython Scripting.

Offshore Team Lead, Accenture - India

Role: Offshore Team Lead (Feb 2010 - Mar 2011)
Projects:
Pyramid Exception Management - UI developed for operate team to monitor exceptions generated by trading systems to process trade on daily basis.
PaymentNet4 web based administration and reporting application supporting commercial card clients with card services for all types of expenditures.
Responsibilities :
Design and Develop Database Access Layer and Service Layer. Develop Test Cases for the complete Application. Planning and Estimation of the modules developed at Offshore. Performance tests the GWT UI to accommodate 500 records using local and remote pagination.
Daily status reporting to client and onshore team. Involved in ensuring that the QA and UAT cycles were executed as per schedule meetingdefined SLA.
Technologies : GWT, Jetty Server, Power Mock, Spring Batch and Spring Web Service.

Accenture, Mumbai
Role: Module Lead (Oct 2008 - Jan 2010)
Projects:
ATS IM UI Access Target State (ATS) Information Management (IM) User Interface provided user screens to access the IM Application.
User Account Setup application enables the users to onboard Traders. Since On boarding and off boarding process was moved to this application, the legacy data had to be migrated to new database.
IX SWIFT designed to facilitate transfer of data from both external and internal client applications to JPMC systems. IX was capable of understanding and converting data received in a variety of formats into a standardized representation.
Responsibilities :
Creating Detailed Design Document and Unit Test Plan.XML Verification against the designed XSD for all the developed services
Provided approach to migrate data from legacy system to UAS System based on code reusability feature. Guided and assisted the team to carryout Migration task.
Developed Message Adaptor Framework (“MAF”) that will enable straight through processing (STP) of data in various formats for various clients allowing it to be more scalable and flexible using mule.

Technologies : Spring Web Service, Struts, Ibatis, Apache Active and Web Sphere MQ, Java Web Mail and Mule allowing it to be more scalable and flexible using mule.
Technologies : Spring Web Service, Struts, Ibatis, Apache Active and Web Sphere MQ, Java Web Mail and Mule.

Java Developer, Cognizant - Pune

Role: Java Developer ( June 2007 – Sep 2008 )
Clients: Putnam Investments, Wellington / Management Company and Fidelity Japan.
Projects:
Technical Support Services – included various applications distributed across three streams for Putnam. One stream being financial engineering, which consisted of 12 applications running autosys jobs on Unix box. These jobs used to get data from upstream process it and make it available to downstream application and data users through UI.
Cash Flow Systems – was developed for Wellington Management Company to manage client flows of various types based on business criteria. CFS sent flow information to upstream application using web services. There were three web services developed a. to manage flow b. manage account and c. to search for flow details based on flow Id.
CashBack System – was designed as a stand-alone system for Fidelity users to handle the Cash back activities to provide monetary benefits based on business criteria. The Cashback system had a scheduled job to import data from CCM (Original) database into Cashback database (New).
Responsibilities :
As a team leaaged 6 applications with team strength of 5 members. Interacted with customer to update daily activities and cycles status.
Developed all the three web services for CFS application and tested the end-to-end functionality using through web client.
Developed user management functionality using Struts an
d EJB. Handled additional responsibilities of deploying the application on remote location server throughout the project cycle.
Technologies : Core Java/J2EE, Struts, EJB, XML, JAXB, SOAP based Web Services and UNIX.

Junior Java Developer, Tata Consultancy Services - India

Role: Junior Java Developer ( Jan 2005 – May 2007 )
Client: Nortel Networks.
Projects:
CACP 15.0 Enhancements (AABS Phase II) - Enhancement of CACP tool to provide functionality added by Adaptive Antenna Beam Selection Phase II Enabler feature.
CACP 14.0 Maintenance - CACP (CDMA Access Configuration and Provisioning.) is a single application that allows the user to perform a wide array of configuration activities for the BSC & CBRS. CACP is based on a client/server model.
Base Sub Systems Manager Commissioning and Up gradation - Indian Lab Operations (ILO) is a group into support and maintenance of Nortel Networks CDMA Devices. ILO looks after CDMA technology integration, MTX commissioning, BSC commissioning, BTS commissioning and BSSM installation.
Responsibilities:
Interact with end users to understand feature requirements and ensure compliance as per the requirement. Documenting – FS (Functional Specification), HLD (High Level Design), DD (Detailed Design), DTP (Design Test Phase).
Involved in the sustenance of the product to resolve issues reported by the end users of the tool. Provide training on the tool to the end users.
Administration of Sun Solaris Systems (Netra 440/1280/240 and E450).Integration of BTS, BSC and MTX for CDMA/GSM technology.
Technologies : Core Java, Swings and UNIX Scripting.

Network Engineer, Patni Computers - India

Role: Network Engineer (Jan 2004 – Dec 2004)
Client: Indian Airlines and Indian Aluminum.
Project:
LAN, WAN and SUN SOLARIS Systems maintenance - Supporting WAN and LAN network at Indian Airlines site – Mumbai.
Responsibilities :
Maintenance of Dial-Up and Lease line.

Maintenance of Sun Solaris servers (E-450).

Commissioning of Sun Solaris servers.

Technologies : Unix and Sun Solaris Systems.

Similar Profiles

Prashant KumarPrashant Kumar
<ul><li>Roles and Responsibilities</li><li>Hiring and team building</li><li>Managing team’s career progression and performance reviews</li><li>Managing Performance improvement plan and firing</li><li>Mentoring</li><li>Product roadmap planning and execution</li><li>Accountability of the entire delivery execution including the risk mitigation planning.</li><li>Stakeholder management</li><li>Adhering to Apple Privacy and Security Guidelines for platform</li><li>Design and implementation of new architectures and solutions to manage all online services for AdXchange in Apple Ads.</li><li>Design and implementation of new architectures and solutions deliver Budget experimentation in Apple Ads</li><li>Design and implementation of new architectures and solutions to manage data for Ad Platform.</li><li>Creating and maintaining all the data and analytics and revenue pipelines for Ad Platforms</li><li>Driving the solution for data ingestions and data processing across Ad platform</li><li>Defining the Strategy for data governance and data access</li><li>Collaboration with business to deliver relevant data and insight for our strategy and decisions.</li><li>Managing storage, processing, copy/synchronization etc. appropriate to scale</li><li>Enabling machine learning and algorithm groups by proving them right frameworks and data</li><li>Advancing team’s design methodology and quality programming practices and evangelize those techniques across Ad platform.</li><li>Deliveries</li><li>Setup 8+ member team to design and deliver Budget Allocation and Budget experimentation.</li><li>Leading Ad Delivery services team, responsible for delivering various capabilities like query understanding, targeting, fraud etc.</li><li>Delivered and Lead Design for observability for budget, and data platform in Ad platform.</li><li>Setup 11+ members high performing team to handle various business needs such as analytics and micro services for Display Advertising as well as Data Platform</li><li>Instrumental in driving Data Platform throughout the SDLC from inception, design, development, testing and deployment.</li><li>Instilled a strong team culture of high-quality delivery by constructive feedback, cross collaboration, focus and motivation.</li><li>Delivered Realtime and Batch Platform with various integrations</li><li>Delivered new DIP design for Data Platform</li><li>Delivered design of all the components of Data processing platform</li><li>Transitioned Display data work to India seamlessly and delivered many critical business impacting features and organizational security initiatives.</li><li>Closely worked with various stakeholders to understand the requirements which resulted into efficient remote work culture.</li><li>Seamless Display execution and set up best practices for 24X7 production on-call.</li><li>Tools and Technologies</li><li>AWS, S3, Kafka, Spark, Hadoop, EMR, Hive, Scoop, Flume, Cassandra, Airflow, Oozie, Gobblin, Schema Store, Icloud, Grafana, Kubernetes, Vertica, Alation, snowflake, druid, Iceberg, Java, Scala, IntelliJ, Oracle, MySql, Git, RIO, Spinnaker, Elastic cache, Grpc, Agile.</li></ul> at Apple - Adplatform<ul><li>Roles and Responsibilities</li><li>Hiring and team building</li><li>Managing team’s career progression and performance reviews</li><li>Managing Performance improvement plan and firing</li><li>Mentoring</li><li>Product roadmap planning and execution</li><li>Accountability of the entire delivery execution including the risk mitigation planning.</li><li>Stakeholder management</li><li>Adhering to Apple Privacy and Security Guidelines for platform</li><li>Design and implementation of new architectures and solutions to manage all online services for AdXchange in Apple Ads.</li><li>Design and implementation of new architectures and solutions deliver Budget experimentation in Apple Ads</li><li>Design and implementation of new architectures and solutions to manage data for Ad Platform.</li><li>Creating and maintaining all the data and analytics and revenue pipelines for Ad Platforms</li><li>Driving the solution for data ingestions and data processing across Ad platform</li><li>Defining the Strategy for data governance and data access</li><li>Collaboration with business to deliver relevant data and insight for our strategy and decisions.</li><li>Managing storage, processing, copy/synchronization etc. appropriate to scale</li><li>Enabling machine learning and algorithm groups by proving them right frameworks and data</li><li>Advancing team’s design methodology and quality programming practices and evangelize those techniques across Ad platform.</li><li>Deliveries</li><li>Setup 8+ member team to design and deliver Budget Allocation and Budget experimentation.</li><li>Leading Ad Delivery services team, responsible for delivering various capabilities like query understanding, targeting, fraud etc.</li><li>Delivered and Lead Design for observability for budget, and data platform in Ad platform.</li><li>Setup 11+ members high performing team to handle various business needs such as analytics and micro services for Display Advertising as well as Data Platform</li><li>Instrumental in driving Data Platform throughout the SDLC from inception, design, development, testing and deployment.</li><li>Instilled a strong team culture of high-quality delivery by constructive feedback, cross collaboration, focus and motivation.</li><li>Delivered Realtime and Batch Platform with various integrations</li><li>Delivered new DIP design for Data Platform</li><li>Delivered design of all the components of Data processing platform</li><li>Transitioned Display data work to India seamlessly and delivered many critical business impacting features and organizational security initiatives.</li><li>Closely worked with various stakeholders to understand the requirements which resulted into efficient remote work culture.</li><li>Seamless Display execution and set up best practices for 24X7 production on-call.</li><li>Tools and Technologies</li><li>AWS, S3, Kafka, Spark, Hadoop, EMR, Hive, Scoop, Flume, Cassandra, Airflow, Oozie, Gobblin, Schema Store, Icloud, Grafana, Kubernetes, Vertica, Alation, snowflake, druid, Iceberg, Java, Scala, IntelliJ, Oracle, MySql, Git, RIO, Spinnaker, Elastic cache, Grpc, Agile.</li></ul> at Apple - Adplatform
Raghavendra KRaghavendra K
QA Manager (Data Science & AI Platform Governance Platform) at Corridor Platforms Pvt LtdQA Manager (Data Science & AI Platform Governance Platform) at Corridor Platforms Pvt Ltd
Krishna Meenon SKrishna Meenon S
Head , Data Platform Engineering & SRE at Standard Chartered BankHead , Data Platform Engineering & SRE at Standard Chartered Bank
Sunil Kumar GosaiSunil Kumar Gosai
Cloud Delivery Lead / Infrastructure/Solutions Architect (Cloud Platform Ops / SRE) at Hitachi Vantara LLCCloud Delivery Lead / Infrastructure/Solutions Architect (Cloud Platform Ops / SRE) at Hitachi Vantara LLC

CREATE PROFILE

Summary

Overview

Work History

Service Delivery Manager-SRE Data Science Platform

General Manager (SRE) Cloud and Production

Vice President (SRE)

Associate Vice President (SRE)

Associate Vice President (Production Support Lead)

Associate Vice President (Production Support)

Associate Manager (Onshore Team Lead)

Senior Software Engineer (Offshore Lead)

Software Engineer (Module Lead)

Associate Developer (Java)

Assistant System Engineer (Java Developer)

Site Maintenance Engineer

Experience Summary

Education

Bachelor of Engineering - Telecommunications

Technical Skills

Core Competencies

Certifications and Trainings

Community Contributions

Personal Projects

Service Delivery Manager-SRE Data Science platform

Cloud - Site Reliability Engineer Experience Details

Cloud SRE - Devops Team Management

Cloud SRE - Azure Cloud Team Management

Cloud SRE - Database Team Management

Cloud SRE - Production Monitoring Team Management

Cloud SRE - Automation Team Management

Cloud SRE - Identity Access Management

Cloud SRE - Audit Management

Cloud SRE - Cloud Cost Management

Site Reliability Engineer Experience Details

Associate Site Reliability Engineer Experience Details

Production Management Lead, JP Morgan Chase -India

Production Management , JP Morgan Chase - United Kingdom

Onshore Team Lead, Accenture - United Kingdom

Offshore Team Lead, Accenture - India

Java Developer, Cognizant - Pune

Junior Java Developer, Tata Consultancy Services - India

Network Engineer, Patni Computers - India

Similar Profiles

Prashant KumarPrashant Kumar

Raghavendra KRaghavendra K

Krishna Meenon SKrishna Meenon S

Sunil Kumar GosaiSunil Kumar Gosai