Summary
Overview
Work History
Education
Skills
Certification
Timeline
Generic
Annapurna Uma Gayathri

Annapurna Uma Gayathri

Bangalore

Summary

Experienced and certified Data Engineering, Management, and Governance Specialist with 8+ years of experience in designing and implementing large-scale data pipelines, cloud migrations, real-time data processing, and automation solutions. Proven expertise in Google Cloud Platform (GCP), BigQuery, Apache Airflow (Cloud Composer), Terraform, and Python. Adept at working with modern data integration tools like Debezium CDC, Kafka, Pub/Sub, and orchestrating robust ETL/ELT workflows using SQL, Cloud Functions, and CI/CD pipelines (Azure DevOps & GitHub Actions). Successfully delivered projects across diverse domains including finance, retail, cybersecurity, and enterprise asset management.

Skilled in performance optimization, data modeling, compliance automation, and conversational AI integrations using AgentSpace AI. Strong foundation in Hadoop ecosystem, with prior experience in Cloudera/Hortonworks administration, Hive, NiFi, Sqoop, and Pig.


Designed and managed scalable ETL/CDC pipelines using Python and Debezium CDC from PostgreSQL to BigQuery in GCP for real-time data streaming.

Migrated batch data pipelines from AWS to GCP, optimizing performance using BigQuery partitioning, clustering, and STRUCT data types.

Developed DAGs using Cloud Composer (Apache Airflow) to orchestrate and monitor data workflows from raw to insights layers in BigQuery.

Implemented CI/CD pipelines using Azure DevOps, integrating with GCP Cloud Functions for automated deployment and data processing.

Built and deployed real-time Purchase Order chatbot using AgentSpace AI, integrated with BigQuery and GCP Cloud Functions for dynamic query handling.

Trained NLP chatbot models to interpret natural language queries like “What’s the status of PO123?” and generate SQL against structured purchase order data.

Developed custom CDC logic with deduplication and snapshot comparison to ensure data freshness and integrity in BigQuery.

Automated VPN compliance audits using Prisma Cloud, Microsoft Graph APIs, and Python scripts to validate Conditional Access Policies.

Created JSON-based compliance reports and pushed them securely to AWS S3 via authenticated cloud credentials.

Converted legacy Control-M XML job definitions to Cloud Composer DAGs using custom-built Python scripts and Janus Converter logic.

Built reusable and modular DAG templates using Python, deployed to GCS, and managed airflow triggers for cross-environment orchestration.

Migrated on-premise MySQL, Oracle, and Hive workloads to GCP using Bash scripting, Terraform, and Airflow, supporting hybrid data flows.

Integrated Hive, HQL, and Sqoop job logic into GCP with caching, using DataProcOperator and custom transformations in Airflow.

Maintained and monitored Hadoop clusters (CDH 5.x), managed HDFS partitions, and executed data transfer jobs using Sqoop and Pig.

Automated data ingestion using Apache NiFi with formats like Avro, Parquet, and ORC into Hive for World Bank Group's data lake project.

Administered Hadoop clusters for VISA and Mercedes Benz, handling user onboarding, Kerberos configuration, and SLA-based monitoring.

Developed Shell scripts and XML configurations to automate Control-M job deployments and integrate scheduling logic.

Created Hive external tables with advanced partitioning and bucketing strategies for optimized querying and reporting.

Leveraged Terraform, GitHub Actions, and Cloud Shell for deploying Compute Engine and KMS setups in a secure and scalable way.

Collaborated with cross-functional teams including DevOps, compliance, and security to enforce data governance policies and ensure audit readiness.

Overview

9
9
years of professional experience
1
1
Certification

Work History

Data & Analytics Specialist | Real-Time Pipelines

Accenture
11.2024 - Current
  • Designed and managed scalable ETL/CDC pipelines using Python and Debezium CDC from PostgreSQL to BigQuery in GCP for real-time data streaming.
  • Migrated batch data pipelines from AWS to GCP, optimizing performance using BigQuery partitioning, clustering, and STRUCT data types.
  • Developed DAGs using Cloud Composer (Apache Airflow) to orchestrate and monitor data workflows from raw to insights layers in BigQuery.
  • Implemented CI/CD pipelines using Azure DevOps, integrating with GCP Cloud Functions for automated deployment and data processing.
  • Built and deployed real-time Purchase Order chatbot using AgentSpace AI, integrated with BigQuery and GCP Cloud Functions for dynamic query handling.
  • Trained NLP chatbot models to interpret natural language queries like “What’s the status of PO123?” and generate SQL against structured purchase order data.
  • Developed custom CDC logic with deduplication and snapshot comparison to ensure data freshness and integrity in BigQuery.
  • Automated VPN compliance audits using Prisma Cloud, Microsoft Graph APIs, and Python scripts to validate Conditional Access Policies.
  • Created JSON-based compliance reports and pushed them securely to AWS S3 via authenticated cloud credentials.
  • Converted legacy Control-M XML job definitions to Cloud Composer DAGs using custom-built Python scripts and Janus Converter logic.
  • Built reusable and modular DAG templates using Python, deployed to GCS, and managed airflow triggers for cross-environment orchestration.
  • Migrated on-premise MySQL, Oracle, and Hive workloads to GCP using Bash scripting, Terraform, and Airflow, supporting hybrid data flows.
  • Integrated Hive, HQL, and Sqoop job logic into GCP with caching, using DataProcOperator and custom transformations in Airflow.

Data Eng, Mgmt and Governance Specialist

Accenture
11.2024 - Current
  • Project Involvement: Cyberfest and ISD Applications
  • Event Management Platform: Cyberfest is an event management system where participants register for various events, each managed by a facilitator and co-facilitator, with a minimum of five participants per event.
  • Data Storage & Streaming: Event data is stored in PostgreSQL and streamed using an IMT Connector (Debezium CDC) with Python-based producer and consumer clients.
  • Scheduler Migration: The scheduler, initially on AWS, was migrated to Google Cloud Platform (GCP) to push data into BigQuery.
  • BigQuery Optimization: Implemented Change Data Capture (CDC) logic in BigQuery to track inserts, updates, and deletes, ensuring only the latest records are retained. Utilized STRUCTs, partitioning, and clustering for performance optimization.
  • Data Pipeline Orchestration: Developed a Cloud Composer DAG (Python) to manage the data pipeline from GCP Raw to Insights tables. The scheduler triggers the DAG, processes records, and purges temporary data post-processing.
  • CI/CD Integration: All deployment scripts and CI/CD pipelines are managed via Azure DevOps, with integration to BigQuery through Google Cloud Functions.
  • ISD Application
  • Asset Management: All application assets are stored in PostgreSQL, with multiple ETL tools processing and loading data.
  • Real-Time CDC Streaming: Used an IMT connector (Debezium CDC) to capture real-time changes (insert, update, delete) from PostgreSQL and stream them to Google BigQuery.
  • Custom CDC Logic: Developed custom Python CDC logic, integrated within Azure DevOps pipelines, scheduled at 5-minute, hourly, and weekly intervals. This ensures only the latest, non-duplicate records are captured for accurate reporting in BigQuery.
  • VPN Managed Access Compliance Automation
  • Authentication: Authenticates with Prisma Cloud and Microsoft Graph APIs.
  • Configuration Retrieval: Retrieves VPN configurations from Azure and AWS.
  • Compliance Checks: Checks each VPN for required Azure AD attributes and Conditional Access policies.
  • Evaluation: Marks each VPN as “Compliant” or “Not Compliant” with managed access requirements.
  • Reporting: Outputs compliance results to a JSON file.
  • Cloud Integration: Uploads the compliance report to an AWS S3 bucket using environment-provided credentials.
  • Realtime Purchase Order Chatbot using AgentSpace AI
  • Tech Stack: BigQuery, GCP Cloud Functions, Python, AgentSpace AI, SQL, REST APIs

Data Engineer

Accenture
04.2023 - 03.2024
  • Control-M to GCP Cloud Migration & Workflow Automation
  • Configured and scheduled batch jobs using Control-M/Control Scheduler to automate data workflows across environments.
  • Developed and maintained Shell scripts for job automation, data extraction, and system-level tasks.
  • Created and modified XML files to define job configurations and deployment parameters for Control Scheduler.
  • Automated deployment of Control Scheduler jobs using Shell scripts integrated with XML templates.
  • Managed job dependencies, conditions, and triggers through XML-based job definitions.
  • Troubleshot and resolved job failures by analyzing Control Scheduler logs and Shell script outputs.
  • Collaborated with DevOps/Operations teams to ensure seamless deployment of automation workflows and reduce manual interventions
  • Migration to GCP Services:
  • Supported the migration from Control-M to Google Cloud services, including Google Cloud Composer (Airflow) and Cloud Functions.
  • Developed automation scripts such as the Janus Converter to transform Control-M XML job definitions into Airflow DAGs, enabling large-scale migration.
  • Built Compute Engine instances, managed KMS keys, and deployed Cloud Composer environments using Terraform and GitHub Actions.
  • Actively worked on retiring legacy applications from Control-M and providing migration solutions.
  • Provided solutions for various Airflow operators, including DataProcOperator, PythonOperator, BashOperator, EmailOperator, SQLOperator, and BigQueryOperator.
  • Developed reusable DAG templates in Python, deployed them to Google Cloud Storage, and triggered them in Apache Airflow.
  • Hands-on experience in configuring and establishing connections to Hive for data extraction within Apache Airflow.

Application Developer Senior Analyst

Accenture
03.2021 - 04.2023
  • PROFESSIONAL EXPERIENCE 1: Sr. GCP Data Engineer (2022)
  • Client: Google | Sub-Client: Walmart
  • Project: Migration of On-premise Data to GCP using Apache Airflow (Walmart Phase 1 & 2)
  • Developed and orchestrated data workflows using Apache Airflow, leveraging Python for workflow authoring, scheduling, and monitoring.
  • Utilized a variety of Airflow operators (DataprocOperator, PythonOperator, BashOperator, EmailOperator, SQLOperator, BigQueryOperator) to automate ETL processes across MySQL, Oracle, Azure, AWS, and GCP.
  • Created Python scripts to read CSV files and deploy data into cloud storage as HQL, as well as copy data between storage locations.
  • Converted HQL, Sqoop, and BigQuery content into Yarn files and scheduled complex cron jobs in Airflow.
  • Designed and deployed reusable DAG templates in Python, stored in Google Cloud Storage, and triggered via Apache Airflow.
  • Configured and established Hive connections in Airflow for data extraction.
  • Successfully migrated HIVE HQL, SQL, and SQOOP data into GCP storage using Airflow with caching mechanisms.
  • PROFESSIONAL EXPERIENCE 2: GCP Data Engineer (2021–2022)
  • Client: Google | Sub-Client: PayPal
  • Project: Migration of On-premise Data to GCP using Cloud Shell & Bash Scripts
  • Developed wrapper scripts to automate workload migration to Google Cloud, converting Hive and Spark-submit commands to cloud-compatible equivalents.
  • Managed deployment of dependent workloads and orchestrated ephemeral cluster creation, job submission, and cluster teardown using Cloud Shell.
  • Applied advanced shell scripting (including SED and AWK) for data transformation and troubleshooting.
  • Built and managed Dataflow pipelines for scalable data processing.
  • PROFESSIONAL EXPERIENCE 3: GCP Data Engineer (2021)
  • Client: Google | Sub-Client: Ford
  • Project: Change Data Capture & SCD in BigQuery, Data Flow from Kafka to BigQuery
  • Implemented Change Data Capture (CDC) and Slowly Changing Dimension (SCD1, SCD2) logic in BigQuery to track and manage historical data changes.
  • Utilized Terraform for basic infrastructure provisioning in the cloud console.
  • Developed Dataflow pipelines to stream data from Pub/Sub and Kafka into BigQuery.
  • Maintained and queried historical datasets using CDC and SCD methodologies for robust data lineage and reporting.

Technical Analyst

Infosys
03.2016 - 03.2021
  • Project Experience 1: Hadoop Developer – World Bank Group (Wave Project)
  • Developed ELT workflows using Apache NiFi, leveraging Avro, Parquet, and ORC formats for efficient data storage in HDFS.
  • Utilized SQL for high-performance reporting and database requests; strong experience with Jupyter notebooks and Python for data collection and visualization.
  • Automated ingestion of raw data from PostgreSQL and MongoDB into Hive databases.
  • Built Python scripts using pandas for data visualization and developed PII data handling for GDPR compliance.
  • Designed and managed multi-step workflows in NiFi, including Kafka streaming and JSON transformation before Hive ingestion.
  • Automated Dremio REST API interactions with Python scripts, including physical-to-virtual dataset conversion, tagging, and SQL job management.
  • Analyzed user requirements and developed automated database applications using SQL.
  • Conducted root cause analysis and performance tuning for SQL and database issues in both production and release environments.
  • Designed and created Hive external tables with partitioning and bucketing, using shared beta-store for optimized storage.
  • Implemented Apache Pig scripts for data loading into Hive and worked with various HDFS file and compression formats (Avro, Sequence File, Snappy, Gzip).
  • Provided business user support for reporting needs and managed data loading/querying via the Hue interface.
  • Project Experience 2: Hadoop Administrator – VISA International (CAE Hadoop Infra)
  • Upgraded CDH from version 5.5 to 5.8 and onboarded users/applications to Hadoop clusters.
  • Managed data emission from Hadoop to relational databases and external file systems using Sqoop, HDFS GET, and CopyToLocal.
  • Processed raw data using MapReduce, Apache Pig, and Hive; developed Pig scripts for CDC and delta processing.
  • Coordinated firewall blueprints for connectivity, managed Cloudera support cases, and applied patches.
  • Monitored HDFS utilization, collaborated with QoS teams on server builds, and implemented DDL changes via change requests.
  • Maintained, monitored, and configured Hadoop clusters using Cloudera Manager (CDH5).
  • Analyzed logs for NameNode, ResourceManager, and other services; documented Solr admin dashboard usage.
  • Performed Hive queries for data analysis, managed data node commissioning/decommissioning, and set up Kerberos authentication.
  • Responsible for Hadoop system performance tuning, backup, and recovery.
  • Transferred data between RDBMS and HDFS using Sqoop; created and optimized Hive tables with static/dynamic partitions.
  • Installed, configured, expanded, and maintained Hadoop clusters, including hardware sizing and OS-level tuning.
  • Project Experience 3: Hadoop Administrator – Mercedes Benz (CISIM Tool Handling)
  • Managed cluster monitoring, node health, SLA, incident, and request management for on-premise and cloud Hadoop environments.
  • Experienced in cluster setup, configuration, and troubleshooting Linux system resources (memory, CPU, OS, storage, network).
  • Monitored multiple Hadoop clusters and worked extensively with Kerberos authentication.
  • Supported and managed Cloudera and Hortonworks Hadoop distributions, including installation and configuration of Hadoop, HBase, Hive, Pig, Flume, and Zeppelin.
  • Analyzed log files to identify and resolve Hadoop-related issues; managed Kerberos keytabs for user and daemon authentication.
  • Led data migration initiatives between different environments, ensuring secure and efficient transfers.

Education

Master of Engineering - Cloud Computing

Vellore Insititute of Technology
Bangalore
06-2015

Skills

  • Google Cloud Platform (GCP) – Cloud Composer, BigQuery, Cloud Functions, Compute Engine, Cloud Shell, KMS, GCS, Dataflow, Pub/Sub
  • Change Data Capture (CDC) – Using Debezium, custom Python logic, and BigQuery CDC logic for real-time inserts/updates/deletes
  • Apache Airflow – Designed and deployed reusable DAGs using Python; handled operators like BashOperator, PythonOperator, SQLOperator, BigQueryOperator, etc
  • Python Programming – Custom CDC scripts, REST API integration, chatbot logic, data validation, orchestration, and DevOps automation
  • Terraform – Infrastructure as code for provisioning GCP resources like Composer, Compute Engine, and KMS
  • BigQuery Optimization – Usage of STRUCTs, Partitioning, Clustering, and SCD logic to ensure high performance and reduced cost
  • ETL/ELT Pipelines – End-to-end design using Shell, Python, NiFi, Airflow, Cloud Functions, and external connectors
  • Azure DevOps – CI

Certification

  • Google Cloud Certified – Associate Cloud Engineer
  • Google Cloud Certified – Professional Data Engineer
  • Google Cloud Certified – Digital Leader
  • HashiCorp Certified – Terraform Associate

Timeline

Data & Analytics Specialist | Real-Time Pipelines

Accenture
11.2024 - Current

Data Eng, Mgmt and Governance Specialist

Accenture
11.2024 - Current

Data Engineer

Accenture
04.2023 - 03.2024

Application Developer Senior Analyst

Accenture
03.2021 - 04.2023

Technical Analyst

Infosys
03.2016 - 03.2021

Master of Engineering - Cloud Computing

Vellore Insititute of Technology
Annapurna Uma Gayathri