Summary
Overview
Work History
Skills
Resume - Sections
Languages
Timeline
Generic

Deepak satam

Cloud Data Engineer
Schaumburg

Summary

With over 18 years of experience as a certified AWS Data Engineer, I specialize in data management, integration, and Big data infrastructure. I have a proven track record in guiding successful AWS Cloud implementation and migration projects. Proficient in both Microsoft Azure and AWS cloud technologies, I excel in setting up EMR and serverless computing for cloud-based big data environments. My expertise extends to strong data integration skills with RDBMS and NoSQL sources, including Hive, HBase, and Sqoop. I possess a comprehensive understanding of MapReduce and Hadoop. Additionally, I have built data intake pipelines and leveraged PySpark for data science models. Adept at effective communication, I deliver compelling commercial value to stakeholders.Highly skilled Data Engineer with specializing in designing, building, and maintaining robust data pipelines and architectures on the Databricks platform.Expertise in leveraging Databricks to process large-scale datasets, develop and deploy machine learning models, and create real-time analytics solutions.Proven ability to optimize data processing performance, ensure data quality, and implement data governance practices.

Overview

12
12
years of professional experience

Work History

Cloud Data Engineer

Lead Data Engineer | Exeliq Consulting Inc |
03.2023 - Current

Client-USPTO

● Designed and oversaw data pipelines utilizing AWS Glue, S3, and Redshift.

● Conducted analytical operations on health insurance data, extracting valuable insights.

● Orchestrated a data pipeline using AWS Glue to amalgamate data from diverse sources.

● Analyzing health insurance data, I seamlessly integrated Snowflake into the existing infrastructure, enhancing data

processing capabilities.

● Imported files from an on-premises SFTP server to an S3 bucket and coordinated the data pipeline through an AWS Glue

workflow.

● Engineered and deployed lambda-based invocations to trigger Glue workflows.

● Implemented data validation processes and audit-level jobs to ensure impeccable data integrity.

● Maintained all code repositories in AWS Code Commit.

● Executed seamless code migrations from the DEV environment to STAGE and PROD using AWS Code Pipeline.

● design and implementation of a scalable data lakehouse on Databricks for business purpose.

● Monitored and optimized Databricks cluster performance, ensuring [cost savings and resource utilization.

● Developed efficient Spark pipelines for specific data processing tasks, achieving quantifiable results.

● Implemented Delta Lake for [data management and performance improvement


Cloud/Data platform Lead for Envision Healthcare

Highlights: Setting up data lake on AWS for VBC product. Setting up Redshift enterprise-wide as a data warehouse SME for Envision

Key Result Areas -

● Created data lake on AWS using S3, Lambda, Glue, Athena Databricks, and Quick sight.

● Migrated On-premises Oracle database objects and data to Redshift.

● Created data models and injection pipelines loading redshift Schemas from RDBMS, DB2.

● Created Redshift Models and ELT pipelines using DBT and data quality architecture.


Data Lake Implementation

AVP | Standard Chartered Bank |
10.2022 - 03.2023

Project: Data Lake Implementation

● Designed and implemented a data lake leveraging AWS Glue, S3, Athena, with Python & Spark for coding.

● Established a seamless connection with the SFTP server through a Glue job for downloading and processing files.

● Deployed Docker images to the AWS Elastic Container Registry (ECR) repository to fulfill specific requirements.

● Configured the AWS Batch environment to execute all jobs stored in Code Commit.

● Engineered an AWS Step Function definition for parallel execution of batch jobs.

● Implemented and configured HP Diagnostics for application monitoring and critical client application call stack analysis.

● Monitored system-level metrics for various virtual server implementations using HP SiteScope and vSphere.

● Reviewed and analyzed Performance Testing deliverables for Performance Testing Projects.

● Developed a test harness tool to monitor and regulate the MQ and EMS queues for optimal performance.


Project: Environmental Social Governance Client: GBS

● Utilized Python for web scraping to extract data from 1500 companies on unpri.org/signatories.

● Assisted in transitioning processing from a NoSQL MongoDB architecture to a cloud-based solution.

● Established a distributed computing infrastructure leveraging Apache SPARK.

Senior Big Data Lead

Tesseract Solutions Pvt. Ltd
06.2015 - 01.2020

Project: Blue Shield Florida

● Enabled a data science team with a view of all available datasets by creating a Data Lake based on Hive and Hadoop.

● Created DW data models using Hive (ORC, AVRO) and partitioning and bucketing for high performance.


Project: Pacific Gas Energy

● Created mapR cluster on AWS EC2 instances and installed Hive, Spark, Pig, and Sqoop.

● Transposed data from Aladdin output files to Eagle.

● Created a uniform SPARK layer to process data from different sources including Hive, MongoDB, and Splunk.


Project: Fitch Ratings

● Migrated data from Oracle to MongoDB collections.

● Built SPARK cluster to process data and integrated MongoDB with Hadoop.

● Automated migrated Mongo-DB cluster to AWS using Chef and cloud-formation.

● Created web services using Data Services Informatica.

Data Integration Architect

Syntel India Ltd
10.2013 - 05.2015

Project: Investment Derivatives and Front Office and Support

● Designed and optimized Informatica mappings for data processing.

● Created reports and reconciliation tools using Tableau.

● Developed .Net web services for data access.

Consultant

Rave Technologies
05.2012 - 03.2013

Project: Trading Portfolio Management System

● Implemented ETL strategies and data flows using Informatica.

● Developed reconciliation between Eagle and Aladdin systems.

● Supported .Net applications and Tableau reports.



Databricks:

  • Design and develop data pipelines to ingest, store, and process data from multiple sources
  • Develop and maintain data models to support data analysis and reporting
  • Develop and maintain data warehouses and data marts
  • Develop and maintain ETL processes to move data between systems
  • Develop and maintain data quality and governance processes
  • Develop and maintain data security processes
  • Develop and maintain data visualization tools
  • Develop and maintain machine learning models
  • Develop and maintain data mining algorithms
  • Develop and maintain data analysis and reporting tools
  • Develop and maintain data integration processes
  • Develop and maintain data governance and compliance processes


Dataiku:


Installation, configuration and maintenance of Dataiku application on AWS servers.

AWS environment management for patching, security audits, compliance checks and IAM audits, network configurations.

Creation and management of project on a Kubernetes environment.

Configuration of Dataiku with Kubernetes and resource optimization for offloading machine learning computations to Kubernetes.

Dataiku application health and lifecycle management.

Implementing automations using python and Linux scripts.

Use of Dataiku Fleet Manager and ansible scripting to manage lifecycle of Dataiku application.

Managing incidents and issues, focusing on effective communication and follow up with stakeholders and other technical teams

Analyzing user requirement, request fulfillment, incident resolution.

Maintaining documentation and learning resources

Minimum Dataiku Core Designer certified, preferable Dataiku DSS Administration certified.

Good working experience with AWS Cloud Services.

You have experience of engineering applications to run on Linux Servers within an enterprise environment including integration with security infrastructure.

Some experience in delivering AI solutions that create business value.

Strong sense of ownership; prioritizing outcome over output.

Strong intrinsic motivation to learn new skills and concepts.

Ability to collaborate and align with your team rather than deliver on your own.

Proven experience with Python, Shell scripting,Git, Linux, Hadoop, Kubernetes, AWS services.

Good communication skills in English and understand Scrum / Agile / DevOps way of working.

Skills

AWS Cloud (Compute, Storage, Analytics)

Resume - Sections

  • Databricks Cluster, A Databricks Cluster is a combination of computation resources and configurations on which you can run jobs and notebooks. Some of the workloads that you can run on a Databricks Cluster include Streaming Analytics, ETL Pipelines, Machine Learning, and Ad-hoc analytics. The workloads are run as commands in a notebook or as automated tasks. There are two types of Databricks Clusters: All-purpose Clusters and Job Clusters.
  • All-purpose Clusters, These types of Clusters are used to analyze data collaboratively via interactive notebooks. They can be terminated and restarted manually and can be shared by multiple users to do collaborative tasks interactively.
  • Job Clusters, These types of clusters are used for running fast and robust automated tasks. They are created when you run a job on your new Job Cluster and terminate the Cluster once the job ends. A Job Cluster cannot be restarted.
  • Types of clusters in Databricks, Standard, High Concurrency, and Single Node clusters are supported by Azure Databricks. Cluster mode is set to Standard by default.
  • Standard Clusters, For a single user, a Standard cluster is ideal. Workloads written in Python, SQL, R, and Scala can all be run on standard clusters.
  • High Concurrency Clusters, A managed cloud resource is a high-concurrency cluster. High-concurrency clusters have the advantage of fine-grained resource sharing for maximum resource utilisation and low query latencies.
  • Single Node Clusters, Spark jobs run on the driver node in a Single Node cluster, which has no workers. To execute Spark jobs in a Standard cluster, at least one Spark worker node is required in addition to the driver node.
  • Extract Transform Load (ETL), ETL stands for extract, transform, and load. It is the process data engineers use to extract data from different sources, transform the data into a usable and trusted resource, and load that data into the systems end-users can access and use downstream to solve business problems.
  • Delta Lake, Delta Lake is the optimized storage layer that provides the foundation for tables in a lakehouse on Databricks. Delta Lake is open source software that extends Parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling. Delta Lake is fully compatible with Apache Spark APIs, and was developed for tight integration with Structured Streaming, allowing you to easily use a single copy of data for both batch and streaming operations and providing incremental processing at scale.
  • Databricks notebooks, Notebooks are a common tool in data science and machine learning for developing code and presenting results. In Databricks, notebooks are the primary tool for creating data science and machine learning workflows and collaborating with colleagues. Databricks notebooks provide real-time coauthoring in multiple languages, automatic versioning, and built-in data visualizations.
  • Databricks Database table, A database in Azure Databricks is a collection of tables and a table is a collection of structured data. Tables in Databricks are equivalent to DataFrames in Apache Spark. This means that you can cache, filter and perform any operations on tables that are supported by DataFrames. You can also query tables using the Spark API’s and Spark SQL.

Languages

English
Advanced (C1)

Timeline

Cloud Data Engineer

Lead Data Engineer | Exeliq Consulting Inc |
03.2023 - Current

Data Lake Implementation

AVP | Standard Chartered Bank |
10.2022 - 03.2023

Senior Big Data Lead

Tesseract Solutions Pvt. Ltd
06.2015 - 01.2020

Data Integration Architect

Syntel India Ltd
10.2013 - 05.2015

Consultant

Rave Technologies
05.2012 - 03.2013
Deepak satamCloud Data Engineer