Summary
Overview
Work History
Education
Skills
Certification
Automationandconfigurationmanagement
Containerizationandorchestration
Problemsolvingandcollaboration
Areaofexcellence
Otherskills
Programming
Timeline
Hi, I’m

David Wise

Data Engineer
Minneapolis, Minnesota

Summary

detail-oriented designs, develops and maintains highly scalable, secure and reliable data structures. Accustomed to working closely with system architects, software architects and design analysts to understand business or industry requirements to develop comprehensive data models. Proficient at developing database architectural strategies at the modeling, design and implementation stages.

Background includes data mining, warehousing, and analytics. Proficient in machine and deep learning. Quality-driven and hardworking with excellent communication and project management skills.

Organized and dependable candidate successful at managing multiple priorities with a positive attitude. Willingness to take on added responsibilities to meet team goals.

Pursuing full-time role that presents professional challenges and leverages interpersonal skills, effective time management, and problem-solving expertise. Hardworking and passionate job seeker with strong organizational skills eager to secure any position within data domain. Ready to help team achieve company goals.

Overview

2
Certifications

Work History

Toyota Auto Mobile

Data Engineer
5 2019 - Current

Job overview

Project Details:

ETL Pipeline for Customer Data Analytics

  • Optimized data processing by implementing efficient ETL pipelines and streamlining database design.
  • Collaborated on ETL (Extract, Transform, Load) tasks, maintaining data integrity and verifying pipeline stability.
  • Enhanced data quality by performing thorough cleaning, validation, and transformation tasks.
  • Automated routine tasks using Python scripts, increasing team productivity and reducing manual errors.
  • Migrated legacy systems to modern big-data technologies, improving performance and scalability while minimizing business disruption.
  • .Fine-tuned query performance and optimized database structures for faster, more accurate data retrieval and reporting.


Project Overview:
This project involved designing and implementing an end-to-end ETL (Extract, Transform, Load) pipeline to consolidate customer data from multiple sources for business intelligence and advanced analytics. The data sources included customer relationship management (CRM) systems, marketing platforms, transactional databases, and third-party data providers.

Goal:
To create a unified, clean, and organized dataset for deeper insights into customer behavior, segmentation, and lifetime value, as well as to optimize marketing strategies and business decisions.

Key Goals:

  • Data Consolidation:
    Integrate and consolidate customer data from multiple, diverse sources into a single, consistent dataset.
  • Data Transformation:
    Clean, enrich, and standardize raw data to ensure consistency and usability for analytics teams.
  • Automation:
    Automate the ETL process to reduce manual intervention and ensure timely updates for reports and dashboards.
  • Business Intelligence:
    Provide stakeholders with real-time and batch reports for data-driven decision-making using customer behavior and marketing insights.
  • Scalability:
    Ensure the ETL pipeline can scale as new data sources and increased volumes are introduced.
  • Apache Airflow:
    Used as the orchestrator for the ETL pipeline. It allowed for scheduling, monitoring, and managing complex data workflows, providing a visual interface to track data movements and failures.
  • AWS Redshift:
    Deployed as the data warehouse for storing structured, transformed customer data. It provided the performance needed to run complex queries and produce timely reports.
  • Python:
    Utilized for data extraction, transformation, and loading. Python scripts interacted with APIs, databases, and data files for data ingestion and manipulation before loading into Redshift.
  • Snowflake:
    Used as an additional data warehouse for staging and advanced analytics. Its scalable architecture enabled concurrent queries and fast processing of large datasets, ideal for ad-hoc analysis and machine learning workloads.
  • Tableau:
    Leveraged for visualizing customer data and creating interactive dashboards. Tableau connected directly to AWS Redshift, providing real-time insights for marketing, sales, and management teams.
  • ETL Pipeline Design:
    Designed the overall ETL architecture to extract customer data from multiple sources like Salesforce, Google Analytics, and internal CRM systems.Developed a modular and flexible pipeline that could easily accommodate new data sources and transformations as business needs evolved.
  • Data Ingestion and Extraction:
    Built custom Python scripts to extract data from various sources, including REST APIs, relational databases (PostgreSQL, MySQL), and CSV files.Automated data ingestion using Apache Airflow to schedule regular data pulls, ensuring real-time updates to the data warehouse.
  • Data Transformation and Cleaning:
    Developed transformation logic in Python to clean and standardize data, handling missing values, removing duplicates, and enriching data with external sources.Applied business rules to transform raw data into meaningful customer metrics such as average order value, customer segmentation, and purchase frequency.
  • Loading Data into Data Warehouse:
    Created optimized loading strategies to store cleaned and transformed data in AWS Redshift, ensuring efficient handling of large volumes of customer records.Partitioned and indexed tables in Redshift for fast querying and minimal report generation time.
  • Data Aggregation and Reporting:
    Aggregated data by customer segments, regions, and time periods to generate comprehensive reports on customer trends and behaviors.Automated reporting and dashboard updates using Tableau for up-to-date customer insights.
  • Performance Optimization:
    Tuned Airflow DAGs for optimized execution times and resource utilization.Optimized ETL scripts by parallelizing processes, reducing overall data pipeline execution time by 20%.
  • Data Security and Governance:
    Implemented data access controls and encryption for sensitive customer data to ensure compliance with GDPR and internal security policies.Developed data quality checks and validation processes to catch anomalies and ensure data accuracy.
  • Collaboration and Cross-Functional Work:
    Worked closely with marketing, sales, and business intelligence teams to tailor the ETL pipeline to their data requirements.Collaborated with data scientists to provide clean datasets for predictive models, including customer segmentation, churn prediction, and lifetime value estimation.
  • Challenge: Inconsistent Data Across Sources
    Solution: Developed comprehensive data cleaning and standardization processes to harmonize data from multiple sources. Custom Python scripts identified and corrected inconsistencies automatically.
  • Challenge: Complex Data Transformations
    Solution: Used Apache Airflow to modularize transformations, making the pipeline more flexible and maintainable. Python allowed for custom transformations tailored to business logic.
  • Challenge: Real-Time Data Integration
    Solution: Automated data extraction using Airflow to schedule real-time pulls from various systems and APIs, ensuring fresh data in reports and dashboards.
  • Challenge: Large-Scale Data Processing
    Solution: Optimized batch processing with parallel transformations and used Redshift’s partitioning and indexing features to significantly reduce query times.
  • Improved Data Accessibility:
    Consolidated customer data from multiple sources into a single data warehouse, providing stakeholders with easy access to consistent, clean, and reliable data.
  • Reduced Manual Effort:
    Automated the entire ETL process, reducing manual effort and enabling real-time campaign adjustments by the marketing and sales teams.
  • Enhanced Data Quality:
    Comprehensive cleaning and validation improved data accuracy, resulting in more reliable analytics and better customer-centric strategies.
  • Scalable Architecture:
    Designed a scalable ETL pipeline capable of handling millions of customer records daily, with the ability to scale further as the business grows.
  • Faster Processing and Reporting:
    Optimized ETL processes reduced overall time for data transformation and report generation by 20%, improving delivery timelines for insights.
  • Business Impact:
    Actionable insights into customer behavior helped marketing teams optimize campaigns and increase customer retention, while up-to-date reports enabled data-driven decision-making for better business performance.


The HCI Companies

Data Engineer
4 2014 - 7 2016

Job overview

Project Details:

Data Lake Implementation for Batch Processing

  • Led end-to-end implementation of multiple high-impact projects from requirements gathering through deployment and post-launch support stages.
  • .Fine-tuned query performance and optimized database structures for faster, more accurate data retrieval and reporting.
  • Designed scalable and maintainable data models to support business intelligence initiatives and reporting needs.
  • Streamlined complex workflows by breaking them down into manageable components for easier implementation and maintenance.
  • Evaluated various tools, technologies, and best practices for potential adoption in the company''s data engineering processes.
  • Collaborated with cross-functional teams for seamless integration of data sources into the company''s data ecosystem.


Project Overview:
Architected and implemented a cloud-based data lake solution to handle large-scale batch processing and data storage. Managed both structured and unstructured data from multiple sources, including databases, CRM systems, IoT devices, and external APIs.

The data lake served as a unified repository for raw data, enabling efficient access, storage, and processing for advanced analytics and machine learning.

Key Goals:

  • Scalability:
    Build a scalable architecture to handle growing data volumes without performance issues.
  • Cost Efficiency:
    Implement cost-effective storage strategies using Azure’s tiered storage capabilities.
  • Data Accessibility:
    Ensure seamless access to data for data scientists, analysts, and business stakeholders.
  • Processing Efficiency:
    Optimize batch processing to transform and aggregate data for analytics within minimal timeframes.
  • Data Governance:
    Enforce strong data governance and security protocols in compliance with organizational and industry regulations.
  • Azure Data Lake:
    Centralized, scalable storage solution capable of handling both structured and unstructured data. Enabled storing data in raw form for flexibility in downstream processing and transformation.
  • Azure Blob Storage:
    Utilized as the landing zone for raw data ingestion from multiple data sources. Tiered storage reduced costs by moving infrequently accessed data to lower-cost storage while ensuring availability for active data.
  • Databricks:
    Primary platform for processing and transforming large datasets using PySpark for distributed computing. Allowed for efficient batch processing and integration with Azure services.
  • PySpark:
    Used for developing and running batch processing jobs to clean, transform, and aggregate large datasets. Optimized for distributed processing to enhance pipeline performance.
  • Delta Lake:
    Implemented on top of Azure Data Lake for ACID transactions, schema enforcement, and data versioning. Enabled time-travel queries and better data governance.
  • Azure Data Factory:
    Orchestrated data ingestion and batch processing workflows. Automated data movement and transformation between services, providing low-maintenance scheduling and monitoring.
  • Azure DevOps:
    Managed continuous integration/continuous delivery (CI/CD) pipelines to deploy and monitor updates to data processing workflows. Supported agile development practices.
  • Architecting the Data Lake:
    Designed the overall architecture to ensure scalability as data volumes grew. Planned data partitioning and tiered storage strategies to optimize retrieval times and costs.
  • Data Ingestion:
    Developed data ingestion pipelines to bring raw data from multiple sources (databases, APIs, IoT sensors) into the data lake. Supported multiple formats like JSON, CSV, Parquet, and binary.
  • Batch Processing with PySpark:
    Implemented batch processing jobs using PySpark on Databricks to clean, transform, and aggregate data. Streamlined ETL workflows for optimized data preparation and loading.
  • Delta Lake Integration:
    Leveraged Delta Lake for version control, time-travel queries, and schema enforcement. Ensured ACID compliance and robust governance across evolving datasets.
  • Performance Optimization:
    Optimized data partitioning in Azure Data Lake to improve query performance, reducing batch processing times by 30%. Tuned PySpark configurations to minimize computational overhead.
  • Collaboration and Cross-Functional Work:
    Worked with data scientists and analysts to deliver pipelines for advanced analytics and machine learning. Created pipelines providing clean, structured data for BI dashboards.
  • Data Governance and Security:
    Implemented governance practices, including access controls, encryption, and auditing to ensure compliance with regulations (e.g., GDPR). Developed monitoring systems for data quality.
  • Data Transformation and Aggregation:
    Developed transformation logic for data aggregation across dimensions (e.g., time, location, customer segmentation) to support business analytics. Standardized data, removing duplicates and ensuring consistency.
  • Challenge: Handling large-scale data ingestion from diverse sources
    Solution: Implemented a modular architecture using Azure Blob Storage for initial data landing, and Azure Data Factory for automated data movement into the data lake.
  • Challenge: Ensuring consistent performance as data volumes increased
    Solution: Optimized batch processing workflows using PySpark and Delta Lake. Scaled the solution horizontally to handle growing data volumes efficiently.
  • Challenge: Maintaining data consistency and version control
    Solution: Integrated Delta Lake for version control, time-travel queries, and schema enforcement. Ensured robust data governance and quality control.
  • Challenge: Managing costs for large-scale storage and processing
    Solution: Utilized Azure’s tiered storage and optimized partitioning strategies to minimize costs, especially for older, infrequently accessed data.
  • Improved Scalability:
    The data lake scaled to handle petabytes of data across different formats, supporting future growth without major redesign.
  • Reduced Processing Time:
    Batch processing times were reduced by 30% through partitioning and job tuning, allowing for quicker data access.
  • Enhanced Data Accessibility:
    Provided data scientists and analysts access to a unified dataset, enabling advanced analytics and machine learning models for better decision-making.
  • Stronger Data Governance:
    Implemented robust governance and security measures, ensuring compliance with organizational policies and regulatory standards like GDPR.


Navy Federal Union

Associate Data Engineer
5 2017 - 2 2018

Job overview

Project Details:

Real-Time Data Streaming and Analytics Platform

  • Developed polished visualizations to share results of data analyses.
  • Analyzed large datasets to identify trends and patterns in customer behaviors.
  • Improved data collection methods by designing surveys, polls and other instruments.
  • Ran statistical analyses within software to process large datasets.
  • Compiled, cleaned and manipulated data for proper handling.
  • Mentored junior engineers on various aspects of data engineering principles while fostering an environment conducive to learning new skills efficiently.


Project Overview:
This project involved building a real-time data streaming and analytics platform to capture, process, and analyze data streams from multiple sources in real time. The goal was to enable the company to make faster, data-driven decisions, improve operational efficiency, and support predictive analytics by processing continuous data streams such as user activity, IoT device data, and website traffic.

The platform needed to ingest high-velocity data, process it in real time to derive meaningful insights, and store both raw and processed data for future batch analytics. Real-time dashboards were set up for monitoring KPIs and generating alerts for critical events such as system anomalies or potential fraud detection.

Key Goals:

  • Real-Time Data Processing:
    Capture and process data streams in real time to provide immediate insights into customer behavior, system performance, and operational metrics.
  • Scalability:
    Build an architecture that could easily scale to handle increased data velocity and volume as the business and data sources grew.
  • High Availability:
    Ensure 24/7 data processing with minimal downtime and timely alerts for critical events.
  • Data Integrity:
    Ensure accurate and reliable data processing despite high throughput and potential system failures.
  • Predictive Analytics:
    Support advanced analytics use cases such as fraud detection, predictive maintenance, and real-time personalization.
  • Apache Kafka:
    Core messaging system to capture, store, and route real-time data streams from multiple sources (e.g., website events, IoT sensors, third-party APIs). Provided scalable and fault-tolerant distributed streaming capabilities.
  • Apache Flink:
    Used for real-time stream processing, applying transformations, enriching data, and performing complex event processing and real-time aggregations (e.g., calculating rolling averages, detecting anomalies).
  • Amazon Kinesis:
    Worked alongside Kafka to ensure redundancy and support multi-region data streaming for critical data. Provided low-latency processing at scale.
  • AWS Lambda:
    Utilized for real-time event-driven processing to trigger specific actions based on incoming data events (e.g., sending alerts, logging exceptions, transforming data).
  • Amazon S3:
    Served as the data lake for storing both raw and processed data for future batch processing and analysis. Provided cost-effective, long-term data archiving.
  • AWS Glue:
    Used for metadata cataloging and automated ETL workflows for batch processing historical data stored in S3. Helped structure raw data for analysts and data scientists.
  • Grafana and Kibana:
    Used to visualize real-time data, display dashboards for monitoring KPIs, system performance, and detect anomalies. Provided customizable alerts and real-time visual feedback.
  • Architecting the Data Streaming Platform:
    Led the design and implementation of the real-time data streaming architecture. Ensured scalability, fault tolerance, and low latency. Designed Kafka topics and partitions to route high-volume data streams efficiently.
  • Data Ingestion and Streaming Setup:
    Configured Apache Kafka to handle multiple data sources and set up producers to push data into the platform from IoT devices, website events, and user interactions. Integrated Amazon Kinesis for redundancy and multi-region availability of critical data streams.
  • Stream Processing with Flink:
    Developed stream processing jobs in Apache Flink to process data in-flight, including filtering, enrichment, and transformation. Implemented real-time windowed aggregations (e.g., rolling averages, detecting traffic spikes) for actionable insights.
  • Event-Driven Processing:
    Used AWS Lambda to trigger actions based on events (e.g., sending alerts when system thresholds were exceeded). Built logic to detect and handle real-time anomalies, such as system errors or fraudulent activity.
  • Data Storage and Archiving:
    Established a pipeline for storing real-time data streams in Amazon S3 for long-term storage. Used AWS Glue for automated ETL workflows, preparing historical data for batch processing and reporting.
  • Monitoring and Alerting:
    Created real-time dashboards in Grafana and Kibana to monitor KPIs, system health, and streaming metrics (e.g., throughput, latency, error rates). Configured alerts for rapid response to system issues.
  • Performance Optimization:
    Tuned Kafka configurations (e.g., partitioning, replication, consumer groups) to ensure high throughput and reliability. Optimized Flink jobs by parallelizing tasks, reducing latency, and improving resource efficiency.
  • Data Governance and Security:
    Implemented security protocols for managing data access and encrypting sensitive streams, ensuring compliance with regulations (e.g., GDPR). Developed role-based access controls (RBAC) for data stream permissions.
  • Challenge: Handling High-Velocity Data Streams
    Solution: Optimized Kafka topics and partitions to distribute data evenly across consumers, ensuring minimal lag and high throughput. Used Flink for stateful processing to ensure data integrity in real time.
  • Challenge: Ensuring Low-Latency Processing
    Solution: Used Flink’s windowing and real-time aggregation capabilities to process data in sub-second intervals, ensuring near-instant feedback for time-sensitive events.
  • Challenge: Ensuring High Availability
    Solution: Implemented data redundancy across Amazon Kinesis and Kafka to ensure critical data availability even during regional outages or system failures.
  • Challenge: Managing System Failures and Data Loss
    Solution: Leveraged Kafka’s replication and failover capabilities, ensuring messages were replicated across multiple brokers, reducing the risk of data loss.
  • Real-Time Decision-Making:
    Enabled real-time decision-making by providing immediate insights into system performance, customer behavior, and operational metrics. Allowed for faster reactions to issues such as system failures or user activity spikes.
  • Improved System Monitoring:
    Real-time dashboards in Grafana and Kibana allowed stakeholders to monitor KPIs in real time, improving operational efficiency and reducing downtime by ensuring rapid response to alerts.
  • Enhanced Data Processing Efficiency:
    Optimized real-time stream processing with Apache Flink, reducing processing latency by 25%. Enabled near-instant access to fresh data for time-sensitive decisions.
  • Scalable Architecture:
    Designed a scalable architecture capable of handling tens of thousands of events per second. Scaled further as data volumes grew without significant reconfiguration.
  • Improved Business Insights:
    Provided deeper insights into customer behavior and operational metrics in real time. Predictive analytics supported by real-time processing improved decision-making for marketing, fraud detection, and personalization.
  • Cost-Effective Data Storage:
    Used Amazon S3 for long-term storage of large datasets at minimal cost, supporting both real-time and batch analytics.
  • Business Impact:
    The platform improved operational efficiency and customer engagement. Real-time data processing led to faster response times, reduced system downtime, and enhanced customer satisfaction.

Education

Kwara Polytechnics

Business Administrator

Skills

ETL development

Certification

AWS Certified Solutions Architect - Associate

Automationandconfigurationmanagement

Skilled in utilizing DevOps and automation tools such as GitOps, Argo CD, Azure Pipelines, and GitHub Actions to streamline deployment processes, improve efficiency, and enhance code quality. Experienced in infrastructure-as-code frameworks like Ansible and Terraform.

Containerizationandorchestration

Strong experience in containerization technologies, including Docker, and container orchestration using Kubernetes. Proficient in deploying and managing Docker images and leveraging Kubernetes features like replicates, pods, deployment, and services.

Problemsolvingandcollaboration

Adept at troubleshooting complex issues, identifying root causes, and devising innovative solutions to business challenges. Collaborative team player, skilled in working closely with software development teams and stakeholders to understand infrastructure requirements and implement optimal cloud-based solutions.

Areaofexcellence

  • AWS Devops
  • Azure Devops
  • Scripting (Python, Bash)
  • Docker, Dockerfile
  • Kubernetes
  • Security / Performance
  • Troubleshooting / Issue Resolving
  • GIT
  • Terraform - IaC
  • OS - Linux and Windows

Otherskills

Linux (RHEL/CentOS 7 & 8, Ubuntu), MS Windows Server 2012, 2016, SQL Server, HBase, MySQL, Apache Kafka, HornetQ, Hadoop, Yarn, Zookeeper, Strong problem-solving skills, ability to identify and resolve issues proactively, Developing innovative solutions to business challenges, streamlining operations

Programming

  • Apache Kafka:
    Expertise in configuring Kafka topics, partitions, and consumer groups for efficient data streaming and processing at scale. Hands-on experience with Kafka's API for both producers and consumers.
  • Apache Flink:
    Proficient in developing real-time stream processing applications, leveraging Flink's event-driven processing, windowing functions, and stateful operations for low-latency data transformation and aggregation.
  • AWS Lambda:
    Strong experience in writing event-driven functions for real-time data processing, automating alerts, and performing lightweight data transformations without managing servers.
  • Amazon Kinesis:
    Skilled in configuring Amazon Kinesis for real-time data ingestion, ensuring data redundancy, and supporting multi-region streaming for mission-critical applications.
  • Python / Java:
    Extensive experience using Python and Java for building real-time data pipelines, implementing stream processing logic, and integrating with AWS services like Lambda and Kinesis.
  • SQL / NoSQL Databases:
    Ability to work with both relational (e.g., PostgreSQL) and non-relational databases (e.g., DynamoDB, MongoDB) for querying and storing real-time data.
  • AWS Glue & S3:
    Expertise in using AWS Glue for building automated ETL workflows and integrating with Amazon S3 for cost-effective, long-term data storage and batch analytics.
  • Grafana & Kibana Dashboards:
    Proficient in configuring real-time data visualization dashboards in Grafana and Kibana to monitor KPIs, system performance, and detect anomalies.
  • Data Security & Governance:
    Experience in implementing security protocols, data encryption, and role-based access controls (RBAC) to ensure data compliance and privacy.

Timeline

Data Engineer

Toyota Auto Mobile
5 2019 - Current

Data Engineer

The HCI Companies
4 2014 - 7 2016

Associate Data Engineer

Navy Federal Union
5 2017 - 2 2018

Kwara Polytechnics

Business Administrator
David WiseData Engineer