Summary
Work History
Education
Skills
Timeline
Competitive Examination
Published the cleaned dataset and then the full analysis notebook on Kaggle for public demonstration and community use.
Projects
Kaggle Competition
Feature Selection and Statistical Filtering for High-Dimensional Data (Python, Machine Learning)
Projects
Overview
Generic

Karan Gupta

Thane

Summary

I’m a detail-oriented data analyst who genuinely enjoys diving into raw data to uncover hidden stories. With hands-on experience in data cleaning, preprocessing, and statistical analysis using Python and pandas, I’m passionate about turning messy datasets into clear, actionable insights. I love building efficient workflows that make daily work smoother and collaborating with teams to solve real analytical challenges. Whether ensuring high data quality for business decisions or helping industries like manufacturing and market intelligence stay ahead with smart solutions, I believe in bringing both rigor and creativity to every project. My curiosity drives me to constantly explore new ways to advance the field—especially when it means making an impact through innovative monitoring or smarter strategies for social media and business growth.

Work History

Analyst

Biltrax Construction Data Research Insight & Technologies Pvt Ltd
mumbai
09.2025 - Current
  • Collected and validated construction project data from multiple sources, ensuring accuracy and completeness for market intelligence database
  • Performed quality checks on construction project information including project specifications, timelines, and financial details to maintain database integrity
  • Organized and structured large volumes of construction industry data using systematic data entry protocols, supporting business intelligence operations
  • Conducted data verification and validation to identify inconsistencies, duplicates, and errors in construction project records
  • Maintained data quality standards by following established data entry procedures and documentation requirements for construction market research
  • Assisted in building comprehensive construction project databases that supported analytical reporting and market insights for clients
  • Key Skills :Data Validation, Data Quality Control, Attention to Detail, MS Excel

Data Analyst

Linux World
jaipur
06.2023 - 07.2023
  • Performed data cleaning and preprocessing on large datasets using Python pandas, handling missing values, removing duplicates, and standardizing data formats to ensure accuracy
  • Applied advanced pandas functions for data manipulation including type conversions, string operations, and data transformation to prepare datasets for analysis
  • Identified outliers and data anomalies during the cleaning process to maintain data quality standards and support reliable analytical outcomes
  • Collaborated with teams to understand data requirements and translate technical processes into actionable insights
  • Key Skills: Python, Pandas, Data Preprocessing, Data Validation, Jupyter Notebook

Education

B.Tech - Production Engineering

Veermata Veermata Jijabai Technological Institute
Matunga, Mumbai
05-2025

Skills

  • Data Validation & Quality Control: Ensured accuracy and integrity of large datasets through systematic validation, error detection, and quality assurance processes
  • Data Preprocessing: Applied best practices for transforming raw data—cleaning, normalization, outlier handling, feature engineering—for robust analysis and model-building
  • Python Programming: Proficient in Python for data analysis, cleaning, visualization, and machine learning (experience with pandas, NumPy, scikit-learn, matplotlib, seaborn)
  • Statistical Analysis: Conducted statistical hypothesis tests, correlation analysis, and regression modeling for decision-making and insight generation
  • MS Excel: Spreadsheet management, basic formulas, simple data entry and visualization
  • SQL: Wrote efficient SQL queries for data extraction, aggregation, joins, and reporting from relational databases

Timeline

Analyst

Biltrax Construction Data Research Insight & Technologies Pvt Ltd
09.2025 - Current

Data Analyst

Linux World
06.2023 - 07.2023

B.Tech - Production Engineering

Veermata Veermata Jijabai Technological Institute

Competitive Examination

  • MHTCET: 94.78 percentile
  • High School (12th Grade) - 84.5%

Published the cleaned dataset and then the full analysis notebook on Kaggle for public demonstration and community use.

Smartphone Dataset Cleaning and Exploratory Data Analysis
Python, pandas, Data Cleaning, Data Visualization

  • Phase 1 – Data Cleaning: Performed comprehensive data cleaning on a raw smartphone dataset with 900+ records; removed duplicates, handled missing values, standardized inconsistent formats, normalized feature names, and corrected data type errors to ensure analysis-ready data quality
    (Cleaning Code: https://www.kaggle.com/code/karanguptaofiicial/smartphone-cleaning-code)
  • Shared the cleaned dataset publicly on Kaggle for community use and reproducibility
  • Phase 2 – Exploratory Data Analysis (EDA): Conducted in-depth EDA on the cleaned dataset to uncover trends, patterns, and insights related to smartphone pricing, brand preferences, and feature correlations; utilized matplotlib and seaborn for clear, informative visualizations
    (EDA Notebook: https://www.kaggle.com/code/karanguptaofiicial/insightful-eda-of-smartphone-dataset-with-stunning)
  • Published the full project notebook on Kaggle for public demonstration and peer review

Tools: Python, pandas, matplotlib, seaborn

Projects

Customer Churn Prediction
Python, Machine Learning, Data Analysis

  • Developed an end-to-end machine learning solution to predict telecom customer churn (7,000+ records, 20+ features) using real-world data
  • Conducted exploratory data analysis (EDA), identified key risk factors, and engineered features for modeling
  • Balanced target classes with SMOTE, compared Logistic Regression, Random Forest, and XGBoost classifiers
  • Selected Random Forest model for highest F1-score and recall (accuracy: 82%, ROC-AUC: 0.84)
  • Assessed model results using confusion matrix and feature importance; provided actionable insights for business retention strategies
  • Documented process, code, and insights with a comprehensive project report and technical README
  • Published the full project on Kaggle and uploaded code to GitHub for public review

Demo Notebooks & Code:

  • Kaggle Notebook: https://www.kaggle.com/code/karanguptaofiicial/telco-customer-churn-eda-to-ml-pipeline
  • GitHub Repository: https://github.com/ropkarangupta-hub/customer-churn-ml

Tools: Python, pandas, scikit-learn, imbalanced-learn, xgboost, matplotlib, seaborn

Kaggle Competition

Kaggle Playground Series S5E10: Road Accident Risk Prediction (2025)

  • Participated in the global Kaggle Playground Series S5E10 competition, focused on predicting the likelihood of road accidents using advanced machine learning techniques and real-world simulated data.
  • Built an end-to-end machine learning pipeline involving data preprocessing, feature engineering, and the use of advanced regression ensembles (XGBoost, LightGBM, CatBoost) to achieve accurate risk predictions.
  • Successfully submitted predictions among over 2,500 participants and achieved a rank of 1,508 on the official competition leaderboard.
  • Demonstrated practical skills in regression modeling, model evaluation using RMSE, and teamwork within a large, real-data, international challenge setting.

Feature Selection and Statistical Filtering for High-Dimensional Data (Python, Machine Learning)

  • Executed end-to-end feature selection and statistical filtering across complex datasets, including the Human Activity Recognition and UCI SECOM manufacturing datasets, using Python (pandas, scikit-learn).
  • Applied a comprehensive suite of techniques: duplicate column removal, variance thresholding, correlation analysis, ANOVA F-test, and Chi-squared test to effectively reduce dimensionality and eliminate redundant features (from 561/592 to 10–100 key variables).
  • Enhanced model accuracy and performance (test accuracy improvement from 87% to 97%) by minimizing multicollinearity and data noise, resulting in more interpretable and efficient predictive models.
  • Demonstrated strong data preprocessing, feature engineering, and machine learning pipeline development skills applicable to industry-scale datasets and real-world analytical challenges.

Project Links:

  • Human Activity Recognition: https://www.kaggle.com/code/karanguptaofiicial/feature-selection-on-human-activity-dataset
  • UCI SECOM Manufacturing: https://www.kaggle.com/code/karanguptaofiicial/statistical-feature-filtering-for-secom-data

Projects

Real Estate Data Cleaning 

Gurgaon Properties 

Performed comprehensive data cleaning on raw real estate listings for Gurgaon properties, including removal of duplicates, handling of missing values, correction of inconsistent formats, and normalization of key features

  • Engineered property features (such as location, size, price per square foot, amenities) to improve data quality and analysis
  • Documented workflow and results for review and future project reproducibility

Tools: Python, pandas, scikit-learn, matplotlib

Overview

2
2
years of professional experience
Karan Gupta