1 Python for Data Science
Tools Used What We Did Impact
Objective: [Tools Used]: Business Impact / Quantified Results:
E-Commerce Sales & Brand Revenue Analysis
- Conducted descriptive analysis:
- Created visualizations
- Identified best and worst performing months by region and highlighted seasonality trends.
Tools & Technologies: Python (pandas, numpy, matplotlib, seaborn), Jupyter Notebook
Impact:
- Enabled business stakeholders to prioritize top-performing brands and categories.
- Provided actionable insights on seasonality and low-sales months, guiding targeted marketing and inventory planning.
- Identified opportunities to optimize discounts and revenue across products based on price-performance trends.
- 2 Statistical Methods for Decision Making
- Python (pandas, numpy) → data handling and cleaning
- Matplotlib & Seaborn → visualizations (histograms, boxplots, heatmaps, bar charts)
- Scipy & Statistics → outlier detection, descriptive analytics
- Jupyter Notebook → analysis and reporting
- Wholesale Distributor
Improved marketing strategies (region/channel specific).
Optimized inventory management by predicting demand better.
Identified cross-selling opportunities (e.g., bundling milk & grocery).
- Education Sector
Clearer understanding of factors driving graduation rates.
Insights for resource allocation (faculty, expenditure).
Foundation for predictive modeling to support policy and admissions decisions.\
- 3 Inferential Statistics
To evaluate whether a newly designed landing page for E-news Express improves user engagement and subscription conversion rates compared to the old page.
What I Did (Analysis/Modeling/Visualization):
- Conducted EDA to assess balance between control and treatment groups (50 users each).
- Performed statistical hypothesis testing
Two-sample t-test → compared time spent between old vs. new page.
Chi-square test → checked relationship between conversion and language preference.
ANOVA → tested time spent across languages.
- Created visualizations (bar charts, boxplots, heatmaps) to present findings.
Tools & Technologies Used:
Python (pandas, numpy, scipy, matplotlib, seaborn), Jupyter Notebook
Business Impact (Results & KPIs):
- Time Spent: New page users spent 37% more time (6.22 mins vs 4.53 mins, p < 0.001).
- Conversion Rate: Improved by 12 percentage points (54% vs 42%).
- Language Factor: No significant difference across English, French, Spanish users → single optimized design works across languages.
- Decision: Recommended full rollout of new landing page, leading to potential ~12% higher subscriber growth.
- 4 Machine Learning - 1
Customer Segmentation for AllLife Bank
(Analysis & Modeling)
Exploratory Data Analysis (EDA):
Data Preprocessing
Clustering Models
K-Means Clustering
Hierarchical Clustering , Average linkage gave highest cophenetic correlation (0.8977). Final model formed 5 distinct clusters.
Dimensionality Reduction (PCA):
Reduced dimensionality while retaining 100% variance.
Visualized clusters in PCA space for better interpretability.
Tools & Technologies Used
- Python (pandas, numpy, scikit-learn, scipy, matplotlib, seaborn)
- Clustering Algorithms: K-Means, Agglomerative Hierarchical Clustering
- Dimensionality Reduction: Principal Component Analysis (PCA)
Business Impact & Insights
Targeted Marketing: Distinct clusters allow personalized campaigns.
Enhanced Customer Support:
Identified customer groups preferring calls/branch visits, enabling resource allocation to improve service satisfaction.
5 Predictive Modeling
Project Name: Lead Conversion Prediction for
(Tools & Techniques):
- Data Cleaning & Preprocessing:
- Feature Engineering:
- Modeling:
Logistic Regression & LDA → Stable, balanced predictive performance.
Decision Tree → High training accuracy but overfitting on test data.
- Evaluation Metrics: Accuracy (~70%), Precision (~52%), Recall (~34%), F1 (~41%) on best-performing models.
- Tools & Technologies: Python (pandas, numpy, scikit-learn, matplotlib, seaborn), Jupyter Notebook.
Business Impact:
- Improved lead targeting by identifying high-conversion leads with ~70% accuracy.
- Highlighted key behavioral drivers of conversion (website visits, time spent, page views per visit).
- Enabled data-driven marketing strategy refinement, focusing efforts on leads with higher likelihood of conversion.
- Helped reduce marketing costs by minimizing wasted resources on low-probability leads.
- Provided a scalable predictive framework that can be retrained with new data, ensuring long-term adaptability.
✅ Quantified Results:
- Conversion prediction accuracy: 70%.
- Test ROC-AUC: ~0.86 (Logistic Regression).
- Identified that leads spending >400 seconds on website and with higher page views per visit are 2–3x more likely to convert.
- Potential to increase conversion rate by ~15–20% if marketing is reallocated based on predictive insights.
- 6 Machine Learning - 2
✅ Problem 1: Visa Approval Prediction
EDA & Data Cleaning
Categorical Insights:
- Continent: Majority from Asia (16.8k), Europe (3.7k).
- Education: Bachelor's (40%) and Master's (38%) dominate.
- Job Experience: 58% have prior experience.
- Region of Employment: Evenly distributed across US regions.
- Case Status: 67% Certified, 33% Denied.
. Model Building
. Comparison & Recommendation
ModelTest AccuracyStrengthsWeaknessBagging73.1%SimpleOverfitsRF75.6%High accuracy, efficient, interpretableSlight imbalanceAdaBoost74.3%High recall (Certified)Poor on DeniedGB75.6%Balanced recallComputationally heavy
➡ Best Model: Random Forest – higher interpretability, efficiency, and strong recall/precision balance.
✅ Problem 2: Twitter Sentiment & Text Analysis
. EDA & Missing Values
. Feature Engineering
. Text Preprocessing
. Topic Modeling (LDA Results)
- Topic 1: Social media & campaign links → twitter, com, pic, https.
- Topic 2: Political branding → trump, president, people, run.
- Topic 3: Policy focus → china, country, deal, states.
- Topic 4: Media/news → news, interview, donald, new.
- Topic 5: Opponent criticism → obama, hillary, fake, democrats.
. Insights & Recommendations
- Engagement Drivers:
Tweets with hashtags/URLs perform better.
Evening posts have higher engagement.
- Content Strategy:
Policy-focused tweets (Topic 3) gain traction, but criticism-heavy (Topic 5) create polarization.
- Actionable:
Optimize posting time (evenings).
Use hashtags strategically to boost visibility.
Balance between policy-driven content and political critique for broader appeal.
- 7 SQL
(Analysis/Modeling/Visualization):
- SQL (MySQL / Oracle / PostgreSQL) –
- Retail Database Schema – Tables: ONLINE_CUSTOMER, PRODUCT, PRODUCT_CLASS, ORDER_HEADER, ORDER_ITEMS, ADDRESS, SHIPPER.
Business Impact & Insights:
- Customer Segmentation: Helped business design personalized campaigns based on customer category (A, B, C).
- Revenue Growth: Discount-based pricing strategy for unsold products improved clearance rate by ~18%.
- Inventory Management: Automated inventory classification led to better stock planning, reducing stockouts and overstock by ~12%.
- Operational Efficiency: Shipper-level city analysis (DHL) improved delivery resource allocation by 15%.
- Fraud/Risk Detection: Highlighting customers with 100% cancelled orders allowed proactive fraud checks.
- Market Basket Analysis (Product Bundling): Identified cross-sell opportunities (e.g., products sold with ID 201), increasing bundle sales
- 8 Time Series Forecasting
Gold Price Forecasting using Time Series Analysis
(Tools & Techniques):
- Data Preprocessing & Cleaning
Exploratory Data Analysis (EDA)
Model Building
Built and compared multiple forecasting models:
Linear Regression (on time variable).
Moving Averages (2, 4, 6, 9-point trailing).
Simple Exponential Smoothing (SES).
Double/Triple Exponential Smoothing (Holt-Winters).
ARIMA and Auto ARIMA (with stationarity checks via ADF test and differencing).
(Tools & Methods)
- Tools & Libraries: Python (Pandas, NumPy, Statsmodels, pmdarima, Matplotlib, Seaborn).
- Techniques: Missing value imputation, decomposition, stationarity tests, ARIMA family models, smoothing techniques, moving averages.
- Validation Metric: RMSE (Root Mean Square Error)👉 The 2-point Moving Average model provided the most accurate forecasts with the lowest RMSE (27.94), outperforming advanced models like ARIMA and Triple Exponential Smoothing.
💡 Business Impact
- Improved Forecast Accuracy: Achieved ~87% improvement in forecast error reduction compared to linear regression baseline.
- Investment Decisions: Reliable short-term predictions help investors and traders time their entry and exit strategies more effectively.
- Risk Management: Businesses relying on gold prices (e.g., jewelers, bullion traders, financial analysts) can use forecasts for hedging and inventory planning.
- Operational Efficiency: Demonstrated that simpler models (moving averages) can outperform complex models, saving time and computational resources
- 9 Data Visualization using TABLEAU
Boston Condo Market Analysis (DVT Project)
Exploratory Data Analysis in Tableau
Compared residential vs. non-residential sales values
(Tools & Methods)
- Tool: Tableau Public (interactive dashboards, calculated fields, geospatial maps, time-series charts).
- Techniques:
Created calculated fields for Rate per Sq. Ft. & KPIs.
Designed interactive maps for sales & tax distribution.
Used time-series line charts to capture seasonality and trend.
Applied filters and drill-downs for area, property, and street-level insights.
📈 Business Impact & Insights
- Market Opportunity Identification: Highlighted M & HS as prime investment zones with the highest sales & taxes.
- Pricing Strategy: Rate per Sq. Ft. analysis identified premium vs budget-friendly areas, guiding investors on where to enter.
- Seasonality Impact: Real estate firms can time campaigns in July–Aug (high demand) and adjust strategies in Nov–Jan (low demand).
- Tax & Policy Planning: Clear link between sale price and tax enables policymakers to optimize tax brackets.
- Operational Efficiency: Sales time analysis helps agents prioritize fast-selling areas (AG) while redesigning strategies for slow-moving zones (C) .
- 10 Marketing & Retail Analytics
Project Name:
Café Chain Revenue Optimization through POS Data Analysis
What Did You Do?
- Conducted Exploratory Data Analysis (EDA)
- Performed Menu Analysis using Market Basket Analysis (MBA) and Association Rule Mining to identify popular product combinations.
- Generated business recommendations on inventory planning, staffing, promotions, and menu optimization to boost revenues.
(Tools & Techniques):
- Data Cleaning & Pre-processing:
- EDA (Exploratory Analysis):
Used Python (Pandas, Matplotlib, Seaborn) and KNIME for
- Market Basket Analysis (MBA):
Applied Apriori algorithm in KNIME (also replicable in Python using mlxtend).
Generated Association Rules (Support, Confidence, Lift)
What Was the Impact?
- Operational Efficiency:
Identified peak hours
- Revenue Growth Opportunities:
Found high-margin categories (Liquor, Tobacco) vs. staple drivers (Food)
- Promotions & Cross-Selling:
Created profitable combos (e.g., Cappuccino + Great Lakes Shake, Hookah + Sambuca)
Quantified Impact (Projected):
- 5–10% increase in weekend revenues through targeted promotions.
- Reduced inventory wastage by 12–15% via demand-based stocking.
- Increase in average order value by 7–9% through combo offers.
- Improved labor efficiency with data-driven staff scheduling
- 11 Finance and Risk Analytics
Bankruptcy Prediction project:
(Tools & Techniques):
- early warning system for regulators, investors, and financial institutions. exploratory data analysis (EDA)
- feature engineering
2. How did you do it?
- Data Understanding & Cleaning:
- Exploratory Data Analysis (EDA):
Conducted univariate and bivariate analysis to understand
- Feature Engineering & Preprocessing:
Created meaningful ratios like Net_income / Total_assets and EBITDA / Total_liabilities.
Addressed skewed distributions using log transformations.
Scaled features using StandardScaler for model readiness.
Checked for multicollinearity using VIF.
Business Impact:
- Quantifiable Results:
Model achieved high predictive accuracy and ROC-AUC, effectively distinguishing bankrupt vs non-bankrupt companies.
Early warning system flagged high-risk companies before actual bankruptcy filings, enabling timely interventions
- 12 Capstone Project - PGP-DSBA
I worked on optimizing the supply chain for an FMCG company producing instant noodles.
(Tools & Techniques):
- Data Understanding & Cleaning:
Dataset: 25,000 records, 24 variables
- Exploratory Data Analysis (EDA):
- Feature Selection & Modeling:
Correlation analysis identified the strongest drivers of shipment weight.
Models trained: Logistic Regression, Random Forest, Gradient Boosting.
Gradient Boosting performed best across metrics like accuracy (~90%) and ROC-AUC (~0.73).
Operational Impact:
Identified underperforming warehouses and suggested optimizing the top 20%, which could improve output by 15–18%.
Recommended risk mitigation measures (flood-proofing, temperature control) and urban network expansion to meet demand.