Summary
Overview
Work History
Education
Skills
Websites
Timeline
Generic

Vaibhav Rachalwar

Summary

Experienced Data Scientist in the US Healthcare and Retail space skilled in OCR, NLP, statistical analysis, and LLMs, with a focus on healthcare and e-commerce. Proficient in Python, SQL, and various ML frameworks, including expertise in PySpark for large-scale data processing. Adept at improving medical coding, optimizing data pipelines, and providing actionable insights. Passionate about Finance, combining technical skills with business insight to drive data-driven solutions.

Overview

4
4
years of professional experience

Work History

Jr. Data Scientist

Datalink Software
07.2023 - 06.2024
  • Created 'Sherlock,' an AI platform using Python, Streamlit, OpenAI, and PyMuPDF, enhancing medical coding efficiency by 5-10 times
  • Utilized OpenCV and Tesseract for OCR to extract text from PDFs/XML, and applied NLP models like LayoutLMv3 and BERT for key information extraction
  • Implemented parallel processing to optimize performance, showcasing expertise in scalability and resource management also collaborated with a teammate to integrate this with Streamlit interface
  • Built logistic regression and random forest models to predict Diseases
  • Used upsampling, hyperparameter tuning with grid search, and ensembling methods (bagging, AdaBoost, gradient boosting) to improve model performance
  • Created an AI document review system using OpenCV, pytesseract, fuzzywuzzy, and Pandas, including a web interface with Flask, HTML, and JavaScript for navigation and information extraction
  • Used image processing techniques like contour detection, dilation, and erosion to extract key information from complex layouts
  • Used NLP techniques like fuzzy string matching to categorize document sections, such as patient records and encounter details
  • Developed a bookmarking feature for saving document snippets with notes and OCR text, enhancing user experience and collaboration
  • Developed a Column Mapping Tool using Python, Streamlit, Pandas, and scikit-learn for efficient data integration
  • Implemented Levenshtein distance, Metaphone phonetic similarity, fuzzywuzzy, and BERT/DeBERTa embeddings to improve mapping accuracy
  • Created an intuitive Streamlit interface for file uploads, mapping reviews, and manual selections
  • Developed an interactive healthcare analytics dashboard using Streamlit, Pandas and PySpark, efficiently processing datasets of 100,000+ members and over 1 million rows to providing actionable insights
  • Leveraged PySpark for initial data processing and complex aggregations, reducing query time for large-scale operations by 70%, while using Pandas for final data shaping and visualization
  • Employed statistical analysis techniques like Z-score and correlation for hypothesis testing, utilizing PySpark's MLlib for computations on full datasets to identify key performance indicators and trends
  • Integrated OpenAI's GPT-4 for generating accurate summaries and insights, and optimized code performance using Streamlit's @st.cache_data decorator and PySpark's distributed computing capabilities, reducing overall report generation time by 60%
  • Worked with stakeholders to develop quarterly roadmaps based on impact, effort and test coordinations.
  • Utilized advanced querying, visualization and analytics tools to analyze and process complex data sets.

ML Data Associate

Amazon
08.2020 - 08.2021
  • Collaborated with a team of 8 to automate labeling and object counting in Amazon Robotic Fulfillment Centers using AWS SageMaker, enhancing efficiency and decision-making
  • Primarily involved in labeling jobs, helped reduce TAKT time by 45%, and generated a financial impact of INR 14.08 Lakhs per month
  • Trained and coached over 40 people as an Interim Trainer, contributing to team development and growth.

Education

Master of Technology - Medical Device Innovation

Indian Institute of Technology
06.2023

Skills

    Language– Python, SQL

    Key Competencies – ML, NLP, Deep Learning, LLM, Prompt Engineering, FastAPI, PySpark, ETL, Data Pipelines

    Developer Tools – Google Collaboratory, Jupyter Notebook, VS Code, Git

    UI – Streamlit, Gradio, Tkinter, Flask

    Database – MSSQL

Timeline

Jr. Data Scientist

Datalink Software
07.2023 - 06.2024

ML Data Associate

Amazon
08.2020 - 08.2021

Master of Technology - Medical Device Innovation

Indian Institute of Technology
Vaibhav Rachalwar