Summary
Overview
Work History
Education
Skills
CAREER INTERESTS
Timeline
Generic

Neharika Mazumdar

Pune

Summary

Data Scientist and Machine Learning Engineer with 5+ years of experience across ML, LLMs, GenAI, and large-scale data engineering. Background in building ML pipelines, developing multimodal models, and working across NLP, computer vision, and real-time data systems. Experienced in high-dimensional data analysis and dimensionality reduction (PCA, MDS), with a research foundation in life sciences and bioinformatics. Thrive in autonomous, research-driven environments.

Overview

11
11
years of professional experience

Work History

Data Science Engineer

Cognologix Technologies
Pune
03.2021 - Current
  • Industrial Automation via OCR: Built end-to-end automation using DenseNet and PaddleOCR to extract tabular data from PDF diagrams and map it to real-time screen elements. Enabled automated actions within client tools, reducing manual mapping by 60–70%.
  • Multimodal RAG System: Built a pipeline using CLIP, ALIGN, and ColPali to generate joint embeddings from PDFs (text, images, tables). Integrated with Milvus and LLMs (OpenAI, Claude) to enable multimodal Q&A, achieving ~20–30% improvement in semantic answer accuracy through embedding comparisons and prompt tuning.
  • LLM Ops Integration: Evaluated Arize AI, Azure ML, and Evidently for monitoring drift, bias, and performance issues across deployed ML/LLM models; set up monitoring pipelines and alerts.
  • Toxicity Detection Tool: Designed a content moderation pipeline using BERT, DistilBERT, OpenAI GPT, and Google Perspective API; conducted comparative analysis of models across model hubs.
  • Data Engineering Projects: Built real-time data ingestion and aggregation pipelines using Apache Spark, Kafka, and Apache Flink.

Tech Stack: Python, Spark, Kafka, Azure ML, OpenAI, CLIP, ALIGN, HuggingFace, Google APIs, REST, PyTorch, TensorFlow, Flask

Data Science R&D Consultant

Fraudlens Inc.
San Francisco
12.2019 - 02.2021
  • Developed unsupervised clustering models (e.g., agglomerative clustering) to analyze semantic similarity and groupings across standardized dental codes, aiding taxonomy refinement.
  • Built fraud detection pipelines over large-scale dental insurance claims using scikit-learn classifiers, with engineered features from cleaned and normalized fields (names, addresses, codes).
  • Conducted EDA, feature engineering, and model evaluation to support fraud analytics and anomaly detection workflows.

Tech Stack: Python, scikit-learn, pandas, NumPy, SciPy, matplotlib, Jupyter, Regex.

Graduate Research Assistant

San Francisco State University
San Francisco
09.2017 - 12.2020
  • Applied ML and graph algorithms to biomedical and social data for temporal modeling and network analysis.
  • Modeled viral transmission using MST-based seriation and dimensionality reduction (PCA, MDS) on high-dimensional genetic data.
  • Built HCV host-pathogen graphs to trace transmission pathways.
  • Analyzed Reddit opioid user behavior using NLP and social network graphs; applied SVM and Random Forest for classification.
  • Developed an interactive web tool to visualize seriation and temporal progression in biomedical datasets.

Tech Stack: Python, BioPython, NetworkX, scikit-learn, Flask, PCA, MSTs, TF-IDF, Plotly

Software Engineer – Analytics Business Unit

Persistent Systems
Pune
11.2014 - 05.2016
  • Seagate Big Data Platform: Built batch processing jobs and managed multi-cluster Hadoop environments.
  • Cisco SMA: Built REST APIs and implemented NLP pipelines for analyzing product-related Twitter data.
  • Cloud Security Testing: Developed API testing tools and automation for security apps across AWS, Box, and Salesforce.

Tech Stack: Java, Python, Hadoop, Hive, Sqoop, Spark, Cassandra, Tableau, REST, NLTK, Selenium, TestNG

Education

M.S. - Computer Science

San Francisco State University
San Francisco, USA
09.2019

B.E. - Information Technology

Pune University
Pune, India
08.2014

Skills

  • Languages: Python, Java, SQL
  • Big Data & Infra: Apache Spark, Kafka, HDFS, Hive, Sqoop, Azure Synapse, AWS
  • ML/DL Libraries: scikit-learn, TensorFlow, PyTorch, Numpy, Pandas, HuggingFace, LangChain, Transformers
  • Visualization: matplotlib, seaborn, Plotly, Tableau
  • DevOps & ML Ops: Azure ML, Arize AI, Evidently, Git, Docker
  • Others: REST APIs, Flask, web scraping, NLP, CLIP, ALIGN, OpenAI APIs, Google Perspective API

CAREER INTERESTS

Actively seeking roles in:

  • Data Science & Machine Learning (R&D & applied)
  • Deep Learning, LLMs, NLP, Multimodal AI
  • Global teams that value creativity, autonomy, and cross-disciplinary thinking

Timeline

Data Science Engineer

Cognologix Technologies
03.2021 - Current

Data Science R&D Consultant

Fraudlens Inc.
12.2019 - 02.2021

Graduate Research Assistant

San Francisco State University
09.2017 - 12.2020

Software Engineer – Analytics Business Unit

Persistent Systems
11.2014 - 05.2016

M.S. - Computer Science

San Francisco State University

B.E. - Information Technology

Pune University
Neharika Mazumdar