Summary

Overview

Work History

Education

Skills

CAREER INTERESTS

Timeline

Neharika Mazumdar

Pune

Summary

Data Scientist and Machine Learning Engineer with 5+ years of experience across ML, LLMs, GenAI, and large-scale data engineering. Background in building ML pipelines, developing multimodal models, and working across NLP, computer vision, and real-time data systems. Experienced in high-dimensional data analysis and dimensionality reduction (PCA, MDS), with a research foundation in life sciences and bioinformatics. Thrive in autonomous, research-driven environments.

Overview

years of professional experience

Work History

Data Science Engineer

Cognologix Technologies

Pune

03.2021 - Current

Industrial Automation via OCR: Built end-to-end automation using DenseNet and PaddleOCR to extract tabular data from PDF diagrams and map it to real-time screen elements. Enabled automated actions within client tools, reducing manual mapping by 60–70%.
Multimodal RAG System: Built a pipeline using CLIP, ALIGN, and ColPali to generate joint embeddings from PDFs (text, images, tables). Integrated with Milvus and LLMs (OpenAI, Claude) to enable multimodal Q&A, achieving ~20–30% improvement in semantic answer accuracy through embedding comparisons and prompt tuning.
LLM Ops Integration: Evaluated Arize AI, Azure ML, and Evidently for monitoring drift, bias, and performance issues across deployed ML/LLM models; set up monitoring pipelines and alerts.
Toxicity Detection Tool: Designed a content moderation pipeline using BERT, DistilBERT, OpenAI GPT, and Google Perspective API; conducted comparative analysis of models across model hubs.
Data Engineering Projects: Built real-time data ingestion and aggregation pipelines using Apache Spark, Kafka, and Apache Flink.

Tech Stack: Python, Spark, Kafka, Azure ML, OpenAI, CLIP, ALIGN, HuggingFace, Google APIs, REST, PyTorch, TensorFlow, Flask

Data Science R&D Consultant

Fraudlens Inc.

San Francisco

12.2019 - 02.2021

Developed unsupervised clustering models (e.g., agglomerative clustering) to analyze semantic similarity and groupings across standardized dental codes, aiding taxonomy refinement.
Built fraud detection pipelines over large-scale dental insurance claims using scikit-learn classifiers, with engineered features from cleaned and normalized fields (names, addresses, codes).
Conducted EDA, feature engineering, and model evaluation to support fraud analytics and anomaly detection workflows.

Tech Stack: Python, scikit-learn, pandas, NumPy, SciPy, matplotlib, Jupyter, Regex.

Graduate Research Assistant

San Francisco State University

San Francisco

09.2017 - 12.2020

Applied ML and graph algorithms to biomedical and social data for temporal modeling and network analysis.
Modeled viral transmission using MST-based seriation and dimensionality reduction (PCA, MDS) on high-dimensional genetic data.
Built HCV host-pathogen graphs to trace transmission pathways.
Analyzed Reddit opioid user behavior using NLP and social network graphs; applied SVM and Random Forest for classification.
Developed an interactive web tool to visualize seriation and temporal progression in biomedical datasets.

Tech Stack: Python, BioPython, NetworkX, scikit-learn, Flask, PCA, MSTs, TF-IDF, Plotly

Software Engineer – Analytics Business Unit

Persistent Systems

Pune

11.2014 - 05.2016

Seagate Big Data Platform: Built batch processing jobs and managed multi-cluster Hadoop environments.
Cisco SMA: Built REST APIs and implemented NLP pipelines for analyzing product-related Twitter data.
Cloud Security Testing: Developed API testing tools and automation for security apps across AWS, Box, and Salesforce.

Tech Stack: Java, Python, Hadoop, Hive, Sqoop, Spark, Cassandra, Tableau, REST, NLTK, Selenium, TestNG

Education

M.S. - Computer Science

San Francisco State University

San Francisco, USA

09.2019

B.E. - Information Technology

Pune University

Pune, India

08.2014

Skills

Languages: Python, Java, SQL
Big Data & Infra: Apache Spark, Kafka, HDFS, Hive, Sqoop, Azure Synapse, AWS
ML/DL Libraries: scikit-learn, TensorFlow, PyTorch, Numpy, Pandas, HuggingFace, LangChain, Transformers

Visualization: matplotlib, seaborn, Plotly, Tableau
DevOps & ML Ops: Azure ML, Arize AI, Evidently, Git, Docker
Others: REST APIs, Flask, web scraping, NLP, CLIP, ALIGN, OpenAI APIs, Google Perspective API

CAREER INTERESTS

Actively seeking roles in:

Data Science & Machine Learning (R&D & applied)
Deep Learning, LLMs, NLP, Multimodal AI
Global teams that value creativity, autonomy, and cross-disciplinary thinking

Timeline

Data Science Engineer

Cognologix Technologies

03.2021 - Current

Data Science R&D Consultant

Fraudlens Inc.

12.2019 - 02.2021

Graduate Research Assistant

San Francisco State University

09.2017 - 12.2020

Software Engineer – Analytics Business Unit

Persistent Systems

11.2014 - 05.2016

M.S. - Computer Science

San Francisco State University

B.E. - Information Technology

Pune University