Summary

Overview

Work History

Education

Skills

Accomplishments

Certification

Timeline

Imaya Bharathi

Data Scientist

Chennai

Summary

Full Stack Data Scientist with strong computer science and machine learning background with a special focus in Natural Language Processing (NLP), Predictive Analysis & Data Modeling . Involved in the Python open source community and has a huge passion towards NLP, Deep Learning and Transfer Learning. Data handler with excellent handle over batch and stream processing. Security person with understanding of key concepts of cryptography and ethical hacking. Creative thinker with strong story-telling skills & data visualization. Natural team player and mentor A Data Scientist who has a firm command over Data Engineering Tools and Model Architecture. Veteran data science professional experienced in identifying opportunities and strategizing methods for improvement. Detail-oriented, methodical and enterprising with strong focus on devising and running effective processes.

Overview

years of professional experience

years of post-secondary education

Certifications

Work History

Data Scientist

Disprz

Chennai

03.2020 - Current

Projects

Creating Skills -to- Role Architecture for Identifying the Skill Gap In the Job Market [AWS, Redshift, Scraping, Tensorflow, Spacy, Flask, Informatica]

Project emphasis on identifying active roles that are hired in the market for a particular industry, and the skills needed for that role in that industry
Created a pdf extraction engine in python for documents about skill information for 33 industries in SKILLS-FRAMEWORK which has deep information on job roles, their responsibilities, proficiency level, knowledge level needed
To get active jobs and their information scraped Linkedin, Indeed and also used Common Crawl for industry-wise job postings
Used AllenNLP for training a sequential sentence classification model to classify a sentence from the job description to role sentence, skill sentence, experience sentence, company sentence and others
Trained a separate Spacy NER model to extract tech, generic, industry specific skills from skill sentences extracted from the above step
After successful extraction of data from both the skills framework and from job posting , I worked on designing the schema for this abundant information, and loaded it successfully in Amazon redshift
Assisted in using redshift data to create multiple dashboards in Power BI to visualize the data cluster for insightful information and identification of skill gaps between roles that our client requires and their competition requires
Utilized advanced querying, visualization and analytics tools to analyze and process complex data sets.
Tried Informatica for ELT process, the data is raw unprocessed text data so not able to use Informatica to its fullest extent. Hence, built a custom python-based text preprocessor and cleaner code
Using the data from redshift suggested an idea of developing a framework for skill evaluation and skill comparison across different level ( e.g. Skill level : beginner, intermediate, advanced).
Since sentence level context represents what a job role is, we created a new technique to get sentence level TF IDF which we named as Sentence Frequency IDF (SFIDF)
Served the same as the endpoint using Flask

Text-Analytics for a Live User Engagement Platform [Python, Flask , BERT, Google Cloud NLP]

Created two endpoints using Flask for the tasks of sentiment analysis and word cloud
In our efforts to increase the engagement of our inbuilt web-based meeting/webinar platform, we created two tools where meeting hosts will be able to ask questions about the session or in general any questions
Based on the answers we receive live our rule based algorithm will work on how to carry on with visualizing the results
Upon satisfying the rules either a word cloud or a clustering algorithm is applied
For clustering we used SBERT (Sentence BERT) which allows us to get embeddings at a sentence level
Hosted both tools using Flask in AWS EC2 instance

GUI Tool for Business Stakeholders And Skilling Sciences Team [PyQt, Qt5 Framework, Python, SQL]

Both the skilling sciences team and engineering services team were in need of a tool to access the database to perform editing and manual vetting before showcasing the data from client to client
Offered a GUI (Graphical User Interface) based in-house solution which can be created within the given timeline and reduce the cost required to outsource whole project
Worked with stakeholders to develop quarterly roadmaps based on impact, effort and test coordinations.
Created a GUI annotation tool from scratch based on the requirements and connected to a backend where an event based architecture was used
Based on the tag for each user they can annotate the data that they can have access to and availability of the data for them to annotate is also notified
Very first person and the project in the organizations to initiate Test Driven Development (TDD)

Data And AI Pipeline [Google Cloud (Composer, Container Registry, Kubernetes, Airflow, Cloud Storage, SQL Server), Shell Scripting , Docker, Python]

Performing preprocessing on the raw job posting collected from multiple sources
Defined the schema for Data Lake [SQL Server] and Data Warehouse [Big Query]
Applied a set of 6 AI models (Built & Deployed) like Sequential Sentence Classification to map a sentence in posting to role related or skill related or experience related, NER to extract the skills, model to classify the industry a job posting belongs to and finally a model to predict the occupation of the job posting and finally populated the data lake
All these have been orchestrated with Google Cloud Composer (Airflow), along with the various stack from the Google cloud
Created a KubeFlow pipeline to parallelize all processes and to have control over what type of resources (RAM, Storage, VCPUs) to use for all processes
The main aim was to get familiarized with google stack for data engineering

MLOps Pipeline For Classification Model [GCP (Cloud Build, Cloud Functions, Pub/Sub, GKE), GIT, Python, Docker, Flask]

An automated CI/CD pipeline for image classification
The source code for Image classification model was given, for which a Cloud Build was configured
Once the build is completed, if its successful a message was pushed to Cloud Pub/Sub topic
Pub/Sub will send the message to cloud function which will extract the deployment information from Pub/Sub Message and trigger a new deployment on the Kubernetes Cluster in GKE
Kubernetes will deploy the new image to production, all these processes are done from just GIT commit
This idea is crafted to make sure that ML Engineers will have the freedom to iterate over the model architecture without worrying about the model deployment for a given problem statement
All he had to do was push the latest code to GIT and the model will be deployed
Optimization :- Initially the build time was 20 minutes.
Developed a Multi Stage docker image and deployed it to GCR (Google Container Registry).
Reduced the Kubernetes deployment steps by preloading the constants After optimization, the build time was reduced to 4 minutes

Job Demand Report [GCP (Cloud Build, Container Registry), GIT, Python, Docker, Fast API, Alteryx]

Started as a POC, our data pipeline collects more than 100k job posting per month
To put a good use to this large volume of data, our idea is to develop a job demand report
The report will have two aspects on the data Most in demand occupation in the market Trending occupations in the market
The data pipeline also extracts skills from the job posting which is reflected in report in two ways like the occupation
Created highly configurable and repeatable workflows using Alteryx for transforming the data lake
Once the POC is done, the stakeholders loved the insights displayed in the report
Due to their interest, the POC is taken to the next step of making a Interactive User Interface
Developed the Wireframe using FIGMA, and handled the team of 3 people, a frontend developer, backend engineer and a data engineer
Delivered the project successfully and monitored the adoptions across clients for the product as well
Clients were happy with the stories and statistical result we produced, hence the organization decided to make this a marketing tool to hook in clients.
Worked with marketing team to integrate the application to Salesforce to drive our customer lead generation engine.

PROJECTS [Part of M.Tech & UG]

Invoice Categorization

When it comes to big logistics firms they find it very difficult to keep track of invoices, especially the one in scanned pdf format
The data set contained scanned copies of 5655 invoices and the test set contains 1394 invoices
The model which i created will automate the process of extracting the text from scanned copies and categorize them into a given product category
It's a random forest multi class model which was able to predict 1392 correctly out of 1394 in the test file
Overall model accuracy was 99.59 for the given public dataset
Important note the model is able to identify a data point which occurred only twice in the entire dataset, in which a deliberate split was made in a sense one data point of that class is in training set
Model was able to correctly classify the second data point of that class while testing

Multi Label Text Classification

Toxic Comment Classification with an accuracy of 98%
Created a custom tokenizer to do the job of tokenization
Found the classifier that can handle the multiple labels and predict the probability of the best label for the new comment
Used pipelining to handle the vectorization and classification at the same time also to reduce the training time of the model

Banking behavioral scorecard for Internal Liability customers

Banking behavioral scorecard is a model that is maintained for a customer based on his liability transactions
Liability transactions are transactions that are transacted by an internal customer of a bank
Internal customers of a bank are the customers who have a savings account (SA) with the bank
Customer pays the loan in equal monthly installments (EMIs)
Loans get paid through post-dated cheques
Customers also have an option to pay through the Electronic Clearing System (ECS) technique or standing instructions to debit the user's HDFC Bank account with the EMI amount
Customer risk profile means the probability of the customer defaulting on his EMI payment
I have been given a data set with binary classification with imbalance in them
Weightage of one class is 2% and balance 98 percent is of another class
There were 2395 anonymised columns as a feature, for which I did an Exploratory Data Analysis and was able to find out there are 1632 columns with major feature importance
Using which i created a Voting Classifier and got an accuracy of 85.87 for the private dataset and 90.70 for the public dataset

Data Analyst

Whirldata Labs

Chennai

06.2018 - 02.2020

Identified and documented detailed business rules and use cases based on requirements analysis.
Researched and resolved issues regarding integrity of data flow into databases.
Collaborated with business-unit leaders to identify and prioritize problems.
Upheld security and confidentiality of documents and data within area of responsibility.

Projects

SAS Code Recommendation Engine [Python, CoreNLP]

Designed & developed an AI based SAS code recommendation engine for a health sciences company to reduce the time consumed by their SAS programmers to write code.
Achieved 50% time reduction for client side programmers developing SAS code.
Trained models on a repository of SAS code developed by SAS programmers as our training data. Solution involved building custom SAS tokenizer due to unavailability of pre-built models capable of processing SAS code.
Manual Labeling of training data where required.
Designed and trained supervised CRF(Conditional Random Fields) model using labelled data.
CRF models were used to tag new SAS code.
Used Cosine Similarity algorithms for vector comparison of SAS code repository to generate recommendations

Cloud based Asset tracker for an Autonomous vehicle data management company [AWS Stack, Python, Ionic, Angular]

Technologies Used – Python, AWS API Gateway, AWS Lambda functions, AWS RDS (MySQL), AWS Cognito User Pool, Ionic, Angular 6.
Understood client’s requirements and designed Database and micro services architecture for the application
Created micro services based APIs to manage both mobile applications and web applications.
Managed security over data transfer using AWS Cognito users management.
Created Selenium based automation testing tool to verify APIs.
Used Talend for supporting data reporting for each and every vehicle movement on dashboard with changes in real time.
Reduced dashboard load up time by 45% by using complex queries in talend
Supported the front-end applications (Ionic & Angular6).
Created automation deployment script to deploy all services to AWS from client using AWS CLI SDK.

Entity Extraction on Operating System Registry For Automated Software Updation and Deletion In An IT-Infrastructure [Python, CoreNLP]

Created approach for automating OS software updation, maintaining same software stack across each and every machine in an Infrastructure.
Built flask application in order to provide an API interface for connecting every system in Infrastructure to get details of every system.
Entity Recognition model that extracts Vendor, Title,Version, Edition for all software in system.
Stanford NER (Named Entity Recognition) architecture is used to create model for entity recognition.
With a purpose based tokenizer, model is able to achieve 94% accuracy in extracting Vendor, Title, Version, Edition.

Question Answering Model Using Bi-Directional Attention Flow Network [Tensorflow, Python, Glove Embeddings ]

Created question answering model where any question asked from given paragraph will be answered by RNN model.
Model is trained on SQUAD (Stanford Question Answering Dataset) dataset, which is collection of wikipedia articles.
Model tries to understand meaning of given paragraph using glove embeddings, glove embedding is pre trained vector representation of word.
Used Gated Recurrent Unit (GRU) for generating a bi-directional attention flow, here a context flow is given from context to question and question to context.
Context refers to given paragraph

Real Time Data Analytics and Visualization [Tableau, SQL, Snowflake]

Performed real time data analytics for claim processing company and created visualizations for same in TABLEAU.
Asserted with client to move data from their MYSQL Datalake to Snowflake.
Since, log data is growing fastly and client wants entire project to be for Data Analytics, moving data from MYSQL to Snowflake is huge benefit for both client and us.
It helps in simplifying raw data into very easily understandable format.
Data analysis is very fast with Tableau and visualizations created are in form of dashboards and worksheets.
Created stories from data with help of statistics and visualized them in appealing manner.
Made forecasting model for predicting raise in number of claims month-wise.
Based on client's requirement, created stored procedures for retrieving data from large SQL database.

A Control Hub for Real-Time Data-Intensive Applications [Docker, Python, Hashicorp Vault, Docker Compose]

Created docker architecture for web based control hub with more than 25 components.
Managed dockerization of that architecture and created CI/CD pipeline with Jenkins and Ansible.
Optimized docker images for decreasing build and deployment time. Time for building and deployment is reduced from 1 hour 40 mins to 35 mins.
Developed generic deployment script for Centos, Redhat, Ubuntu, Windows which reduced time spent on providing run time inputs from 30 mins to 12 mins.
Performed sandboxing for Application based on client request. Comparison was done between Amazon Redshift and Ampool Enterprise with Tableau Desktop.
Used TPCDS (Transaction Processing Control - Decision Support ) data (10gb, 100gb, 1TB, 100TB) for benchmarking Amazon Redshift and Ampool Enterprise.

Education

B.E - Computer Science and Engineering

Anna University

Chennai, India

07.2014 - 05.2018

M.Tech - Data Science and Engineering

BITS

Pilani, Rajasthan

03.2020 - 03.2022

Skills

Oracle 11g, MySQL, MongoDB, SQL

undefined

Accomplishments

Publications & Research Work

S.Imaya Bharathi et.al, Data Security Using GOS system , IEEE Conference on Data security ISBN NO - 978-1-5386-0373-4 June 2016
S.Imaya Bharathi et.al, Stock Volume Prediction Based on Polarity of Tweets, News, and Historical Data Using Deep Learning | 2020 2nd International Conference on Big-data Service and Intelligent Computation, December 2020

Seminars

Presented a hands-on seminar on Predicting Maruti Stocks with Time Series Forecasting.
Two Week Faculty Development Programme on "Cutting Edge Trends in Deep Learning Approaches" held from 16 November 2020 – 1 December 2020. Presented 4 sessions of NLP, Text Analytics, RNN Architecture. Appreciated by numerous Head of Departments from various institutes ranging from IITH (IIT Hyderabad) to VIT as the best session presented.
World Class Smart Manufacturing Powered by Artificial Intelligence and Machine Learning - Industry 4.0, delivered 2 sessions on AI importance in manufacturing along with industry use cases.

Certification

Machine Learning Course by Andrew Ng in Coursera.

Timeline

Data Scientist

Disprz

03.2020 - Current

M.Tech - Data Science and Engineering

BITS

03.2020 - 03.2022

Data Analyst

Whirldata Labs

06.2018 - 02.2020

B.E - Computer Science and Engineering

Anna University

07.2014 - 05.2018

Imaya Bharathi

Summary

Overview

Work History

Data Scientist

Data Analyst

Education

B.E - Computer Science and Engineering

M.Tech - Data Science and Engineering

Skills

Accomplishments

Certification

Timeline

Data Scientist

M.Tech - Data Science and Engineering

Data Analyst

B.E - Computer Science and Engineering

Similar Profiles

Malcom DsilvaMalcom Dsilva

Gowri Prasanna Gowri Prasanna null

RAJIV LADHANIRAJIV LADHANI

ADARSH SHUKLAADARSH SHUKLA

CA Srimathi BCA Srimathi B