Summary
Overview
Work History
Education
Skills
Accomplishments
Certification
Timeline
Generic

Imaya Bharathi

Data Scientist
Chennai

Summary

Full Stack Data Scientist with strong computer science and machine learning background with a special focus in Natural Language Processing (NLP), Predictive Analysis & Data Modeling . Involved in the Python open source community and has a huge passion towards NLP, Deep Learning and Transfer Learning. Data handler with excellent handle over batch and stream processing. Security person with understanding of key concepts of cryptography and ethical hacking. Creative thinker with strong story-telling skills & data visualization. Natural team player and mentor A Data Scientist who has a firm command over Data Engineering Tools and Model Architecture. Veteran data science professional experienced in identifying opportunities and strategizing methods for improvement. Detail-oriented, methodical and enterprising with strong focus on devising and running effective processes.

Overview

4
4
years of professional experience
6
6
years of post-secondary education
5
5
Certifications

Work History

Data Scientist

Disprz
Chennai
03.2020 - Current

Projects


Creating Skills -to- Role Architecture for Identifying the Skill Gap In the Job Market [AWS, Redshift, Scraping, Tensorflow, Spacy, Flask, Informatica]

  • Project emphasis on identifying active roles that are hired in the market for a particular industry, and the skills needed for that role in that industry
  • Created a pdf extraction engine in python for documents about skill information for 33 industries in SKILLS-FRAMEWORK which has deep information on job roles, their responsibilities, proficiency level, knowledge level needed
  • To get active jobs and their information scraped Linkedin, Indeed and also used Common Crawl for industry-wise job postings
  • Used AllenNLP for training a sequential sentence classification model to classify a sentence from the job description to role sentence, skill sentence, experience sentence, company sentence and others
  • Trained a separate Spacy NER model to extract tech, generic, industry specific skills from skill sentences extracted from the above step
  • After successful extraction of data from both the skills framework and from job posting , I worked on designing the schema for this abundant information, and loaded it successfully in Amazon redshift
  • Assisted in using redshift data to create multiple dashboards in Power BI to visualize the data cluster for insightful information and identification of skill gaps between roles that our client requires and their competition requires
  • Utilized advanced querying, visualization and analytics tools to analyze and process complex data sets.
  • Tried Informatica for ELT process, the data is raw unprocessed text data so not able to use Informatica to its fullest extent. Hence, built a custom python-based text preprocessor and cleaner code
  • Using the data from redshift suggested an idea of developing a framework for skill evaluation and skill comparison across different level ( e.g. Skill level : beginner, intermediate, advanced).
  • Since sentence level context represents what a job role is, we created a new technique to get sentence level TF IDF which we named as Sentence Frequency IDF (SFIDF)
  • Served the same as the endpoint using Flask


Text-Analytics for a Live User Engagement Platform [Python, Flask , BERT, Google Cloud NLP]

  • Created two endpoints using Flask for the tasks of sentiment analysis and word cloud
  • In our efforts to increase the engagement of our inbuilt web-based meeting/webinar platform, we created two tools where meeting hosts will be able to ask questions about the session or in general any questions
  • Based on the answers we receive live our rule based algorithm will work on how to carry on with visualizing the results
  • Upon satisfying the rules either a word cloud or a clustering algorithm is applied
  • For clustering we used SBERT (Sentence BERT) which allows us to get embeddings at a sentence level
  • Hosted both tools using Flask in AWS EC2 instance


GUI Tool for Business Stakeholders And Skilling Sciences Team [PyQt, Qt5 Framework, Python, SQL]

  • Both the skilling sciences team and engineering services team were in need of a tool to access the database to perform editing and manual vetting before showcasing the data from client to client
  • Offered a GUI (Graphical User Interface) based in-house solution which can be created within the given timeline and reduce the cost required to outsource whole project
  • Worked with stakeholders to develop quarterly roadmaps based on impact, effort and test coordinations.
  • Created a GUI annotation tool from scratch based on the requirements and connected to a backend where an event based architecture was used
  • Based on the tag for each user they can annotate the data that they can have access to and availability of the data for them to annotate is also notified
  • Very first person and the project in the organizations to initiate Test Driven Development (TDD)


Data And AI Pipeline [Google Cloud (Composer, Container Registry, Kubernetes, Airflow, Cloud Storage, SQL Server), Shell Scripting , Docker, Python]

  • Performing preprocessing on the raw job posting collected from multiple sources
  • Defined the schema for Data Lake [SQL Server] and Data Warehouse [Big Query]
  • Applied a set of 6 AI models (Built & Deployed) like Sequential Sentence Classification to map a sentence in posting to role related or skill related or experience related, NER to extract the skills, model to classify the industry a job posting belongs to and finally a model to predict the occupation of the job posting and finally populated the data lake
  • All these have been orchestrated with Google Cloud Composer (Airflow), along with the various stack from the Google cloud
  • Created a KubeFlow pipeline to parallelize all processes and to have control over what type of resources (RAM, Storage, VCPUs) to use for all processes
  • The main aim was to get familiarized with google stack for data engineering


MLOps Pipeline For Classification Model [GCP (Cloud Build, Cloud Functions, Pub/Sub, GKE), GIT, Python, Docker, Flask]

  • An automated CI/CD pipeline for image classification
  • The source code for Image classification model was given, for which a Cloud Build was configured
  • Once the build is completed, if its successful a message was pushed to Cloud Pub/Sub topic
  • Pub/Sub will send the message to cloud function which will extract the deployment information from Pub/Sub Message and trigger a new deployment on the Kubernetes Cluster in GKE
  • Kubernetes will deploy the new image to production, all these processes are done from just GIT commit
  • This idea is crafted to make sure that ML Engineers will have the freedom to iterate over the model architecture without worrying about the model deployment for a given problem statement
  • All he had to do was push the latest code to GIT and the model will be deployed
  • Optimization :- Initially the build time was 20 minutes.
  • Developed a Multi Stage docker image and deployed it to GCR (Google Container Registry).
  • Reduced the Kubernetes deployment steps by preloading the constants After optimization, the build time was reduced to 4 minutes


Job Demand Report [GCP (Cloud Build, Container Registry), GIT, Python, Docker, Fast API, Alteryx]

  • Started as a POC, our data pipeline collects more than 100k job posting per month
  • To put a good use to this large volume of data, our idea is to develop a job demand report
  • The report will have two aspects on the data Most in demand occupation in the market Trending occupations in the market
  • The data pipeline also extracts skills from the job posting which is reflected in report in two ways like the occupation
  • Created highly configurable and repeatable workflows using Alteryx for transforming the data lake
  • Once the POC is done, the stakeholders loved the insights displayed in the report
  • Due to their interest, the POC is taken to the next step of making a Interactive User Interface
  • Developed the Wireframe using FIGMA, and handled the team of 3 people, a frontend developer, backend engineer and a data engineer
  • Delivered the project successfully and monitored the adoptions across clients for the product as well
  • Clients were happy with the stories and statistical result we produced, hence the organization decided to make this a marketing tool to hook in clients.
  • Worked with marketing team to integrate the application to Salesforce to drive our customer lead generation engine.


PROJECTS [Part of M.Tech & UG]


Invoice Categorization

  • When it comes to big logistics firms they find it very difficult to keep track of invoices, especially the one in scanned pdf format
  • The data set contained scanned copies of 5655 invoices and the test set contains 1394 invoices
  • The model which i created will automate the process of extracting the text from scanned copies and categorize them into a given product category
  • It's a random forest multi class model which was able to predict 1392 correctly out of 1394 in the test file
  • Overall model accuracy was 99.59 for the given public dataset
  • Important note the model is able to identify a data point which occurred only twice in the entire dataset, in which a deliberate split was made in a sense one data point of that class is in training set
  • Model was able to correctly classify the second data point of that class while testing


Multi Label Text Classification

  • Toxic Comment Classification with an accuracy of 98%
  • Created a custom tokenizer to do the job of tokenization
  • Found the classifier that can handle the multiple labels and predict the probability of the best label for the new comment
  • Used pipelining to handle the vectorization and classification at the same time also to reduce the training time of the model


Banking behavioral scorecard for Internal Liability customers

  • Banking behavioral scorecard is a model that is maintained for a customer based on his liability transactions
  • Liability transactions are transactions that are transacted by an internal customer of a bank
  • Internal customers of a bank are the customers who have a savings account (SA) with the bank
  • Customer pays the loan in equal monthly installments (EMIs)
  • Loans get paid through post-dated cheques
  • Customers also have an option to pay through the Electronic Clearing System (ECS) technique or standing instructions to debit the user's HDFC Bank account with the EMI amount
  • Customer risk profile means the probability of the customer defaulting on his EMI payment
  • I have been given a data set with binary classification with imbalance in them
  • Weightage of one class is 2% and balance 98 percent is of another class
  • There were 2395 anonymised columns as a feature, for which I did an Exploratory Data Analysis and was able to find out there are 1632 columns with major feature importance
  • Using which i created a Voting Classifier and got an accuracy of 85.87 for the private dataset and 90.70 for the public dataset

Data Analyst

Whirldata Labs
Chennai
06.2018 - 02.2020
  • Identified and documented detailed business rules and use cases based on requirements analysis.
  • Researched and resolved issues regarding integrity of data flow into databases.
  • Collaborated with business-unit leaders to identify and prioritize problems.
  • Upheld security and confidentiality of documents and data within area of responsibility.

Projects

SAS Code Recommendation Engine [Python, CoreNLP]

  • Designed & developed an AI based SAS code recommendation engine for a health sciences company to reduce the time consumed by their SAS programmers to write code.
  • Achieved 50% time reduction for client side programmers developing SAS code.
  • Trained models on a repository of SAS code developed by SAS programmers as our training data. Solution involved building custom SAS tokenizer due to unavailability of pre-built models capable of processing SAS code.
  • Manual Labeling of training data where required.
  • Designed and trained supervised CRF(Conditional Random Fields) model using labelled data.
  • CRF models were used to tag new SAS code.
  • Used Cosine Similarity algorithms for vector comparison of SAS code repository to generate recommendations

Cloud based Asset tracker for an Autonomous vehicle data management company [AWS Stack, Python, Ionic, Angular]

  • Technologies Used – Python, AWS API Gateway, AWS Lambda functions, AWS RDS (MySQL), AWS Cognito User Pool, Ionic, Angular 6.
  • Understood client’s requirements and designed Database and micro services architecture for the application
  • Created micro services based APIs to manage both mobile applications and web applications.
  • Managed security over data transfer using AWS Cognito users management.
  • Created Selenium based automation testing tool to verify APIs.
  • Used Talend for supporting data reporting for each and every vehicle movement on dashboard with changes in real time.
  • Reduced dashboard load up time by 45% by using complex queries in talend
  • Supported the front-end applications (Ionic & Angular6).
  • Created automation deployment script to deploy all services to AWS from client using AWS CLI SDK.

Entity Extraction on Operating System Registry For Automated Software Updation and Deletion In An IT-Infrastructure [Python, CoreNLP]

  • Created approach for automating OS software updation, maintaining same software stack across each and every machine in an Infrastructure.
  • Built flask application in order to provide an API interface for connecting every system in Infrastructure to get details of every system.
  • Entity Recognition model that extracts Vendor, Title,Version, Edition for all software in system.
  • Stanford NER (Named Entity Recognition) architecture is used to create model for entity recognition.
  • With a purpose based tokenizer, model is able to achieve 94% accuracy in extracting Vendor, Title, Version, Edition.

Question Answering Model Using Bi-Directional Attention Flow Network [Tensorflow, Python, Glove Embeddings ]

  • Created question answering model where any question asked from given paragraph will be answered by RNN model.
  • Model is trained on SQUAD (Stanford Question Answering Dataset) dataset, which is collection of wikipedia articles.
  • Model tries to understand meaning of given paragraph using glove embeddings, glove embedding is pre trained vector representation of word.
  • Used Gated Recurrent Unit (GRU) for generating a bi-directional attention flow, here a context flow is given from context to question and question to context.
  • Context refers to given paragraph

Real Time Data Analytics and Visualization [Tableau, SQL, Snowflake]

  • Performed real time data analytics for claim processing company and created visualizations for same in TABLEAU.
  • Asserted with client to move data from their MYSQL Datalake to Snowflake.
  • Since, log data is growing fastly and client wants entire project to be for Data Analytics, moving data from MYSQL to Snowflake is huge benefit for both client and us.
  • It helps in simplifying raw data into very easily understandable format.
  • Data analysis is very fast with Tableau and visualizations created are in form of dashboards and worksheets.
  • Created stories from data with help of statistics and visualized them in appealing manner.
  • Made forecasting model for predicting raise in number of claims month-wise.
  • Based on client's requirement, created stored procedures for retrieving data from large SQL database.

A Control Hub for Real-Time Data-Intensive Applications [Docker, Python, Hashicorp Vault, Docker Compose]

  • Created docker architecture for web based control hub with more than 25 components.
  • Managed dockerization of that architecture and created CI/CD pipeline with Jenkins and Ansible.
  • Optimized docker images for decreasing build and deployment time. Time for building and deployment is reduced from 1 hour 40 mins to 35 mins.
  • Developed generic deployment script for Centos, Redhat, Ubuntu, Windows which reduced time spent on providing run time inputs from 30 mins to 12 mins.
  • Performed sandboxing for Application based on client request. Comparison was done between Amazon Redshift and Ampool Enterprise with Tableau Desktop.
  • Used TPCDS (Transaction Processing Control - Decision Support ) data (10gb, 100gb, 1TB, 100TB) for benchmarking Amazon Redshift and Ampool Enterprise.

Education

B.E - Computer Science and Engineering

Anna University
Chennai, India
07.2014 - 05.2018

M.Tech - Data Science and Engineering

BITS
Pilani, Rajasthan
03.2020 - 03.2022

Skills

Oracle 11g, MySQL, MongoDB, SQL

undefined

Accomplishments

Publications & Research Work

  • S.Imaya Bharathi et.al, Data Security Using GOS system , IEEE Conference on Data security ISBN NO - 978-1-5386-0373-4 June 2016
  • S.Imaya Bharathi et.al, Stock Volume Prediction Based on Polarity of Tweets, News, and Historical Data Using Deep Learning | 2020 2nd International Conference on Big-data Service and Intelligent Computation, December 2020

Seminars

  • Presented a hands-on seminar on Predicting Maruti Stocks with Time Series Forecasting.
  • Two Week Faculty Development Programme on "Cutting Edge Trends in Deep Learning Approaches" held from 16 November 2020 – 1 December 2020. Presented 4 sessions of NLP, Text Analytics, RNN Architecture. Appreciated by numerous Head of Departments from various institutes ranging from IITH (IIT Hyderabad) to VIT as the best session presented.
  • World Class Smart Manufacturing Powered by Artificial Intelligence and Machine Learning - Industry 4.0, delivered 2 sessions on AI importance in manufacturing along with industry use cases.

Certification

Machine Learning Course by Andrew Ng in Coursera.

Timeline

Data Scientist

Disprz
03.2020 - Current

M.Tech - Data Science and Engineering

BITS
03.2020 - 03.2022

Data Analyst

Whirldata Labs
06.2018 - 02.2020

B.E - Computer Science and Engineering

Anna University
07.2014 - 05.2018

Machine Learning Course by Andrew Ng in Coursera.

Neural Networks and Deep Learning Course by Andrew Ng in Coursera

Python Best practices from Udemy

Machine Learning for Trading - Specialization Coursera

Git Complete: A Definitive Guide to Git - Udemy

Imaya BharathiData Scientist