Projects
Creating Skills -to- Role Architecture for Identifying the Skill Gap In the Job Market [AWS, Redshift, Scraping, Tensorflow, Spacy, Flask, Informatica]
- Project emphasis on identifying active roles that are hired in the market for a particular industry, and the skills needed for that role in that industry
- Created a pdf extraction engine in python for documents about skill information for 33 industries in SKILLS-FRAMEWORK which has deep information on job roles, their responsibilities, proficiency level, knowledge level needed
- To get active jobs and their information scraped Linkedin, Indeed and also used Common Crawl for industry-wise job postings
- Used AllenNLP for training a sequential sentence classification model to classify a sentence from the job description to role sentence, skill sentence, experience sentence, company sentence and others
- Trained a separate Spacy NER model to extract tech, generic, industry specific skills from skill sentences extracted from the above step
- After successful extraction of data from both the skills framework and from job posting , I worked on designing the schema for this abundant information, and loaded it successfully in Amazon redshift
- Assisted in using redshift data to create multiple dashboards in Power BI to visualize the data cluster for insightful information and identification of skill gaps between roles that our client requires and their competition requires
- Utilized advanced querying, visualization and analytics tools to analyze and process complex data sets.
- Tried Informatica for ELT process, the data is raw unprocessed text data so not able to use Informatica to its fullest extent. Hence, built a custom python-based text preprocessor and cleaner code
- Using the data from redshift suggested an idea of developing a framework for skill evaluation and skill comparison across different level ( e.g. Skill level : beginner, intermediate, advanced).
- Since sentence level context represents what a job role is, we created a new technique to get sentence level TF IDF which we named as Sentence Frequency IDF (SFIDF)
- Served the same as the endpoint using Flask
Text-Analytics for a Live User Engagement Platform [Python, Flask , BERT, Google Cloud NLP]
- Created two endpoints using Flask for the tasks of sentiment analysis and word cloud
- In our efforts to increase the engagement of our inbuilt web-based meeting/webinar platform, we created two tools where meeting hosts will be able to ask questions about the session or in general any questions
- Based on the answers we receive live our rule based algorithm will work on how to carry on with visualizing the results
- Upon satisfying the rules either a word cloud or a clustering algorithm is applied
- For clustering we used SBERT (Sentence BERT) which allows us to get embeddings at a sentence level
- Hosted both tools using Flask in AWS EC2 instance
GUI Tool for Business Stakeholders And Skilling Sciences Team [PyQt, Qt5 Framework, Python, SQL]
- Both the skilling sciences team and engineering services team were in need of a tool to access the database to perform editing and manual vetting before showcasing the data from client to client
- Offered a GUI (Graphical User Interface) based in-house solution which can be created within the given timeline and reduce the cost required to outsource whole project
- Worked with stakeholders to develop quarterly roadmaps based on impact, effort and test coordinations.
- Created a GUI annotation tool from scratch based on the requirements and connected to a backend where an event based architecture was used
- Based on the tag for each user they can annotate the data that they can have access to and availability of the data for them to annotate is also notified
- Very first person and the project in the organizations to initiate Test Driven Development (TDD)
Data And AI Pipeline [Google Cloud (Composer, Container Registry, Kubernetes, Airflow, Cloud Storage, SQL Server), Shell Scripting , Docker, Python]
- Performing preprocessing on the raw job posting collected from multiple sources
- Defined the schema for Data Lake [SQL Server] and Data Warehouse [Big Query]
- Applied a set of 6 AI models (Built & Deployed) like Sequential Sentence Classification to map a sentence in posting to role related or skill related or experience related, NER to extract the skills, model to classify the industry a job posting belongs to and finally a model to predict the occupation of the job posting and finally populated the data lake
- All these have been orchestrated with Google Cloud Composer (Airflow), along with the various stack from the Google cloud
- Created a KubeFlow pipeline to parallelize all processes and to have control over what type of resources (RAM, Storage, VCPUs) to use for all processes
- The main aim was to get familiarized with google stack for data engineering
MLOps Pipeline For Classification Model [GCP (Cloud Build, Cloud Functions, Pub/Sub, GKE), GIT, Python, Docker, Flask]
- An automated CI/CD pipeline for image classification
- The source code for Image classification model was given, for which a Cloud Build was configured
- Once the build is completed, if its successful a message was pushed to Cloud Pub/Sub topic
- Pub/Sub will send the message to cloud function which will extract the deployment information from Pub/Sub Message and trigger a new deployment on the Kubernetes Cluster in GKE
- Kubernetes will deploy the new image to production, all these processes are done from just GIT commit
- This idea is crafted to make sure that ML Engineers will have the freedom to iterate over the model architecture without worrying about the model deployment for a given problem statement
- All he had to do was push the latest code to GIT and the model will be deployed
- Optimization :- Initially the build time was 20 minutes.
- Developed a Multi Stage docker image and deployed it to GCR (Google Container Registry).
- Reduced the Kubernetes deployment steps by preloading the constants After optimization, the build time was reduced to 4 minutes
Job Demand Report [GCP (Cloud Build, Container Registry), GIT, Python, Docker, Fast API, Alteryx]
- Started as a POC, our data pipeline collects more than 100k job posting per month
- To put a good use to this large volume of data, our idea is to develop a job demand report
- The report will have two aspects on the data Most in demand occupation in the market Trending occupations in the market
- The data pipeline also extracts skills from the job posting which is reflected in report in two ways like the occupation
- Created highly configurable and repeatable workflows using Alteryx for transforming the data lake
- Once the POC is done, the stakeholders loved the insights displayed in the report
- Due to their interest, the POC is taken to the next step of making a Interactive User Interface
- Developed the Wireframe using FIGMA, and handled the team of 3 people, a frontend developer, backend engineer and a data engineer
- Delivered the project successfully and monitored the adoptions across clients for the product as well
- Clients were happy with the stories and statistical result we produced, hence the organization decided to make this a marketing tool to hook in clients.
- Worked with marketing team to integrate the application to Salesforce to drive our customer lead generation engine.
PROJECTS [Part of M.Tech & UG]
Invoice Categorization
- When it comes to big logistics firms they find it very difficult to keep track of invoices, especially the one in scanned pdf format
- The data set contained scanned copies of 5655 invoices and the test set contains 1394 invoices
- The model which i created will automate the process of extracting the text from scanned copies and categorize them into a given product category
- It's a random forest multi class model which was able to predict 1392 correctly out of 1394 in the test file
- Overall model accuracy was 99.59 for the given public dataset
- Important note the model is able to identify a data point which occurred only twice in the entire dataset, in which a deliberate split was made in a sense one data point of that class is in training set
- Model was able to correctly classify the second data point of that class while testing
Multi Label Text Classification
- Toxic Comment Classification with an accuracy of 98%
- Created a custom tokenizer to do the job of tokenization
- Found the classifier that can handle the multiple labels and predict the probability of the best label for the new comment
- Used pipelining to handle the vectorization and classification at the same time also to reduce the training time of the model
Banking behavioral scorecard for Internal Liability customers
- Banking behavioral scorecard is a model that is maintained for a customer based on his liability transactions
- Liability transactions are transactions that are transacted by an internal customer of a bank
- Internal customers of a bank are the customers who have a savings account (SA) with the bank
- Customer pays the loan in equal monthly installments (EMIs)
- Loans get paid through post-dated cheques
- Customers also have an option to pay through the Electronic Clearing System (ECS) technique or standing instructions to debit the user's HDFC Bank account with the EMI amount
- Customer risk profile means the probability of the customer defaulting on his EMI payment
- I have been given a data set with binary classification with imbalance in them
- Weightage of one class is 2% and balance 98 percent is of another class
- There were 2395 anonymised columns as a feature, for which I did an Exploratory Data Analysis and was able to find out there are 1632 columns with major feature importance
- Using which i created a Voting Classifier and got an accuracy of 85.87 for the private dataset and 90.70 for the public dataset