About
Azure & IBM certified data scientist with 6+ years of experience in building , productionizing, and delivering state-of-the-art and scalable AI/ML based solutions.
Skills & Expertise
Python
Spark
Scala
R
SAS
SPSS
Azure
Databricks
GCP
AWS
MLOps
Hadoop
LGBM
XGBoost
PCA
Autoencoders
Bayesian
ensemble models
T5
GPT
BERT
CNN
RNN
LSTM
Tiger graph
Neo4j
TensorFlow
PyTorch
Keras
Trax
NLTK
sk-learn
matplotlib
Tableau
Power BI
RShiny
JavaScript
jQuery
PHP
CSS
HTML
XML
JSON
REST
SQL Server
MySQL
NoSQL
Jira
Git
GitHub
Windows
Linux
MS Office
Excel
Access
Word
PPT
SQL
PySpark
SparkSQL
MLlib
fastai
Convolutional Neural Networks
Recurrent Neural Networks
LSTM
Google Cloud Platform
Plotly
jQuery
JavaScript
CSS
HTML
Cordova
Ionic
Framework7
Research Interests
problem solving
communication
collaboration
presentation
analytics
leadership
stakeholder management
Connect With Me
Experience
Sr Data Scientist
Client : Humana, KY Built predictive models to identify the Medicare members who are at risk of inpatient admission in the next 6 months using LightGBM & Neural Networks, improved the overall model performance by 20% and was recognized by McKinsey & IMPAQ as leader in payor industry. Processed text data from health records, phone calls transcripts, nurse notes using BERT, Bio -BERT and Clinical BERT, reduced dimensionality using PCA, assessed similarity using Cosine Similarity and improved the performance of the inpatient predictive model by 5% . Trained NLP Encoder -Decoder architecture T5, BERT model on the ~10K ICD -10 codes and predicted the sequence of first 5 ICD code in next admission with 68% precision and 56% recall. Modelled the progression of Diabetes using Diabetes Complication Severity Index (DCSI), multiple models like OLS, Lasso, Gradient boosting, random forest were built and the XGBoost regression model achieved higher performance with R -Square of 0.89 and RMSE of 1.2. To account for the seasonality in the data, month level data snapshots were created and used in building LSTM model for prediction of Diabetes disease progression. Productionized the model with end-to-end pipelines on Azure Cloud using MLFlow and Azure DevOps. Developed pipelines to extract, clean, transform and analyze data using Python and Pyspark. Techniques like removing unitary columns, correlation analysis, VIF, recursive feature elimination, one hot encoding, scaling, and imputing were performed to clean and transform data. Worked with stakeholders to translate the business problem into predictive analytics solution. Analyzed STARS measures using various metrics from insurance claims. Segmented Medicare, Medicaid population into multiple cohorts based on disease severity, Social Determinants of Health (SDoH), CMS risk adjustment, Charlson Comorbidity Index (CCI), and Facility Condition Index (FCI). Calculated retention rates for Medicare, Medicaid and commercial population. ICD-10 CM/PCS were grouped into Major Diagnostic Category (MDC) to understand the cost, frequency, demographics, seasonality and geographical distribution of claims by each cohort. Client : MARS, NJ Predicted customer churn for pet care segment among three business units with deep learning models like DeepSurv and DeepHit, improved model performance by 15% . Ideated and developed a feedback loop to understand the impact of the marketing strategies which helped to better understand customers needs, recommend appropriate solution and achieve better retention. Built multiple dashboards that provide customer weekly, monthly, and quarterly statistics.
Data Scientist
Creating queries for data extraction using SAS and reports generation using SQL, and Tableau. Predicted seasonality in patient visits per specialty, and effectively utilizing nursing and junior doctor staff without jeopardizing quality. Predictive analytics were performed using time series healthcare data (patient vitals, ECG, EMR, ventilator and other clinical data) with various machine learning algorithms like support vector machine (SVM), decision trees, Naive Bayes classifier to predict adverse clinical event, health severity progression and mortality rates. Collaborated with stakeholders from clinical, operations and product teams to identify analytics opportunities and leverage solutions, maintaining monitoring system in Tableau .
Data Scientist
Developed SQL database from data warehouse for data analytics team. Extracting data from SQL Server and performing data cleaning, wrangling in Python. Kaplan -Meier survival analysis to see glucose sensor survival rates among various countries. Glucose sensors life has increased by 6% after identifying shelf -life and S0 to be the reason for early sensor retirement using K -means clustering. From CGM data, predicted the life of glucose sensor using XGBoost algorithm with AUC of 0.78. Multiple dashboards were created in PowerBI by extracting data from the SQL database, which were customized using R scripts to generate more comprehensive analysis and visualizations. Forecasted new customers 15% better than previous years using ARIMA model for time series in Python. Increased accessibility to the data by designing visualizations to include statistical graphs and information graphics in PowerBI. Imported, cleaned, merged and transformed datasets in R to build dashboards. Built a KPI dashboard in RShiny which automatically sends monthly performance reports and restricts the access to different employees based on their position in the organization.
Data Analyst
Multiple claims like Dental claims, high cost claims and Emergency Department (ED) Medicaid insurance claims from 2017 Indiana Medicaid challenge were analyzed for average cost per recipient and averagecost per claim among federally qualified & non -qualified clinics, primary care physicians (PCPs) and Non-PCPs. Federally qualified health clinics and PCPs were found to have high cost claims than their counterparts which is due to preventive care and frequent visits in FQHCs and PCPs re spectively. Using demographics groups of individuals with high healthcare costs and more prevalent diseases were identified. Underlying conditions like job type, socioeconomic status and education level were correlated. These claims were analyzed using various machine learning algorithms like: o K-means clustering to identify patterns in amount & number of claims per diseases, physician specialties o Time series analysis to forecast claim amount, diseases per geographic region and o Regression algorithms (Multivariate regression, Decision trees and Random forest) to identify underlying causes like education, socio -economic status, drug abuse etc. in python. Indiana Community Health Centers (CHCs) data was concatenated using MySQL and disease patterns were identified using descriptive, inferential statistics and were plotted in Tableau. Negative pearsons correlation was identified between income and frequency of diseases. A predictive model was built using linear regression in Python with 84% accuracy. Indiana State Department of Health (ISDH) data was normalized and correlated the no of active physicians with deaths per county in Indiana. Death patterns were visualized in all counties from 2011 to 2015 using ggplot2 in R. Built a web application using Python Flask to feed both human and machine results of analyzing Chest X-ray report which showed improved performance of human under machine guidance using statistical tests like t - tests, one -way ANOVA, F -Score in python. Experienced in Machine Learning Regression Algorithms like Simple, Multiple, Polynomial, SVR (Support Vector Regression), Decision Tree Regression, Random Forest Regression. Experienced in Machine Learning Classification Algorithms like Logistic Regression, K-NN, SVM, Kernel SVM, Naive Bayes, Decision Tree & Random Forest classification. Proficient in various statistical models like MANCOVA, two-way ANOVA, chi-square etc. Performed various parametric tests like t-tests, one-way ANOVA and nonparametric statistical testslike spearmans correlation, Mann Whitney U test, Friedman test, Kruskall -Wallis test and Ad -hoc analysis.
Education
Indiana University Purdue University
Krishna University
Projects
Development of mobile application to monitor baby care and teach healthcare workers
A hybrid mobile application was built using Essential Care for Every Baby (ECEB) action plan and currently being tested. Technologies used: jQuery, JavaScript, CSS, HTML, Cordova, Ionic, and Framework7.
Analysis of Substance Abuse Trends in UNITED STATES: AN EPIDEMIOLOGIC STUDY
Extracted 10 Million records, cleaned for missing values and preprocessing using R Data was analyzed by performing predictive analysis Technologies used: o R: Boruta algorithm for variable selection o Python: Recursive feature elimination; Logistic regression & Support Vector Machines o Graphical representation: Choropleth using Plotly in Python
Identification of diagnosis from discharge notes using RNN
Natural Language Processing (NLP) using Recurrent Neural Networks (RNN) was applied on MIMIC -III Note -events data to identify diagnosis from the discharge notes. Diagnosis table was merged with note-events table; our model was trained on this table and could predict the diagnosis with 82% accuracy. Technologies used: Recurrent Neural Networks, LSTM, fastai, Python, Google Cloud Platform (GCP).
RSNA pneumonia detection challenge (Kaggle)
A deep learning algorithm was built to detect pneumonia with lung opacities from Chest X-rays. Along with pneumonia detection, bounding boxes were generated to identify the location of the opacities. Our algorithm performed better than 700 teams participated in the competition. Technologies used: Convolutional Neural Networks (CNN), fastai, Python.
Fraud detection in Medicare claims data from CMS
Medicare data (25 GB) was dumped from CMS and concatenated per each category like Part B Prescriber, Part D Prescriber, Inpatient, Outpatient, Nursing facilities etc. Merging all these datasets to a single dataset, a fraud detection algorithm was created using machine learning algorithms like random forest, gradient boost descent and logistic regression with 80% accuracy. Technologies used: PySpark, SparkSQL, MLlib.
Certificates & Licenses (3)
IBM data science professional
Natural Language Processing Specialization
Microsoft Azure Data Science Associate
Awards & Achievements (2)
🏆 2 SPOT awards
Description
🏆 1 STAR award
Description
dd