Comparative Performance Evaluation of Machine Learning Models for Early Prediction of Diabetes

Sakshi Tupkar

Transparent Peer Review By Scholar9

Comparative Performance Evaluation of Machine Learning Models for Early Prediction of Diabetes

Abstract

Diabetes is an ever-increasing chronic metabolic disorder. Leaving diabetes untreated or diagnosed, can lead to life-altering health complications (e.g. cardiovascular disease, kidney failure, vision loss) long-term. Therefore, early recognition of diabetes is important to prevent health complications, and to manage the outcomes in patients. With the emergence of data-focused methods in healthcare, machine learning provides mechanisms to develop outstanding predictive models based on demographics from patients. This research explores six supervised machine learning algorithms (Logistic Regression, Random Forest Classifier, Gradient Boosting Classifier, Decision Tree, Naive Bayes, and XGBoost) in the predictive modeling of diabetes, through the use of features (i.e., clinical, behavioral, and socioeconomic) from a structured dataset of 'Last health question' records. The structured dataset had variables including: high blood pressure, cholesterol values, body mass index (BMI), exercise, alcohol consumption, reported health, healthcare access, and demographics (age, sex, education level, and income). In light of the existence of class imbalance presented in medical datasets we used SMOTE, which allows balanced training data and benefits our ability to be sensitive to the minority class (i.e. diabetic individuals). We used well-established methods to evaluate the performance of each model including accuracy, precision, recall, F1 score, and ROC AUC score, which provide an indication of the ability to classify participants by predictive ability. Of the models we tested, XGBoost was the highest performer, showing an excellent level of prediction, with high F1 and AUC, indicating the model was able to distinguish diabetic and non-diabetic individuals while balancing false positives with false negatives. Overall our results demonstrate that with ensemble-based methods like XGBoost, improved predictive accuracy can exist within a clinical diagnostic. This program is evidence that machine learning can be a powerful ancillary tool for screening programs and support personalized healthcare decisions.

Niravkumar K Patel Reviewer

Review Request Accepted

Niravkumar K Patel Reviewer

Revision Required

Hello Researcher,

I hope you are doing well. I have reviewed your research apart in the health care, but I have a few suggestions to improve this process tightly in machine learning. Which can be based on the machine learning algorithms to process the large amount of data and process it together in a single process, like filtering, accuracy, and get a correct result. I can't see any machine learning algorithm used in this research study. I am giving you a few suggestions to improve this research. Machine learning is a large technology stack, and it needs to be handled with correct security and dataset processing with the correct algorithms.

1. Data Dependency

Requires Large and High-Quality Data: ML models need large volumes of clean, relevant data to learn effectively.
Garbage In, Garbage Out: Poor-quality or biased data leads to inaccurate or unfair results.

2. Computational Cost

High Resource Usage: Training complex models (e.g., deep learning) demands significant processing power, memory, and time.
Expensive Infrastructure: Costs for GPUs, cloud computing, and storage can be substantial.

3. Lack of Interpretability

"Black Box" Models: Many ML models, especially deep neural networks, are difficult to interpret or explain.
Hard to Debug: Understanding why a model made a specific decision can be challenging.

4. Bias and Fairness Issues

Inherent Bias in Data: Models may learn societal or historical biases present in the training data.
Unfair Outcomes: Can result in discrimination or unfair treatment in sensitive applications (e.g., hiring, lending, law enforcement).

5. Overfitting and Underfitting

Overfitting: Model learns the training data too well, performing poorly on new, unseen data.
Underfitting: Model is too simple to capture underlying patterns, leading to poor performance.

6. Security and Privacy Concerns

Data Privacy: Training on sensitive personal data can raise privacy concerns.
Model Vulnerabilities: ML systems can be attacked through adversarial examples or model extraction.

7. Dependence on Human Supervision

Needs Expert Involvement: Model selection, feature engineering, and tuning require domain knowledge and expertise.
Ongoing Maintenance: Models need retraining and updating as data and environments change.

8. Ethical and Legal Challenges

Accountability: Difficult to assign responsibility for incorrect or harmful decisions made by ML systems.
Regulatory Issues: Legal frameworks for AI/ML are still evolving and often lag behind technology.

9. Limited Generalization

Narrow Intelligence: ML models are usually task-specific and don’t generalize well across domains without retraining.
Lack of Common Sense: ML lacks the reasoning abilities that humans use naturally.

These are the main advantages of the process. If we can improve this process, then it will be very helpful in the healthcare industry to process the large amount of data.

Linear SVM (with Stochastic Gradient Descent - SGD)

Use SGDClassifier with hinge loss (equivalent to linear SVM).
Much faster and scalable to millions of data points.
Works well when data is linearly separable or close.

python
CopyEdit
from sklearn.linear_model import SGDClassifier
clf = SGDClassifier(loss='hinge')  # Linear SVM

If we can use this technique, then the data will be processed efficiently in the healthcare industry, and it will give accurate results. I like the research on diabetes because a lot of people are suffering from this disease, and it will improve the healthcare process.

Thanks for giving me the chance to review your research paper.

Thanks,

Nirav patel

Niravkumar K Patel Reviewer
23 Jun 2025 10:31 AM

Approved Rating

Relevance and Originality

Methodology

Validity & Reliability

Clarity and Structure

Results and Analysis

Comment

Hello Researcher,

I have given few comments to improve this research. If you can correct then it will be very good as per the security concerns.

IJ Publication Publisher

Dear Sir,

We have forwarded the revision request to author. we will publish this paper.

Transparent Peer Review By Scholar9

Comparative Performance Evaluation of Machine Learning Models for Early Prediction of Diabetes

Abstract

Niravkumar K Patel Reviewer

Niravkumar K Patel Reviewer

1. Data Dependency

2. Computational Cost

3. Lack of Interpretability

4. Bias and Fairness Issues

5. Overfitting and Underfitting

6. Security and Privacy Concerns

7. Dependence on Human Supervision

8. Ethical and Legal Challenges

9. Limited Generalization

Linear SVM (with Stochastic Gradient Descent - SGD)

Niravkumar K Patel Reviewer
23 Jun 2025 10:31 AM

IJ Publication Publisher

QUICKLINKS

CONTACT US

Transparent Peer Review By Scholar9

Comparative Performance Evaluation of Machine Learning Models for Early Prediction of Diabetes

Abstract

Niravkumar K Patel Reviewer

Niravkumar K Patel Reviewer

1. Data Dependency

2. Computational Cost

3. Lack of Interpretability

4. Bias and Fairness Issues

5. Overfitting and Underfitting

6. Security and Privacy Concerns

7. Dependence on Human Supervision

8. Ethical and Legal Challenges

9. Limited Generalization

Linear SVM (with Stochastic Gradient Descent - SGD)

Niravkumar K Patel Reviewer 23 Jun 2025 10:31 AM

IJ Publication Publisher

Niravkumar K Patel Reviewer
23 Jun 2025 10:31 AM