MALARIA PREDICTION MODEL USING MACHINE LEARNING ALGORITHMS

Measures have been taking to ensure the safety of individuals from the burden of vector-borne disease but it remains the causative agent of death than any other diseases in Africa. Many human lives are lost particularly of children below five years regardless of the efforts made. The effect of malaria is much more challenging mostly in developing countries. In 2019, 51% of malaria fatality happen in Africa which it increased by 20% in 2020 due to the covid-19 pandemic. The majority of African countries lack a proper or a sound health care system, proper environmental settlement, economic hardship, limited funding in the health sector, and absence of good policies to ensure the safety of individuals. Information has to become available to the peoples on the effect of malaria by making public awareness program to make sure people become acquainted with the disease so that certain measure can be maintained. The prediction model can help the policymakers to know more about the expected time of the malaria occurrence based on the existing features so that people will get to know the information regarding the disease on time, health equipment and medication to be made available by government through it policy. In this research weather condition, non-climatic features, and malaria cases are considered in designing the model for prediction purposes and also the performance of six different machine learning classifiers for instance Support Vector Machine, K-Nearest Neighbour, Random Forest, Decision Tree, Logistic Regression, and Naïve Bayes is identified and found that Random Forest is the best with accuracy (97.72%), AUC (98%) AUC, and (100%) precision based on the data set used in the analysis.


INTRODUCTION
Malaria is a dangerous communicable disease that leads to the death of peoples in the world every day despite the effort makes for its eradication since it can be prevented if certain measures are putting in place but remain the causative agent of death more especially in Africa because of its impact in human lives [1]. In 2019 worldwide malaria statistics show that 94% of the cases and 51% of death happened in Africa whereby Nigeria has the highest rates followed by the Republic of Congo (DRC), Republic of Tanzania, Republic of Mozambique, Niger, and Republic of Burkina Faso regarded as 23%, 11%, 5%, 4%, 4%, and 4% respectively with resultant death of 409,000 globally [2].
Vector-borne disease effects are more in Africa because of a limited medical resource, lack of proper medication, inadequate equipment, inefficient funding, lack of policies to manage the situation, economic hardships, poor environmental sanitation, and housing conditions [2]. In the year 2020, Malaria cases in Africa increase because of coronavirus pandemic becomes a challenge that leads to the loss of many lives in the world. This is a serious challenge to health care management and non-governmental organizations in finding all the possible measures to bring the long-lasting question on why Malaria becomes a causative agent of death more especially in third world countries.
Malaria is associated as the most weather pattern quick to respond disease [3]; therefore, changes in the atmospheric conditions can help widespread its presence by a direct route or indirectly through human actions via improper cleaning by dumping waste product and industrial waste [4]. Numerous factors such as crowded environmental pattern, existing of unclean gutters, presence of garbage disposers and shallow pools of brackish water which help mosquitos' larvae to complete their growth development.
All report suggests that vulnerable places could be rescued if a good Policy can be put in place to tackle malaria scourge; the number of peoples affected will reduce or eradicated at all. This can be achieved when the policymakers become aware of where the malaria instances are liable to occur all the necessary action can be done to save lives. Therefore, there is a need for a proper prediction mechanism on when, where, and how the outbreak will occur. This research work used different Algorithms in designing the model that can help to predict the instance of malaria and also help the government, health care providers, and relevant pertinent to take all the possible measures in ensuring the safety of individuals long before the epidemics can be achieved by enlighten peoples on the effect of vector-borne disease and increase malaria prevention facilities [5][6][7].

LITERATURE REVIIEW
Multiple researchers make an effort to predict outbreaks by the use of weather conditions which is likely the most foretell of malaria transmission, some factors are also attributed to different circumstances [8,10]. It has been found that heavy or less rainfall is not only the factor that influences the incidence of malaria [11]. Rainfall and temperature are the essential components that contributed to malaria transference [12]. Analysis has shown that the rate of fatality increase during the rainy season regardless of the temperature and humidity [13].
Temperature, precipitation and humidity influence the mosquito's life cycle for their growth and development [14]. The vector larvae survive only when the environmental condition is conducive in a moderate temperature above 160C and die when it is lower or higher [15]. The rate at which mosquitoes bite humans by sucking their blood increases when the environmental conditions are favourable for their survival.
The approach used for prediction ranges from mathematical to statistical modelling and machine learning algorithms [16]. Mathematical and statistical models play an important role in predictions to make decisions. Machine learning is used in health care and helps diagnose many diseases such as cancer [17][18]. It also helps pharmacologists to find the right formula for forming a reliable medicine to incapacitate a disease virus [19,20]. The machine is also used in selecting fruitful treatment [20]. Also can be used in Agricultural production for predicting pest plants [21]. Also stock exchange in the market can be predicted [22].
Selecting the techniques to use for proper predictions depend on the problems to be solved include Support Vector Machine (SVM), Naïve Bayes, Decision Tree, Artificial Neural Network (ANN), Random Forest, Logistic Regression and Gradient Extreme Boost Algorithm [23].

Data Collection
The datasets used to train the machine learning Algorithms to contain different parameters, which includes the percentage of the population using at least basic sanitation, Percentage of the population using at least basic drinking water, Average Temperature, Average Rainfall, Total number of yearly malaria reported cases, Total Incidence of malaria (populations at risk per 1,000). The malaria incidence dataset was found from the portal of the world health organization http://apps.who.int/gho/data/node.gswcah, https://www.kaggle.com/lydia70/malaria-in-africa and the Atmospheric dataset of Rainfall and Temperature from the world climate portal https://climateknowledgeportal.worldbank.org/download-data.

Building a Model
Jupyter notebook was used, it can be downloaded free of cost and stored user work in a complete format that contains text and programming codes which is a tool for data analysis and machine learning algorithms are used to analysed data and also used to developed models for making predictions.

Data transformation and Execution
Data used here are in CSV file format by integrating the Atmospheric dataset and malaria incidence datasets into a single file structure. The dataset used in this case is split into two categories, eighty Percent (80%) and twenty percent (20%) for the model training and testing respectively. Also, six different classification techniques are used, Support Vector Machine, Logistic Regression, Random Forest Classifier, Naïve Bayes, Decision Tree, and KNN.

Algorithms Used
Popular classification algorithms were used and compared their performances.

Support Vector Machine
Support Vector Machine (SVM) or Support Vector Classifier (SVC) is used to calculate the hyperplane to separate groups of different classes [24]. In this algorithm, each feature has a corresponding coordinate in the N-Dimensional X, Y plain graph whereby each value represents a particular data point for classification. It is one of the famous techniques.
Let the data to be train be, A The Ci can have to values (-1 and +1), this values determine the position of Bi in the vector class. A hyperplane is a set of points B satisfying, P . B -K = 0, whereby P represent the vector to the hyperplane and K / ||P|| signifies the effect of the vector in finding the maximum distance of Ci.

K-Nearest Neighbors (KNN) Algorithm
It is one of the supervised machine learning techniques that are popular based on its simplicity in solving a classification problem. It uses to predict a new class by identifying its closest neighbors in the observations which makes it have more predictive power and spend less time during interpretation of the output because the majority vote of its k nearest common classes are considered [25]. If the value of K is well selected it will minimize the training and validation time.

Logistic Regression Algorithm
It is when dependant and independent variables are of the same binary category. The probability of each group is estimated since there are no linear relationships between the variables. The final result of the observation belonging to a particular group has to be binary [26].
Z represents a numeric value target variable of linear Regression.

Random Forest Algorithm
This technique is used to solve almost all the problems either in binary or Regression; it works by combining several decision trees [27]. The Row sample and feature sample is considered with replacement to feed each decision tree and find out the result of each tree by bootstrap techniques so that the result of each decision tree is obtained and used majority voting for making the final analysis in case of regression problem mean and medium is considered. Each decision tree will have low bias and high variance but when combining them low variance will be achieved. The hyper-parameter is used to select the number of features to be considered.

Naive Bayes Algorithm
This type of classifier used probability techniques in which all the features have to be considered as independent of one another to calculates the probability of the values of each variable that correspond to the independent variables based on the train data [28].
Let Z be the target variable and q an array of independent variables, the probability can be obtained.

Decision Tree Algorithm
Decision tree classifier is one of the famous techniques. It is used in a different area of research, in which the case dataset has to be divided into different branches based on the conditions. It uses a tree-look structure whereby each node in the tree is the decision point and the leaf of the node represents the output [29].

MODEL PERFORMANCE EVALUATION
Different aspects are considered before making conclusions on which Algorithm performs better than others by evaluating its accuracy. Receiver operating characteristics and area under the curve are among the components to measure in making the final assessment of the model.

Classification Models Performance
Performances of different Algorithms were checked and contrast has been made to identify the degree and accuracy of the malaria prediction algorithm that will allow the possible identification of malaria outbreaks in a particular society. Performance is measure by considering all the components futures like accuracy, Area under Curve (AUC), precision, sensitivity/Recall, specificity, F! score which is the harmonic mean of precisions and recalls of the algorithm, the rate of the error (ER), the True Positive Rate (TPR), the False Positive Rate (FPR), Average Macro and Average Weighted generated from the dataset. The overall results obtained are summarized and organized in the tables below.   Table 1 above, shows the percentages and performances of each classifier which indicate the significance of one classifier over another, Random forest has the highest level of performance in accuracy more than all other classifiers which are involved in the process of accessing the ability of every classifier to produce a result with higher precision and AUC for the prediction of malaria outbreak base on the variables involves in the analysis.
From table 2, It indicates the ranking performances in which the one with higher performance is assigned to a numerical value one followed by the second with which carry the numerical value of two continuously up to the last with numerical value six. For Recall/ Sensitivity measures all the classifiers have the same value except for the Decision Tree classifier.
After adding each row from table 2, the classifier with less number is considered the best classifier. The performance of the Random Forest Algorithm is the best with (97.72%) accuracy, (98%) AUC, and (100%) precision followed by logistic Regression. In the same vain equally analysis has indicated classifiers such as KNN and Naïve Bayes with a law ability to detect and predict the malaria outbreak in a given society base on the availability of the response variables and in turn, it has a smaller percentage in term of performances. Some classifier has demonstrated their average performance toward prediction power and ability to detect malaria outbreak such as Decision tree and SVM.
The best three performing Classifiers in our use case are found to be Random Forest, Logistic Regression, and Decision Tree Algorithm while Naïve Bayes have less performance.

CONCLUSION
The results signified that atmospheric factors, non-climatic features are significant in determining the occurrence of malaria outbreaks in a given society. Also implies one Algorithm outperforms better than others in given a proper and accurate result based on the dataset used.
Accurate prediction is essential in determining when, where, and how the outbreak will occur so that people will get to know how to prepare for it and take all the necessary measures to ensure their safety against the causative agent of death.
Since malaria can be prevented, if a certain measure is taking into account then Government and non-Governmental organizations have to use the available information in providing all the necessary resources that can help individuals in fighting against malaria more especially in the tropical and temperate regions of sub-Sahara in Africa. Programs like Global Technical Strategy for Malaria 2016-2030 [30]; have to encourage and public campaigns on the effect of malaria and how peoples can manage their environmental sanitation effectively.

FUTURE WORK
This research can be improved by using hybridized ensemble Method of machine learning to realize a productive performance model by integrating atmospheric features and non-climatic features which include population size, the vegetation of the area, nature of interventions received, and other environmental features together during prediction. It's highly recommended to enlighten the public on the effect of malaria and publicized its monthly data of each country or region as it happens with the covid-19 pandemic. So that more datasets about malaria will be available at any time because most of the current malaria datasets tend to be incomplete and vague.