Introduction: As we know that world is facing huge crisis of water. There is scarcity of drinkable water. Beside that in India the reliability of water heavily depends upon the wells, ponds etc. This has triggered the crisis. The aim of this project is to analyze the dataset that I received form my Prof. A.B (ISI Kolkata) and analysis of missing features and a solution to impute the missing values.
Requirements: The library that we are going to use are pandas, numpy , matplotlib , basemap, fancyimpute and others. Since the above mentioned library has intensive use in order to do the numeric computation and visualization of the data across India.
Description of the dataset- The dataset contains multiple features across the seven states of India.
Loading the dataset- The dataset is of csv format, pandas library helps to read the csv dataset. Beside that we import some bunch of library in order to future work. The below is the output
State counts:- The initial steps is to visualize the number of states across the country from where well data depth is collected. As we can see that there is sever states from where the data are taken. From the figure down below, it is concluded that maximum number of well data haven collected of Maharashtra then Haryana and finally M.P (Madhya Pradesh).
Missing Value: The next job is to visualize the missing data in descending order. Since analyzing the missing value is one of the important tasks. As we know missing data can be 3 type
1. Missing Completely At Random (MCAR)- In this category, the missing data are not intentionally missed but there are missing at random. There is Little’s MCAR test to analyze whether data is missing completely random.
2. Missing At Random (MAR)- In this condition ,the missing data are not intentionally missed but there are missed at random.
3. Missing Not At Random (MNAR):- In the condition where the data are not missed at random instead it is missed intentionally.
Note- Imputation of missing value using the mean or median or most frequent value is not advisable. Before imputing the missing data we have to take care of the above mentioned condition or type of missing value, only then error due to imputation of the can be minimized
o Maximum missing rate is large for the column named Aquifers i.e. 0.78%.
o Missing value in case of Well Depth is 0.43%.
o As from the above theory we can say that the missing value is random
Heat map of India:- The very next task is to plotting Latitude and longitude of each of the well over the map of India. To do so we will use basemap. The aim is to plotting the entire well on the map and then starts analyzing the geographical and geological effect on water crisis across these states. The conclusion that we can infer from the map shown below
1. Very few well data have been collected
2. These data cover the specific region of the particular states
3. Some of the well are located near the near the river side which might affect the water level of those wells. If the river is flowing throughout the year, then water level of those wells will be not that deep.
4. Depth of water level of the well is proportional to the scarcity of the water on that region.
5. The grey areas represent the wells those having the maximum well depth. From the map we can see that some of the wells of Gujarat, Rajasthan, and Bangalore are facing huge water crisis and green colors represents those well whose depth is better as compared to the other wells.
6. Since the colors bar on the right side of Map is representing the variation of depth of wells.
Trend Analysis:- As we can see the dataset contains the post and pre monsoon water depth from 2015 to 2019 . Since there period is mentioned so I try to do trend analysis in order to get the useful insight form the dataset. Here I have analyzed the features, well depth, pre and post monsoon depth.
Inference- The inference that we can make form the fig is
1. We can observe the similar trends for pre and post monsoon.
2. Pre and post monsoon well depths are less that well depth. Still we can observe the similar trends.
Correlation: Since after analyzing the datasets we encounter the number of numerical features. Instead of analyze the whole features, only correlated feature will be taken into account with well depth. Since those which are deep red are highly correlated with each other. It can be seen that some of the features are significantly co-related with well depth, those features are Pre_2015, Pre_2016, Pre_2017, Pre_2018, Pre_2019 , Pst_2015, Pst_2016, Pst_2017, Pst_2018, Pst_2019.
Imputation: — As it is seen that imputation is most important in order to get the clear idea of the dataset or the features. Since there multiple of missing features that can be seen. In order to impute those missing data, we are going to use the special method i.e. fancyimpute.
!pip install fancyimpute
from fancyimpute import IterativeImputer
mice_imputer = IterativeImputer()
Before imputation percentage of missing value are given below% Null in the Well Depth is: 43.49 %% Null in the Pre_2015 is: 6.81 %% Null in the Pre_2016 is: 2.56 %% Null in the Pre_2017 is: 2.94 %% Null in the Pre_2018 is: 2.45 %% Null in the Pre_2019 is: 8.7 %% Null in the Pst_2015 is: 5.42 %% Null in the Pst_2016 is: 2.97 %% Null in the Pst_2017 is: 1.47 %% Null in the Pst_2018 is: 2.33 %% Null in the Pst_2019 is: 10.02 %After imputation using the fancyimpute we haveWell Depth 0Pre_2015 0Pre_2016 0Pre_2017 0Pre_2018 0Pre_2019 0Pst_2015 0Pst_2016 0Pst_2017 0Pst_2018 0Pst_2015 0
Results: To observe whether the imputation is up to the mark , we will plot first 100 rows of well depth and analyze the before and after imputation as shown in the figures below. Since the trend is being followed by the dataset. Similarly this kind of trends will be seen throughout the features which are imputed using this algorithm. I believe that trends analysis helped in order to observe the datasets and features variation. The below is our results , now further analysis can be done in order to bring useful insights.
Datasets is available here
My GitHub for the complete analysis
My Kaggle profile
All the analysis has been done jupyter notebook provided by kaggle.com