Statistical Analysis Before Classification

Rabi Kumar Singh
8 min readAug 30, 2021

Introduction: Breast cancer arises in the lining cells (epithelium) of the ducts (85%) or lobules (15%) in the glandular tissue of the breast. Over time, these in situ (stage 0) cancers may progress and invade the surrounding breast tissue (invasive breast cancer) then spread to the nearby lymph nodes (regional metastasis) or to other organs in the body (distant metastasis). If a woman dies from breast cancer, it is because of widespread metastasis.

Breast cancer treatment can be highly effective, especially when the disease is identified early. Treatment of breast cancer often consists of a combination of surgical removal, radiation therapy and medication (hormonal therapy, chemotherapy and/or targeted biological therapy) to treat the microscopic cancer that has spread from the breast tumor through the blood.
In 2020, there were 2.3 million women diagnosed with breast cancer and 685 000 deaths globally. As of the end of 2020, there were 7.8 million women alive who were diagnosed with breast cancer in the past 5 years, making it the world’s most prevalent cancer. There are more lost disability-adjusted life years (DALYs) by women to breast cancer globally than any other type of cancer.

To Initially I have checked for missing values, data pre-processing, data transformation, correlation among the features, multicollinearity, slicing the dataset, and then models were used to bring out the desire outputs. Since different classifiers were used like Logistic Regression, KNN, Decision Tree, Random Forest, SVM and Boosting. Besides that important metrics related to the particular models were also used to check and compare which model is working very well with datasets.

Who is at risk? Breast cancer is not a transmissible or infectious disease. There are no known viral or bacterial infections linked to the development of breast cancer. Certain factors increase the risk of breast cancer including increasing age, obesity, harmful use of alcohol, family history of breast cancer, history of radiation exposure, reproductive history (such as age that menstrual periods began and age at first pregnancy), tobacco use and postmenopausal hormone therapy.

Behavioral choices and related interventions that reduce the risk of breast cancer include:

  • prolonged breastfeeding;
  • regular physical activity;
  • weight control;
  • avoidance of harmful use of alcohol;
  • avoidance of exposure to tobacco smoke;
  • avoidance of prolonged use of hormones; and
  • avoidance of excessive radiation exposure.

Treatment Breast cancer treatment can be highly effective, achieving survival probabilities of 90% or higher, particularly when the disease is identified early. Treatment generally consists of surgery and radiation therapy for control of the disease in the breast, lymph nodes and surrounding areas (locoregional control) and systemic therapy (anti-cancer medicines given by mouth or intravenously) to treat and/or reduce the risk of the cancer spreading (metastasis).

Dataset:- Dataset has been collected from open source and is available here dataset. Besides that it is also available at Kaggle. Since this dataset is used for the classification work. The features that are available in this dataset are listed below:

Details of Scales:- Since we can see that “id” is one the features of the dataset which is serial number of each patients who underwent into the breast cancer diagnosis. Since “id” is serial number hence it is interval scale.

Features and their scales

The above is the details of the scales of the each feature which are available in the dataset.

Explanatory and Response variable: As we can see that dataset has a number of explanatory and single response variables. The listed is shown in the figure

Response and Explanatory Variables

Note: There are total of 32 variables. Since “diagnosis” is response variable while rest of excluding “id” are the explanatory variables.

Response Variables:- It can bee see that

Response Variables

From the figure it can be observed that there B is 62% while M is almost 38%. We can see a kind of data imbalanced. Though in our analysis I have considered two cases:

Case-I: Where we have ignored the data imbalanced

Case-II: Where we have considered the data imbalanced and sampling is performed.

Missing Values:- After carefully analyzing the dataset we have seen that it is perfect dataset which contains no null values. Hence our work became easy as no imputation has been performed here. The code to find the missing value is shown below.

for i in df:
null=df[i].isnull().sum()/len(df)
if null>0:
print("{} 's null rate {} %".format(i, null))

In our case we see that our input or explanatory variables are numerical and output or response variable is categorical one. Since in my case I have taken categorical as numerical and find out the

Pearson’s Correlation Coefficient:- To find out the relationship between the response variable and each explanatory variable , I have calculated Pearson’s coefficient. The formula for the correlation coefficient is given by

Pearson’s Correlation

The above is the formula for the correlation coefficient and it is denoted by r.

The value of r lies between -1 and 1.

  • Correlation refers to the association between the observed values of two variables
  • Correlation quantifies this association, often as a measure between the values -1 to 1 for perfectly negatively correlated and perfectly positively correlated. The calculated correlation is referred to as the “correlation coefficient.
  • The correlation between two variables that each have a Gaussian distribution can be calculated using standard methods such as the Pearson’s correlation
  • This procedure cannot be used for data that does not have a Gaussian distribution. Instead, rank correlation methods must be used.

Case-1: when r=-1, it shows that both the variables are negatively correlated. With the increase of the one variable other will decrease.

Case-2: when r=0, it shows that both the features are independent.

Case-3: When r=1, it shows that both the features are highly related.

Now we have the required coefficient calculation of the all features related with response variables.

Pearson's coefficient and P-value

Observation:- We can see that some of the features which are correlated or independent on the basis of Pearson’s correlation values.

Spearman’s Correlation Coefficient- it is same as with Pearson’s correlation coefficient.

Spearman’s Correlation Coefficient

We get almost the similar values of correlation as that we get in case of Pearson’s correlation coefficient.

Difference between Spearman and Pearson’s coefficient

  • Pearson’s correlation assesses linear relationships, Spearman’s correlation assesses monotonic relationships (whether linear or not)

Multi-collinearity: As we can see that there many features in the dataset i.e. there are 32 features including the response variable. There is high chance that there might be multicollinearity among the features. To detect the multicollinearity we have multiple options.

Variance inflation factor (VIF) measures how much the behavior (variance) of an independent variable is influenced, or inflated, by its interaction/correlation with the other independent variables.

  • 1 = not correlate.
  • Between 1 and 5 = moderately correlate.
  • Greater than 5 = highly correlate.
  • VIF> 10 is used and remove the feature which has that high VIF.
VIF of Features

Observation: we can make some of the observations. Those are

  • VIF of most of the features are more than 10.
  • It is not possible to remove the all the features.
  • There is alternative way to remove the features. The higher VIF values will be removed and the remaining features will be used in modelled.

Alternative Method to remove features:- There are two methods that I have performed in order to remove some of the features which might not be

In our case VIF is not working well. So I have chosen an alternative option. We have to perform some statistical analysis. We have to calculate the p-value for each feature. We remove those features which have higher p-value.

Next is normality test is performed for each of the variables. Those features which do not follow the Gaussian distribution will be not used for the modelling. To perform the normality test, we can use QQ plot, distribution plot and histogram. Beside that some KS test and chi-square goodness of fit are also performed.

Q-Q plot

  • Another popular plot for checking the distribution of a data sample is the quantile-quantile plot, Q-Q plot, or QQ plot for short.
  • it is a technique to compare whether two sets of sample points are from or they follow same distributions.
  • one distribution is know , we have to check for the other distribution
  • If the unknown sample of dataset follow given distribution, we will have a scatter plot, where data points will be in a straight line y = x.
  • he idea is to plot the quintile values of two distributions/samples and see
  • if they make a straight line or not. If the quintiles of two sample sets are similar or in a better case, identical then sample set is from the same distribution. # The process of QQ plot
  • Arrange the dataset in increasing order.
  • Calculate the percentiles for each increasing dataset
  • Plot those percentiles with the help of Scatter plot
  • Done

Shapiro-Wilk Test

  • Evaluates a data sample and quantifies how likely it is that the data was drawn from a Gaussian distribution
  • The Shapiro-Wilk test is believed to be a reliable test of normality, although there is some suggestion that the test may be suitable for smaller samples of data, e.g. thousands of observations or fewer.
  • The function returns both the W-statistic calculated by the test and the p-value.

References

Datasets is available here

My GitHub for the complete analysis

My Kaggle profile

https://www.kaggle.com/jurk06

--

--