EDA(Exploratory Data Analysis) on Haberman’s Survival Data Set

kartika Panwar
9 min readOct 14, 2020

--

1. Introduction

Haber man's data set is the data set which contains the data from the study of Chicago’s Billings Hospital University between the year 1958 to 1970 for the patients who has undergone through the surgery of breast cancer.

I would like to explain the entire data analysis by doing various operation. Even though we will conclude to predict the entire survival status of the patients who have undergone through surgery.

Basically EDA(exploratory Data Analysis ) is based on the analysis of data . This can be done when we should have good domain knowledge so that we can relate the data features and also can give accurate conclusion. So I would like to explain the features of this data set very efficiently and simplifying how it affects other features.

Kaggle Source-(https://www.kaggle.com/gilsousa/habermans-survival-data-set)

There are 4 attributes in the data set. Out of 4 attributes, 3 are features and 1 class attributes. Also, 306 are instances of data.

Number of Auxiliary Nodes (Lymph Nodes) — Numerically detected

Age (Patient at the time of operation Numerically detected

Operation Year (Year in which patient undergoes surgery)

Survival Status — It basically presents the information whether patient survive more than 5 year or less after undergone through surgery. Here if patients survived 5 years or more is represented as 1 and patients who survived less than 5 years is represented as 2.

Lymph Nodes:

Lymph Nodes are small structure, bean-shaped organ that act like a filter for harmful substances. They basically contains immune cells that help destroying germs. Lymph Nodes try to trap the cancer cells before they reach other parts of the body. If a person has cancer cells in the lymph nodes under the arms then it suggests an increased risk of the cancer spreading.

Sentinel Lymph Nodes (Lymph Nodes closest to the Tumour)
Sentinel Lymph Nodes(the lymph nodes closest to the tumor)

(Sources-https://www.nationalbreastcancer.org/breast-cancer-lymph-node-removal)

Now let’s start exploring the dataset and finding the conclusion.

2. Libraries

I have used various libraries such as Pandas, Numpy, Seaborn, Matplotlib . Basically theses modules in python make data science easy and effective. I have specifically used Python, As there are a lot of rich collection of multiple libraries or modules and mathematical operation.

Imported various Libraries for doing various operation.

Now we will read the csv file . As the snippet of code read_csv from pandas packages you can import the data from csv file format and perform the entire operation.

3. Loading the Data set

The below snippet of code will show the shape of data that is the number of columns and rows present in data.

Now we will get to know that we have imported the 306 instances of data successfully and 4 attributes. Even though we can see columns in a single line.

Now the below snippet of code shows the value count in the dataset.

From the above snippet of code tells that there are 225 patients out of 306 were survived more than 5 years and only 81 patients survived less than 5 years.

Now let’s deep dive into the actual classification of data so that we can easily come to a conclusion.

Before analysing any data-set we need to be pretty much sure about the actual objective . objective -To predict whether the patient will survive by the given treatment or not. There are various analysing plots but we need to choose them according to our preferences and don’t forget about the objective.

4. Bi-variate Analysis

4.1 2-D Scatter Plot

Now we need to always understand the axis : labels and scale.

Observation:

This plot doesn’t make more sense as we can’t recognise between data points of age,auxiliary nodes and year_of_op. so we will plot another by using sns. sns refers to sea-born library which is extensively used.

Now let’s deep dive into the process of analysing data. so as per the above plot we won’t able to understand that what’s actual happening or which feature we should take further. So for knowing that which feature we should use . we will introduce the concept of Pair plots.

4.2 Pair Plots

Pair plots helps in analysing the feature which is more suitable out of all.

The above image is a combination of all data points. These types of plots are called Pair plots.

Observation:

In the above plot Each and every data feature is overlapping very massively but if we see the plot between the axis of year of operation (year_of_op) and auxiliary nodes .

Plot1:-Now lets take plot’s will surely explain you that which data feature I will take for my further data analysis. I will take such a data which can show me distinguishable difference than any other data feature. So, lets start analysing each plot except plot 1,5,9 as it is a smoothed form of histogram of features in pairplots.

Plot 2:-In this plot you can see that there is Operation Age on X-axis and Age on Y-axis and the plot of there data is mostly overlapping on each other data and we cannot distinguish anything properly.

Plot 3:-In this plot there are some points which is distinguishable but still it is better from other plot as we can provide conclusion more precisely by histogram and CDF .In this plot the overlap of points are there but still it is better than all other plot. So I will select the data feature of this plot that is Age and Auxiliary nodes.

Plot 4:- It is plotted using the data feature Operation Age and Age which shows similar type of plot like Plot 2 but it just rotated by 90 degree. so i will be eliminating this feature.

Plot 6:-It plot on the feature Operation Age and Auxiliary nodes which is somewhat exactly similar to the Plot 2 but overlapping of points seems to be more in this plot comparative to other. So, I will also reject this combination

Plot 7:- This plot is similar as Plot 3 only feature interchange its axis so the plot will rotate by 90 degree. Also, I will accept this combination for further operations

Plot 8:- It is exactly same as Plot 6 only feature on axis interchange.Now we have just selected the data feature which is giving more clarification for analysing data. so we choose year of operation and auxiliary nodes.

4.3 1-D Scatter plot

Observation:

Through this plot we can conclude that people have nodes less than 50 which are able to survive more than 5 years but people more than 60 and less than 70 are not able to survive. In this we can not obtain the data clearly as the points are overlapping a lot .

Now for analysing it more prescriptively we need to perform CDF and PDF.

5. Uni-variate Analysis

5.1 Probability Density Function(PDF)

Probability density function is the probability that the variable takes a value x. It is basically the smooth form of histogram.

Now we will make PDF for all the features and will encounter what feature should we take at one to distinguish which people survived by treatment in the year more than 5 or less than 5 year. Here I’ll be explaining one of them which is using for analysis but you can even try other features plot of PDF for better understanding .

Output:

Observation:

  1. In this we can surely conclude that the feature is giving the entire distinguish that people with (auxiliary.nodes>0) survived but people more than 0 is not very sure that they survived or not
    2. if (auxiliary nodes<0 and auxiliary.nodes>5) there are most of the people who died.

5.2 CDF(Cumulative Distribution Function):

The Cumulative Distribution Function is the probability that the variable takes a value less than or equal to x.

Output:

Observation:

Basically in this we can clearly observe that if the auxiliary nodes are less than 5. suppose we took the threshold as 5 so the x axis plot cuts the y axis on around 78% -80% so the survival change increases but if we analyse that auxiliary nodes>40 in increasing order than 100% of the people has less survival change. Also you can see as number of auxiliary nodes increases survival chances also reduces means it is clearly observed that around 80% of people have good chances of survival if they have less no of auxiliary nodes detected and as nodes increases the survival status also decreases as a result 100% of people have less chances of survival if nodes increases >40.

5.3 Box Plot

Box Plot is an another method of visualising the 1-D scatter plot more intuitively. In this plot below we can even call a technique which is inter-quantile range and which is used to plot the whiskers.

whiskers in the plot don’t corresponds to the min and max value.

Output:

Observation:

This is the observation of plot 1 as we are taking the feature auxiliary nodes for further reference and attaining the analysis.

  1. In this we can clearly observe that in long survive , the minimum threshold is around 0–7 and value of 25% & 50% is nearly same which is 0 and value of 75% percentile is from 0 to 3.
  2. If we see in short survive , the minimum threshold for this is 0 to 25 and value of 25% is 1 or 2, 50 % is same as 75% of long survive and 75% is around 12.

5.4 Violin Plot

Violin plot is the combination of a box plot and probability density function.

Output:

Observation:

  1. In this we can clearly observe that in plot of violin 1 more dense region is 0. So people having 0 axillary.nodes will survive more and its wishkers range vary from 0 to 7.
  2. In violin 2 we can clearly see that the range of wishkers is from 0 to 25 and more dense part is near 0 to 12.

6. Multivariate Analysis

6.1 Contours Plot

A contour line of a function of two variables is a curve along which the function has a constant value. It is a cross-section of the three-dimensional graph. The colour going from dark to light blue is basically called contour probability density plot.

Output:

Observation:

In this we can clearly observe that if age is between 44 to 64 and an is between 0 to 3 then more people will survive and is called density plot for long survival we can find this feature by comparing the overlapping in pair plots.

Conclusion

  1. So we can finally conclude that we can analyse the haberman data set by using various techniques and diagnosis of cancer patient can be done by analysis the auxiliary nodes from this visualisation of data.
  2. People less than 35 years have more chance of survival. But people’s age and operation year is not the only factor to decide.

--

--

kartika Panwar
kartika Panwar

Written by kartika Panwar

Application Engineer at SLB || Data Science Enthusiast

No responses yet