______________________________________________________________

 

Abstract :  The sinking of the RMS Titanic is one of the most
infamous shipwrecks in history.  On April 15, 1912, during her maiden
voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of
2224 passengers and crew. This sensational tragedy shocked the international
community and led to better safety regulations for ships.In this paper we are
going to make  the predictive analysis of
what sorts of people were likely to survive  and using some tools of machine learing  to predict which passengers survived the
tragedy with accuracy..

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

 

IndexTerms
– Machine learning .

________________________________________________________________________________________________________

   
I. Introduction

 

 

           

Machine learning means the application of any
computer-enabled algorithm that can be applied against a data set to find a
pattern in the data.  This encompasses
basically all types of data science algorithms, supervised, unsupervised,segmentation,
classification, or regression”.few important areas where machine learning can
be applied are

Handwriting Recognition:convert
written letters into digital letters
Language Translation:translate spoken
and or written languages (e.g. Google Translate)
Speech Recognition:convert voice
snippets to text (e.g. Siri, Cortana, and Alexa)ü
Image Classification:label images with
appropriate categories (e.g. Google Photos)
Autonomous Drivin:genable cars to
drive (e.g. NVIDIA and Google Car)

 

 

 some features
of machine learning algorithms are :
Features are the observations that are used to form predictions

 

For image classification, the pixels
are the features
For voice recognition, the pitch and
volume of the sound samples are the features
For autonomous cars, data from the
cameras, range sensors, and GPS are features

 

Extracting relevant features is important for building
a model

 

Source of mail is an irrelevant feature when
classifying images

Source is relevant when classifying emails because
SPAM often originates from reported sources

2.Literature
survey

            Every
machine learning algorithm works best under a given set of conditions. Making
sure your algorithm fits the assumptions / requirements ensures superior
performance. You can’t use any algorithm in any condition.

Instead, in such situations, you should try using
algorithms such as Logistic Regression, Decision Trees, SVM, Random Forest etc.

why  Logistic Regression ?

            it is
used to model the probability of an evenet occuring depending on the values of
the independent variables which can be categorical and numerical and to
estimate the probability that an event occurs for a randomly selected onservations
versus the probability that the evecnt does not occur and it is used to predict
the effects of series of varibales on  a
binary response variable and it is used to classify observations by estimating
the probability that an observation is in a particular category

 

Peformance of Logistic
regression model:

AIC (AkaikeInformation Criteria) –The analogous metric of adjusted R² in logistic
regression is AIC. AIC is the measure of

fit which penalizes model for the number of model
coefficients. Therefore, we always prefer model with minimum AIC

value

Null Deviance and Residual Deviance –Null Deviance indicates the response predicted
by a model with nothing but an

intercept. Lower the value, better the model. Residual
deviance indicates the response predicted by a model on adding

independent variables. Lower the value, better the
model.

Confusion Matrix:
It is nothing but a tabular representation of Actual vs Predicted values.
This helps us to find the accuracy

of the model and avoid overfitting.

McFadden R2
is called as pseudo R2. Whenanalyzingdata with a logistic regression, an
equivalent statistic to R-squared does not exist. However, to evaluate the
goodness-of-fit of logistic models, several pseudo R-squareds have been
developed.

 

  accuracy=truepostives + true negatives/
(truepostivies+true negatives+false positives+false negatives) 

 

 

Decision Trees

 

Decision tree is a hierarchical tree structurethat can
be used to divide up a large collection of records into  smaller sets of classes by applying a
sequence of simple decision rules. A decision tree model consists of a set of
rules for dividing a large heterogeneous population into smaller, more
homogeneous(mutually exclusive) classes.The attributes of the classes can be
any type of variables from binary, nominal, ordinal, and quantitative values,
while the classes must be qualitative type (categorical or binary, or ordinal).
In short, given a data of attributes together with its classes, a decision tree
produces a sequence of rules (or series of questions) that can be used to
recognize the class.

One rule is applied after another, resulting in a hierarchy
of segments within segments. The hierarchy is called a tree, and each segment
is called a node.With each successive division, the members of the resulting
sets become more and more similar to each other.

 

Hence, the algorithm used to construct decision tree
is referred to as recursive partitioning

Decision tree applications :

prediction tumor cells as benign or
maligant
classify  credit card transaction as legitimate or
fradulent
classify buyers from non -buyers
decision on whether or not to approve
a loan
diagnosis of various diseases based on
symptoms  and profiles

 

3.Methodolgy:

                                                                                               

 

 

our approach to solve the problem:

1. collect the raw data need to solve the problem.

2. improt the dataset into the working environment

3.Data preprocessing  which
includes data wrangling and feature engineering .

4.explore the data and prepare a model for performing analysis using
machine learing algorithms

5.Evaluate the model and re-iterate till we get satisfactory model
performance

6.Compare the results and select a model which gives a more accurate
result.

 

 

 the data we collected is
still  rawdata which is very likely to
contains mistakes ,missing values and corrupt values. before drawing any
conclusions from the data we need to do some data preprocessing which involves
data wrangling and feature engineering .

data wrangling is the  process of cleaning and unify the messy and
complex data sets for easy access and analysis

feature engineering  process attempts to create additional
relevant features from existing raw features in the data and to increase the
predictive power of learing algorithms

 

 

 

4 Experimental Analysis and Discussion

 

  a) Data set description:

       The original data has been split into two
groups :training dataset(70%) and test dataset(30%).The training
set should be used to build your machine learning models..

The test
set should be used to see how well your model performs on unseen data. For
the test set, we do not provide the ground truth for each passenger. It is your
job to predict these outcomes. For each passenger in the test set, use the
model you trained to predict whether or not they survived the sinking of the
Titanic.

 

b) Measures

 

 

 

 

c)  Results

after training with the algorithms , we have to validate our trained
algorithms with test data set and measure the algorithms performance with
godness of fit with confusion matrix for validation. 70% of data as training
data set  and 30%  as training data set

confusion matrix for decision tree

trained data set                                                                                           test
data set

 

References

predictions

0

1

  0

395

71

1

45

203

 

References

predictions

0

1

  0

97

20

1

12

48

 

 

 

 

 

 

 

 

 

 

 

 

 

 

confusion matrix  for  logistic regression                                                                                                                                       trained data                                                                                                     test
data

 

References

predictions

0

1

  0

395

12

1

21

204

 

References

predictions

0

1

  0

97

12

1

21

47

 

d) Enhancements and reasoning

       predicting the survival
rate with others machine learing algorithms like random forests , various Support
Vector machines  may improve the accuracy
of prediction for the given data set.

 

 

5. Conclusion:

The
analyses revealed interesting patterns across individual-level features.
Factors such as socioeconomic status, social norms and family composition
appeared to have an impact on likelihood of survival. These conclusions,
however, were derived from findings in the dataThe accuracy of predicting the
survival rate using decision tree algorithm(83.7) is high when compared with
logistic regression(81.3)  for a given
data set

 

 

 

 

 

 

 

 

 

                                                                                                  

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

types of conclusions

 

1. The analyses revealed interesting patterns across
individual-level features. Factors such as socioeconomic status, social norms
and family composition appeared to have an impact on likelihood of survival.
These conclusions, however, were derived from findings in the data. Many
stories and oral histories have
been collected by both survivors and relatives of the passengers in the past
century, and these qualitative data sets may help to elucidate what really
happened that fateful night.