Data Mining Process - Business understanding, Data understanding , Data preparation, Modeling, Evaluation, Deployment. Two types of categorical variables are often distinguished – nominal and ordinal Two types of quantitative variables are often distinguished – continuous and discrete Rows and columns are identified by integer or string labels. The set of row labels is called the index and the set of column labels is called columns. Imputation is the estimation of missing values with descriptive statistics or predicted values. Simpler methods of imputation that use the feature's mean, median, or mode are valid only if the missing values are random. The data_frame.dropna() function removes rows with missing data from a data frame. The data_frame.fillna(value, method) function replaces a missing value by either a specified value or a value resulting from a method. Standardization brings a feature's values to a small range centered near 0 by computing: (observation - mean) / standard deviation), called a z-score. Normalization is rescaling a feature's values to the range [0,1] by computing: (observation min) / (max - min). The goal of supervised learning is to predict a particular feature's value based on other features' values – KNN , Logistic reg Unsupervised learning methods do not attempt to predict an output value but instead detect and identify patterns and relationships in data. – cluster analysis, association rules. Partitioning is the process of splitting the data into training, validation, testing. The gini index I(A) = 1-sum(Pk^2), the overall Gini index is the weighted average of the indices of the partitions. Entropy = - Sum(Pk * log2Pk) Information gain is defined as the entropy of an attribute minus the weighted entropy of each partition of that attribute. Accuracy – (TP+TN)/ALL Precision – TP/(TP+FP) Recall – TP/(TP+FP) Bootstrapping is the process of generating simulated samples by repeatedly drawing with replacement from an existing sample.