This was our fifth week in CST 383 - Data Science.

Reflection

This week we covered how to process missing data in datasets, how to scale datasets, and the basics of training and testing machine learning algorithms.

Some of the more straightforward concepts this week were those about missing data. Although much of the information itself was new, it corresponded the most with what we've learned about DataFrames and other data types in previous weeks.

I would say that the most difficult concepts covered this week were those about training and testing algorithms. We covered some basic ideas around this in the first weeks of the course, but this was the first time that we covered how this is done in code. Despite the material being difficult, enough was covered in class to greatly help with the homework that covered these concepts. I imagine that we will expand on these concepts in upcoming weeks.

Class Lecture

Missing Data

This week we discussed missing and bad data within data sets. Missing data can be split into two categories:

Easy Case: The missing data is stored in a standard form, such as NAN or None.
Hard Case: The missing data is stored in a non-standard form, such as 0, -1, or "N/A". Despite the name, hard cases are not that bad if the missing data is well-documented.

None: None is a special python value that represents the absence of a value. It is similar to how NULL. Operations involving None values cause an error.

NAN: NAN is a special floating point value that represents a value that is not a number. Unlike None, using NAN in operations results in NAN instead of an error. We can use the function math.isnan(variable) to tell if a variable is NAN. When using NumPy, the constant np.nan represents NAN.

Pandas NA: Pandas uses the NA value to represent None or NAN. We can use the function pd.isna(value) to tell if something is either None or NAN. We can also use this function on a dataframe df.isna() to return a boolean mask of NA values. NA is a data science term that stands for "Not Available" or "Not Applicable".

Data Type Limitations: Because NumPy and Pandas data structures are of a single data type, None and NAN can only be used in arrays of objects and floats respectively.

Dealing with Missing Data

There are two common approaches when dealing with missing data. The most simple is to remove the rows/columns containing missing data. We can also try to "impute" missing data (replace with proper data).

Continous Data: Replace missing values with median or mean values.
Categorical or Discrete Data: Replace missing values with the variable's node, or with a new category (like "NA" or "UNKNOWN").
Advanced: Use machine learning to impute values.

Deleting Missing Data

We can use df.dropna() to remove all rows with any NA values, and df.dropna(axis=1) to remove any columns with NA values.

We can use the thresh attribute to retain columns/rows with a certain number of non-NA values (thresh=6 would retain rows/columns with at least 6 non-NA values).

We can set the how attribute to 'any' or 'all' to removes rows/columns with any missing data or only missing data.

We can also set the inplace attribute to True to modify the data directly instead of returning a copy.

Imputing Missing Data

We can use df.fillna(value) to replace NA values. As a rule of thumb, we should use the median to impute quantitative data, and the mode to impute categorical data.

Scaling Data

We also discussed scaling data, which can help in visualizing and exploring data, and help some machine learning algorithms perform better.

To scale data using Pandas, we can use the method df.apply while passing in either unit_scale or z_scale.

unit_scale: Scales values so they range from zero to one. This is an easy to compute and understand way of scaling data, but is sensitive to min and max values.
z_scale: Scales values to show how many standard deviations they are away from the mean.

To scale data using Scikit-Learn we can:

from sklearn.preprocessing import StandardScaler
X = df.values
scaler = StandardScaler()
scaler.fit(X) - finds the mean and std. dev. for each column of X
X_scaled = scaler.transform(X) - does the scaling

Note: Scikit-Learn is oriented towards arrays, not dataframes.

K-Nearest Neighbors

K-Nearest Neighbors (KNN) is a method of imputing a value by comparing the variable to its k nearest neighbors. For example, if k = 1, then we would compare the variable to its nearest neighbor, if it is 3 then we would compare the variable to its three nearest neighbors.

We can calculate the distance between two variables using the Pythagorean Theorem: sqrt( (var1[0] - var2[0])^20 + ... + (var1[last] - var2[last)^2 ).

Two ways we can calculate this in code are:

np.sqrt(np.sum((x-y)**2)
np.sum(np.abs(x - y))

When using KNN, we often compare a variable to training data, and impute it based on its nearest neighbors in the training data. We should also scale the data before using KNN, as the result could differ with unscaled data.

Finally, one way we can use KNN is with the following code:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

predictors = ['column1', 'column2']
target = 'column3'
X = df[predictors].values
y = (df[target] == 'Yes').values.astype(int)
then we get the test data from the code from Test Sets.
then we repeat the scaling code from Scaling Data.
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X, y)
y_pred = knn.predict(X_test)

Then, we can get the accuracy of our predictions with (y_pred == y_test).mean().

Test Sets

Test/Train Split

A common way to train on data is to split data into a training set and a test set. The class suggests a rule of thumb of splitting data into 70% training data and 30% testing data. Increasing the percentage of training data will result in higher test accuracy, while increasing the percentage of testing data will increase the reliability of test accuracy.

We can get X_test and y_test with:

from sklearn.model_selecton import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

Cross Validation

Another way to train data is to split data into multiple folds (the lecture suggests 10). Then we choose one fold to test the data and train the data using the rest. We repeat this until every fold has been used for testing, then we take the mean of the accuracy values.

Scoring

Some ways to score training accuracy include the follwing:

Train/Test: After spliting and training the data, we use clf.score(X_test, y_test).
Cross Validation: cross_val_score(clf, X_train, y_train, cv=10).mean()

Assessing Classifiers

False Positive: When a classifier predicts a positive, but the actual is negative.

False Negative: When a classifier predicts a negative, but the actual is positive.

When assessing classifiers, we assess their precision and recall. We want both of these to be as high as possible.

Precision: What fraction of the positive predictions are correct? We can calculate this by dividing the number of correct positive predictions with the number of false positive predictions (0/3 in this case).
Recall: What fraction of the positive cases are predicted correctly? We can calculate this by dividing the number of correct positive predictions by the number of false negative predictions (0/5 in this case).

Using precision and recall, we can calculate a classifier's total accuracy with the formula 2*((precision*recall) / (precision+recall)).

We can also use the following methods to assess clasifiers' accuracy:

Python Data Science Handbook Ch. 16

One of our textbooks discusses handling missing data.

Approaches in Handling Missing Data

It describes two general strategies for storing missing data.

Masking Approach: We use a boolean mask of the data or append a boolean value in each cell to track missing values. This method requires extra storage.
Sentinel Approach: We use one or more values, such as -9999 or NAN, to indicate missing values. This method reduces the range of valid values and requires extra computation.

Hands-On Machine Learning Ch. 2

Look For Correlations

We can get the correlation of a dataframe using the corr function: df.corr().

We can also plot the correlations between attributes using the scatter_matrix function:

from pandas.plotting import scatter_matrix
scatter_matrix(housing[attributes], figsize=(12,8))

Clean the Data

Another way to impute data is by using the Scikit-Learn SimpleImputer object.

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy = 'median')

We can also use 'mean', 'most_frequent', or 'constant' while setting the fill_value attribute.

housing_num = housing.select_dtypes(inlude = [np.number])

SimpleImputers only support numerical data, so we need to exclude non-numerical data.

imputer.fit(housing_num)
X = imputer.transform(housing_num)

There are also more powerful imputers available from sklearn.impute, such as KNNImputer and IterativeImputer.

Handling Text and Categorical Attributes

Ordinal Encoding

Most machine learning algorithms prefer to use numbers. We can convert text to numbers using Scikit-Learn's OrdinalEncoder.

from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
then we can use the ordinal_encoder.categories variable to get an array of the categories:

[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
dtype=object)]

One issue with this method is that disimilar categories may be considered similar. In the above code, '1H OCEAN' (index 0) is more similar to 'NEAR OCEAN' (index 4) than 'INLAND' (index 1).

One Hot Encoding

To fix this problem, we can use one-hot encoding, where each category has a binary attribute that is set to one or zero. We can use one-hot encoding using Scikit-Learn's OneHotEncoder.

from sklearn.preprocessing import OneHotEncoder
hot_encoder = OneHotEncoder()
housing_cat_encoded = hot_encoder.fit_transform(housing_cat)

The OneHotEncoder creates a sparse matrix of the non-zero values. We can access this by turning it into an array: housing_cat_encoded.toarray(). We can also set the sparse_output attribute to false in the OneHotEncoder constructor.

We can also use the Pandas function get_dummies to convert a categorical attribute to one-hot encoding. However, using OneHotEncoder is perferred.

Feature Scaling and Transformation

Min-Max Scaling

Min-Max scaling, often called normalization, shifts each attributes values so that they are between zero and one. It does this by subtracting the min value and dividing by the difference between the min and max values.

Scikit-Learn's MinMaxScaler can be used to scale values in this way. We can also set the feature_range attribute in its constructor to change the range.

from sklearn.preprocessing import MinMaxScaler
min_max_scaler = MinMaxScaler(feature_range=(-1, 1))
housing_num_min_max_scaled = min_max_scaler.fit_transform(housing_num)

Standardsization

Standardsization instead subtracts the mean value and divides by the standard deviation. Standardization is much less affected by outliers than min-max scaling.

We can use Scikit-Learn's StandardScaler object.

from sklearn.preprocessing import StandardScaler
std_scaler = StandardScaler()
housing_num_std_scaled = std_scaler.fit_transform(housing_num)

Transforming Skewed Data

If data is heavily skewed, then we will want to transform it to make the scaling look better. A good way to do this is to replace the values with their square roots (or raise them to a power between zero and one), or their logs if they are very heavily skewed (power law distribution).

We can also use bucketizing, where we create several, equally sized buckets and replace attributes with which bucket they belong in. We can also create the buckets, then divide each attribute by the number of buckets.

Search This Blog

CS Online Learning Journal

CST 383 Week 5