This was our seventh week in CST 383 - Data Science.

Reflection

This week we reviewed ordinal and nominal (dummy variables or one hot) encoding. We discussed a couple of ways to do each in code, and when we want to use one type of encoding over the other. Although we have covered this in previous weeks, it was still helpful to review the topic, as we had to use it during this week's lab.

We also discussed logistic regression, which is similar in concept to linear regression but is designed specifically to predict binary, categorical variables. Although linear regression can be made to do this (for example, if we consider all predictions above 0.5 to be True), logistic regression is generally better at it. Although we didn't have much of a chance to explore the regression model during the lab, it appears to be a powerful tool.

In one of our textbooks, we also covered in more detail how linear and logistic models are trained. We also covered another similar model, the polynomial model. This model is similar to the linear model, but it doesn't use linear equations. This lets polynomial models more accurately predict data that doesn't follow a linear shape, something that linear models cannot do.

In the lab this week we were handed a set of data and asked to preprocess (encode and scale) it, split the data into train and test sets, and create two models to try and predict the test data. Although this assignment seemed daunting at first, it ended up being the most enjoyable assignment so far in the class. I ended up training three different models instead of the required two.

Class Lecture

Types of Variables

We reviewed the different types of variables in data science:

Quantatative: Data that represents a measurement (continuous) or a count (discrete) instead of a category.
Categorical: Data that represents discrete categories. Is further split into nominal (unordered) and ordinal (ordered).

Encoding

Ordinal Encoding

In Ordinal Encoding we map each unique value of a variable to a number, starting from zero.

To ordinally encode Pandas data (without using extra libraries), we can create a dictionary that maps values from the column to numbers, then replaces the values with said numbers.

labels = {'val1': 0, 'val2': 1}
df.loc[:, 'column'] = df['column'].replace(labels)

We can also use the replace method on multiple columns at the same time:

df.replace({'column1': labels1, 'column2': labels2})

Nominal Encoding

We can encode nominal variables by creating "dummy values". This creates new columns that store a boolean value when the variable has that value. This is also known as One Hot Encoding.

We can use the get_dummies function to create dummy variables.

df = pd.get_dummies(df, columns=['sex', 'role'], drop_first=True)

One problem with dummy variables is that the number of new columns, and therefore the memory required, balloons with the number of values in a column. There are a couple of ways to mitigate this before creating dummy variables:

Combine rare values into an "other" category. For example, combining the Arctic and Antartica into "other" since few people live there.
Combine values into larger groups. For example, combining Canda, Mexico, and the US into 'North America'.

Which Encoding to use?

We generally want to use Ordinal Encoding for ordinal predictor variables, and Nominal Encoding (DV) for nominal predictor variables and categorical target variables.

Logistic Regression

Logistic Regression is similar to Linear Regression, in that training a Logistic Regression model creates an equation similar to Linear Regression models, but we can use it to predict categorical target values. It can be thought as predicting the probability of the output being a certain value. Note that Logistic Regression only works for binary classification (target variable has two possible values).

We can create and fit a Logistic Regression model with:

from sklearn.linear_model import LogisticRegression
clf = LogisticRegression
clf.fit(X_train, y_train)

Assuming that our target variable has two values, we can get predictions of the probabilities with:

y_pred_probs = clf.predict_proba(X_test)

The above code gives us the following array

Array generated from the above code

Then we can use (y_pred_probs[:, 1] > 0.5).astype(int) to turn the values into integers, or we can use clf.predict(X_test) to instead get an array of integers.

Maximum Likelihood

We use a concept called Maximum Likelihood, or which model explains the data better, to determine which Logistic Regression Model is more accurate.

To test Maximum Likelihood by hand, for each model we determine the probability of each outcome happening, then multiply them together. Whichever model has the higher result is considered to have the higher likelihood.

Similarities with Linear Regression

Like Linear Regression:

Making predictions is fast
Training data does not have to be scales
Total number of model parameters equals the number of predictor variables plus one
No hyperparameters to tune

Unlike Linear Regression:

Predicts probabilities
Used for classification problems
Trained using the MLE principle

Overfitting

One problem with machine learing is that ML algorithms are trained to do the best on the training data, but we want them to do the best possible job on data that either isn't used to train the model (test data) or data we don't have yet (future data). We can try to solve this by capturing the rough shape of the training data instead of its fine detail.

Bias: When a model has bias, it consistently makes prediction errors across multiple subsets of training data. Example: Using Linear Regression when a dataset's shape is not linear.

Bias in subsets of training data. The blue line shows predictions.

Variance: A model has high variance when it is very sensitive to the training data. Higher variance leads to lower prediction errors on average, but the predicted values vary highly between training sets. Example: Using KNN with a small k value.

Variance in subsets of training data. The blue line shows predictions.

"Irreducible Error": Random noise in the data that we cannot do anything about.

Overfitting: When a model's variance is too high, and thus cannot 'generalize' well between datasets.

We usually cannot compute bias and variance from the data directly, since we would need to know the relationship between the predictors and the target. Instead, we can see what happens if we change the size of the training set. For example, we can take 10% of the training data, compute the training and test (or CV) error rate, plot the two points, then repeat with a larger training subset until we do this with 100% of the training data.

Error rate on test and training data with different sized subsets of the training data.

Error on test data: As number of training examples gets larger, model tends to generalize better.

Error on training data: With fewer examples, easier to get good fit on training data.

Error rate on test and training data with high variance model.

The above model is too flexible, as it leaves a large gap between the two curves. We can solve this by getting more training data, or by using a less flexible model.

Error rate on test and training data with low variance model.

The above model is too inflexible, a larger dataset won't help. We could use a more flexible model, or more and better data features might help.

Hands-On Machine Learning Ch. 4

Polynomial Regression

As we've seen with some of the above examples, we cannot use a linear equation to accurately predict non-linear datasets. To do this we can use Polynomial Regression by preprocessing the dataset using Scikit-Learn's PolynomialFeatures before training the Linear Regression Model.

from sklearn.preprocessing import PolynomialFeatures
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
then we train the Linear Regression model using X_poly in place of X

Search This Blog

CS Online Learning Journal

CST 383 Week 7

Reflection

Class Lecture

Types of Variables

Encoding

Ordinal Encoding

Nominal Encoding

Which Encoding to use?

Logistic Regression

Maximum Likelihood

Similarities with Linear Regression

Overfitting

Hands-On Machine Learning Ch. 4

Polynomial Regression

Comments

Post a Comment

Popular posts from this blog

Week 26

Week 22

Week 18