CST 383 Week 7

 This was our seventh week in CST 383 - Data Science.

Reflection

This week we reviewed ordinal and nominal (dummy variables or one hot) encoding. We discussed a couple of ways to do each in code, and when we want to use one type of encoding over the other. Although we have covered this in previous weeks, it was still helpful to review the topic, as we had to use it during this week's lab.

We also discussed logistic regression, which is similar in concept to linear regression but is designed specifically to predict binary, categorical variables. Although linear regression can be made to do this (for example, if we consider all predictions above 0.5 to be True), logistic regression is generally better at it. Although we didn't have much of a chance to explore the regression model during the lab, it appears to be a powerful tool.

In one of our textbooks, we also covered in more detail how linear and logistic models are trained. We also covered another similar model, the polynomial model. This model is similar to the linear model, but it doesn't use linear equations. This lets polynomial models more accurately predict data that doesn't follow a linear shape, something that linear models cannot do.

In the lab this week we were handed a set of data and asked to preprocess (encode and scale) it, split the data into train and test sets, and create two models to try and predict the test data. Although this assignment seemed daunting at first, it ended up being the most enjoyable assignment so far in the class. I ended up training three different models instead of the required two. 

Class Lecture

Types of Variables

We reviewed the different types of variables in data science:

  •  Quantatative: Data that represents a measurement (continuous) or a count (discrete) instead of a category.
  • Categorical: Data that represents discrete categories. Is further split into nominal (unordered) and ordinal (ordered).
 

Encoding 

Ordinal Encoding

In Ordinal Encoding we map each unique value of a variable to a number, starting from zero.

To ordinally encode Pandas data (without using extra libraries), we can create a dictionary that maps values from the column to numbers, then replaces the values with said numbers.

  • labels = {'val1': 0, 'val2': 1}
  • df.loc[:, 'column'] = df['column'].replace(labels)

We can also use the replace method on multiple columns at the same time:

  • df.replace({'column1': labels1, 'column2': labels2})

Nominal Encoding

We can encode nominal variables by creating "dummy values". This creates new columns that store a boolean value when the variable has that value. This is also known as One Hot Encoding.

We can use the get_dummies function to create dummy variables.

  •  df = pd.get_dummies(df, columns=['sex', 'role'], drop_first=True)

One problem with dummy variables is that the number of new columns, and therefore the memory required, balloons with the number of values in a column. There are a couple of ways to mitigate this before creating dummy variables:

  • Combine rare values into an "other" category. For example, combining the Arctic and Antartica into "other" since few people live there.
  • Combine values into larger groups. For example, combining Canda, Mexico, and the US into 'North America'.

Which Encoding to use?

We generally want to use Ordinal Encoding for ordinal predictor variables, and Nominal Encoding (DV) for nominal predictor variables and categorical target variables.

Logistic Regression

Logistic Regression is similar to Linear Regression, in that training a Logistic Regression model creates an equation similar to Linear Regression models, but we can use it to predict categorical target values. It can be thought as predicting the probability of the output being a certain value. Note that Logistic Regression only works for binary classification (target variable has two possible values).

We can create and fit a Logistic Regression model with: 

  • from sklearn.linear_model import LogisticRegression
  • clf = LogisticRegression
  • clf.fit(X_train, y_train) 

Assuming that our target variable has two values, we can get predictions of the probabilities with:

  • y_pred_probs = clf.predict_proba(X_test)

The above code gives us the following array 

Array generated from the above code

Then we can use (y_pred_probs[:, 1] > 0.5).astype(int) to turn the values into integers, or we can use clf.predict(X_test) to instead get an array of integers.

Maximum Likelihood

We use a concept called Maximum Likelihood, or which model explains the data better, to determine which Logistic Regression Model is more accurate.

To test Maximum Likelihood by hand, for each model we determine the probability of each outcome happening, then multiply them together. Whichever model has the higher result is considered to have the higher likelihood.

 
 

Similarities with Linear Regression

Like Linear Regression:

  • Making predictions is fast
  • Training data does not have to be scales
  • Total number of model parameters equals the number of predictor variables plus one
  • No hyperparameters to tune

Unlike Linear Regression:

  • Predicts probabilities
  • Used for classification problems
  • Trained using the MLE principle

Overfitting

One problem with machine learing is that ML algorithms are trained to do the best on the training data, but we want them to do the best possible job on data that either isn't used to train the model (test data) or data we don't have yet (future data). We can try to solve this by capturing the rough shape of the training data instead of its fine detail.

Bias: When a model has bias, it consistently makes prediction errors across multiple subsets of training data. Example: Using Linear Regression when a dataset's shape is not linear.

Bias in subsets of training data. The blue line shows predictions.

Variance: A model has high variance when it is very sensitive to the training data. Higher variance leads to lower prediction errors on average, but the predicted values vary highly between training sets. Example: Using KNN with a small k value.

Variance in subsets of training data. The blue line shows predictions.

"Irreducible Error": Random noise in the data that we cannot do anything about.

Overfitting: When a model's variance is too high, and thus cannot 'generalize' well between datasets. 

We usually cannot compute bias and variance from the data directly, since we would need to know the relationship between the predictors and the target. Instead, we can see what happens if we change the size of the training set. For example, we can take 10% of the training data, compute the training and test (or CV) error rate, plot the two points, then repeat with a larger training subset until we do this with 100% of the training data.

Error rate on test and training data with different sized subsets of the training data.

Error on test data: As number of training examples gets larger, model tends to generalize better.

Error on training data: With fewer examples, easier to get good fit on training data.

 

Error rate on test and training data with high variance model.

The above model is too flexible, as it leaves a large gap between the two curves. We can solve this by getting more training data, or by using a less flexible model.

 

Error rate on test and training data with low variance model.

 The above model is too inflexible, a larger dataset won't help. We could use a more flexible model, or more and better data features might help.

Hands-On Machine Learning Ch. 4

Polynomial Regression

As we've seen with some of the above examples, we cannot use a linear equation to accurately predict non-linear datasets. To do this we can use Polynomial Regression by preprocessing the dataset using Scikit-Learn's PolynomialFeatures before training the Linear Regression Model.

  • from sklearn.preprocessing import PolynomialFeatures
  • poly_features = PolynomialFeatures(degree=2, include_bias=False)
  • X_poly = poly_features.fit_transform(X)
  • then we train the Linear Regression model using X_poly in place of X
 

Comments

Popular posts from this blog

Week 26

Week 17

Week 29