CST 383 Week 6

This was our sixth week in CST 383 - Data Science. 

Reflection

This week we continued to focus on hyperparameters, which we begun discussing last week. We also covered KNN and Linear regression models, and how we can asses the accuracy of regression models using MSE, RMSE, and MAE. Going through the lecture, the concepts seemed to be relatively straightforward. I didn't feel confused about any of the concepts going into the homework.

This feeling continued into the homework. While the portions of the homework that covered this week's concepts weren't necessarily easy, none of them were problematic. One homework topic that gave me problems was One Hot Encoding. We learned about encoding last week, but last week's homework didn't really include it. This week's homework did, and it proved to be the most challenging part by far. It felt like either the reading material wasn't fully explaining how to use the OneHotEncoding object properly, or that I simply wasn't understanding it. I eventually got it to work, although I ended up writing much more logic for it than the other problems. 

Class Lecture

Hyperparameter Tuning

Hyperparameters are extra parameters that we set before training. They control things like how many neighbors to compare to, and are important for making an accurate algorithm.

Strategy 1: Test all Combinations

One way we can find the best combination of hyperparameters is to simply test all possible combinations. We can do this with for loops, for example:

  • from sklearn.model_selection import cross_val_score
  • from sklearn.neighbors import KNeighborsClassifier
  • results = []
  • for k in range (1, 20):
    • for dist_fun in [1, 2]: # 1 = Euclidean, 2 = Manhattan
      • clf = KNNeighborsClassifier(n_neighbors=k, p=dis_fun)
      • cv_acc = cross_val_score(clf, X_train, y_train, cv=10).mean()
      • results.append([k, dist_fun, cv_acc])
  • df_results pd.DataFrame(results, columns=['k', 'dist_fun', 'cv_acc'])

 This method of testing hyperparameter combinations is also called Grid Search. The Scikit-Learn library also includes the GridSearchCV objetct, which tests each combination of hyperparameter values like the code above:

  • from sklearn.model_selection import GridSearchCV
  • from sklearn.neighbors import KNeighborsClassifier 
  • grid =  {'n_neighbors':range(1, 20), 'p':[1, 2]}
  • knn_cv = GridSearchCV(KNeighborsClassifier(), grid)
  • knn_cv.fit(X_train, y_train)
  • knn_cv.best_params_ # To get the best scoring parameters
  • knn_cv.best_score_ # To get the score of the best parameters 

One problem with testing all combinations is that the number of possible combinations grows rapidly with the number of hyperparameters and values. This can make finding the optimal combination slow. Futhermore, many combinations have similar results, so we are calculating the accuracy of combinations that we do not need to.

Strategy 2: Test Random Combinations 

Another strategy we can use to avoid strategy one's problems is testing random combinations. This strategy is also called Random Search

  • from sklearn.model_selection import cross_val_score
  • from sklearn.neighbors import KNeighborsClassifier
  • num_trials = 10
  • results = []
  • for trial in range(num_trials):
    • k = np.random.choice([3, 5, 9, 13, 17, 21, 27, 35, 49])
    • dist_fun = np.random.choice([1,2])
    • clf = KNeighborsClassifier(n_neighbors=k, p=dist_fun)
    • cv_acc = cross_val_score(clf, X_train, y_train, cv=10).mean()
    • results.append([k, dist_fun, cv_acc])
  •  df_results pd.DataFrame(results, columns=['k', 'dist_fun', 'cv_acc'])
Similar to the GridSearchCV object from strategy one, we can use the RandomizedSearchCV instead of the code above.
  • from sklearn.model_selection import RandomizedSearchCV
  • from sklearn.neighbors import KNeighborsClassifier 
  • grid = {'p':[1, 2], 'n_neighbors':[3, 5, 9, 13, 17, 21, 27, 35, 49]}
  • knnCV = RandomizedSearchCV(KNNeighborsClassifier(), grid, cv=10, n_iter=10)
  • knnCV.fit(X_train, y_train)
  • knnCV.best_params_ # To get the best scoring parameters
  • knnCV.best_score_ # To get the score of the best parameters 

While this strategy takes less time than strategy one, it is less accurate. If we have time then we should use grid search, otherwise we should use random search.

Hyperparameters in Machine Learning

We can summarize machine learning in two steps. In step one, we pass in the hyperparameters and training data to train the model. In step two, we use the model to predict the outcomes for new data.

Best Practices

  1. Perform a train/test split on the data.
  2. Use cross validation with the training data to compare models/hyperparameters.
  3. Train the best model/hyperparameter combination on the full training set.
  4. Compute the test score using the test set. 

KNN - Regression

With machine learning (more specifically Supervised Learning according to our class lecture), we can try to predict a data's category, such as whether a college is public or private, or its quantity, such as a college's tuition. Predicting categories is called classification, and predicting quantities is regression.

When doing KNN Regression, we take the average value of the nearest neighbors to predict the value, instead of the mode like in KNN Classification. 

We can use the KNeighborsRegressor object to perform KNN Regression.

  • from sklearn.neighbors import KNeighborsRegressor
  • knn = KNeighborsRegressor()
  • knn.fit(X_train, y_train)
  • y_pred = knn.predict(X_test)
  • MSE = ((y_pred - y_test)**2).mean() # Mean Squared Error, lower is better

Linear Regression

With linear regression, we try to come up with a linear equation that can predict the outcome of values as accurately as possible. The below image represents what this might look like on a graph. Each line is a different linear equation, and the line that is closest to all of the points on average (with the lowest Mean Squared Error (MSE)) is the most accurate equation.

These equations only have one predictor variable (displacement), but we can use more than one predictor variable.

The machine concept learning is the same with linear regression as with other types of regression or classification. We train a model to find the best parameters/coefficients for the function. We can use the LinearRegression object to do this.

  • from sklearn.linear_model import LinearRegression
  • regr = LinearRegression()
  • regr.fit(X_train, y_train)
We can get the y intercept and coefficients with: 
  • regr.intercept_ # To get the y intercept
  • regr.coef_[index] # To get the coefficients

We can get the Root Mean Squared Error (RMSE) with:

  • y_pred = regr.predict(X_test)
  • rmse = np.sqrt( ((y_pred - y_test)**2).mean() )

Linear Models like this are simple and easy to understand compared to other types of models, and are also able to make predictions quickly. However, they have limited flexibility, and cannot capture relationships between predictor variables.

Assessing Regressors

MSE and RMSE 

To assess regression models, we can use their Mean Squared Error (MSE) or Root Mean Squared Error (RMSE). Both of these are calculations of how far off a model's predictions are from the the real values on average.

To calculate them we:

  • MSE = ((predicted - actual)**2).mean()
  • RMSE = np.sqrt(MSE)
 
We want both MSE and RMSE values to be as low as possible. However, what is considered low differs depending on the data. If the target data is the price for used cars, 500 might be a good RMSE value, where as 0.5 might be a good RMSE value for a target value for cat weight.
 
We can also use the Scikit-Learn library to compute MSEs:
  • from sklearn.metrics import mean_squared_arror
  • mse =  mean_squared_error(y_test, y_pred)
To figure out which RMSE or MSE values are considered good for a dataset, we can calculate its baseline MSE or RMSE. To do this, we use the average target value of the training data.
  • mean_target = y_train.mean()
  • mse_baseline = ((mean_target - y_test)**2).mean()
  • rmse_baseline = np.sqrt(mse_baseline)
If the above code results in a baseline RMSE of 17.7, then we can assume that any RMSE value below 17.7 is good, and that any above 17.7 is bad.

Mean Absolute Error (MAE)

Mean Absolute Error (MAE) is similar to MSE, but we take the absolute value of the errors instead of squaring them. MAE is easier to interpret than MSE and RMSE because of the more straightforward math.

Instead of the code from the image above, we can use the Scikit-Learn library to calulcate a model's MAE.

  • from sklearn.metrics import mean_absolute_error
  • mae = mean_absolute_error(y_test, y_pred) 

 We can also get the baseline MAE with the following code:

  • mean_target = y_train.mean()
  • mae_baseline = np.abs(mean_target - y_test).mean() 

Cross-Validation RMSE/MAE

When doing cross validation with cross_val_score, we can set its scoring attribute to 'neg_root_mean_squared_error' to use RMSE, and 'neg_mean_absolute_error' to use MAE.

Comments

Popular posts from this blog

Week 26

Week 29

Week 17