CST 383 Week 6
This was our sixth week in CST 383 - Data Science.
Reflection
This week we continued to focus on hyperparameters, which we begun discussing last week. We also covered KNN and Linear regression models, and how we can asses the accuracy of regression models using MSE, RMSE, and MAE. Going through the lecture, the concepts seemed to be relatively straightforward. I didn't feel confused about any of the concepts going into the homework.
This feeling continued into the homework. While the portions of the homework that covered this week's concepts weren't necessarily easy, none of them were problematic. One homework topic that gave me problems was One Hot Encoding. We learned about encoding last week, but last week's homework didn't really include it. This week's homework did, and it proved to be the most challenging part by far. It felt like either the reading material wasn't fully explaining how to use the OneHotEncoding object properly, or that I simply wasn't understanding it. I eventually got it to work, although I ended up writing much more logic for it than the other problems.
Class Lecture
Hyperparameter Tuning
Hyperparameters are extra parameters that we set before training. They control things like how many neighbors to compare to, and are important for making an accurate algorithm.
Strategy 1: Test all Combinations
One way we can find the best combination of hyperparameters is to simply test all possible combinations. We can do this with for loops, for example:
- from sklearn.model_selection import cross_val_score
- from sklearn.neighbors import KNeighborsClassifier
- results = []
- for k in range (1, 20):
- for dist_fun in [1, 2]: # 1 = Euclidean, 2 = Manhattan
- clf = KNNeighborsClassifier(n_neighbors=k, p=dis_fun)
- cv_acc = cross_val_score(clf, X_train, y_train, cv=10).mean()
- results.append([k, dist_fun, cv_acc])
- df_results pd.DataFrame(results, columns=['k', 'dist_fun', 'cv_acc'])
This method of testing hyperparameter combinations is also called Grid Search. The Scikit-Learn library also includes the GridSearchCV objetct, which tests each combination of hyperparameter values like the code above:
- from sklearn.model_selection import GridSearchCV
- from sklearn.neighbors import KNeighborsClassifier
- grid = {'n_neighbors':range(1, 20), 'p':[1, 2]}
- knn_cv = GridSearchCV(KNeighborsClassifier(), grid)
- knn_cv.fit(X_train, y_train)
- knn_cv.best_params_ # To get the best scoring parameters
- knn_cv.best_score_ # To get the score of the best parameters
One problem with testing all combinations is that the number of possible combinations grows rapidly with the number of hyperparameters and values. This can make finding the optimal combination slow. Futhermore, many combinations have similar results, so we are calculating the accuracy of combinations that we do not need to.
Strategy 2: Test Random Combinations
Another strategy we can use to avoid strategy one's problems is testing random combinations. This strategy is also called Random Search
- from sklearn.model_selection import cross_val_score
- from sklearn.neighbors import KNeighborsClassifier
- num_trials = 10
- results = []
- for trial in range(num_trials):
- k = np.random.choice([3, 5, 9, 13, 17, 21, 27, 35, 49])
- dist_fun = np.random.choice([1,2])
- clf = KNeighborsClassifier(n_neighbors=k, p=dist_fun)
- cv_acc = cross_val_score(clf, X_train, y_train, cv=10).mean()
- results.append([k, dist_fun, cv_acc])
- df_results pd.DataFrame(results, columns=['k', 'dist_fun', 'cv_acc'])
- from sklearn.model_selection import RandomizedSearchCV
- from sklearn.neighbors import KNeighborsClassifier
- grid = {'p':[1, 2], 'n_neighbors':[3, 5, 9, 13, 17, 21, 27, 35, 49]}
- knnCV = RandomizedSearchCV(KNNeighborsClassifier(), grid, cv=10, n_iter=10)
- knnCV.fit(X_train, y_train)
- knnCV.best_params_ # To get the best scoring parameters
- knnCV.best_score_ # To get the score of the best parameters
While this strategy takes less time than strategy one, it is less accurate. If we have time then we should use grid search, otherwise we should use random search.
Hyperparameters in Machine Learning
We can summarize machine learning in two steps. In step one, we pass in the hyperparameters and training data to train the model. In step two, we use the model to predict the outcomes for new data.
Best Practices
- Perform a train/test split on the data.
- Use cross validation with the training data to compare models/hyperparameters.
- Train the best model/hyperparameter combination on the full training set.
- Compute the test score using the test set.
KNN - Regression
With machine learning (more specifically Supervised Learning according to our class lecture), we can try to predict a data's category, such as whether a college is public or private, or its quantity, such as a college's tuition. Predicting categories is called classification, and predicting quantities is regression.
When doing KNN Regression, we take the average value of the nearest neighbors to predict the value, instead of the mode like in KNN Classification.
We can use the KNeighborsRegressor object to perform KNN Regression.
- from sklearn.neighbors import KNeighborsRegressor
- knn = KNeighborsRegressor()
- knn.fit(X_train, y_train)
- y_pred = knn.predict(X_test)
- MSE = ((y_pred - y_test)**2).mean() # Mean Squared Error, lower is better
Linear Regression
With linear regression, we try to come up with a linear equation that can predict the outcome of values as accurately as possible. The below image represents what this might look like on a graph. Each line is a different linear equation, and the line that is closest to all of the points on average (with the lowest Mean Squared Error (MSE)) is the most accurate equation.
These equations only have one predictor variable (displacement), but we can use more than one predictor variable.
The machine concept learning is the same with linear regression as with other types of regression or classification. We train a model to find the best parameters/coefficients for the function. We can use the LinearRegression object to do this.
- from sklearn.linear_model import LinearRegression
- regr = LinearRegression()
- regr.fit(X_train, y_train)
- regr.intercept_ # To get the y intercept
- regr.coef_[index] # To get the coefficients
We can get the Root Mean Squared Error (RMSE) with:
- y_pred = regr.predict(X_test)
- rmse = np.sqrt( ((y_pred - y_test)**2).mean() )
Linear Models like this are simple and easy to understand compared to other types of models, and are also able to make predictions quickly. However, they have limited flexibility, and cannot capture relationships between predictor variables.
Assessing Regressors
MSE and RMSE
To assess regression models, we can use their Mean Squared Error (MSE) or Root Mean Squared Error (RMSE). Both of these are calculations of how far off a model's predictions are from the the real values on average.
To calculate them we:
- MSE = ((predicted - actual)**2).mean()
- RMSE = np.sqrt(MSE)
- from sklearn.metrics import mean_squared_arror
- mse = mean_squared_error(y_test, y_pred)
- mean_target = y_train.mean()
- mse_baseline = ((mean_target - y_test)**2).mean()
- rmse_baseline = np.sqrt(mse_baseline)
Mean Absolute Error (MAE)
Mean Absolute Error (MAE) is similar to MSE, but we take the absolute value of the errors instead of squaring them. MAE is easier to interpret than MSE and RMSE because of the more straightforward math.
Instead of the code from the image above, we can use the Scikit-Learn library to calulcate a model's MAE.
- from sklearn.metrics import mean_absolute_error
- mae = mean_absolute_error(y_test, y_pred)
We can also get the baseline MAE with the following code:
- mean_target = y_train.mean()
- mae_baseline = np.abs(mean_target - y_test).mean()
Cross-Validation RMSE/MAE
When doing cross validation with cross_val_score, we can set its scoring attribute to 'neg_root_mean_squared_error' to use RMSE, and 'neg_mean_absolute_error' to use MAE.
Comments
Post a Comment