CST383 Week 1

This was our first week in CST 383 - Data Science. This class discusses machine learning, how to analyze data, and how to use Python for data analysis.

Python Array Operations

In this week's video lecture we discussed various operations one can perform on python arrays.

  • Slicing:  Python arrays can be accessed using [start:stop:step]. Start defaults to 0, stop defaults to the end of the array, and step defaults to 1. For example, to access the last three elements in an array we would use array[len(array) - 3:], to access the first half of the array we would use array[:(len(array) / 2], and to access every other element we would use array[::2].
  • Fancy Indexing: We can use an array as a list of indices we want to get from a different array. For example, array[[0, 2, 4]] would return the first, third, and fifth elements of the array.
  • Broadcasting: We can perform operations on arrays to quickly perform operations on each element in said array. If we have the array [1, 2, 3, 4], we could use array + 1 to get [2, 3, 4, 5]. This works for arithmetic operations for both constants, other arrays, and boolean operations.
  • Boolean Masking: We can perform boolean operations on an array to get an array of booleans. For example, [1, 2, 3, 4] < 3 would return [True, True, False, False]. Boolean arrays like this are often called masks. We can use mask arrays to get elements from arrays. For example, using the two previous arrays we can use array[mask] to get [1, 2]. We could also use boolean operations directly to index arrays, such as array[array < 3].
  • NumPy arrays are faster than traditional Python arrays. we can build them by using functions such as np.array([1, 2, 3, 4]) or np.arange(10).
  • NumPy has array functions that are equivalent to built-in python methods, but perform several times faster. Some of the ones we covered this week were: np.sumnp.mediannp.meannp.maxnp.min, and np.sort.

We also covered the attribute dtype, shape, and size which can be used to get the data type, shape (represented as a tuple), and the size of an array.

We also covered 2D NumPy arrays. All of the operations, functions, and attributes discussed above also work for 2D and multi-dimensional arrays. When using 2D arrays, the terms rows and columns are often used. If we use the following array as an example:

[[1, 2, 3],

 [4, 5, 6], 

 [7, 8, 9]]

Its rows would be [1, 2, 3], [4, 5, 6], and [7, 8, 9], and its columns would be [1, 4, 7], [2, 5, 8], [3, 6, 9].

Introduction to Machine Learning with Python - Chapter 1

One of the textbooks we will be reading in this class is Introduction to Machine Learning with Python.

The first chapter defines two types of machine learning:

  • Supervised Learning: The algorithm is fed both a series of inputs, and the expected output of said input. For example, when training an algorithm to detect fradulent activity it is fed transaction logs and bank activity for the input, and it is told which transactions are fradulent for the output.
  • Unsupervised Algorithms: The algorithm is fed a series of inputs, but no corresponding outputs. For example, an algorithm to detect abnormal behavior in websitte traffic will be given an input consisting of said website traffic, but it will not be told which traffic is abnormal.

The first chapter also says that Python is widely used in machine learning and data science because of its extensive library support. It lists scikit-learn, NumPy, SciPymatplotlibpandas, and mglearn as particularly important libraries.

It is also important to test an algorithm after training it. The same data used to train an algorithm cannot be used to test it, as the algorithm will be able to remember the training data. Instead we must split a dataset into two sections, one used to train the algorithm and one used to test it.

Python Data Science Handbook - Chapter 2

Another one of the textbooks we will be reading is Python Data Science Handbook.

This chapter discusses useful shell commands offered by IPython.

  • %run [filename] - Runs the entered file.
  • %timeit [line of python code] - Runs the entered python line and outputs its performance.
  • [command]? - Prints the docstring for the entered command.
  • %magic - Gives a list of magic functions.
  • $lsmagic - Gives a list of all available magic functions.
  • In[] - Stores an array of entered commands.
  • Out[] - Stores an array of outputs made.
  • Underscores (_) - One to three underscores can be used to access the last one to three inputs.

Reflection

Despite Python's popularity, I had very little experience with it before this class. I was familiar with the basic syntax, but I did not know about more advanced features, such as its many array operations. Going into this class, I was excited to learn more about Python.

The first week immediately fufullied this excitment by discussing Python's many array features. As one who is used to working with Java or C style arrays, Python's array features were immediately very interesting. On the writing code side, many tasks are simplified by taking advantage of Python's functionality. In other languages if I wanted to see which below a certain value I would have to write a for loop, but with Python I can accomplish this in one line.

One aspect of Python's array functions that I found confusing was its 2D arrays. Specifically, the notation for rows and columns. It took me longer to figure out what was meant by rows and columns, and how I might access each column specifically. 

 

Comments

Popular posts from this blog

Week 31

Week 19

Week 20