CST 383 Week 2

This is our second week in CST 383 - Introduction to Data Science.

Pandas

This week we discussed Pandas, which is a Python library that is built on top of NumPy and is designed for data analysis.

Series 

First we discussed Pandas Series, which is analogous to a list or 1D array. 

  • Series are made up of two 1D arrays. One array contains values, and the other contains indices for the data. The two arrays for values and indices can contain different data types. To create a series we use pd.Series, such as x = pd.Series([.2, .4, .3, 1.0], index=['Mon', 'Tue', 'Wed', 'Thu']). If no index is explicitly given, the Series defaults to standard indices (0-length - 1).
  • We can also create a Series from a dictionary by passing the dictionary into pd.Series. For example, if we had the dictionary d={'Mon':0.2, 'Tue':0.4} we could create a Series with x=pd.Series(d)
  • In the above example, the values ['Mon', 'Tue', 'Wed', 'Thu'] are known as explicit indices. To access explicit indices we can use x['Mon'] or x.loc['Mon'].
  • We can also access Series' values by using indices 0-length - 1 like 1D arrays. These are called implicit indices, and we can access them using x.iloc[0].
  • We can index into Series using slicing, fancy indexing, and boolean masks like with arrays, and we can also perform vectorized operations and broadcasting.

 

We also discussed some useful methods we can use with Series.

  • x.isin([]) will return a boolean mask for which values are in the passed in array.
  • x.between(num1, num2) will return a boolean mask for which values are between num1 and num2 (inclusive).

Dataframes 

We also discussed Pandas dataframes, which are analogous to 2D arrays.

  • We can create a dataframe using pd.DataFrame().
    • We can create a dataframe using a dictionary: hw = pd.DataFrame({'hw1':[89, 74, 68, 94], 'hw2':[92, 90, 78, 97]}, index=[1, 2, 3, 4]).
    • We can also create a dataframe using a 2D array:  hw = pd.DataFrame(np.array([[89, 92], [74, 90], [68, 78], [94, 97]]), columns=['hw1', 'hw2']).

 We have several operations we can perform on columns:

  • Selecting a column: hw['hw2']
  • Selecting multiple columns: hw['hw1', 'hw2']
  • Adding a column: hw['hw3'] = [86, 77, 80, 86] 
  • Removing a column: hw.drop('hw3', axis=1, inplace=True)
  • Removing multiple columns: hw.drop(['hw1', 'hw2'], axis=1, inplace=True)
  • Renaming columns: hw.columns = ['hw-1', 'hw-2']
  • Rearranging columns: hw = hw[['hw2', 'hw1']] 

We also have several operations we can perform on rows:

  • Selecting a row: hw.loc['001'] for explicit indices or hw.iloc[2] or hw.values[0] for implicit indices.
    • Note: loc includes stop of slices, while iloc does not.
    • To select both rows and columns at the same time, we can use .iloc for rows and .loc for columns: hw.iloc[1].loc['hw2'].
  • Selecting multiple rows: We can use slicing hw[:2], fancy indexing hw[['001', '002']], or boolean masks hw[hw['hw1] > 80].
  • Dropping rows: hw.drop([0, 2], inplace=True])
  • Sample random rows: hw_2 = hw.sample(3)
  • Stack dataframes vertically: pd.concat([df1, df2])

 

Aggregation

This week we discussed simple aggregation techniques that we can use on series and dataframes.

  • Like with arrays, we can use methods like min, max, mean, and more to aggregate data.
  • We can also use the aggregate function. We can either pass in certain strings, like df.aggregate('min'), or pass in functions, like df.aggregate(np.max).
  • We can also aggregate multiple data points at once df.aggregate(['min', 'mean', 'max']).
  • df.groupby('column'): Returns a dataframe for each separate value of the column. We can also use slicing, fancy indexing, and boolean masks for the argument.
  •  df['column'].value_counts(): Returns a dataframe containing the number of times each value appears in a certain column. We can also pass in normalize=True to normalize the results.
  • Pandas supports vectrized operations on strings, including methods like, len, lower, replace, split, isspace, and partition.

 Reading Data

 We briefly discussed how to read data from CSV files using pandas.

  • We can read from a CSV file using df = pd.read_csv(file path or URL).
  • We can then get some info about the data using df.info().

Our class lecture also briefly discussed data. Depending on the angle in which we look at data, it could mean different things. For example, a computer scientist may look at 1 as simply an integer, but a data analyst make look at 1 as categorical.

Our class lecture defines two types of data, quantative and categorical. Then it defined two subtypes of data for each type. Quantative data can be split into discrete data, whose values usually come from counting, and continous data, whose values usually come from measurement. Categorical data can be split into nominal and ordinal data.

When finding or analyzing data, we want lots of data that is recent, reliable, and relevant while keeping privacy concerns in mind.

We usually gather data using random samples of a population. In this case, a population is a group of something, and a sample is a subset of said population. We want samples to be selected randomly, and we ideally want larger samples that better represent the population.

We discussed some useful functions for analyzing data.

  • We can get the variance of data using df.values.var() 
  • We can get the standard deviation using df.std()
  • We can get various data about the data using df.describe().

We can calculate the probability of a question about data by using the mean on a boolean mask. For example, if we wanted to know the probability of age being 55 or greater, we could use (df['age'] >= 55).mean().

We discussed two different types of graphs. 

  • histogram, often called a bar graph, is a chart that maps data across discrete values 
  • probability density function (PDF) shows the distribution of a continuous variable. 
    • The median of a PDF is the point associated with the line that splits the area under the curve into two equal parts.
    • The cumulative distribution function (CDF) is a graph that is derived from the PDF, and can be used to directly get probabilities.
    • A PDF with a longer left tail has a negative skew, while a PDF with a longer right tail has a positive skew.

Crosstabs

We briefly discussed crosstabs, which are a certain way to represent joint distributions. Crosstabs show the number of cominations between two or more variables. The number of cells in a crosstab is the product of the number of values in each variable. In this case we have two variables with two values each, giving us 2*2=4 cells. If we had three variables with two values each we would have 2*2*2=8 cells.

 

For example, the above crosstab shows how many students scored poorly on the quiz and final, poorly on the quiz but well on the final, well on the quiz but poorly on the final, and well on the quiz and final. We can get the total number of students by adding together all cells (43 for this example). We can use this cross tab to calculate various probabilities about the students' scores. For example, the probability of a student scoring poorly on the quiz and final would be 15 (the number of students who scored poorly in the sample) divided by 43, or 0.35. As another example, the probability of a student scoring well on the final if they scored well on the quiz would be 14 (the number of students who scored well in the sample) divided by 26 (the number of students who scored well on the quiz), or 0.54.

Reflection

This week we covered the pandas library in Python. Based on the lecture, it appears to be a powerful library like NumPy that is widely used for data analytics. Some notable aspects of pandas that we covered includes series, list-like objects with additional methods, dataframes, 2D arrays with additional methods, and several methods used by a variety of pandas objects. We also briefly covered different types of graphs, including histograms, PDFs, and crosstabs.

The most challenging topic of this week was dataframes, like how the most challenging topic of last week was 2D arrays. Dataframes are incredibly useful for processing data but they were sometimes hard to visualize, which made solving certain problems difficult.

Comments

Popular posts from this blog

Week 31

Week 20

Week 29