CST 383 Week 4
This was our fourth week in CST 383 - Data Science.
Reflection
This week we went into more detail on crosstabs, how to manipulate them, and how to plot them. We also briefly covered how to plot more than two variables on a scatter plot and Violin Plots and Seaborn Facet Grids.
After the second week, which introduced crosstabs but did not cover them in detail, I was wondering when we were going to cover them in more detail. At first crosstabs seemed like a unique data structure, but after learning about them and manipulating them in code they feel like any other Dataframe or 2D structure.
Speaking of 2D structures, at the beginning of this course I found manipulating 2D arrays and Dataframes to be quite complex. I often had problems envisioning the data that I was working on. After having worked on 2D structures for the past few weeks, I am happy to find that it now feels easier. While the syntax for crosstabs is more complicated, I was able to intuit the data easier than I was able to intuit 2D arrays.
Similar to how crosstabs were not fully covered in the week that they were introduced, I wonder if we will cover facet grids in more detail. Facet grids seem like a powerful tool for visualizing large amounts of data, but I feel like not enough was covered in class for me to take advantage of them.
Crosstabs
We can create a crosstab in code using the pandas.crosstab function, such as pandas.crosstab(index=df['column1', columns=df['column2'], margins=True).
Normalizing Crosstabs
We can set the normalize attribute to 'index', 'columns', or 'all'/True to normalize the crosstab.
'index': Each of the rows sum to one. This shows conditional probabilites for the cells of each row.
'columns': Each of the columns sum to one. This shows the conditional probabilities for the cells of each column.
'all'/True: All of the data sums to one. This shows the joint probability of each cell instead of counts.
Probabilities using Crosstabs
The calculations listed below assume that normalized='all'.
Joint Probability: Joint probability is the probability of two events or values occuring at the same time, and is represented with P(var 1 = 0, var2 = 0). We can calculate two variables' joint probability by accessing a specific index of the crosstab using crosstab.loc[0, 1].
Marginal Probability: Marginal probability is the probability of having a certain outcome regardless of other variables, and is represented with P(var1 = 0). We can calculate a variable's marginal probability by summing all of its values in an index or column using crosstab.loc[0, :].sum().
Conditional Probability: Conditional probability is the probability of one event occuring given another event, and is represented with P(var1 = 0 | var2 = 0) or P(val1 = 0 given val2 = 0. We can calculate the conditional probability of two events using either (assuming we want to get (1, 0)):
- crosstab.loc[1, 0] / crosstab.loc[:, 0].sum()
- or cond = crosstab.loc[:, 0] / crosstab[:, 0].sum()
- then cond[1]
Graphing Crosstabs
We can plot crosstabs the same way we plot DataFrames or Series. For example, we can plot a crosstab with crosstab.plot.bar(). This will create a barplot of the histogram, separated by rows.
If we want the graph to show conditional probabilities, such as P(selective given private), then we will want to set the precondition (private in this case) to be the index, then normalize the crosstab on the index. For example, crosstab = pd.crosstab(index=df['Private'], columns=df['Selective'], normalize='index').
We can create stacked bar plots by setting the stacked attribute to True in the bar function call. For example, crosstab.plot.bar(stacked=True).
Scatterplots
Besides using Parwise Scatterplots, we can use scatterplots with more than two columns by setting the hue and size attributes. For example, seaborn.scatterplot(data=df, x='Top10perc', y='Grad.Rate', hue='Private', size='Outstate'). The hue attribute works with some other seaborn plots, such as barplots or violin plots.
Violin Plot
Using a similar setup to the Scatterplot above, we can create a Violin Plot, which is similar to a boxpot, using seaborn.violinplot. For example, seaborn.violinplot(data=df, x='Selective', y='S.F.Ratio', hue='Private', split=True).
FacetGrid
We can combine multiple grids together using seaborn's FacetGrids. This lets us more directly compare multiple variables along an index and/or column.
First we have to create a FacetGride object by calling seaborn.FacetGrid. For example, grid = seaborn.FacetGrid(df, col='Selective', height=5.5, aspect=.5).
Then we call the map method on the grid object. grid.map(seaborn.barplot, "Private", "Outstate", order=['Yes', 'No']). This call creates the following barplot:
For another example, if we create a FacetGrid object with grid = seaborn.FacetGrid(df, row='Private, col='Selective'), we can create scatterplots using grid.map(plt.scatter, 'Grad.Rate', 'Outstate', s=25).
Comments
Post a Comment