CST 383 Week 3

Reflection

This was our third week in CST383 - Data Science. This week we learned about visualizating data and graphs using pandas, matplotlib, and scpipy. We focused on definitions of various types of graphs, functions to create them, and various parameters we can use to alter them.

Like with the array functions and operations from week one, I was surprised at how robust these libraries are, and also by how easy their basic uses are to learn. In very little time, I was able to start making nice looking graphs, and with more research I could quickly make professinal quality ones.

In previous weeks, I found half of the material to be relatively straightforward (namely 1D arrays and Series), and the rest of it to be more difficult (2D arrays and Data Frames). This week felt a little different, with all of the different types of graphs and parameters feeling about as advanced an complicated as each other. If one thing felt more difficult, it would be using the plt.subplot function to create multiple, separate plots. It is not surprising that this was more difficult, as it resembles 2D arrays or Data Frames, just with graphs.

Basic Plotting

Density Plots

To make a density plot we use the function df['column'].plot.density(). Important arguments for this function include:

title=String: Change the title of the plot.
bw_method=float: Changes the plot's smoothness. A higher value means a smoother plot. This parameter is important for ensuring that a plot does not have too little or too much detail.

Before creating the plot, we can use samples = df['column'].sample(frac=1, replace=True) before calling sample.plot.density() to get a plot using random samples.

Cumulative Density Plots

To make a cumulative density plot we have to first import the seaborn library then use the function seaborn.kdeplot(df['column'], cumulative=True).

Mutli-Variable Density Plots

To make multi-variable density plots we use the function df[*Indexing/Statement for multiple columns*].plot.density(figsize=(x,y)).

Histograms

We can create a Histogram, also called bar plots or graphs using the following functions. Note that there are minor differences between these methods.

df['column'].plot.hist()
df['column'].hist()
df['column'].plot(type='hist')

We can also use the bins attribute to change the number and size of the bars, such as df['column'].hist(bins=range(30, 90, 10)).

Boxplots

We can create a boxplot using df['columns'].plot.box(). Boxplots are useful for showing the distribution of a continous variable.

Heatplot

We can create a heatmap, which shows the correlation of each combination of two variables, using seaborn.heatmap, such as seaborn.heatmap(df.corr(), cmap='PRGn', vmin=1, vmax=1)

Which Plot to use?

Plotting in Notebook

When plotting using a Notebook, we can use the following functions to alter any plots we have made:

plt.figure(figsize=(width, height)): Sets the width and height of a plot in inches/number of boxes.
plt.title(String): Sets the title of the plot.
plt.xlabel(String): Gives a label to the x-axis.
plt.xlim(min, max): Sets the start and ending values of the x-axis.
plt.ylabel(String): Sets the label of the y-axis.

When using the above functions in a notebook, it is important to use plt.show() or to append ; to the last plt statement.

Famous Distributions

Uniform Distribution

Uniform distribution is a type of probability distribution in which all outcomes are equally likely. There are two types of uniform distribution:

Discrete Uniform Distribution: Used when there are a finite amount of outcomes, with each having the same probability. A good example of this is rolling a die, each side of the die has an equal probability.
Continous Uniform Distribution: Used when outcomes ar continously spread between two values. The probability density function is flat across the interval.

Normal Distribution

Normal distributions have a continous probability distribution that form the famous bell-shaped curve. Real-world data often follow this distribution/shape.

A normal distribution is defined by two parameters: μ: mean (center of the curve) and σ: standard deviation (spread).

The 68-95-99.7 Rule

The 68-95-99.7 rule states that in a normal distribution 68% of the data is within one standard deviation of the mean, 95% of the data is within two, and 99.7% of the data is within three.

Joint and Conditional Probabilities

Joint Probabilities

The joint probability is the probability of two things happening. For example, the probability of age being at least 60 and maxhr being above 130 could be represented as P(age <= 60 and maxhr > 130), and could be calculated with ((df['age'] <= 60) & (df['maxhr'] > 130)).mean().

A related idea is the probability of either of two things happening, or P(age <= 60 or maxhr > 130). This can be calculated in a similar way.

Conditional Probabilities

A conditional probability is the probability of one thing happening given another value. This can be represented with P(maxhr > 130 | (given) age <= 50) and calculated with df_age50 = df[df['age'] <= 50] followed by df_age50[df_age50['maxhr'] > 130].mean().

Another way to calulate the above probability: P(maxhr > 130 | age <= 50) = P(maxhr >= 130 and age <= 50)/P(age <= 50).

Correlation

To represent correlation we often use the Greek letter ρ.

Positive Correlation: When one variable is high the other tends to be high, and vice versa.

Negative Correlation: When one variable is high the other tends to be low, and vice versa.

Covariance: Covariance shows the joint variability of two variables. If their covariance is below zero they negatively correlate, is it is above zero they positively correlate, and if it is near zero they do not correlate. Two variables' covariance can be values outside of -1 to 1.

Covariance can be calculated by finding the mean of: (val1 - val1_mean) * (val2 - val2_mean).
We can find two variables' covariance with:

val1_centered = df['val1'] - df['val1'].mean()
val2_centered = df['val2'] - df['val2'].mean()
covariance = (val1_centered * val2_centered).mean()

We can also use df[['val1', 'val2']].cov().

Correlation: Two variables' Pearson Correlation Coefficient is the normalized version (between -1 and 1 of their covariance).

Correlation is calculated with the formula: Covariance/(σₓ*σᵧ)
We can find it using:

covariance = (val1_centered * val2_centered).mean()
btm = df['val1'].std() * df['val2'].std()
covariance / btm

We can also use df[['val1', 'val2']].corr().

Variance: A variable's variance is its covariance with itself. It can be found with (val_centered * val_centered).mean(). The pandas cov function also includes the variance of variables as well as their covariance.

Scatterplots

We can create a scatterplot using df.plot.scatter(x='column1', y='column2').

Overplotting

Often scatterplot can have many values that overlap, causing the graph to be difficult to read. This is called overplotting, and can happen when the dataset is large, many points share similar values, or the data is discrete. We can reduce overplotting by reducing the size, color, transparency, or shape of plots. We can also change the scale of one of the axis or use a sample of the data.

Pairwise Scatterplots

Pairwise Scatterplots show multiple scatterplots between multiple variables, somewhat like a crosstab. We can create a pairwise scatterplot using the seaborn function seaborn.pairplot.

Scatterplot with Regression Line

We can make a scatterplot with a regression line, a line that helps see correlation, using the seaborn function seaborn.regplot. A function call of regplot could look like: seaborn.regplot(x='val1', y='val2', data=df, scatter_kws={'s':25}, line_kws={'color':'dimgrey'}).

Contour Plots

We can create contour plots using seaborn.kdeplot, such as seaborn.kdeplot(data=df, x='val1', y='val2'). We can overlay this with a scatterplot by calling the scatterplot function above first.

Probability Mass Function

A probability mass function (PMF) is a histogram whose values directly correspond to the probability of each value. For the graph to be correct, all probabilities must sum to 1, which we can get by normalizing the data.

We can create a PMF using the same functions we would use to create a histogram. For example:

probs = pd.Series({0: 0.5, 1: 0.25, 2: 0.15, 3: 0.10}) or probs['column'].value_counts(normalize=True)
probs.plot.bar()

Calculating PMF Mean

We can calculate the mean, also called the expectation or expected value, of a PMF by using weighted averages. Using probs from the example above, we can get its mean by:

0.5*0 + 0.25*1 + 0.15*2 + 0.1*3 = 0.85
(probs.index * probs.values).sum() = 0.85

Calculating PMF Variance

Cumulative Distribution Function

Cumulative Distribution Functions (CDF) are an alternative to PMFs that show the same data in a different way. We can create a CDF with:

cumulative = probs.cumsum()
cumulative.plot.bar()

Random Variables

While variables have a value, random variables are objects that have a distribution. For example, a random variable could be the combination of possible results when rolling two dice. Random variables are often reffered to by using its distribution: X ~N(μ,σ). This statement means that X is a random variable according to a normal distribution with parameters μ and σ.

SciPy stats objects are useful for representing random variables because they have an underlying distribution and mean and rvs functions for getting the expectation and random samples. For example, we can create a stats object with:

mean = df['column'].mean()
std = df['column'].std()
rv = stats.norm(loc=mean, scale=std).

Bar Plots

Bar plots look similar to Histograms, and are useful for showing the distribution of a discrete variable. We can create a bar plot using df['column'].value_counts().plot.bar().

We can create a horizontal bar plot using barh() instead of bar(), and we can rotate the x-axis labels using the rot attribute (e.g. rot=0).

Search This Blog

CS Online Learning Journal