Helpful Seaborn Linear Regression Visualisations for Total Beginners

Nikol Holicka
5 min readJun 19, 2019

I was immediately attracted to Seaborn visualisations when I started learning about data science and coding in Python. Plots in Seaborn by default look more smooth and sleek but also contain some very handy features that will same you code and time.

First thing you need to do when you want to make use of its features is to import it into your Jupyter Notebook.

Import seaborn as sns
  1. Countplots

Countplot is similar to a bar chart that will give you insight into how are your values distributed. It will aggregate number for each value for you. This is helpful for checking whether your data is normally distributed or skewed.

sns.countplot(df[‘bedrooms’])

You can edit the design of your plot with some of these simple additional features.

sns.set(style=”white”)sns.countplot(df[‘bedrooms’])sns.despine()

Let’s break this simple code down line by line:

  1. Sets background as ‘white’
  2. Plots your column
  3. Removes the upper and right axes spines, which are not necessary in this case. With additional arguments bottom=True, left=True you can also the remaining spines if you want to make your plot very simple.

2. KDE (Kernel Density Estimate) Plots

KDE plots are really useful for visualising numerical variables and of its benefits is smoothing out data noise. Compared to a line chart, KDE plot gives you a more general look at the distribution of your data, as its right axis represents how often the variable occurs or its probability.

sns.kdeplot(data=df.sqft_living)sns.despine()

3. Jointplots

Jointplot in seaborn provides you with multiple visualisations at once, the main one of which is a scatter plot that reveals the relationship between two variables. On the sides, you can see the histograms for each variable. If you are working with a large data set, you can pass an additional argument alpha=0.1 which allows you to set the transparency of the dots from 0 (invisible) to 1 (totally opaque).

Another helpful feature is the argument stat_func=pearsonr, which you can use if you import it from scipy.stats. This will display the Pearson’s correlation coefficient and the p-value. The figure will tell you more about the relationship between your variables straight away! You can see the Person’s correlation coefficient for the variables shown is 0.7, which suggest a strong positive correlation. The p-value is 0, which means we can safely assume that that our evidence of the relationship between the variables is very strong.

Note that the p-value is never actually 0, but likely less than 0.0005, rounded up and reported as 0. There is always some chance that such results happened by accident. However, reporting p-value as 0.0001 is not ‘cool’ among professional statisticians, which I learnt from here.

from scipy.stats import pearsonrsns.jointplot(x=’sqft_living’, y=’price’, data=df, alpha = 0.1, stat_func=pearsonr)

Jointplots will also allow you to display the regression line if you use the argument kind=’reg’. The histograms on the side will turn into KDE plots, which I explained above.

sns.jointplot(x=’sqft_living’, y=’price’, data=df, kind=’reg’)

You can visualise a hexagonal heatmap of you pass kind=’hex’ as an argument. This will allow you to see areas with a high density of your variables. Alternatively, you can use kind=’kde’ to visualise the probability density.

sns.jointplot(x=’sqft_living’, y=’price’, data=df,kind=’resid’)

Last but not least, passing kind=’resid’ as an argument will show you the scatterplot of the residuals.

4. Visualising Multicollinearity

It is important to check whether some of your columns are not multicollinear before you start building your model. Multicollinear information can negatively impact your model by weakening the precision of your final coefficients. You can find out more about multicollinearity here.

High multicollinearity among your column is a reason to drop one of them from your data set, as they describe the same reality. As a self-proclaimed minimalist, I always appreciate a good reason to throw something out!

sns.set(style=”white”)
corr = df.corr()
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize=(12, 10))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.9, center=0, square=True, linewidths=.5, annot=True,cbar_kws={“shrink”: .5});

Let’s break down this code line by line:

  1. This sets the background of your figure. You can choose from white, dark, whitegrid, darkgrid (default) and ticks.
  2. This method calculates the correlation of columns and returns a dataframe with correlation values
  3. This line creates a Boolean array with the help on NumPy
  4. A mask is created that ‘covers’ the upper triangle of the correlation matrix. This way we are not represented with repetitive information, as the correlation matrix is symmetrical.
  5. Sets up a matplotlib figure.
  6. Sets up a diverging colour palette that will help you visualise values by colour intensity and spot multicollinear columns instantly. Seaborn uses the HUSL colour system
  7. Sets up our heat map! We pass our parameters, which we defined above.
  8. The ‘annot’ parameter is especially useful, as it writes the data value in each square. This way we can see the correlation coefficient.

There is more to seaborn beyond these four examples. If you want to find out more, check out this guide.

--

--