Scientific Visualization with Python

A super quick crash course

Authors:¶

Stephanie Wollherr (@swollherr)

This notebook introduces some basic plotting examples using matplotlib.

#we need the following packages, always execute this cell at the beginning

#plots inside the notebook
%matplotlib inline

#you can use these libraries by refering to their appreviation plt., np. or pd.
#basic plotting library
import matplotlib.pyplot as plt

#scientifc computing library
import numpy as np

#data analysis tool
import pandas as pd

Reading in your data¶

To handle data in python we first have to read it in.

Simple ascii/txt/csv files.¶

loadtxt by numpy¶

#read in a seismogram
#4 columns with 
#time North-South East-West Up-Down
time, ns, ew, ud = np.loadtxt('data/station_1.dat').T
print(time)
print(ns)

[1.490116e-08 2.500002e-02 5.000002e-02 ... 1.599000e+02 1.599250e+02
 1.599500e+02]
[0.00794364 0.00824447 0.00853784 ... 0.00865288 0.00865299 0.00865258]

usefull parameters for loadtxt:

comments : The characters or list of characters used to indicate the start of a comment; default: ‘#’.
skiprows : Skip the first skiprows lines; default: 0.
usecols : Which columns to read, with 0 being the first. For example, usecols = (1,4,5) will extract the 2nd, 5th and 6th columns. The default, None, results in all columns being read.

#in action: we only need the time series and the North-South component
time, ns = np.loadtxt('data/station_1.dat', usecols=(0,1)).T
print(time)
print(ns)

[1.490116e-08 2.500002e-02 5.000002e-02 ... 1.599000e+02 1.599250e+02
 1.599500e+02]
[0.00794364 0.00824447 0.00853784 ... 0.00865288 0.00865299 0.00865258]

read_csv by pandas¶

much faster than loadtxt, in particular for large files

#read it in without any specifications
data_pd = pd.read_csv('data/station_1.dat')
print (data_pd)

     # Broadband (3D) simulated ground motion for 1992 Landers EQ
0                                                    #           
1                                        # Station: luc          
2                            #    longitude: -116.61250          
3                               #    latitude: 34.56900          
4             #    closest distance to fault (km): 1.42          
...                                                 ...          
6409  1.598500e+02\t8.652060e-03\t3.179460e-02\t-9.2...          
6410  1.598750e+02\t8.652000e-03\t3.179530e-02\t-9.2...          
6411  1.599000e+02\t8.652880e-03\t3.179580e-02\t-9.2...          
6412  1.599250e+02\t8.652990e-03\t3.179550e-02\t-9.2...          
6413  1.599500e+02\t8.652580e-03\t3.179440e-02\t-9.2...          

[6414 rows x 1 columns]

It reads in everything!

So we have to set a couple of parameters. Possible parameters that we can set:

sep : string; delimiter to use; default ‘,’
header : integer or list of integers; row number(s) to use as the column names and the start of the data
names : List of column names to use. If file contains no header row, then you should explicitly pass header=None; array-like; default None.
usecols : Return a subset of the columns; array-like or callable; default None.
skiprows : Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file; list-like or integer or callable; default None.

... and many more: full list available at https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

Let's try this again:

#files uses tab-seperation, so we'll let read_csv know
#names can be used for header names, header can also be read in but now we're skipping them with comment='#'
data_pd = pd.read_csv('data/station_1.dat', sep = '\t', comment='#', names=['time','NS','EW', 'UD'])
print (data_pd)

              time        NS        EW        UD
0     1.490116e-08  0.007944 -0.000750 -0.015390
1     2.500002e-02  0.008244 -0.000600 -0.015707
2     5.000002e-02  0.008538 -0.000434 -0.016018
3     7.500002e-02  0.008838 -0.000254 -0.016323
4     1.000000e-01  0.009148 -0.000065 -0.016622
...            ...       ...       ...       ...
6394  1.598500e+02  0.008652  0.031795 -0.000928
6395  1.598750e+02  0.008652  0.031795 -0.000928
6396  1.599000e+02  0.008653  0.031796 -0.000928
6397  1.599250e+02  0.008653  0.031795 -0.000928
6398  1.599500e+02  0.008653  0.031794 -0.000928

[6399 rows x 4 columns]

Now we can nicely refer to our data with:

print('time')
print (data_pd['time'])

time
0       1.490116e-08
1       2.500002e-02
2       5.000002e-02
3       7.500002e-02
4       1.000000e-01
            ...     
6394    1.598500e+02
6395    1.598750e+02
6396    1.599000e+02
6397    1.599250e+02
6398    1.599500e+02
Name: time, Length: 6399, dtype: float64

Other data formats¶

There are other possibilties to read in your data when it's not in ascii format.

For example, we will later use netcdf-based data formats in the second tutorial.

More packages for handling data input:

netcdf4 package (Network Common Data Form)
read in hdf with panda (Hierarchical Data Format)

What kind of data format are you using?

Plotting your data - Simple Plots¶

Matplotlib is a widely used plotting library that we will use here for our first simple plots.

Single Figures
Axis labels, titles
Subplots
Styles
Scatter plots
Other types of plots
How to save your plots

1. Single figures¶

Call the plotting function together with its package that we called 'plt'.

The plot gets visible by inserting plt.show() at the end.

plt.plot(data_pd['time'], data_pd['NS'])
plt.show()

2. Labels and titles¶

Let's insert more information!

label the axis
insert a legend
give a title

#default label is the name of the array, but you can also label it with a own name

plt.plot(data_pd['time'], data_pd['NS'])
#plt.plot(data_pd['time'],data_pd['NS'], label='North-South')

plt.legend()

#axis labels
plt.xlabel('time [s]')
plt.ylabel('acceleration cm/s²')

#title
plt.title('Station LUC')

plt.show()

No handles with labels found to put in legend.

3. Subplots¶

We now want to plot all three components of the seismogram in one single plot.

Subplots are structured as follows:

We will now introduce three axis in one figure.

#create a figure f and three subplots
f, (ax1, ax2, ax3) = plt.subplots(3, 1)

ax1.plot(data_pd['time'], data_pd['NS'])
#for axis properties we use set_*
ax1.set_xlabel('time [s]')
ax1.set_ylabel('acceleration cm/s²')
ax1.legend()

ax2.plot(data_pd['time'], data_pd['EW'])
ax2.set_xlabel('time [s]')
ax2.set_ylabel('acceleration cm/s²')
ax2.legend()

ax3.plot(data_pd['time'], data_pd['UD'])
ax3.set_xlabel('time [s]')
ax3.set_ylabel('acceleration cm/s²')
ax3.legend()


#plot title over all subplots
f.suptitle('station LUC', size=20)

#needs to be shifted when tight_layout is used
f.subplots_adjust(top=0.5)
#makes all axis labels visible
f.tight_layout()

f.show()

No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.
/Users/lion/miniconda3/envs/seismo_live/lib/python3.7/site-packages/ipykernel_launcher.py:29: UserWarning: Matplotlib is currently using module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot show the figure.

Beautify the plot!

plots can share axis, plots can even share boxes
focus on the area of interest

#share x axis
f, (ax1, ax2, ax3) = plt.subplots(3, 1, sharex=True, sharey=True)

ax1.plot(data_pd['time'], data_pd['NS'])
ax1.set_ylabel('acc. cm/s²')
ax1.legend()

ax2.plot(data_pd['time'], data_pd['EW'])
ax2.set_ylabel('acc. cm/s²')
ax2.legend()

ax3.plot(data_pd['time'], data_pd['UD'])
ax3.set_ylabel('acc. cm/s²')
ax3.set_xlabel('time')
ax3.legend()

#share the box
plt.subplots_adjust(hspace=0)

#reduce number of ticks
plt.locator_params(axis='both', numtick=8)

#focus on the time where the signal is
plt.xlim((10.0,40.0))

#plot title over all subplots
f.suptitle('station LUC', size=20)


plt.show()

No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.
/Users/lion/miniconda3/envs/seismo_live/lib/python3.7/site-packages/ipykernel_launcher.py:21: MatplotlibDeprecationWarning: MaxNLocator.set_params got an unexpected parameter: numtick

4. Styles¶

You can use different style option to render your plot.

Full documentation available here: https://matplotlib.org/devdocs/gallery/style_sheets/style_sheets_reference.html

print(plt.style.available)

['seaborn-dark', 'seaborn-darkgrid', 'seaborn-ticks', 'fivethirtyeight', 'seaborn-whitegrid', 'classic', '_classic_test', 'fast', 'seaborn-talk', 'seaborn-dark-palette', 'seaborn-bright', 'seaborn-pastel', 'grayscale', 'seaborn-notebook', 'ggplot', 'seaborn-colorblind', 'seaborn-muted', 'seaborn', 'Solarize_Light2', 'seaborn-paper', 'bmh', 'tableau-colorblind10', 'seaborn-white', 'dark_background', 'seaborn-poster', 'seaborn-deep']

# choose a style from above
plt.style.use('default')

#you can also combine different styles
#plt.style.use(('fivethirtyeight', 'seaborn-white', 'seaborn-pastel'))

f, (ax1, ax2, ax3) = plt.subplots(3, 1, sharex=True, sharey=True)

ax1.plot(data_pd['time'],data_pd['NS'])
ax1.set_ylabel('acc. cm/s²')

ax2.plot(data_pd['time'],data_pd['EW'])
ax2.set_ylabel('acc. cm/s²')

ax3.plot(data_pd['time'],data_pd['UD'], label='data1')
ax3.set_ylabel('acc. cm/s²')
ax3.set_xlabel('time')
ax3.legend()


#reduce number of ticks
plt.locator_params(axis='both',numtick=8)

#focus on the time where the signal is
plt.xlim((10.0,40.0))

#plot title over all subplots
f.suptitle('station LUC', size=20)


plt.show()

/Users/lion/miniconda3/envs/seismo_live/lib/python3.7/site-packages/ipykernel_launcher.py:22: MatplotlibDeprecationWarning: MaxNLocator.set_params got an unexpected parameter: numtick

or you can change the line style and colors.

#read in another dataset
data2_pd = pd.read_csv('data/station_2.dat', sep = '\t', comment='#', names=['time','NS','EW', 'UD'])

f, (ax1, ax2, ax3) = plt.subplots(3,1, sharex=True, sharey=True)

ax1.plot(data_pd['time'], data_pd['NS'])
ax1.plot(data2_pd['time'], data2_pd['NS'], ls='-.')
ax1.set_ylabel('acc. cm/s²')

ax2.plot(data_pd['time'], data_pd['EW'])
ax2.plot(data2_pd['time'], data2_pd['EW'], ls='--')
ax2.set_ylabel('acc. cm/s²')

ax3.plot(data_pd['time'], data_pd['UD'], label='station 1')
ax3.plot(data2_pd['time'], data2_pd['UD'], ls=':', label='station 2')
ax3.set_ylabel('acc. cm/s²')
ax3.set_xlabel('time')
ax3.legend()


#reduce number of ticks
plt.locator_params(axis='both',numtick=8)

#focus on the time where the signal is
plt.xlim((10.0,40.0))

#plot title over all subplots
f.suptitle('Landers earthquake', size=20)


plt.show()

/Users/lion/miniconda3/envs/seismo_live/lib/python3.7/site-packages/ipykernel_launcher.py:19: MatplotlibDeprecationWarning: MaxNLocator.set_params got an unexpected parameter: numtick

5. Scatter Plots¶

Using scatter plots we can show discrete data, for example measurements in dependence of two (or even three) dimensions.

Full documentation: https://matplotlib.org/devdocs/api/_as_gen/matplotlib.pyplot.scatter.html

# For this example we first create some random input.
# You can use your own dataset by reading in your data first
x = np.random.randn(100)
y = x + np.random.randn(100) + 10

#-------
# Insert linear fit

#from scipy import stats
#slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)
#line = slope*x+intercept
#we can add some oppacity with alpha (0 to 1)
#plt.plot(x, line, alpha=0.5)

#-------
plt.xlabel('x')
plt.ylabel('y')
plt.title('Example Scatter Plot')

plt.scatter(x,y)


plt.show()

Additionally, we can use color and size as a third dimension of information in our scatter plot.

Available colorbars: https://matplotlib.org/users/colormaps.html

#create a third dimension, for example temperature
z = np.random.randn(100)*70

#plt.scatter(x,y, cmap='viridis', c=z)

#even better to spot the difference: 
#change the size according to the third data dimension
#colorbar can be restricted by vmin and vmax
plt.scatter(x, y, cmap='viridis', c=z, s=z, vmin=-40, vmax=140)

plt.xlabel('x')
plt.ylabel('y')
plt.title('Example Scatter Plot')

plt.colorbar()

#set colorbar axis to customized range
#plt.set_cmap([0,150])

plt.show()

/Users/lion/miniconda3/envs/seismo_live/lib/python3.7/site-packages/matplotlib/collections.py:857: RuntimeWarning: invalid value encountered in sqrt
  scale = np.sqrt(self._sizes) * dpi / 72.0 * self._factor

Additional options on markers and labels¶

You can customize your scatter plot by adding different markers for different datasets and labels.

Markerstyles can be found here: https://matplotlib.org/api/markers_api.html

#create a second data set
x_2 = np.random.randn(100)
y_2 = x_2 + np.random.randn(100) + 10
z_2 = np.random.randn(100)*70

#shades of red with z=color itensity
plt.scatter(x, y, cmap='Reds', c=z, label='dataset 1')

#shades of blue with z=color intensity
plt.scatter(x_2, y_2, cmap='Blues', c=z_2, marker ='v', label='dataset 2')


plt.xlabel('x')
plt.ylabel('y')
plt.legend()

plt.title('Difference day and night')
plt.show()

3D scatter plots¶

You can also use a 3D figure to plot your three dimensional data set.

# we need the following package
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

#colored by the z values
ax.scatter(x, y, z, c=z)

ax.set_xlabel('X Label')
ax.set_ylabel('Y Label')
ax.set_zlabel('Z Label')

plt.title('3D scatter plot')
plt.show()

6. Other types of plots¶

Some additional graphics and inspiration.

Bar plots¶

#Generates
n = 12
#creates a array ranging from 0 to 11
X = np.arange(n)

#n random number, uniform distribution
Y1 = np.random.uniform(0.5, 1.0, n)
Y2 = np.random.uniform(0.5, 1.0, n)

#set a face and edge color
plt.bar(X, +Y1, facecolor='#9999ff')
plt.bar(X, -Y2, facecolor='#ff9999')

for x,y in zip(X,Y1):
    #annotations
    #shift the values slightly above the bars
    plt.text(x+0.1, y+0.05, '%.2f' % y, ha='center', va= 'bottom')

for x,y in zip(X,Y2):
    plt.text(x+0.1, -y-0.05, '%.2f' % y, ha='center', va= 'top')

plt.xlim(-.5, n)
#remove ticks on x axis
plt.xticks([])

plt.ylim(-1.25, +1.25)
#or fully remove the ticks/labels
plt.yticks([])

plt.title('Histogram example')

plt.show()

Seaborn¶

Seaborn is a package dedicated to statistical graphics.

More general information: http://seaborn.pydata.org/examples/

Jointplots combine scatter plots with distribution plots along the two axis.

More information about jointplots: https://seaborn.pydata.org/generated/seaborn.jointplot.html

#same random data set as before for the scatter plot
#this time with seaborn
import seaborn as sns

x_2 = np.random.randn(100)
y_2 = x_2 + np.random.randn(100) + 10

#data needs to put into this format
data = pd.DataFrame({"x": x_2, "y": y_2})

#scatter plot with distribution along x and y coordinate
#first plot: default
sns.jointplot(x=x_2, y=y_2, data=data)

#but you can also play around with the "kind" option
#second plot: hex + changed color
sns.jointplot(x=x_2, y=y_2, data=data, kind="reg", space=0, color="r")

#third plot: kde + changed color
sns.jointplot(x=x_2, y=y_2, data=data, kind="kde", space=0, color="g")

plt.show()

Some more inspiration

#Error bar plots

sns.set(style="ticks")

#Load the example tips dataset
tips = sns.load_dataset("tips")

#Draw a nested boxplot to show bills by day and sex
sns.boxplot(x="day", y="total_bill", hue="sex", data=tips, palette="PRGn")

#This setting removes the borders to minimalize the figure
sns.despine(offset=10, trim=True)

plt.show()

# Heatmaps
sns.set()

# Load the example flights dataset and conver to long-form
flights_long = sns.load_dataset("flights")
flights = flights_long.pivot("month", "year", "passengers")

# Draw a heatmap with the numeric values in each cell
f, ax = plt.subplots(figsize=(9, 6))

#main command
sns.heatmap(flights, annot=True, fmt="d", linewidths=.5, ax=ax)
plt.title('Passengers')

plt.show()

7. How to save your plots¶

You can easily save your nice figures by using

plt.savefig('my_figure.png')

before calling

plt.show()

The resolution can be increase by increasing dpi values or saving in pdf or svg format.

plt.savefig('my_figure.png', dpi=800)
plt.savefig('my_figure.svg')