You are on page 1of 43

CHPC/BBML – Python Workshops

Malcolm Tobias
mtobias@wustl.edu
(314) 362-1594

Xing Huang
x.huang@wustl.edu

http://chpc2.wustl.edu
http://chpc.wustl.edu
CHPC/BBML – Python Workshops

Maze Ndonwi
ndonwimaze@wustl.edu

Marcy Vana
vanam@wustl.edu

https://becker.wustl.edu/services/science-informatics-support
CHPC/BBML – Python Workshops

Aditi Gupta
agupta24@wustl.edu

Madhurima Kaushal
kaushalm@wustl.edu

https://informatics.wustl.edu
CHPC/BBML – Python Workshops

Introduction to Python #1 - Getting Started with


Python

Introduction to Python #2 – Using Python for Data


Analysis
CHPC/BBML – Python Workshops

Introduction to Python #2 – Using Python for Data


Analysis

Goals
- Learn to write Python code to perform data
analysis and visualization
- Learn to run Python code in the virtual
environment set up on the CHPC
Topics covered in Intro to Python #1

 Variables and Python data types

 Python lists

 Numpy arrays

 Matplotlib for basic data visualization

 Jupyter Notebook
Topics to be covered today

 Conditions and Loops

 Functions and Methods

 Packages for Data Analysis


(Numpy, Pandas, and Matplotlib)

 Python Virtual Environment on the CHPC


Conditions and loops

Python can use rational and logical operators in


conditions and loops to make comparisons
between objects
Booleans
 Relational Operators >>> 4 < 5
True
>>> 6 <= 3
< strictly less than False
<= less than or equal >>> 10.7 > 8.2
True
> strictly greater than
>>> 12 >= 12.0
>= greater than or equal True
== equal >>> 3/5 == 0.6
True
!= not equal
>>> 2**4 != 16
False
Booleans
 Logical Operators

And Or Not

>>> True and True >>> True or True


True True
>>> False and True >>> False or True >>> not True
False True False
>>> True and False >>> True or False >>> not False
False True True
>>> False and False >>> False or False
False False
Conditional Statements
if-elif-else

if condition1: if score >= 90:


expression 1 letter = ‘A’
elif condition2: elif score >= 80:
expression 2 letter = ‘B’
elif condition3: elif score >= 70:
expression 3 letter = ‘C’
…… elif score >= 60:
elif condition n-1: letter = ‘D’
expression n-1 else:
else: letter = ‘F’
expression n
Looping

while lines = list()


print (‘Enter lines of text.’)
print (‘Enter an empty line to quit.’)
while condition1:
expression 1 Line = input (‘Next line: ’)
expression 2 while line != ‘ ’:
…… lines.append (line)
expression n line = input (‘Next line: ’)

print (‘Your lines were:’)


print (lines)
Looping

for … in …
height = [74, 70, 63, 69, 67,
71, 64, 66, 71]
for iterating_variable in sequence: sum = 0.0
expression 1
…… for i in range(len(height)):
expression n sum += height[i]

avg = sum/len(height)
print(‘The average height is:’)
print(avg)
Built-in Functions

• pieces of reusable code


• solve particular tasks
• call function instead of writing
your own code

We have seen print() and type(), what else?


https://docs.python.org/3/library/functions.html
Built-in Functions
 max(), min(), sum(), len()
 Let’s use the height list from Intro to Python #1
 height = [74, 70, 63, 69, 67, 71, 64, 66, 71]
>>> max(height) Return the largest item in an object.
74
>>> min(height) Return the smallest item in an object.
63
>>> sum(height) Sums the items of an object from left
615 to right and returns the total.
>>> len(height) Return the length (the number of items) of an object.
9
Define your own Functions
define your function PI = 3.14159265358979

def function_name([argv]): def circleArea(radius):


body return PI*radius*radius

call your function def circleCircumference(radius):


return 2*PI*radius
function_name([argv])
def main():
print('circle area with radius 5:', circleArea(5))
Note that the function print('circumference with radius 5:',
must be defined first circleCircumference(5))
before it can be called
main()
Methods
 Objects: everything, including data and functions

 Object have methods associated, depending on type

 Methods: call functions on objects

 Syntax: object.method(parameters)

 String Methods

 List Methods
Methods
 String Methods

find() >>> s = 'Mississippi'


>>> s.find('si')
It shows the
3
location of the 1st
>>> s.find('sa')
occurrence of the
-1 String not found
searched string

split() >>> s = 'Mississippi'


>>> s.split('i')
['M', 'ss', 'ss', 'pp', ' ']
>>> s.split() # no white space
['Mississippi']
Methods
 List Methods

index() >>> fam = ['emma', 1.68, 'mom', 1.71, 'dad', 1.89]


>>> fam.index('mom')
2

count() >>> fam.count(1.68)


1

>>> fam.append(1.78)
append()
>>> fam
['emma', 1.68, 'mom', 1.71, 'dad', 1.89, 1.78]
Packages
 A collection of python scripts

 Thousands of them available from the internet

 Packages for data science

Numpy arrays

Pandas dataframe

Matlibplot data visualization


Numpy: Basic Statistics
data containing
height and weight
of 5000 people
>>> import numpy as np
>>> np_city >>> np.mean(np_city[:,0])
array([[ 1.69, 83.24], 1.7188
[ 1.38, 58.78], >>> np.median(np_city[:,1])
[ 1.89, 85.14], 63.43
…, >>> np.std(np_city[:,0])
[ 1.75, 66.55], 0.1719
[ 1.61, 54.46],
[ 1.86, 95.69]])
Numpy: Basic Statistics
Generate data

>>> import numpy as np


>>> height = np.round(np.random.normal(1.72, 0.17, 5000), 2)
>>> weight = np.round(np.random.normal(63.45, 18, 5000), 2)
>>> np.city = np.column((height, weight))

average standard number of


value deviation data points
Pandas
Handle data of different types (for example, CSV files)

data.csv
country population area capital
US United States 326,474,013 9,144,930 Washington, DC
column labels
RU Russia 143,375,006 16,292,614 Moscow
IN India 1,342,512,706 2,973,450 New Delhi
CH China 1,388,232,693 9,386,293 Beijing
BR Brazil 211,243,220 8,349,534 Brasilia

row labels
Pandas

>>> import pandas as pd


>>> data = pd.read_csv(“path_to_data.csv”)

>>> data
Unnamed: 0 country population area capital
0 US United States 326474013 9144930 Washtington, DC
1 RU Russia 143375006 16292614 Moscow
2 IN India 1342512706 2973450 New Delhi
3 CH China 1388232693 9386293 Beijing
4 BR Brazil 211243220 8349534 Brasilia
Pandas

>>> data = pd.read_csv(“path_to_data.csv”, index_col = 0)

>>> data
country population area capital
US United States 326474013 9144930 Washtington, DC
RU Russia 143375006 16292614 Moscow
IN India 1342512706 2973450 New Delhi
CH China 1388232693 9386293 Beijing
BR Brazil 211243220 8349534 Brasilia
Pandas
Column Access This output is resulted from specifying
index_col=0.
In the case of column access, this is
>>> data[[“country”]]
optional, but with different format of
US United States
outputs.
RU Russia
IN India
CH China This output is resulted from specifying
BR Brazil index_col=0.
In the case of row access, this is a
Row Access must, otherwise Python will give an
exception.
>>> data.loc[[“BR”]]
country population area capital density
BR Brazil 211243220 8349534 Brasilia 25
Pandas
Add Column

>>> data[“on_earth”] = [True, True, True, True, True]

>>> data
country population area capital on_earth
US United States 326474013 9144930 Washtington, DC True
RU Russia 143375006 16292614 Moscow True
IN India 1342512706 2973450 New Delhi True
CH China 1388232693 9386293 Beijing True
BR Brazil 211243220 8349534 Brasilia True
Pandas
Add Column

>>> data[“density”] = data[“population”] / data[“area”]

>>> data
country population area capital density
US United States 326474013 9144930 Washtington, DC 36
RU Russia 143375006 16292614 Moscow 9
IN India 1342512706 2973450 New Delhi 452
CH China 1388232693 9386293 Beijing 148
BR Brazil 211243220 8349534 Brasilia 25
Pandas

Element Access

>>> data.loc[“US”, “capital”]


Washington, DC

>>> data[“capital”].loc[“US”]
Washington, DC

>>> data.loc[“US”][“capital”]
Washington, DC
Matplotlib
Advanced features

fig, ax = plt.subplots (m, n,


sharex = True, sharey =
True)

ax – user defined name for this


set of plots (can be anything)
m – number of plots in rows
n – number of plots in columns
sharex = True, share x axis
sharey = True, share y axis
Matplotlib
Advanced features
1. add axis labels
ax.set_xlabel(‘abc’)
ax.set_xlabel (‘angle’, fontsize=16)
2. specify axis limits
ax.set_xlim([a,b])
ax.set_xlim ([0,10])
3. specify axis tick values
ax.set_xticks([a, b, c, d, e])
ax.set_xticks([-2,-1,0,1,2])
Matplotlib
Advanced features

4. specify axis tick positions


ax.xaxis.set_ticks_position(‘bottom’/’top’/’left’/’right’)
ax.xaxis.set_ticks_position(‘bottom’)
5. specify font size for tick label
ax.yaxis.set_tick_params()
ax.yaxis.set_tick_params(labelsize=15)
6. add plot title
ax.set_title(‘abc’)
ax.set_title(‘House price’,fontsize = 12)
Matplotlib
customize color, either by
specify a name, e.g., “red”,
Advanced features or Hexagonal number, e.g.,
#FF0000
7. change line color and line width For searching the
8. add customized legend hexagonal number for a
specific color, go to:
ax.plot(x,y,color=‘XX’,lw=XX,label=‘XXX’) https://color.adobe.com/cre
ax.legend() ate/color-wheel/

ax.plot(data[:,0],data[:,1],color=‘#1980FF’, customize the relative


lw=2,label=‘New York’) position of the legend
to the entire plot
specify the
line width axlegend(loc = (0.03,0.77), prop=dict(size=11),
in unit of pt.
frameon=True, framealpha=.2, handlelength=1.25,
specify the handleheight=0.5, labelspacing=0.25, handletextpad=0.8,
length of the
legend handle
ncol=1,numpoints=1, columnspacing=0.5)
Using Python on CHPC
Set up Python virtual environment
http://login02.chpc.wustl.edu/wiki119/index.php/Python

1. manually add Anaconda to your path:


[xhuang@login01 ~]$ export PATH=/act/Anaconda3-
2.3.0/bin:${PATH}

2. create an environment with:


[xhuang@login01 ~]$ conda create --name <name_of_env>
python=3
where <name_of_env> can be whatever you want to call it. You
can also use python=2 depending on which version of Python
you want to use.
Using Python on CHPC
Set up Python virtual environment

3. Activate this environment using:


[xhuang@login01 ~]$ source activate <name_of_env>

You can now install any package you'd like with:


(conda_env)[xhuang@login01 ~]$ conda install --name
<name_of_env> <name_of_package>

Besides being more flexible, this installation method won’t


interfere with software modules.
Using Python on CHPC
Using Python on CHPC
Using Python on CHPC
Set up Python virtual environment

4. Install numpy package:


(conda_env)[xhuang@login01 ~]$ conda install --name
<name_of_env> numpy

5. Install matlibplot package:


(conda_env)[xhuang@login01 ~]$ conda install --name
<name_of_env> matplotlib

6. Install pandas package:


(conda_env)[xhuang@login01 ~]$ conda install --name
<name_of_env> pandas
Using Python on CHPC
Using Python on CHPC
Using Python on CHPC
Using Python on CHPC
Using Python on CHPC
Once virtual environment is set use Python scripts to run on CHPC

1. Write script:
py_vir_env)[xhuang@login02 ~]$ vi TEST.py
#!/usr/bin/python
base = 2.5
height = 3
area = base * height / 2
print(area)

2. Make it executable:
(py_vir_env)[xhuang@login02 ~]$ chmod +x TEST.py

3. Run script:
(py_vir_env)[xhuang@login02 ~]$ ./TEST.py
3.75

You might also like