Brief Introduction to Python

Python Brief Introduction

Brief Intro to Python

Outline

Why Python

Python is a generic scripting computer language used for many purposes

  • System administration
  • Web development
  • game development
  • multimedia, natural language
  • Scientific computing
  • Data science and statistics
  • Machine Learning, …

Python is increasingly popular in the scientific community pretty much in all areas. Economists have been rapidly learning to use Python in research and teaching, and it is no stranger to young PhD students and researchers.

The ecosystem of Python developer and user community is huge, it is safe to say that Python is the computer language for everything.

Python is awarded the programming language of the year 2020.

Python trends

Jupyter notebook

The Jupyter Notebook is an interactive computing environment that enables users to author notebook documents that include:

  • Live code
  • Interactive widgets
  • Plots
  • Narrative text
  • Equations
  • Images
  • Video

These documents provide a complete and self-contained record of a computation that can be converted to various formats and shared with others using email, Dropbox, version control systems (like git/GitHub) or nbviewer.jupyter.org.

The Jupyter Notebook combines three components:

  • The notebook web application: An interactive web application for writing and running code interactively and authoring notebook documents.
  • Kernels: Separate processes started by the notebook web application that runs users’ code in a given language and returns output back to the notebook web application.
  • Notebook documents: Self-contained documents that contain a representation of all content visible in the notebook web application, including inputs and outputs of the computations, narrative text, equations, images, and rich media representations of objects. Each notebook document has its own kernel.

Notebooks consist of a sequence of cells. There are three basic cell types:

  • Code cells: Input and output of live code that is run in the kernel
  • Markdown cells: Narrative text with embedded LaTeX equations
  • Raw cells: Unformatted text that is included, without modification, when notebooks are converted to different formats using nbconvert

Internally, notebook documents are JSON data with binary values base64 encoded. This allows them to be read and manipulated programmatically by any programming language. Because JSON is a text format, notebook documents are version control friendly.

Notebooks can be exported to different static formats including HTML, reStructeredText, LaTeX, PDF, and slide shows (reveal.js) using Jupyter’s nbconvert utility.

Furthermore, any notebook document available from a public URL or on GitHub can be shared via nbviewer. This service loads the notebook document from the URL and renders it as a static web page. The resulting web page may thus be shared with others without their needing to install the Jupyter Notebook.

Through Jupyter’s kernel and messaging architecture, the Notebook allows code to be run in a range of different programming languages. For each notebook document that a user opens, the web application starts a kernel that runs the code for that notebook. Each kernel is capable of running code in a single programming language and there are kernels available in the following languages:

The default kernel runs Python code. The notebook provides a simple way for users to pick which of these kernels is used for a given notebook.

Each of these kernels communicate with the notebook web application and web browser using a JSON over ZeroMQ/WebSockets message protocol that is described here. Most users don’t need to know about these details, but it helps to understand that “kernels run code.”

Python starter

Python is generic. It is a calcultor to start with, and it is a gigantic system used to run big systems by corporates.

Data types

  • boolean
  • int, double, complex
  • strings
  • None

Operators

  • mathematical
  • logical
  • bitwise
  • membership
  • identity
  • assignment and in-place operators
  • operator precedence

Collections

  • Sequence containers - list, tuple
  • Mapping containers - set, dict
  • The collections module

Functions and methods

  • Anatomy of a function
  • Docstrings
  • Class methods

Control flow

  • if and the ternary operator
  • Checking conditions - what evaluates as true/false?
  • if-elif-else
  • while
  • break, continue
  • pass

Loops and comprehensions

  • for, range, enumerate
  • lazy and eager evaluation
  • list, set, dict comprehensions
  • generator expression

Packages and namespace

  • Modules (file)
  • Package (hierarchical modules)
  • Namespace and naming conflicts
  • Using import
  • Batteries included
3+4
7

Arithmetic Operators

The syntax for arithmetic operators in Python are:

Operator Description
+ addition
- subtraction
* multiplication
/ division
** exponentiation
% remainder (or modulo)
// integer division

Notice that division of integers always returns a float:

long_winded_computation = (1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10 + 11 + 12 +
                           13 + 14 + 15 + 16 + 17 + 18 + 19 + 20)
print("long_winded_computation=",long_winded_computation)
long_winded_computation= 210

Comparison Operators

Comparison operators produce Boolean values as output. For example, if we have variables x and y with numeric values, we can evaluate the expression x < y and the result is a boolean value either True or False.

Comparison Operator Description
< strictly less than
<= less than or equal
> strictly greater than
>= greater than or equal
== equal
!= not equal

For example:

Boolean Operators

We combine logical expressions using boolean operators and, or and not.

Boolean Operator Description
A and B returns True if both A and B are True
A or B returns True if either A or B is True
not A returns True if A is False

For example:

math_is_scary = False
print(math_is_scary)
False
# Note the indention
names = ["Bob", "Alice", "Zack", "Tyler"]
for x in names:
    print(x)
Bob
Alice
Zack
Tyler

Reserved Words

Summarized below are the reserved words in Python 3. Python will raise an error if you try to assign a value to any of these keywords and so you must avoid these as variable names.

False class finally is return
None continue for lambda try
True def from nonlocal while
and del global not with
as elif if or yield
assert else import pass break
except in raise

Built-in Function Names

There are several functions which are included in the standard Python library. Do not use the names of these functions as variable names otherwise the reference to the built-in function will be lost. For example, do not use sum, min, max, list or sorted as a variable name. See the full list of builtin functions.

"""Anonymous functions are handy"""
fib = [0,1,1,2,3,5,8,13,21,34,55]
results = list(filter(lambda x: x % 2==0, fib))
print('results:', results)
results: [0, 2, 8, 34]

Loops

for Loops

A for loop allows us to execute a block of code multiple times with some parameters updated each time through the loop. A for loop begins with the for statement:

iterable = [1,2,3]
for item in iterable:
    # code block indented 4 spaces
    print(item)
1
2
3

while Loops

What if we want to execute a block of code multiple times but we don’t know exactly how many times? We can’t write a for loop because this requires us to set the length of the loop in advance. This is a situation when a while loop is useful.

The following example illustrates a while loop:

n = 5
while n > 0:
    print(n)
    n = n - 1
5
4
3
2
1

Sequences

The main sequence types in Python are lists, tuples and range objects. Main differences between these sequence objects are:

  • Lists are mutable and their elements are usually homogeneous (things of the same type making a list of similar objects)
  • Tuples are immutable and their elements are usually heterogeneous (things of different types making a tuple describing a single structure)
  • Range objects are efficient sequences of integers (commonly used in for loops), use a small amount of memory and yield items only when needed

Lists

Create a list using square brackets [ ... ] with items separated by commas. For example, create a list of square integers, assign it to a variable and use the built-in function print() to display the list:

fib = [0,1,1,2,3,5,8,13,21,34,55]
results2 = [x for x in fib if x % 2 == 0]
for x in results2:
    print(x)
0
2
8
34

Tuples

Tuples are similar to lists but are immutable, i.e., elements in a tuple cannot be changed.

listx = [3,4]
tupley = (3,4)
listx[1]=6
print("listx",listx)
try:
    tupley[1] = 0
except TypeError:
    print("cannot modify a tuple")
listx [3, 6]
cannot modify a tuple

Dictionaries

grades = {"Joel": 80, "Tim": 95}
x = grades.keys()
print("x",x)
y = grades.values()
print("y",y)
x dict_keys(['Joel', 'Tim'])
y dict_values([80, 95])

Functions

A code block that returns results, and can be used repeatedly.

"""
this code solve for consumer's utility maximization problem. U(C,l) is Cobb-Douglas, l is leisure, C is consumption
I worked out first-order conditions in equations, translated them to code
"""
import numpy as np

# utility function
def utilityfun(l,C,shareC, sharel, sigma):
    tmplc = (l**sharel) * (C**shareC)
    return (tmplc**(1.0-sigma) - 1.0)/(1.0-sigma)
# optimal choices

def optimalChoice(sharel,shareC,wage,hmax,profit,tax):
    lstar = (sharel/(sharel+shareC)) *(wage*hmax + profit - tax) / wage
    Cstar = lstar * wage * shareC / sharel
    return lstar,Cstar

# parameters
shareC = 0.65
sharel = 1.0 - shareC
sigma = 1.4

wage, profit, tax = 0.5, 1.8, 0.5
hmax = 5.5
#---optimal cohices: c/l = w * sharec/sharel
lstar, Cstar=optimalChoice(sharel,shareC,wage,hmax,profit,tax)
# maximized utility under wage
util0 = utilityfun(lstar,Cstar,shareC, sharel, sigma)

print('Optimal C=', Cstar, "optimal l=", lstar)
print('Maximized utility level=', util0)
Optimal C= 2.6325 optimal l= 2.8349999999999995
Maximized utility level= 0.8200869606862031

Built-in Functions

The standard Python library has a collection of built-in functions ready for us to use. We have already seen a few of these functions in previous sections such as type(), print() and sum(). The following is a list of built-in functions that we’ll use most often:

Function Description
print(object) print object to output
type(object) return the type of object
abs(x) return the absolute value of x (or modulus if x is complex)
int(x) return the integer constructed from float x by truncating decimal
len(sequence) return the length of the sequence
sum(sequence) return the sum of the entries of sequence
max(sequence) return the maximum value in sequence
min(sequence) return the minimum value in sequence
range(a,b,step) return the range object of integers from a to b (exclusive) by step
list(sequence) return a list constructed from sequence
sorted(sequence) return the sorted list from the items in sequence
reversed(sequence) return the reversed iterator object from the items in sequence
enumerate(sequence) return the enumerate object constructed from sequence
zip(a,b) return an iterator that aggregates items from sequences a and b

Python for scientific computing

Python is one of the core languages of scientific computing. In fact, it is the modules (libraries) written in Python that have become the essential tools for scientific computing.

Many of the Python modules are translated from numerical libraries written in C and Fortran.

Its popularity in economics has been rising rapidly, a collaborated effort is quantecon.org.

We take a bird’s-eye view on Python modules for scientific computing

In Python, a module is a file that contains python definitions and statements. We need to import them when we need to use them.

NumPy library is fundamental for many other modules. It is for matrix and array manipulation and operation.

Scipy is a comprehensive numerical library.

Pandas is a package for data processing and analysis, popular but largely because no good substitutes. It is based on R data frame.

Matplotlib is a basic (but rich) library for visualization. Some other complements are Bokeh, Plotly, etc.

Statsmodels is a library for statistical estimation and inference.

Scikit-learn, a Python machine learning package, mostly for supervised learning. Some other machine learning libraries, such as TensorFlow for neural networks.

Numpy

import numpy as np                     # convention to name numpy as np in your codes

x = np.array([[1,2,3],[4,5,6]])
print ('x=',x)
for xi in x:
    print('xi=',xi)
y = x.T
print('y=',y)
x= [[1 2 3]
 [4 5 6]]
xi= [1 2 3]
xi= [4 5 6]
y= [[1 4]
 [2 5]
 [3 6]]
a = np.linspace(-np.pi, np.pi, 10)    # Create even grid from -π to π
print('a=', a)
a= [-3.14159265 -2.44346095 -1.74532925 -1.04719755 -0.34906585  0.34906585
  1.04719755  1.74532925  2.44346095  3.14159265]
np.fromfunction(lambda x, y: x*3 + y + 1, (2,3))
array([[1., 2., 3.],
       [4., 5., 6.]])

Solving a system of linear equations: \begin{align*}3 x_0 + 2 x_1 - x_2 = 9 \ x_0 - 2 x_1 + 0.5x_2 = 2 \ 5 x_0 +0.2x_1 - 2x_2 = 10 \end{align*} In matrix, $A x = b$

from numpy import linalg
A = np.array([[3, 2, -1], [1,-2, 0.5], [5,0.2,-2]])
b = np.array([9,2,10])
x = np.linalg.solve(A, b)
print('x=',x)
x= [3.11428571 1.28571429 2.91428571]

Scipy

Scipy is built on top of Numpy, it includes modules for function interpolation, integration, optimization, linear algebra, Fourier transformation, eignevalues, and more. See here for a complete list.

Example: calculate $ \int_{-0.5}^2 \phi(x) dx $ where $ \phi(x) = \frac{1}{\sqrt{2\pi}} \cdot e^{-x^2/2}$ is the standard normal density function.

from scipy.stats import norm
from scipy.integrate import quad
phi = norm()
y, error = quad(phi.pdf, -1.2, 2)  # Integrate using Gaussian quadrature
print('y=',y)
y= 0.8621801978301126

Matplotlib

Given data in Numpy arrays, plot 2D or 3D figures, can also create animation.

First thing to look at should be the anatomy of a figure

Example: Standard normal distribution.

import numpy as np;
from scipy.stats import norm;
import matplotlib.pyplot as plt;
import seaborn;
seaborn.set(style='ticks');
x = np.linspace(norm.ppf(0.0000001),norm.ppf(0.9999999), 120);
pdfx = norm.pdf(x);
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(14, 6), sharey='all');
ax.plot(x,pdfx, color="green", linewidth=1.6, linestyle="-");
ax.fill_betweenx(pdfx,x,x2=1.96, where = x>1.96,interpolate=True);
plt.axvline(x=0.0, color='black', linewidth=1.4,linestyle='--');
plt.text(2.0,.1,'\n$\\alpha$=shaded area',fontsize=16, horizontalalignment='left');
plt.autoscale(enable=True, axis='x', tight=True);
ax.set_title('Standard normal distribution', fontsize=20);
ax.tick_params(axis='y', labelsize=16);
ax.tick_params(axis='x', labelsize=16);
ax.set_xlim([-4,4]);
ax.set_ylim([0.0,0.4]);
plt.xlabel("z", fontsize=16);
ax.set_ylabel('$\phi(z)$', fontsize=14);
#ax.set_aspect('equal')
#fig.tight_layout();

png

Pandas

Pandas is not so fast, not so efficient, not so flexible and not so well designed. Somehow, its syntax is confusing to me.

But Pandas has been improving.

For large data set, one should use https://dask.org.

# Example 1: use pandas to create time series
import pandas as pd
np.random.seed(1234)
data = np.random.randn(5, 2)  # 5x2 matrix of N(0, 1) random draws
dates = pd.date_range('28/12/2010', periods=5)
df = pd.DataFrame(data, columns=('price', 'weight'), index=dates)
print(df)
               price    weight
2010-12-28  0.471435 -1.190976
2010-12-29  1.432707 -0.312652
2010-12-30 -0.720589  0.887163
2010-12-31  0.859588 -0.636524
2011-01-01  0.015696 -2.242685


/tmp/ipykernel_40679/2615389923.py:5: UserWarning: Parsing dates in DD/MM/YYYY format when dayfirst=False (the default) was specified. This may lead to inconsistently parsed dates! Specify a format to ensure consistent parsing.
  dates = pd.date_range('28/12/2010', periods=5)
df.mean()
price     0.411768
weight   -0.699135
dtype: float64
# Example 2: read Canadian labor market condition data from CSV form
import pandas as pd
cansimid = '14100287'
filename = 'tbl14100287Final3.csv'
lfs = pd.read_csv(filename, index_col=0)
lfs.index = pd.to_datetime(lfs.index)
lfsQtr = lfs.resample('QS-OCT').mean()
lfsQtr = lfsQtr.round(2)
lfsQtr

emplBothSex emplFemale emplFullBothSex emplFullFemale emplFullMale emplMale emplPartBothSex emplPartFemale emplPartMale emplRateBothSex ... participRateMale populationBothSex populationFemale populationMale unemplBothSex unemplFemale unemplMale unemplRateBothSex unemplRateFemale unemplRateMale
ref_date
1976-01-01 9666.90 3566.33 8469.30 2735.10 5734.20 6100.53 1197.57 831.23 366.30 57.23 ... 77.93 16891.70 8538.53 8353.20 718.17 307.77 410.40 6.93 7.93 6.30
1976-04-01 9737.53 3604.93 8529.63 2756.80 5772.83 6132.60 1207.87 848.10 359.73 57.27 ... 77.73 17008.10 8599.70 8408.40 718.13 312.23 405.90 6.87 7.97 6.20
1976-07-01 9778.40 3635.07 8562.97 2779.30 5783.70 6143.37 1215.43 855.83 359.67 57.10 ... 77.53 17121.43 8659.10 8462.33 753.40 336.07 417.33 7.17 8.43 6.37
1976-10-01 9824.70 3673.90 8568.27 2790.33 5777.90 6150.80 1256.43 883.60 372.87 57.07 ... 77.53 17210.97 8705.67 8505.33 785.97 338.47 447.50 7.43 8.43 6.77
1977-01-01 9869.73 3700.57 8605.30 2810.80 5794.50 6169.17 1264.43 889.73 374.67 57.03 ... 77.70 17300.70 8752.77 8548.00 829.93 358.83 471.10 7.77 8.83 7.10
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2019-07-01 19102.03 9081.50 15495.80 6783.90 8711.87 10020.57 3606.27 2297.57 1308.70 62.00 ... 70.17 30806.87 15611.33 15195.57 1140.47 503.03 637.43 5.63 5.27 5.97
2019-10-01 19124.53 9102.93 15521.23 6787.10 8734.17 10021.60 3603.27 2315.83 1287.43 61.83 ... 69.90 30931.17 15671.30 15259.87 1154.90 509.90 645.00 5.70 5.33 6.07
2020-01-01 18842.40 8899.60 15438.27 6697.90 8740.33 9942.83 3404.13 2201.70 1202.47 60.70 ... 69.23 31032.37 15719.97 15312.40 1268.40 607.10 661.30 6.30 6.43 6.23
2020-04-01 16695.60 7783.43 13971.77 6042.47 7929.30 8912.17 2723.80 1740.97 982.87 53.67 ... 66.47 31118.57 15760.20 15358.33 2496.70 1199.43 1297.27 13.00 13.37 12.73
2020-07-01 18135.83 8564.80 14692.03 6352.70 8339.33 9571.07 3443.80 2212.10 1231.70 58.13 ... 69.47 31196.97 15796.80 15400.17 2021.03 895.67 1125.37 10.03 9.47 10.50

179 rows × 27 columns

#--unemployment rate, raw
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 8), sharey='all')
ax.plot(lfsQtr['unemplRateBothSex'],color="blue", linewidth=2.5, linestyle="-",label='Unemployment rate')
plt.autoscale(enable=True, axis='x', tight=True)
ax.set_title('Unemployment rate, Canada', fontsize=18)
ax.tick_params(axis='x', labelsize=14)
plt.xlabel("\nLate date: 2020 July. Data source: Staistics Canada.", fontsize=14)
ax.set_ylabel('Percentage', fontsize=16)
fig.tight_layout()
filename = 'UnemploymentRate_Canada.png'
plt.savefig(filename, dpi=200, format='png')
plt.show()

png

Statsmodels

We read in the Mroz’s data on wages from PSID, then do a OLS estimation of the wage equation

import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

mroz = pd.read_csv("mroz1987.csv")
mroz

inlf hours kidslt6 kidsge6 age educ wage repwage hushrs husage ... faminc mtr motheduc fatheduc unem city exper nwifeinc lwage expersq
0 1 1610 1 0 32 12 3.3540 2.65 2708 34 ... 16310 0.7215 12 7 5.0 0 14 10.910060 1.210154 196
1 1 1656 0 2 30 12 1.3889 2.65 2310 30 ... 21800 0.6615 7 7 11.0 1 5 19.499980 0.328512 25
2 1 1980 1 3 35 12 4.5455 4.04 3072 40 ... 21040 0.6915 12 7 5.0 0 15 12.039910 1.514138 225
3 1 456 0 3 34 12 1.0965 3.25 1920 53 ... 7300 0.7815 7 7 5.0 0 6 6.799996 0.092123 36
4 1 1568 1 2 31 14 4.5918 3.60 2000 32 ... 27300 0.6215 12 14 9.5 1 7 20.100060 1.524272 49
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
748 0 0 0 2 40 13 0.0000 0.00 3020 43 ... 28200 0.6215 10 10 9.5 1 5 28.200000 NaN 25
749 0 0 2 3 31 12 0.0000 0.00 2056 33 ... 10000 0.7715 12 12 7.5 0 14 10.000000 NaN 196
750 0 0 0 0 43 12 0.0000 0.00 2383 43 ... 9952 0.7515 10 3 7.5 0 4 9.952000 NaN 16
751 0 0 0 0 60 12 0.0000 0.00 1705 55 ... 24984 0.6215 12 12 14.0 1 15 24.984000 NaN 225
752 0 0 0 3 39 9 0.0000 0.00 3120 48 ... 28363 0.6915 7 7 11.0 1 12 28.363000 NaN 144

753 rows × 22 columns

# OLS estmation of wage equation for husbands
mroz['ones'] = 1.0
X = mroz.loc[:,['ones','huseduc']]
Y = np.log(mroz['huswage'])
#Y = mroz4['huswage']

model = sm.OLS(Y, X)
results = model.fit()
print(" ")
print(results.summary())

# prediction: results.fittedvalues
Yhat = results.params['ones'] + results.params['huseduc']*X['huseduc']

Yhat2 = results.params['ones']*0.7 + results.params['huseduc']*1.4*X['huseduc']
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                huswage   R-squared:                       0.155
Model:                            OLS   Adj. R-squared:                  0.154
Method:                 Least Squares   F-statistic:                     138.3
Date:                Thu, 12 Jan 2023   Prob (F-statistic):           2.03e-29
Time:                        09:20:08   Log-Likelihood:                -598.39
No. Observations:                 753   AIC:                             1201.
Df Residuals:                     751   BIC:                             1210.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
ones           0.9105      0.083     10.943      0.000       0.747       1.074
huseduc        0.0761      0.006     11.758      0.000       0.063       0.089
==============================================================================
Omnibus:                      144.321   Durbin-Watson:                   1.890
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              371.275
Skew:                          -0.986   Prob(JB):                     2.39e-81
Kurtosis:                       5.819   Cond. No.                         55.0
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Networks and Graphs

Disclaimer: This subsection is credited to quantecon.org

Python has many libraries for studying graphs.

One well-known example is NetworkX. Its features include, among many other things:

  • standard graph algorithms for analyzing networks
  • plotting routines

Here’s some example code that generates and plots a random graph, with node color determined by shortest path length from a central node.

%matplotlib inline
import networkx as nx
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (10,6)
np.random.seed(1234)

# Generate a random graph
p = dict((i, (np.random.uniform(0, 1), np.random.uniform(0, 1)))
         for i in range(200))
g = nx.random_geometric_graph(200, 0.12, pos=p)
pos = nx.get_node_attributes(g, 'pos')

# Find node nearest the center point (0.5, 0.5)
dists = [(x - 0.5)**2 + (y - 0.5)**2 for x, y in list(pos.values())]
ncenter = np.argmin(dists)

# Plot graph, coloring by path length from central node
p = nx.single_source_shortest_path_length(g, ncenter)
plt.figure()
nx.draw_networkx_edges(g, pos, alpha=0.4)
nx.draw_networkx_nodes(g,
                       pos,
                       nodelist=list(p.keys()),
                       node_size=120, alpha=0.5,
                       node_color=list(p.values()),
                       cmap=plt.cm.jet_r)
plt.show()

png

Next