NumPy and Pandas
NumPy
- NumPy (or Numpy) is a Linear Algebra Library for Python, the reason it is so important for Data Science with Python is that almost all of the libraries in the PyData Ecosystem rely on NumPy as one of their main building blocks.
- Numpy is also incredibly fast, as it has bindings to C libraries. For more info on why you would want to use Arrays instead of lists, check out this great [StackOverflow post](http://stackoverflow.com/questions/993984/why-numpy-instead-of-python-lists).
Installation
It is highly recommended you install Python using the Anaconda distribution to make sure all underlying dependencies (such as Linear Algebra libraries) all sync up with the use of a conda install.
If you have Anaconda, install NumPy by:
conda install numpy
If you are not using Anaconda distribution:
pip install numpy
Then, to use it:
import numpy as np
arr = np.arange(0,10)
Arrays
Method/Operation | Description/Comments | Example | |||
---|---|---|---|---|---|
import numpy as np
| |||||
Methods for creating NumPy Arrays |
From a Python List |
array()
|
We can create an array by directly converting a list or list of lists. | my_list = [1,2,3]
| |
From Built-in NumPy Methods |
arange()
|
Return evenly spaced values within a given interval. | np.arange(0,10)
| ||
zeros()
|
Generate arrays of zeros. | np.zeros(3)
| |||
ones()
|
Generate arrays of ones. | np.ones(3)
| |||
linspace()
|
Return evenly spaced numbers over a specified interval. | np.linspace(0,10,3)
| |||
eye()
|
Creates an identity matrix. | np.linspace(0,10,50)
| |||
random
|
rand()
|
Create an array of the given shape and populate it with random samples from a uniform distribution over [0, 1) .
|
np.random.rand(2)
np.random.rand(5,5)
# Another way to invoke a function:
from numpy.random import rand
# Then you can call the function directly
rand(5,5)
| ||
randn()
|
Return a sample (or samples) from the "standard normal" distribution. Unlike rand which is uniform. | np.random.randn(2)
| |||
randint()
|
Return random integers from low (inclusive) to high (exclusive).
|
np.random.randint(1,100)
| |||
seed()
|
sets the random seed of the NumPy pseudo-random number generator. It provides an essential input that enables NumPy to generate pseudo-random numbers for random processes. See s1 and s2. for explanation. | np.random.seed(101)
| |||
Others Array Attributes and Methods |
reshape()
|
Returns an array containing the same data with a new shape. | arr.reshape(5,5)
| ||
max() , min() , argmax() , argmin()
|
Finding max or min values. Or to find their index locations using argmin or argmax. | arr.max()
| |||
shape()
|
Shape is an attribute that arrays have (not a method). | NO LO ENTENDI.. REVISAR!
arr_length = arr2d.shape[1]
| |||
dtype()
|
You can also grab the data type of the object in the array. | arr.dtype
| |||
- | - | - | - | - | |
Indexing and Selection
|
Expand
Creating sample array for the following examples: | ||||
Bracket Indexing and Selection (Slicing) |
Note: When we create a sub-array slicing an array (slice_of_arr = arr[0:6]), data is not copied, it's a view of the original array! This avoids memory problems! To get a copy, need to use the method copy(). See important note below. | #Get a value at an index
arr[8]
#Get values in a range
arr[1:5]
slice_of_arr = arr[0:6]
#2D
arr_2d[1]
arr_2d[1][0]
arr_2d[1,0] # The same that above
#Shape (2,2) from top right corner
arr_2d[:2,1:]
#Output:
array([[10, 15],
[25, 30]])
#Shape bottom row
arr_2d[2,:]
| |||
Expand
Fancy Indexing: | |||||
Broadcasting
|
Setting a value with index range:
Numpy arrays differ from a normal Python list because of their ability to broadcast. |
arr[0:5]=100 #Show arr Output: array([100, 100, 100, 100, 100, 5, 6, 7, 8, 9, 10]) | |||
#Setting all the values of an Array
arr[:]=99 | |||||
Get a copy of an Array |
copy()
|
Note: When we create a sub-array slicing an array (slice_of_arr = arr[0:6]), data is not copied, it's a view of the original array! This avoids memory problems! To get a copy, need to use the method copy(). See important note below. | arr_copy = arr.copy() | ||
Important notes on Slices |
|||||
Using brackets for selection based on comparison operators and booleans |
|||||
- | - | - | - | - | |
Arithmetic operations |
arr + arr
|
Warning on division by zero, but not an error!
|
import numpy as np
arr = np.arange(0,10)
arr + arr
# Output:
array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18])
arr**3
# Output:
array([ 0, 1, 8, 27, 64, 125, 216, 343, 512, 729])
| ||
Universal Array Functions |
np.sqrt(arr)
|
Taking Square Roots | np.sin(arr)
# Output:
array([ 0. , 0.84147098, 0.90929743, 0.14112001, -0.7568025 ,
-0.95892427, -0.2794155 , 0.6569866 , 0.98935825, 0.41211849])
| ||
np.exp(arr)
|
Calcualting exponential (e^) | ||||
np.max(arr)
same as |
Max | ||||
np.sin(arr)
|
Sin | ||||
np.log(arr)
|
Natural logarithm |
Pandas
You can think of pandas as an extremely powerful version of Excel, with a lot more features. In this section of the course, you should go through the notebooks in this order:
Series
A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.
Method/Operator | Description/Comments | Example | ||
---|---|---|---|---|
import pandas as pd
| ||||
Creating Pandas Series
You can convert a |
From a List |
pd.Series(my_list)
|
# Creating some test data:
labels = ['a','b','c']
my_list = [10,20,30]
arr = np.array([10,20,30])
d = {'a':10,'b':20,'c':30}
pd.Series(data=my_list)
pd.Series(my_list)
pd.Series(arr)
# Output:
0 10
1 20
2 30
dtype: int64
pd.Series(data=my_list,index=labels)
pd.Series(my_list,labels)
pd.Series(arr,labels)
pd.Series(d)
# Output:
a 10
b 20
c 30
dtype: int64
| |
From a NumPy Array |
pd.Series(arr)
| |||
From a Dectionary |
pd.Series(d)
| |||
Data in a Series |
A pandas Series can hold a variety of object types. Even functions (although unlikely that you will use this)pd.Series(data=labels)
# Output:
0 a
1 b
2 c
dtype: object
# Holding «functions» into a Series
# Output:
pd.Series([sum,print,len])
0 <built-in function sum>
1 <built-in function print>
2 <built-in function len>
dtype: object
| |||
Index in Series |
The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information (works like a hash table or dictionary).ser1 = pd.Series([1,2,3,4],index = ['USA', 'Germany','USSR', 'Japan'])
ser1
# Output:
USA 1
Germany 2
USSR 3
Japan 4
dtype: int64
ser2 = pd.Series([1,2,5,4],index = ['USA', 'Germany','Italy', 'Japan'])
ser1['USA']
# Output:
1
# Operations are then also done based off of index:
ser1 + ser2
# Output:
Germany 4.0
Italy NaN
Japan 8.0
USA 2.0
USSR NaN
dtype: float64
|
DataFrames
DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let's use pandas to explore this topic!
import pandas as pd
import numpy as np
from numpy.random import randn
np.random.seed(101)
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())
df
# Output:
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509
DataFrame Columns are just Series:
type(df['W'])
# Output:
pandas.core.series.Series
Method/
Operator |
Description/Comments | Example | ||
---|---|---|---|---|
Selection and Indexing
Let's learn the various methods to grab data from a DataFrame |
Standard systax |
df['']
|
# Pass a list of column names:
df[['W','Z']]
W Z
A 2.706850 0.503826
B 0.651118 0.605965
C -2.018168 -0.589001
D 0.188695 0.955057
E 0.190794 0.683509
| |
SQL syntax(NOT RECOMMENDED!) |
df.W
|
|||
Selecting Rows |
df.loc['']
|
df.loc['A']
# Or select based off of position instead of label :
df.iloc[2]
# Output:
W 2.706850
X 0.628133
Y 0.907969
Z 0.503826
Name: A, dtype: float64
| ||
Selecting subset of rows and columns |
df.loc['','']
|
df.loc['B','Y']
# Output:
-0.84807698340363147
df.loc[['A','B'],['W','Y']]
# Output:
W Y
A 2.706850 0.907969
B 0.651118 -0.848077
| ||
Conditional Selection |
Expand
An important feature of pandas is conditional selection using bracket notation, very similar to numpy: | |||
Creating a new column |
df['new'] = df['W'] + df['Y']
| |||
Removing Columns |
df.drop()
|
|||
Resetting the index |
Reset to default(0,1...n index) |
df.reset_index()
|
||
Setting index to something else |
df.set_index('')
|
|||
Multi-Indexed DataFrame |
Creating a Multi-Indexed DataFrame |
|||
Multi-Index and Index Hierarchy |
Missing Data
https://www.geeksforgeeks.org/python-pandas-dataframe-dropna/
Pandas will recognise a value as null if it is a np.nan object, which will print as NaN in the DataFrame.
Let's show a few convenient methods to deal with Missing Data in pandas.
dropna()
method allows the user to analyze and drop Rows/Columns with Null values in different ways:
DataFrameName.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
fillna()
allows to fill Null fields with a given value:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A':[1,2,np.nan],
'B':[5,np.nan,np.nan],
'C':[1,2,3]})
df
# Output:
A B C
0 1.0 5.0 1
1 2.0 NaN 2
2 NaN NaN 3
'''By default, dropna() drop all the rows without Null values:'''
df.dropna()
df.dropna(axis=0) # Same as default
# Output:
A B C
0 1.0 5.0 1
'''If we want to display all the columns without Null values:'''
df.dropna(axis=1)
'''If we want to display all the rows that have at least 2 non-null values:'''
df.dropna(thresh=2)
# Output:
A B C
0 1.0 5.0 1
1 2.0 NaN 2
'''Columns with at least 3 non-null values:'''
df.dropna(thresh=3)
# Output:
A B C
0 1.0 5.0 1
'''You can also use df.isnull() to check for Null values:
df.isnull()
# Output:
A B C
0 False False False
1 False True False
2 True True False
'''To fill null fields with a given value:'''
df.fillna(value='FILL VALUE')
# Output:
A B C
0 1 5 1
1 2 FILL VALUE 2
2 FILL VALUE FILL VALUE 3
'''But many times what we want to do is to replace these null fields with, for example, the «mean» of the columns. We can do it this way:'''
df['A'].fillna(value=df['A'].mean())
# Output:
0 1.0
1 2.0
2 1.5 # *
Name: A, dtype: float64
'''* The Null field has been filled with the mean of the column'''
GroupBy
The groupby method allows you to group rows of data together and call aggregate functions
Now you can use the .groupby() method to group rows together based on a column name. For instance let's group based off of Company. This will create a DataFrameGroupBy object:
Method | Description/Example | |
---|---|---|
import pandas as pd
# Create dataframe
data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
'Sales':[200,120,340,124,243,350]}
df = pd.DataFrame(data)
df
# Output:
Company Person Sales
0 GOOG Sam 200
1 GOOG Charlie 120
2 MSFT Amy 340
3 MSFT Vanessa 124
4 FB Carl 243
5 FB Sarah 350
| ||
GroupBy
|
df.groupby('')
|
For instance let's group based off of Company:df.groupby('Company')
# Output:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f2027fdd470>
'''You can save this object as a new variable:'''
by_comp = df.groupby("Company")
'''And then call aggregate methods off the object:'''
|
We can call aggregate methods on the
|
df.mean()
|
df.groupby('Company').mean()
by_comp.mean()
# Output:
Sales
Company
FB 296.5
GOOG 160.0
MSFT 232.0
|
df.std()
|
by_comp.std()
# Output:
Sales
Company
FB 75.660426
GOOG 56.568542
MSFT 152.735065
| |
df.min() df.max()
|
by_comp.min()
# Output:
Person Sales
Company
FB Carl 243
GOOG Charlie 120
MSFT Amy 124
by_comp.max()
| |
df.count()
|
by_comp.count()
# Output:
Person Sales
Company
FB 2 2
GOOG 2 2
MSFT 2 2
| |
df.describe()
|
by_comp.describe()
# Output:
Sales
Company
FB count 2.000000
mean 296.500000
std 75.660426
min 243.000000
25% 269.750000
50% 296.500000
75% 323.250000
max 350.000000
GOOG count 2.000000
mean 160.000000
std 56.568542
min 120.000000
25% 140.000000
50% 160.000000
75% 180.000000
max 200.000000
MSFT count 2.000000
mean 232.000000
std 152.735065
min 124.000000
25% 178.000000
50% 232.000000
75% 286.000000
max 340.000000
by_comp.describe().transpose()
by_comp.describe().transpose()['GOOG']
# Output:
count mean std min 25% 50% 75% max
Sales 2.0 160.0 56.568542 120.0 140.0 160.0 180.0 200.0
|
Concatenation - Merging - Joining
Method | Description/Comments | Example | |
---|---|---|---|
Concatenation |
pd.concat()
|
Concatenation basically glues together DataFrames. Keep in mind that dimensions should match along the axis you are concatenating on. You can use |
Example:
import pandas as pd
pd.concat([df1,df2,df3])
pd.concat([df1,df2,df3],ignore_index=True)
pd.concat([df1,df2,df3],axis=1)
|
Merging |
pd.merge()
|
The |
Example 1:
import pandas as pd
pd.merge(left,right,how='inner',on='key')
|
Example 2:
import pandas as pd
pd.merge(left, right, on=['key1', 'key2'])
pd.merge(left, right, how='outer', on=['key1', 'key2'])
pd.merge(left, right, how='right', on=['key1', 'key2'])
pd.merge(left, right, how='left', on=['key1', 'key2'])
| |||
Joining |
df.join(df)
|
Joining is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame. |
Example:
left.join(right)
left.join(right, how='outer')
|
Comparison with SQL
- https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_sql.html
- https://www.edureka.co/blog/sql-joins-types
Some operations
Method/Operator | Description/Comments | Example | ||
---|---|---|---|---|
import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']})
df.head()
# Output:
col1 col2 col3
0 1 444 abc
1 2 555 def
2 3 666 ghi
3 4 444 xyz
| ||||
Selecting Data |
Select from DataFrame using criteria from multiple columns. | newdf = df[(df['col1']>2) & (df['col2']==444)]
newdf
# Output:
col1 col2 col3
3 4 444 xyz
| ||
Info on Unique Values |
df.unique()
|
df['col2'].unique()
# Output:
array([444, 555, 666])
df['col2'].nunique()
# Output:
3
| ||
Count values |
df.value_counts()
|
df['col2'].value_counts()
# Output:
444 2
555 1
666 1
Name: col2, dtype: int64
| ||
Removing a Column |
del df['col']
|
del df['col1']
# Output:
col2 col3
0 444 abc
1 555 def
2 666 ghi
3 444 xyz
| ||
Get column and index names |
df.columns
|
df.columns
# Output:
Index(['col2', 'col3'], dtype='object')
df.index
# Output:
RangeIndex(start=0, stop=4, step=1)
| ||
Sorting and Ordering a DataFrame |
df.sort_values()
|
df.sort_values(by='col2') #inplace=False by default
# Output:
col2 col3
0 444 abc
3 444 xyz
1 555 def
2 666 ghi
| ||
Applying Functions |
df[''].apply(some_function)
|
def function():
|
We can define our own function | def times2(x):
return x*2
df['col1'].apply(times2)
# Output
0 2
1 4
2 6
3 8
Name: col1, dtype: int64
|
len
|
df['col3'].apply(len)
# Output:
0 3
1 3
2 3
3 3
Name: col3, dtype: int64
| |||
df[''].sum()
|
Sum values in a column | df['col1'].sum()
# Output:
10
| ||
df.pivot_table()
|
data = {'A':['foo','foo','foo','bar','bar','bar'],
'B':['one','one','two','two','one','one'],
'C':['x','y','x','y','x','y'],
'D':[1,3,2,5,4,1]}
df = pd.DataFrame(data)
df
# Output:
A B C D
0 foo one x 1
1 foo one y 3
2 foo two x 2
3 bar two y 5
4 bar one x 4
5 bar one y 1
df.pivot_table(values='D',index=['A', 'B'],columns=['C'])
# Output:
C x y
A B
bar one 4.0 1.0
two NaN 5.0
foo one 1.0 3.0
two 2.0 NaN
|
Data Input and Output
CSV
CSV Input
CSV Output
Excel
Excel Input
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html
dart = pd.read_excel('dart.xls', index_col=0)
Excel Output
HTML
HTML Input
SQL