import numpy as np
Arrays
Often we desire to perform the same operation on numerous values at the same time. This can be accomplished using the numpy module.
The use numpy, we have to import it to our workspace:
Why np
? The imported name is shortened to np
for better readability of code using NumPy
. This is a widely adopted convention that makes your code more readable for everyone working on it.
Once imported, we can its attributes via the .
operator.
Array Creation
Use np.array
to create arrays from lists/tuples.
1D Arrays
Lets create a numpy array of 5 numbers:
= np.array([3,4,7,8,9])
a a
array([3, 4, 7, 8, 9])
= np.arange(1,6) # Could you tell what this does?
b b
array([1, 2, 3, 4, 5])
NumPy operations are usually done on pairs of arrays on an element-by-element basis. In the simplest case, the two arrays must have exactly the same shape.
a.shape
(5,)
b.shape
(5,)
The shape shows the number of elements in each dimension. In the case of the arrays above, they have one dimension. Thus the shape is a tuple of length 1. With the number 5 representing the size of array in that dimension.
Doing math:
+ b a
array([ 4, 6, 10, 12, 14])
- b a
array([2, 2, 4, 4, 4])
* b a
array([ 3, 8, 21, 32, 45])
/ b a
array([3. , 2. , 2.33333333, 2. , 1.8 ])
% b a
array([0, 0, 1, 0, 4])
** b a
array([ 3, 16, 343, 4096, 59049])
// b a
array([3, 2, 2, 2, 1])
Suppose we want to use a scalar and an array? NumPy’s broadcasting rule relaxes this constraint when the arrays’ shapes meet certain constraints. The simplest broadcasting example occurs when an array and a scalar value are combined in an operation:
+ 3 # adds 3 to every element a
array([ 6, 7, 10, 11, 12])
* 4 # multiplies every element by 4 a
array([12, 16, 28, 32, 36])
Indexing
For 1D arrays, Indexing remains the same as done with lists and tuples
0] # get the first element a[
3
0] = 10 # set the first element to 10
a[-1] # reverse the array a[::
array([ 9, 8, 7, 4, 10])
# reverse the array np.flip(a)
array([ 9, 8, 7, 4, 10])
4] # from the first to the 4th element. a[:
array([10, 4, 7, 8])
2D Arrays:
= np.array([[1,2],[3,4]])
a2 a2
array([[1, 2],
[3, 4]])
a2.shape
(2, 2)
= np.array([[5,6,7,8]])
b2 b2
array([[5, 6, 7, 8]])
b2.shape
(1, 4)
= np.array([[5],[6],[7],[8]])
c c
array([[5],
[6],
[7],
[8]])
c.shape
(4, 1)
All of these are two dimensional arrays. But notice the difference in the shapes. array a
is a \(2 \times 2\) array, b
is a \(1\times 4\) array and c
is a \(4\times 1\) array.
Indexing
This is abit different than previously learned. We use the normal \(X_{ij}\) notation where \(i\) represents the rows and \(j\) represents the columns. This is used in conjunction with the extraction operator []
= np.array([[5,6,7,8],[9,10,11,12], [13,14,15,16]])
a3 0] # the first object. ie row 0 a3[
array([5, 6, 7, 8])
0][0] # the first element of the first object. a3[
5
0,0] # the first element. ie the object at row 0 and column 0(same as above) a3[
5
1,:] # the entirety of row 1. ie row 1 for all columns. a3[
array([ 9, 10, 11, 12])
0] # the entirety of column 0 a3[:,
array([ 5, 9, 13])
1:,:2] # rows 1 to end for columns 0 and 1 a3[
array([[ 9, 10],
[13, 14]])
Note that for entire axes, we use the elipses instead of the colon
0] a3[...,
array([ 5, 9, 13])
You might see the different when using higher dimensions:
= np.arange(18).reshape(2,3,3)
arr_dim3 arr_dim3
array([[[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8]],
[[ 9, 10, 11],
[12, 13, 14],
[15, 16, 17]]])
arr_dim3.shape
(2, 3, 3)
0] # first row for every array arr_dim3[:,
array([[ 0, 1, 2],
[ 9, 10, 11]])
0] # first column for every array. arr_dim3[...,
array([[ 0, 3, 6],
[ 9, 12, 15]])
0] arr_dim3[:, :,
array([[ 0, 3, 6],
[ 9, 12, 15]])
Broadcasting
a2
array([[1, 2],
[3, 4]])
+ a2 a2
array([[2, 4],
[6, 8]])
* a2 # elementwise a2
array([[ 1, 4],
[ 9, 16]])
Can we do math operations on arrays of different shapes?
a2
array([[1, 2],
[3, 4]])
b2
array([[5, 6, 7, 8]])
c
array([[5],
[6],
[7],
[8]])
+ b2 a2
ValueError: operands could not be broadcast together with shapes (2,2) (1,4)
+ c a2
ValueError: operands could not be broadcast together with shapes (2,2) (4,1)
+ c b2
array([[10, 11, 12, 13],
[11, 12, 13, 14],
[12, 13, 14, 15],
[13, 14, 15, 16]])
Only b2 + c
worked. Why? How was numpy able to do the above? By using what is known as broadcasting. The above is the simplest notion of broadcasting. Broadcasting is simply stretching an array along a given dimension to match the size of another array in that dimension. The only time broadcasting occurs is when one array is of length 1 while the other is of another size within a particular dimension.
Notice how with broadcasting, we did not have to write a loop to do all the sums.
How can we use this to do math?
Suppose we have been given an array a
which is 1D. A simple question is to find the distance between each point to the rest. in this case, we could use broadcasting. But how?
First lets look at a loop solution
= [3,4,7,8,9]
a a
[3, 4, 7, 8, 9]
We need a \(4\times 4\) as the results. ie
abs(i-j) for j in a] for i in a]) np.array([[
array([[0, 1, 4, 5, 6],
[1, 0, 3, 4, 5],
[4, 3, 0, 1, 2],
[5, 4, 1, 0, 1],
[6, 5, 2, 1, 0]])
How can we do the same using broadcasting?
All we have to do is to ensure that when subtracting \(a\) from itself, the next \(a\) need to have a different dimension, and then python will stretch the two to match.
= np.array(a)
arr 5,1) arr.reshape(
array([[3],
[4],
[7],
[8],
[9]])
abs(arr - arr.reshape(5, 1))
array([[0, 1, 4, 5, 6],
[1, 0, 3, 4, 5],
[4, 3, 0, 1, 2],
[5, 4, 1, 0, 1],
[6, 5, 2, 1, 0]])
In the above, we used reshape(5,1)
to turn the vector to a matrix. Here we specified that there should be 5 rows and 1 column. Well sometimes we do not know the first size before hand and therefore need while reshaping. Here are the other ways:
None] arr[:,
array([[3],
[4],
[7],
[8],
[9]])
abs(arr - arr[:, None])
array([[0, 1, 4, 5, 6],
[1, 0, 3, 4, 5],
[4, 3, 0, 1, 2],
[5, 4, 1, 0, 1],
[6, 5, 2, 1, 0]])
abs(arr - arr.reshape(-1,1)) # We use -1 to tell the computer to calculate for us
array([[0, 1, 4, 5, 6],
[1, 0, 3, 4, 5],
[4, 3, 0, 1, 2],
[5, 4, 1, 0, 1],
[6, 5, 2, 1, 0]])
abs(arr - arr[:, np.newaxis])
array([[0, 1, 4, 5, 6],
[1, 0, 3, 4, 5],
[4, 3, 0, 1, 2],
[5, 4, 1, 0, 1],
[6, 5, 2, 1, 0]])
The way to rearrange the arrays is very important and you need to know this.
More examples:
= np.array([[5,6,7],[9,10,11], [13,14,15]]) a3
Suppose we wanted to subtract 5,9,13
from the first row, then 5,9,13
from the second row and also from the 3rd row. How will we do this?
= np.array([5,9,13]) b
Note that since it is row wise , we only need the first dimension ie the row to match. They are already matching, ie, b
is already packed as one unit of 3 elements, and for a3
, each unit has 3 elements. we can therefore directly do the subtraction
- b a3
array([[ 0, -3, -6],
[ 4, 1, -2],
[ 8, 5, 2]])
What if we wanted to Subtract 5
from the first row, 9 from the second row and 13
from the third row? We use broadcasting. We ensure the object to be subtracted is packed as a unit. Note that we have to arrange b
such that it has 3 rows and 1 element per row.
To do the task,:
a3.shape
(3, 3)
We can then do:
= (3, 1) # similar to b.shape = 3, 1
b.shape b
array([[ 5],
[ 9],
[13]])
Now we can do the subtraction:
- b a3
array([[0, 1, 2],
[0, 1, 2],
[0, 1, 2]])
Note that the method I used above for replacing the shape by a tuple changes b
completely. I now have a b
that is of shaped as (3, 1)
. If there were other operations that were to depend on the previous version of b
they would fail if the dimension did not align. Thus instead of doing an inplace replacement, we should use one of the methods shown above.
Firs lets revert back b
to how it was
= b.flatten() # or b.shape = (3,) or b = b.ravel()
b b
array([ 5, 9, 13])
Now lets use any of the methods previously introduced.
- b[:, None] a3
array([[0, 1, 2],
[0, 1, 2],
[0, 1, 2]])
Now what if we wanted to subtract 5 from all the rows and columns, then 9 from all the rows then 13 from all the rows?. Well alittle advanced:
- b[:,None,None] a3
array([[[ 0, 1, 2],
[ 4, 5, 6],
[ 8, 9, 10]],
[[-4, -3, -2],
[ 0, 1, 2],
[ 4, 5, 6]],
[[-8, -7, -6],
[-4, -3, -2],
[ 0, 1, 2]]])
Anyway dont worry about the last example. But notice how 5
was subtracted from the whole of a3
then 9
then 13
. We get three matrices.
Broadcasting is useful as it alleviates the need for loops whenever the operations are independent.
Question
= [[1,6],[3,7],[9,12],[4,13]]
flowers = [2,3,7,11]
people = np.array(flowers)
flowers = np.array(people)[:, None]
people 0] <= people) & (people <= flowers[:,1])).sum(1) ((flowers[:,
array([1, 2, 2, 2])
Elementary Math functions
There are a lot of elementary functions. Some basic are:
Polynomial functions:
0,4,9]) np.sqrt([
array([0., 2., 3.])
0,27]) np.cbrt([
array([0., 3.])
0,4,5])**2 np.array([
array([ 0, 16, 25])
Exponential and Logarithm
# constant e np.e
2.718281828459045
0,1,2]) # same as above e^x function np.exp([
array([1. , 2.71828183, 7.3890561 ])
0,1,2]) # 2^x np.exp2([
array([1., 2., 4.])
0,1,2]) # similar to e^x - 1. To provide greater precision for small x np.expm1([
array([0. , 1.71828183, 6.3890561 ])
10, 20, 30]) # log base e. np.log([
array([2.30258509, 2.99573227, 3.40119738])
10,100,1000]) # log base 10 np.log10([
array([1., 2., 3.])
1,2,3]) # log base e of 1+x. high precision for small x. np.log1p([
array([0.69314718, 1.09861229, 1.38629436])
2,4, 8, 16]) # log base 2 np.log2([
array([1., 2., 3., 4.])
Trigonometric and hyperbolic functions
30)) # np.sine(30*np.pi/180) We are used to degrees np.sin(np.radians(
0.49999999999999994
0.5)) #inverse of sin function np.degrees(np.arcsin(
30.000000000000004
2) # hyperbolic cosine np.cosh(
3.7621956910836314
2) + np.exp(-2))/2 # definition of hyperbolic sine (np.exp(
3.7621956910836314
Other trigonometric functions could be found here and here.
The hyperbolic functions could be found here and their relations to the trigonometric functions could be found here
Methods
= np.array([[1, -1, 2, 3, 5], [4,-8,-3,0,5]]) arr
Instance Methods
min() #minimum for the whole array arr.
-8
min(0) # minimum across the rows (along the column) arr.
array([ 1, -8, -3, 0, 5])
min(1) # minimum across the columns (along the row) arr.
array([-1, -8])
# position where minimum occurs arr.argmin()
6
0) # position where minimum occurs across rows/along columns arr.argmin(
array([0, 1, 1, 1, 0], dtype=int64)
1) # similar to arr.argmin(axis = 1). Likewise for the above arr.argmin(
array([1, 1], dtype=int64)
max() #arr.max((0, 1)) or arr.max(axis = (0, 1))
arr.max(0) # arr.max(axis = 0)
arr.max(1)
arr.sum()
arr.sum(0)
arr.sum(1)
arr.
arr.mean()0)
arr.mean(1)
arr.mean(
arr.std()0)
arr.std(
arr.var()0)
arr.var(
arr.cumsum()0)
arr.cumsum(1)
arr.cumsum(
arr.prod()0)
arr.prod(
arr.cumprod()# CAUTION. DOES INPLACE ORDERING, AND CHANGES THE ORIGINAL ARRAY
arr.sort()0) # CAUTION. DOES INPLACE ORDERING, AND CHANGES THE ORIGINAL ARRAY
arr.sort(# returns indices at which the current values should be for the array to be sorted
arr.argsort() 0)
arr.argsort(>= 0).all(0)
(arr >= 0).any(0) (arr
You could get other instance methods by typing the instance and period the pressing on the tab key.
Question:
Find the distance matrix for the ar below. Where the distance is defined as:
\[ d(x_i, x_j) = \sqrt{\sum_{l} (x_{il} - x_{jl})^2} \]
= np.arange(10).reshape(-1, 2) ar
Module numeric array methods.
All the instance method are just an inheritance of the class methods. Thus for every every instance method, there is an equivalent class method.
sum(arr)
np.sum(arr, 1)
np.sum(arr, axis = 1)
np.min(arr) np.
In addition, the module provides extra functions that are not inherited by the class
-10) # minimim per element np.minimum(arr,
array([[-10, -10, -10, -10, -10],
[-10, -10, -10, -10, -10]])
0) # maximum per element np.maximum(arr,
array([[1, 0, 2, 3, 5],
[4, 0, 0, 0, 5]])
# seems redundant? But its not. We will see np.add(arr, arr)
array([[ 2, -2, 4, 6, 10],
[ 8, -16, -6, 0, 10]])
The comparison above does not do justice the the provided functions. Lets take another example.
1,4,7,9],[0,5,8,3]) np.minimum([
array([0, 4, 7, 3])
Also the module provided the equivalent functions to be carried out whenever there are nan
’s in an array:
np.nansum(arr)= 1)
np.nansum(arr, axis
np.nanmean(arr)
np.nanmax(arr) # etc np.nancumsum(arr)
Other vector/array functions
arr
array([[ 1, -1, 2, 3, 5],
[ 4, -8, -3, 0, 5]])
np.where
Think of this in 2 ways.As a vectorized
if else
ternary operator:>= 4, 0, 5) # 0 if arr>4 else 5 np.where(arr
array([[5, 5, 5, 5, 0], [0, 5, 5, 5, 0]])
>=4, arr - 1, abs(arr)) # np.where(arr
array([[1, 1, 2, 3, 4], [3, 8, 3, 0, 4]])
Compare the following:
= np.array([1,4,7,9]) a = np.array([0,5,8,3]) b np.minimum(a, b)
array([0, 4, 7, 3])
<b, a, b) np.where(a
array([0, 4, 7, 3])
As a vectorized
find
function. ie gives index where the condition isTrue
>= 4) np.where(arr
(array([0, 1, 1], dtype=int64), array([4, 0, 4], dtype=int64))
np.select
vectorized generic elif statement. ie nestednp.where
EG: grading scale: A : >=90, B: >=80, C: >=70; D: >=60, E: >=50, F: <50
= np.array([97, 90, 72, 89, 50, 23]) grade >=90, "A", np.where(grade>=80, "B", np.where(grade>=70, "C", np.where(grade>=60, "D", np.where(grade>=50, "E","F"))))) np.where(grade
array(['A', 'A', 'C', 'B', 'E', 'F'], dtype='<U1')
That was a long one. We simply use
np.select
:= [grade>=90, grade>=80, grade>=70, grade>=60, grade>=50, grade<50] conditions = 'A','B','C','D','E','F' choices np.select(conditions, choices)
array(['A', 'A', 'C', 'B', 'E', 'F'], dtype='<U3')
Note that the
else
part, ie the very last condition could be omitted and a default value passed to thenp.select
function. eg:= [grade>=90, grade>=80, grade>=70, grade>=60, grade>=50] conditions = 'ABCDE' choices 'F') np.select(conditions, choices,
array(['A', 'A', 'C', 'B', 'E', 'F'], dtype='<U1')
np.in1d
Determines as to whether elements of one array are in the other. Note that the function is specifically named1d
as it deals with1d
arrays:1,3,4,6], [2,5,7,6,3]) np.in1d([
array([False, True, False, True])
np.unique
. Among top 3 most useful for data science. Determines the unique values in an array, their positions, their counts, etc= [1,2,2,2,1,1,1,3,3,6,3,2,2,4,4,4] arr1 np.unique(arr1)
array([1, 2, 3, 4, 6])
= True) np.unique(arr1, return_index
(array([1, 2, 3, 4, 6]), array([ 0, 1, 7, 13, 9], dtype=int64))
= True) np.unique(arr1, return_inverse
(array([1, 2, 3, 4, 6]), array([0, 1, 1, 1, 0, 0, 0, 2, 2, 4, 2, 1, 1, 3, 3, 3], dtype=int64))
= True) np.unique(arr1, return_counts
(array([1, 2, 3, 4, 6]), array([4, 5, 3, 3, 1], dtype=int64))
= True, return_inverse = True, return_counts = True) np.unique(arr1, return_index
(array([1, 2, 3, 4, 6]), array([ 0, 1, 7, 13, 9], dtype=int64), array([0, 1, 1, 1, 0, 0, 0, 2, 2, 4, 2, 1, 1, 3, 3, 3], dtype=int64), array([4, 5, 3, 3, 1], dtype=int64))
np.diff
returns the differences w.r.t a certain order. Note that the first difference is the difference between the next point and the current point= np.array([[1,3,5,6,9],[2,9,7,4,10]]) arr np.diff(arr)
array([[ 2, 2, 1, 3], [ 7, -2, -3, 6]])
= 0) np.diff(arr, axis
array([[ 1, 6, 2, -2, 1]])
=2) np.diff(arr, n
array([[ 0, -1, 2], [-9, -1, 9]])
= 2, axis = 0) np.diff(arr, n
array([], shape=(0, 5), dtype=int32)
np.concatenate
Used to combine multiple arrays into one array= np.arange(12).reshape(-1,3) a1 = np.arange(13,25).reshape(-1,3) a2 = np.array([1,2,3]) a3 np.concatenate([a1, a2])
array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 9, 10, 11], [13, 14, 15], [16, 17, 18], [19, 20, 21], [22, 23, 24]])
=1) np.concatenate([a1, a2], axis
array([[ 0, 1, 2, 13, 14, 15], [ 3, 4, 5, 16, 17, 18], [ 6, 7, 8, 19, 20, 21], [ 9, 10, 11, 22, 23, 24]])
np.row_stack
- Stacks the arrays row-wisenp.column_stack
- Stacks the arrays column-wisenp.hstack
- Stacks the arrays horizontally. Equal tonp.column_stack
IFF the arrays have more than one dimensionnp.vstack
- Stacks the arrays vertically. Equivalent tonp.row_stack
Here is a link to showcase the differences about the 4 functions above.
np.r_
np.r_[a1, a2]
array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 9, 10, 11], [13, 14, 15], [16, 17, 18], [19, 20, 21], [22, 23, 24]])
0, 1:10, a3] np.r_[
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3])
np.c_
np.c_[a1, a2]
array([[ 0, 1, 2, 13, 14, 15], [ 3, 4, 5, 16, 17, 18], [ 6, 7, 8, 19, 20, 21], [ 9, 10, 11, 22, 23, 24]])
np.append
only works with 2 input arrays.
np.append(a1, a2)
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24])
= 1) np.append(a1, a2, axis
array([[ 0, 1, 2, 13, 14, 15], [ 3, 4, 5, 16, 17, 18], [ 6, 7, 8, 19, 20, 21], [ 9, 10, 11, 22, 23, 24]])
Matrix Operations/ functions
The best way to work with arrays is to use matrix operations. These operations allow us to easily manipulate data. So me of the functions include:
T
/transpose
. Transpose your matrix/array.transpose
is generic as it passes axes to be transposed.a1.T
array([[ 0, 3, 6, 9], [ 1, 4, 7, 10], [ 2, 5, 8, 11]])
a1.transpose()
array([[ 0, 3, 6, 9], [ 1, 4, 7, 10], [ 2, 5, 8, 11]])
@
Used for matrix multiplication/ inner product for vectors@ a3 a3
14
a3.dot(a3)
14
@ a3 a2
array([ 86, 104, 122, 140])
Universal functions
Suppose you have 3 students who did different exams. You are asked to find the average of the students. eg
= [1, 1, 2, 2, 1, 1, 3, 3, 2]
students = [90, 70, 87, 84, 65, 87, 98, 99, 86] grades
Note that each grade corresponds to a student. to visualize, we have:
student 1: 90, 70, 65, 87
student 2: 87, 84, 86
student 3: 98, 99
How shall we go about this? So far, we know that numpy arrays only hold rectangular data. ie, data that can is of the same shape.
We could revert back to the for-loops or simply start thinking for various ways to solve. Luckily, numpy provides a way out. Notice how so far we have been passing only the axis into the functions, there are other arguments to be passed in.
We could create a matrix but also have indices to indicate those elements that we are interested in:
= np.array([[90,70,65,87],
grades_mat 87,84,86,np.nan],
[98,99,np.nan,np.nan]])
[1) # average per student. np.nanmean(grades_mat,
array([78. , 85.66666667, 98.5 ])
1, where = ~np.isnan(grades_mat))# average per student grades_mat.mean(
array([78. , 85.66666667, 98.5 ])
This method would require us to manually manipulate the data, add the nan
s then solve the problem. That will be tedious. Is there a way to directly use the data given? Yes:
= np.array(students)
students = np.array(grades)
grades = np.argsort(students)
sort_index = grades[sort_index]
sorted_grades = np.unique(students[sort_index], return_index = True, return_counts = True)
_, idx, counts /counts np.add.reduceat(sorted_grades, idx)
array([78. , 85.66666667, 98.5 ])
The second method is quite intriguing. We did not have to manually structure our data in a certain way. we just had to use reduceat
method provided by the add
function.
Note that if we want the maximum per student, we use the reduceat
provided by the universal function maximum
. The rest of the code remains the same:
np.minimum.reduceat(sorted_grades, idx)
array([65, 84, 98])
np.maximum.reduceat(sorted_grades, idx)
array([90, 87, 99])
Extra:
Note that we could do the same using python’s STL.
sum(vec:=[grade for stud, grade in zip(students, grades) if stud ==i ])/len(vec) for i in set(students)] [
[78.0, 85.66666666666667, 98.5]
Note how we used the inbuilt universal functions. We could write a function and vectorize it.
Ways to vectorize a function:
np.vectorize
np.frompyfunc
Note that at times we are just interested to apply the function along a given axis or over some axes. Use the functions:
np.apply_along_axis
np.apply_over_axes
Time wont allow me to talk of loading data into python using numpy, of dealing with character arrays, of rolling windows/strides tricks using the np.libs.stride_tricks
module, of padding arrays using the np.lib.arraypad
module, of convolutions etc.
There is still alot to learn from this package that we haven.t scratched the surface. Once you grasp what is happening, and you could respond to problems, then you are ready for DATA SCIENCE. are you ready?
Options :
Move directly to Data Science ie
scipy
,sklearn
,statsmodels
. Need to learnnp.linalg
moduleMove to data Analytics ie
pandas
json
sql
pyspark
siuba
- Pandas an extension of numpy for data frames, thereby no extra numpy knowledge needed. Pandas would provide easy ways to solve.
Both of the above still require Data Visualization – using matplotlib
and seaborn
This is someone you can easily learn.
Question
https://stackoverflow.com/questions/77262300/how-do-i-filter-on-multiple-criteria-in-group-by
https://stackoverflow.com/questions/77260897/r-how-to-do-the-rowwise-mutate-operation
https://stackoverflow.com/questions/77262541/loop-combination-two-columns-sums-in-r
https://stackoverflow.com/questions/77262398/update-values-in-df-after-groupby-and-get-group