import numpy as npArrays
Often we desire to perform the same operation on numerous values at the same time. This can be accomplished using the numpy module.
The use numpy, we have to import it to our workspace:
Why np? The imported name is shortened to np for better readability of code using NumPy. This is a widely adopted convention that makes your code more readable for everyone working on it.
Once imported, we can its attributes via the . operator.
Array Creation
Use np.array to create arrays from lists/tuples.
1D Arrays
Lets create a numpy array of 5 numbers:
a = np.array([3,4,7,8,9])
aarray([3, 4, 7, 8, 9])
b = np.arange(1,6) # Could you tell what this does?
barray([1, 2, 3, 4, 5])
NumPy operations are usually done on pairs of arrays on an element-by-element basis. In the simplest case, the two arrays must have exactly the same shape.
a.shape(5,)
b.shape(5,)
The shape shows the number of elements in each dimension. In the case of the arrays above, they have one dimension. Thus the shape is a tuple of length 1. With the number 5 representing the size of array in that dimension.
Doing math:
a + barray([ 4, 6, 10, 12, 14])
a - barray([2, 2, 4, 4, 4])
a * barray([ 3, 8, 21, 32, 45])
a / barray([3. , 2. , 2.33333333, 2. , 1.8 ])
a % barray([0, 0, 1, 0, 4])
a ** barray([ 3, 16, 343, 4096, 59049])
a // barray([3, 2, 2, 2, 1])
Suppose we want to use a scalar and an array? NumPy’s broadcasting rule relaxes this constraint when the arrays’ shapes meet certain constraints. The simplest broadcasting example occurs when an array and a scalar value are combined in an operation:
a + 3 # adds 3 to every elementarray([ 6, 7, 10, 11, 12])
a * 4 # multiplies every element by 4array([12, 16, 28, 32, 36])
Indexing
For 1D arrays, Indexing remains the same as done with lists and tuples
a[0] # get the first element3
a[0] = 10 # set the first element to 10
a[::-1] # reverse the arrayarray([ 9, 8, 7, 4, 10])
np.flip(a) # reverse the arrayarray([ 9, 8, 7, 4, 10])
a[:4] # from the first to the 4th element.array([10, 4, 7, 8])
2D Arrays:
a2 = np.array([[1,2],[3,4]])
a2array([[1, 2],
[3, 4]])
a2.shape(2, 2)
b2 = np.array([[5,6,7,8]])
b2array([[5, 6, 7, 8]])
b2.shape(1, 4)
c = np.array([[5],[6],[7],[8]])
carray([[5],
[6],
[7],
[8]])
c.shape(4, 1)
All of these are two dimensional arrays. But notice the difference in the shapes. array a is a \(2 \times 2\) array, b is a \(1\times 4\) array and c is a \(4\times 1\) array.
Indexing
This is abit different than previously learned. We use the normal \(X_{ij}\) notation where \(i\) represents the rows and \(j\) represents the columns. This is used in conjunction with the extraction operator []
a3 = np.array([[5,6,7,8],[9,10,11,12], [13,14,15,16]])
a3[0] # the first object. ie row 0array([5, 6, 7, 8])
a3[0][0] # the first element of the first object.5
a3[0,0] # the first element. ie the object at row 0 and column 0(same as above)5
a3[1,:] # the entirety of row 1. ie row 1 for all columns.array([ 9, 10, 11, 12])
a3[:,0] # the entirety of column 0array([ 5, 9, 13])
a3[1:,:2] # rows 1 to end for columns 0 and 1array([[ 9, 10],
[13, 14]])
Note that for entire axes, we use the elipses instead of the colon
a3[...,0]array([ 5, 9, 13])
You might see the different when using higher dimensions:
arr_dim3 = np.arange(18).reshape(2,3,3)
arr_dim3array([[[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8]],
[[ 9, 10, 11],
[12, 13, 14],
[15, 16, 17]]])
arr_dim3.shape(2, 3, 3)
arr_dim3[:,0] # first row for every array array([[ 0, 1, 2],
[ 9, 10, 11]])
arr_dim3[...,0] # first column for every array.array([[ 0, 3, 6],
[ 9, 12, 15]])
arr_dim3[:, :, 0]array([[ 0, 3, 6],
[ 9, 12, 15]])
Broadcasting
a2array([[1, 2],
[3, 4]])
a2 + a2array([[2, 4],
[6, 8]])
a2 * a2 # elementwisearray([[ 1, 4],
[ 9, 16]])
Can we do math operations on arrays of different shapes?
a2array([[1, 2],
[3, 4]])
b2array([[5, 6, 7, 8]])
carray([[5],
[6],
[7],
[8]])
a2 + b2ValueError: operands could not be broadcast together with shapes (2,2) (1,4)
a2 + cValueError: operands could not be broadcast together with shapes (2,2) (4,1)
b2 + carray([[10, 11, 12, 13],
[11, 12, 13, 14],
[12, 13, 14, 15],
[13, 14, 15, 16]])
Only b2 + c worked. Why? How was numpy able to do the above? By using what is known as broadcasting. The above is the simplest notion of broadcasting. Broadcasting is simply stretching an array along a given dimension to match the size of another array in that dimension. The only time broadcasting occurs is when one array is of length 1 while the other is of another size within a particular dimension.
Notice how with broadcasting, we did not have to write a loop to do all the sums.
How can we use this to do math?
Suppose we have been given an array a which is 1D. A simple question is to find the distance between each point to the rest. in this case, we could use broadcasting. But how?
First lets look at a loop solution
a = [3,4,7,8,9]
a[3, 4, 7, 8, 9]
We need a \(4\times 4\) as the results. ie
np.array([[abs(i-j) for j in a] for i in a])array([[0, 1, 4, 5, 6],
[1, 0, 3, 4, 5],
[4, 3, 0, 1, 2],
[5, 4, 1, 0, 1],
[6, 5, 2, 1, 0]])
How can we do the same using broadcasting?
All we have to do is to ensure that when subtracting \(a\) from itself, the next \(a\) need to have a different dimension, and then python will stretch the two to match.
arr = np.array(a)
arr.reshape(5,1)array([[3],
[4],
[7],
[8],
[9]])
abs(arr - arr.reshape(5, 1))array([[0, 1, 4, 5, 6],
[1, 0, 3, 4, 5],
[4, 3, 0, 1, 2],
[5, 4, 1, 0, 1],
[6, 5, 2, 1, 0]])
In the above, we used reshape(5,1) to turn the vector to a matrix. Here we specified that there should be 5 rows and 1 column. Well sometimes we do not know the first size before hand and therefore need while reshaping. Here are the other ways:
arr[:, None]array([[3],
[4],
[7],
[8],
[9]])
abs(arr - arr[:, None])array([[0, 1, 4, 5, 6],
[1, 0, 3, 4, 5],
[4, 3, 0, 1, 2],
[5, 4, 1, 0, 1],
[6, 5, 2, 1, 0]])
abs(arr - arr.reshape(-1,1)) # We use -1 to tell the computer to calculate for usarray([[0, 1, 4, 5, 6],
[1, 0, 3, 4, 5],
[4, 3, 0, 1, 2],
[5, 4, 1, 0, 1],
[6, 5, 2, 1, 0]])
abs(arr - arr[:, np.newaxis])array([[0, 1, 4, 5, 6],
[1, 0, 3, 4, 5],
[4, 3, 0, 1, 2],
[5, 4, 1, 0, 1],
[6, 5, 2, 1, 0]])
The way to rearrange the arrays is very important and you need to know this.
More examples:
a3 = np.array([[5,6,7],[9,10,11], [13,14,15]])Suppose we wanted to subtract 5,9,13 from the first row, then 5,9,13 from the second row and also from the 3rd row. How will we do this?
b = np.array([5,9,13])Note that since it is row wise , we only need the first dimension ie the row to match. They are already matching, ie, b is already packed as one unit of 3 elements, and for a3, each unit has 3 elements. we can therefore directly do the subtraction
a3 - barray([[ 0, -3, -6],
[ 4, 1, -2],
[ 8, 5, 2]])
What if we wanted to Subtract 5 from the first row, 9 from the second row and 13 from the third row? We use broadcasting. We ensure the object to be subtracted is packed as a unit. Note that we have to arrange b such that it has 3 rows and 1 element per row.
To do the task,:
a3.shape(3, 3)
We can then do:
b.shape = (3, 1) # similar to b.shape = 3, 1
barray([[ 5],
[ 9],
[13]])
Now we can do the subtraction:
a3 - barray([[0, 1, 2],
[0, 1, 2],
[0, 1, 2]])
Note that the method I used above for replacing the shape by a tuple changes b completely. I now have a b that is of shaped as (3, 1). If there were other operations that were to depend on the previous version of b they would fail if the dimension did not align. Thus instead of doing an inplace replacement, we should use one of the methods shown above.
Firs lets revert back b to how it was
b = b.flatten() # or b.shape = (3,) or b = b.ravel()
barray([ 5, 9, 13])
Now lets use any of the methods previously introduced.
a3 - b[:, None]array([[0, 1, 2],
[0, 1, 2],
[0, 1, 2]])
Now what if we wanted to subtract 5 from all the rows and columns, then 9 from all the rows then 13 from all the rows?. Well alittle advanced:
a3 - b[:,None,None]array([[[ 0, 1, 2],
[ 4, 5, 6],
[ 8, 9, 10]],
[[-4, -3, -2],
[ 0, 1, 2],
[ 4, 5, 6]],
[[-8, -7, -6],
[-4, -3, -2],
[ 0, 1, 2]]])
Anyway dont worry about the last example. But notice how 5 was subtracted from the whole of a3 then 9 then 13. We get three matrices.
Broadcasting is useful as it alleviates the need for loops whenever the operations are independent.
Question
flowers = [[1,6],[3,7],[9,12],[4,13]]
people = [2,3,7,11]
flowers = np.array(flowers)
people = np.array(people)[:, None]
((flowers[:,0] <= people) & (people <= flowers[:,1])).sum(1)array([1, 2, 2, 2])
Elementary Math functions
There are a lot of elementary functions. Some basic are:
Polynomial functions:
np.sqrt([0,4,9])array([0., 2., 3.])
np.cbrt([0,27])array([0., 3.])
np.array([0,4,5])**2array([ 0, 16, 25])
Exponential and Logarithm
np.e # constant e2.718281828459045
np.exp([0,1,2]) # same as above e^x functionarray([1. , 2.71828183, 7.3890561 ])
np.exp2([0,1,2]) # 2^xarray([1., 2., 4.])
np.expm1([0,1,2]) # similar to e^x - 1. To provide greater precision for small xarray([0. , 1.71828183, 6.3890561 ])
np.log([10, 20, 30]) # log base e.array([2.30258509, 2.99573227, 3.40119738])
np.log10([10,100,1000]) # log base 10array([1., 2., 3.])
np.log1p([1,2,3]) # log base e of 1+x. high precision for small x.array([0.69314718, 1.09861229, 1.38629436])
np.log2([2,4, 8, 16]) # log base 2array([1., 2., 3., 4.])
Trigonometric and hyperbolic functions
np.sin(np.radians(30)) # np.sine(30*np.pi/180) We are used to degrees0.49999999999999994
np.degrees(np.arcsin(0.5)) #inverse of sin function30.000000000000004
np.cosh(2) # hyperbolic cosine3.7621956910836314
(np.exp(2) + np.exp(-2))/2 # definition of hyperbolic sine3.7621956910836314
Other trigonometric functions could be found here and here.
The hyperbolic functions could be found here and their relations to the trigonometric functions could be found here
Methods
arr = np.array([[1, -1, 2, 3, 5], [4,-8,-3,0,5]])Instance Methods
arr.min() #minimum for the whole array-8
arr.min(0) # minimum across the rows (along the column)array([ 1, -8, -3, 0, 5])
arr.min(1) # minimum across the columns (along the row)array([-1, -8])
arr.argmin() # position where minimum occurs6
arr.argmin(0) # position where minimum occurs across rows/along columnsarray([0, 1, 1, 1, 0], dtype=int64)
arr.argmin(1) # similar to arr.argmin(axis = 1). Likewise for the abovearray([1, 1], dtype=int64)
arr.max() #arr.max((0, 1)) or arr.max(axis = (0, 1))
arr.max(0) # arr.max(axis = 0)
arr.max(1)
arr.sum()
arr.sum(0)
arr.sum(1)
arr.mean()
arr.mean(0)
arr.mean(1)
arr.std()
arr.std(0)
arr.var()
arr.var(0)
arr.cumsum()
arr.cumsum(0)
arr.cumsum(1)
arr.prod()
arr.prod(0)
arr.cumprod()
arr.sort()# CAUTION. DOES INPLACE ORDERING, AND CHANGES THE ORIGINAL ARRAY
arr.sort(0) # CAUTION. DOES INPLACE ORDERING, AND CHANGES THE ORIGINAL ARRAY
arr.argsort() # returns indices at which the current values should be for the array to be sorted
arr.argsort(0)
(arr >= 0).all(0)
(arr >= 0).any(0)You could get other instance methods by typing the instance and period the pressing on the tab key.
Question:
Find the distance matrix for the ar below. Where the distance is defined as:
\[ d(x_i, x_j) = \sqrt{\sum_{l} (x_{il} - x_{jl})^2} \]
ar = np.arange(10).reshape(-1, 2)Module numeric array methods.
All the instance method are just an inheritance of the class methods. Thus for every every instance method, there is an equivalent class method.
np.sum(arr)
np.sum(arr, 1)
np.sum(arr, axis = 1)
np.min(arr)In addition, the module provides extra functions that are not inherited by the class
np.minimum(arr, -10) # minimim per element array([[-10, -10, -10, -10, -10],
[-10, -10, -10, -10, -10]])
np.maximum(arr, 0) # maximum per elementarray([[1, 0, 2, 3, 5],
[4, 0, 0, 0, 5]])
np.add(arr, arr) # seems redundant? But its not. We will see array([[ 2, -2, 4, 6, 10],
[ 8, -16, -6, 0, 10]])
The comparison above does not do justice the the provided functions. Lets take another example.
np.minimum([1,4,7,9],[0,5,8,3])array([0, 4, 7, 3])
Also the module provided the equivalent functions to be carried out whenever there are nan’s in an array:
np.nansum(arr)
np.nansum(arr, axis = 1)
np.nanmean(arr)
np.nanmax(arr)
np.nancumsum(arr) # etcOther vector/array functions
arrarray([[ 1, -1, 2, 3, 5],
[ 4, -8, -3, 0, 5]])
np.whereThink of this in 2 ways.As a vectorized
if elseternary operator:np.where(arr >= 4, 0, 5) # 0 if arr>4 else 5array([[5, 5, 5, 5, 0], [0, 5, 5, 5, 0]])np.where(arr>=4, arr - 1, abs(arr)) #array([[1, 1, 2, 3, 4], [3, 8, 3, 0, 4]])Compare the following:
a = np.array([1,4,7,9]) b = np.array([0,5,8,3]) np.minimum(a, b)array([0, 4, 7, 3])np.where(a<b, a, b)array([0, 4, 7, 3])As a vectorized
findfunction. ie gives index where the condition isTruenp.where(arr >= 4)(array([0, 1, 1], dtype=int64), array([4, 0, 4], dtype=int64))
np.selectvectorized generic elif statement. ie nestednp.whereEG: grading scale: A : >=90, B: >=80, C: >=70; D: >=60, E: >=50, F: <50
grade = np.array([97, 90, 72, 89, 50, 23]) np.where(grade >=90, "A", np.where(grade>=80, "B", np.where(grade>=70, "C", np.where(grade>=60, "D", np.where(grade>=50, "E","F")))))array(['A', 'A', 'C', 'B', 'E', 'F'], dtype='<U1')That was a long one. We simply use
np.select:conditions = [grade>=90, grade>=80, grade>=70, grade>=60, grade>=50, grade<50] choices = 'A','B','C','D','E','F' np.select(conditions, choices)array(['A', 'A', 'C', 'B', 'E', 'F'], dtype='<U3')Note that the
elsepart, ie the very last condition could be omitted and a default value passed to thenp.selectfunction. eg:conditions = [grade>=90, grade>=80, grade>=70, grade>=60, grade>=50] choices = 'ABCDE' np.select(conditions, choices, 'F')array(['A', 'A', 'C', 'B', 'E', 'F'], dtype='<U1')np.in1dDetermines as to whether elements of one array are in the other. Note that the function is specifically named1das it deals with1darrays:np.in1d([1,3,4,6], [2,5,7,6,3])array([False, True, False, True])np.unique. Among top 3 most useful for data science. Determines the unique values in an array, their positions, their counts, etcarr1 = [1,2,2,2,1,1,1,3,3,6,3,2,2,4,4,4] np.unique(arr1)array([1, 2, 3, 4, 6])np.unique(arr1, return_index = True)(array([1, 2, 3, 4, 6]), array([ 0, 1, 7, 13, 9], dtype=int64))np.unique(arr1, return_inverse = True)(array([1, 2, 3, 4, 6]), array([0, 1, 1, 1, 0, 0, 0, 2, 2, 4, 2, 1, 1, 3, 3, 3], dtype=int64))np.unique(arr1, return_counts = True)(array([1, 2, 3, 4, 6]), array([4, 5, 3, 3, 1], dtype=int64))np.unique(arr1, return_index = True, return_inverse = True, return_counts = True)(array([1, 2, 3, 4, 6]), array([ 0, 1, 7, 13, 9], dtype=int64), array([0, 1, 1, 1, 0, 0, 0, 2, 2, 4, 2, 1, 1, 3, 3, 3], dtype=int64), array([4, 5, 3, 3, 1], dtype=int64))np.diffreturns the differences w.r.t a certain order. Note that the first difference is the difference between the next point and the current pointarr = np.array([[1,3,5,6,9],[2,9,7,4,10]]) np.diff(arr)array([[ 2, 2, 1, 3], [ 7, -2, -3, 6]])np.diff(arr, axis = 0)array([[ 1, 6, 2, -2, 1]])np.diff(arr, n=2)array([[ 0, -1, 2], [-9, -1, 9]])np.diff(arr, n = 2, axis = 0)array([], shape=(0, 5), dtype=int32)np.concatenateUsed to combine multiple arrays into one arraya1 = np.arange(12).reshape(-1,3) a2 = np.arange(13,25).reshape(-1,3) a3 = np.array([1,2,3]) np.concatenate([a1, a2])array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 9, 10, 11], [13, 14, 15], [16, 17, 18], [19, 20, 21], [22, 23, 24]])np.concatenate([a1, a2], axis=1)array([[ 0, 1, 2, 13, 14, 15], [ 3, 4, 5, 16, 17, 18], [ 6, 7, 8, 19, 20, 21], [ 9, 10, 11, 22, 23, 24]])np.row_stack- Stacks the arrays row-wisenp.column_stack- Stacks the arrays column-wisenp.hstack- Stacks the arrays horizontally. Equal tonp.column_stackIFF the arrays have more than one dimensionnp.vstack- Stacks the arrays vertically. Equivalent tonp.row_stack
Here is a link to showcase the differences about the 4 functions above.
np.r_
np.r_[a1, a2]array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 9, 10, 11], [13, 14, 15], [16, 17, 18], [19, 20, 21], [22, 23, 24]])np.r_[0, 1:10, a3]array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3])np.c_
np.c_[a1, a2]array([[ 0, 1, 2, 13, 14, 15], [ 3, 4, 5, 16, 17, 18], [ 6, 7, 8, 19, 20, 21], [ 9, 10, 11, 22, 23, 24]])np.appendonly works with 2 input arrays.
np.append(a1, a2)array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24])np.append(a1, a2, axis = 1)array([[ 0, 1, 2, 13, 14, 15], [ 3, 4, 5, 16, 17, 18], [ 6, 7, 8, 19, 20, 21], [ 9, 10, 11, 22, 23, 24]])
Matrix Operations/ functions
The best way to work with arrays is to use matrix operations. These operations allow us to easily manipulate data. So me of the functions include:
T/transpose. Transpose your matrix/array.transposeis generic as it passes axes to be transposed.a1.Tarray([[ 0, 3, 6, 9], [ 1, 4, 7, 10], [ 2, 5, 8, 11]])a1.transpose()array([[ 0, 3, 6, 9], [ 1, 4, 7, 10], [ 2, 5, 8, 11]])@Used for matrix multiplication/ inner product for vectorsa3 @ a314a3.dot(a3)14a2 @ a3array([ 86, 104, 122, 140])
Universal functions
Suppose you have 3 students who did different exams. You are asked to find the average of the students. eg
students = [1, 1, 2, 2, 1, 1, 3, 3, 2]
grades = [90, 70, 87, 84, 65, 87, 98, 99, 86]Note that each grade corresponds to a student. to visualize, we have:
student 1: 90, 70, 65, 87
student 2: 87, 84, 86
student 3: 98, 99
How shall we go about this? So far, we know that numpy arrays only hold rectangular data. ie, data that can is of the same shape.
We could revert back to the for-loops or simply start thinking for various ways to solve. Luckily, numpy provides a way out. Notice how so far we have been passing only the axis into the functions, there are other arguments to be passed in.
We could create a matrix but also have indices to indicate those elements that we are interested in:
grades_mat = np.array([[90,70,65,87],
[87,84,86,np.nan],
[98,99,np.nan,np.nan]])
np.nanmean(grades_mat,1) # average per student.array([78. , 85.66666667, 98.5 ])
grades_mat.mean(1, where = ~np.isnan(grades_mat))# average per studentarray([78. , 85.66666667, 98.5 ])
This method would require us to manually manipulate the data, add the nans then solve the problem. That will be tedious. Is there a way to directly use the data given? Yes:
students = np.array(students)
grades = np.array(grades)
sort_index = np.argsort(students)
sorted_grades = grades[sort_index]
_, idx, counts = np.unique(students[sort_index], return_index = True, return_counts = True)
np.add.reduceat(sorted_grades, idx)/countsarray([78. , 85.66666667, 98.5 ])
The second method is quite intriguing. We did not have to manually structure our data in a certain way. we just had to use reduceat method provided by the add function.
Note that if we want the maximum per student, we use the reduceat provided by the universal function maximum. The rest of the code remains the same:
np.minimum.reduceat(sorted_grades, idx)array([65, 84, 98])
np.maximum.reduceat(sorted_grades, idx)array([90, 87, 99])
Extra:
Note that we could do the same using python’s STL.
[sum(vec:=[grade for stud, grade in zip(students, grades) if stud ==i ])/len(vec) for i in set(students)][78.0, 85.66666666666667, 98.5]
Note how we used the inbuilt universal functions. We could write a function and vectorize it.
Ways to vectorize a function:
np.vectorizenp.frompyfunc
Note that at times we are just interested to apply the function along a given axis or over some axes. Use the functions:
np.apply_along_axisnp.apply_over_axes
Time wont allow me to talk of loading data into python using numpy, of dealing with character arrays, of rolling windows/strides tricks using the np.libs.stride_tricks module, of padding arrays using the np.lib.arraypad module, of convolutions etc.
There is still alot to learn from this package that we haven.t scratched the surface. Once you grasp what is happening, and you could respond to problems, then you are ready for DATA SCIENCE. are you ready?
Options :
Move directly to Data Science ie
scipy,sklearn,statsmodels. Need to learnnp.linalgmoduleMove to data Analytics ie
pandasjsonsqlpysparksiuba- Pandas an extension of numpy for data frames, thereby no extra numpy knowledge needed. Pandas would provide easy ways to solve.
Both of the above still require Data Visualization – using matplotlib and seaborn This is someone you can easily learn.
Question
https://stackoverflow.com/questions/77262300/how-do-i-filter-on-multiple-criteria-in-group-by
https://stackoverflow.com/questions/77260897/r-how-to-do-the-rowwise-mutate-operation
https://stackoverflow.com/questions/77262541/loop-combination-two-columns-sums-in-r
https://stackoverflow.com/questions/77262398/update-values-in-df-after-groupby-and-get-group