Python for ML
Basic Python reference useful for ML
- Python Collections
- Python Conditions
- NumPy
- Pandas
- Creating a Series using Pandas
- Using an Index
- DataFrames
- Selection and Indexing
- Adding a new column to the DataFrame
- Removing Columns from DataFrame
- Dropping rows using axis=0
- Selecting Rows
- select based off of index position instead of label - use iloc instead of loc function
- Selecting subset of rows and columns using loc function
- Conditional Selection
- Some more features of indexing includes
- Missing Data
- Groupby
- groupby function for each aggregation
- combining DataFrames together:
- Concatenation
- Merging
- Operations
- Permanently Removing a Column
- Get column and index names
- Sorting and Ordering a DataFrame
- Find Null Values or Check for Null Values
- Filling in NaN values with something else
- Data Input and Output
Collection Types:
1) List is a collection which is ordered and changeable. Allows duplicate members
2) Tuple is a collection which is ordered and unchangeable. Allows duplicate members
3) Set is a collection which is unordered and unindexed. No duplicate members
4) Dictionary is a collection which is unordered, changeable and indexed. No duplicate members.
list = ["apple", "grapes", "banana"]
print(list)
print(list[1]) #access the list items by referring to the index number
print(list[-1]) #Negative indexing means beginning from the end, -1 refers to the last item
list2 = ["apple", "banana", "cherry", "orange", "kiwi", "melon", "mango"]
print(list2[:4]) #By leaving out the start value, the range will start at the first item
print(list2[2:])
print(list2[-4:-1]) #range
list3 = ["A", "B", "C"]
list3[1] = "D" #change the value of a specific item, by refering to the index number
print(list3)
# For loop
list4 = ["apple", "banana", "cherry"]
for x in list4:
print(x)
#To determine if a specified item is present in a list
if "apple" in list4:
print("Yes")
#To determine how many items a list has
print(len(list4))
List Methods:
- append() : Adds an element at the end of the list
- clear() : Removes all the elements from the list
- copy() : Returns a copy of the list
- count() : Returns the number of elements with the specified value
- extend() : Add the elements of a list (or any iterable), to the end of the current list
- index() : Returns the index of the first element with the specified value
- insert() : Adds an element at the specified position
- pop() : Removes the element at the specified position
- remove() : Removes the item with the specified value
- reverse() : Reverses the order of the list
- sort() : Sorts the list
#append() method to append an item
list4.append("orange")
print(list4)
#Insert an item as the second position
list4.insert(1, "orange")
print(list4)
#The remove() method removes the specified item
list4.remove("banana")
print(list4)
#pop() method removes the specified index
#and the last item if index is not specified
list4.pop()
print(list4)
#The del keyword removes the specified index
del list4[0]
print(list4)
#The del keyword can also delete the list completely
del list4
ptint(list4)
#The clear() method empties the list
list5 = ["apple", "banana", "cherry"]
list5.clear()
print(list5)
#the copy() method to make a copy of a list
list5 = ["apple", "banana", "cherry"]
mylist = list5.copy()
print(mylist)
#Join Two Lists
list1 = ["a", "b" , "c"]
list2 = [1, 2, 3]
list3 = list1 + list2
print(list3)
#Append list2 into list1
list1 = ["a", "b" , "c"]
list2 = [1, 2, 3]
for x in list2:
list1.append(x)
print(list1)
#the extend() method to add list2 at the end of list1
list1 = ["a", "b" , "c"]
list2 = [1, 2, 3]
list1.extend(list2)
print(list1)
tuple1 = ("apple", "banana", "cherry")
print(tuple1)
#access tuple item
print(tuple1[1])
#Negative indexing means beginning from the end, -1 refers to the last item
print(tuple1[-1])
#Range : Return the third, fourth, and fifth item
tuple2 = ("apple", "banana", "cherry", "orange", "kiwi", "melon", "mango")
print(tuple2[2:5])
#Specify negative indexes if you want to start the search from the end of the tuple
print(tuple2[-4:-1])
#loop through the tuple items by using a for loop
tuple3 = ("apple", "banana", "cherry")
for x in tuple3:
print(x)
#Check if Item Exists
if "apple" in tuple3:
print("Yes")
#Print the number of items in the tuple
print(len(tuple3))
# join two or more tuples you can use the + operator
tuple1 = ("a", "b" , "c")
tuple2 = (1, 2, 3)
tuple3 = tuple1 + tuple2
print(tuple3)
#Using the tuple() method to make a tuple
thistuple = tuple(("apple", "banana", "cherry")) # note the double round-brackets
print(thistuple)
set1 = {"apple", "banana", "cherry"}
print(set1)
#Access items, Loop through the set, and print the values
for x in set1:
print(x)
if "apple" in set1:
print("Yes")
Set methods:
- add() Adds an element to the set
- clear() Removes all the elements from the set
- copy() Returns a copy of the set
- difference() Returns a set containing the difference between two or more sets
- difference_update() Removes the items in this set that are also included in another, specified set
- discard() Remove the specified item
- intersection() Returns a set, that is the intersection of two other sets
- intersection_update() Removes the items in this set that are not present in other, specified set(s)
- isdisjoint() Returns whether two sets have a intersection or not
- issubset() Returns whether another set contains this set or not
- issuperset() Returns whether this set contains another set or not
- pop() Removes an element from the set
- remove() Removes the specified element
- symmetric_difference() Returns a set with the symmetric differences of two sets
- symmetric_difference_update() inserts the symmetric differences from this set and another
- union() Return a set containing the union of sets
- update() Update the set with the union of this set and others
# Adding new items
set1.add("orange")
print(set1)
#Add multiple items to a set, using the update() method
set1.update(["orange", "mango", "grapes"])
print(set1)
# length of the set
print(len(set1))
# remove item
set1.remove("banana")
print(set1)
#Remove the last item by using the pop() method
set2 = {"apple", "banana", "cherry"}
x = set2.pop()
print(x)
print(set2)
#clear() method empties the set
thisset = {"apple", "banana", "cherry"}
thisset.clear()
print(thisset)
#del keyword will delete the set completely
thisset = {"apple", "banana", "cherry"}
del thisset
print(thisset)
#use the union() method that returns a new set containing all items from both sets,
#or the update() method that inserts all the items from one set into another
set1 = {"a", "b" , "c"}
set2 = {1, 2, 3}
set3 = set1.union(set2)
print(set3)
#update() method inserts the items in set2 into set1
set1 = {"a", "b" , "c"}
set2 = {1, 2, 3}
set1.update(set2)
print(set1)
dict = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}
print(dict)
#access the items of a dictionary by referring to its key name, inside square brackets
dict["model"]
Dict methods
- clear() Removes all the elements from the dictionary
- copy() Returns a copy of the dictionary
- fromkeys() Returns a dictionary with the specified keys and value
- get() Returns the value of the specified key
- items() Returns a list containing a tuple for each key value pair
- keys() Returns a list containing the dictionary's keys
- pop() Removes the element with the specified key
- popitem() Removes the last inserted key-value pair
- setdefault() Returns the value of the specified key. If the key does not exist: insert the key, with the specified value
- update() Updates the dictionary with the specified key-value pairs
- values() Returns a list of all the values in the dictionary
#use get() to get the same result
dict.get("model")
#change the value of a specific item by referring to its key name
dict1 = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}
dict1["year"] = 2018
print(dict1)
#loop through a dictionary by using a for loop
for x in dict1:
print(x)
#Print all values in the dictionary, one by one
for x in dict1:
print(dict1[x])
#use the values() method to return values of a dictionary
for x in dict1.values():
print(x)
#Loop through both keys and values, by using the items() method
for x, y in dict1.items():
print(x, y)
#Check if an item present in the dictionary
if "model" in dict1:
print("Yes")
print(len(dict1))
#adding items
thisdict = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}
thisdict["color"] = "red"
print(thisdict)
#pop() method removes the item with the specified key name
thisdict = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}
thisdict.pop("model")
print(thisdict)
# popitem() method removes the last inserted item
thisdict = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}
thisdict.popitem()
print(thisdict)
#del keyword removes the item with the specified key name
thisdict = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}
del thisdict["model"]
print(thisdict)
#dictionary can also contain many dictionaries, this is called nested dictionaries
myfamily = {
"child1" : {
"name" : "Emil",
"year" : 2004
},
"child2" : {
"name" : "Tobias",
"year" : 2007
},
"child3" : {
"name" : "Linus",
"year" : 2011
}
}
#Create three dictionaries, then create one dictionary that will contain the other three dictionaries
child1 = {
"name" : "Emil",
"year" : 2004
}
child2 = {
"name" : "Tobias",
"year" : 2007
}
child3 = {
"name" : "Linus",
"year" : 2011
}
myfamily = {
"child1" : child1,
"child2" : child2,
"child3" : child3
}
a = 100
b = 200
if b > a:
print("b is greater than a")
#simplyfied:
a = 100
b = 200
if a < b: print("a is greater than b")
a = 20
b = 20
if b > a:
print("b is greater than a")
elif a == b:
print("a and b are equal")
a = 200
b = 100
if b > a:
print("b is greater than a")
elif a == b:
print("a and b are equal")
else:
print("a is greater than b")
# simplyfied:
a = 100
b = 300
print("A") if a > b else print("B")
a = 200
b = 33
c = 500
if a > b and c > a:
print("Both conditions are True")
a = 200
b = 33
c = 500
if a > b or a > c:
print("At least one of the conditions is True")
x = 41
if x > 10: print("Above ten,") if x > 20: print("and also above 20!") else: print("but not above 20.")
#if statements cannot be empty, but if you for some reason have an if statement
#with no content, put in the pass statement to avoid getting an error
a = 33
b = 200
if b > a:
pass
i = 1
while i < 6:
print(i)
i += 1
i = 1
while i < 6:
print(i)
if i == 3:
break
i += 1
# with Continue
i = 0
while i < 6:
i += 1
if i == 3:
continue
print(i)
### Else statement
i = 1
while i < 6:
print(i)
i += 1
else:
print("i is no longer less than 6")
# For loop for List
fruits = ["apple", "banana", "cherry"]
for x in fruits:
print(x)
# strings
for x in "banana":
print(x)
#break statement
fruits = ["apple", "banana", "cherry"]
for x in fruits:
print(x)
if x == "banana":
break
fruits = ["apple", "banana", "cherry"]
for x in fruits:
if x == "banana":
break
print(x)
#continue
fruits = ["apple", "banana", "cherry"]
for x in fruits:
if x == "banana":
continue
print(x)
# Range
for x in range(6):
print(x)
for x in range(2, 6):
print(x)
for x in range(2, 30, 3):
print(x)
for x in range(6):
print(x)
else:
print("Finally finished!")
adj = ["red", "big", "tasty"]
fruits = ["apple", "banana", "cherry"]
for x in adj:
for y in fruits:
print(x, y)
for x in [0, 1, 2]:
pass
def my_function():
print("Hello")
my_function()
def my_function(*kids):
print("The youngest child is " + kids[2])
my_function("Emil", "Tobias", "Linus")
def my_function(child3, child2, child1):
print("The youngest child is " + child3)
my_function(child1 = "Emil", child2 = "Tobias", child3 = "Linus")
#Passing a List as an Argument
def my_function(food):
for x in food:
print(x)
fruits = ["apple", "banana", "cherry"]
my_function(fruits)
#return value
def my_function(x):
return 5 * x
print(my_function(3))
#Recursion Example
def tri_recursion(k):
if(k > 0):
result = k + tri_recursion(k - 1)
print(result)
else:
result = 0
return result
print("\n\nRecursion Example Results")
tri_recursion(6)
x = lambda a, b, c : a + b + c
print(x(5, 6, 2))
def myfunc(n):
return lambda a : a * n
def myfunc(n):
return lambda a : a * n
mydoubler = myfunc(2)
print(mydoubler(11))
def myfunc(n):
return lambda a : a * n
mydoubler = myfunc(2)
mytripler = myfunc(3)
print(mydoubler(11))
print(mytripler(11))
#f = open("demofile.txt", "r")
#print(f.read())
#f = open("D:\\myfiles\welcome.txt", "r")
#print(f.read())
#Read one line of the file
#f = open("demofile.txt", "r")
#print(f.readline())
#Loop through the file line by line
#f = open("demofile.txt", "r")
#for x in f:
# print(x)
#Close the file when you are finish with it
#f = open("demofile.txt", "r")
#print(f.readline())
#f.close()
#Open the file "demofile2.txt" and append content to the file
#f = open("demofile2.txt", "a")
#f.write("Now the file has more content!")
#f.close()
#open and read the file after the appending:
#f = open("demofile2.txt", "r")
#print(f.read())
#Open the file "demofile3.txt" and overwrite the content
#f = open("demofile3.txt", "w")
#f.write("Woops! I have deleted the content!")
#f.close()
#open and read the file after the appending:
#f = open("demofile3.txt", "r")
#print(f.read())
#Create a file called "myfile.txt"
#f = open("myfile.txt", "x")
#Remove the file "demofile.txt"
#import os
#os.remove("demofile.txt")
#Check if file exists, then delete it:
#import os
#if os.path.exists("demofile.txt"):
# os.remove("demofile.txt")
#else:
# print("The file does not exist")
#Try to open and write to a file that is not writable:
#try:
# f = open("demofile.txt")
# f.write("Lorum Ipsum")
#except:
# print("Something went wrong when writing to the file")
#finally:
# f.close()
#Raise an error and stop the program if x is lower than 0:
#x = -1
#if x < 0:
# raise Exception("Sorry, no numbers below zero")
#Raise a TypeError if x is not an integer:
#x = "hello"
#if not type(x) is int:
# raise TypeError("Only integers are allowed")
import numpy as np
simple_list = [1,2,3]
np.array(simple_list)
list_of_lists = [[1,2,3], [4,5,6], [7,8,9]]
np.array(list_of_lists)
np.arange(0,10)
np.arange(0,21,5)
np.zeros(50)
np.ones((4,5))
np.linspace(0,20,10)
np.eye(5)
np.random.rand(3,2)
np.random.randint(5,20,10)
np.arange(30)
np.random.randint(0,100,20)
sample_array = np.arange(30)
sample_array.reshape(5,6)
rand_array = np.random.randint(0,100,20)
rand_array.argmin()
sample_array.shape
sample_array.reshape(1,30)
sample_array.reshape(30,1)
sample_array.dtype
a = np.random.randn(2,3)
a.T
sample_array = np.arange(10,21)
sample_array
sample_array[[2,5]]
sample_array[1:2] = 100
sample_array
sample_array = np.arange(10,21)
sample_array[0:7]
sample_array = np.arange(10,21)
sample_array
subset_sample_array = sample_array[0:7]
subset_sample_array
subset_sample_array[:]=1001
subset_sample_array
sample_array
copy_sample_array = sample_array.copy()
copy_sample_array
copy_sample_array[:]=10
copy_sample_array
sample_array
sample_matrix = np.array(([50,20,1,23], [24,23,21,32], [76,54,32,12], [98,6,4,3]))
sample_matrix
sample_matrix[0][3]
sample_matrix[0,3]
sample_matrix[3,:]
sample_matrix[3]
sample_matrix = np.array(([50,20,1,23,34], [24,23,21,32,34], [76,54,32,12,98], [98,6,4,3,67], [12,23,34,56,67]))
sample_matrix
sample_matrix[:,[1,3]]
sample_matrix[:,(3,1)]
sample_array=np.arange(1,31)
sample_array
bool = sample_array < 10
sample_array[bool]
sample_array[sample_array <10]
a=11
sample_array[sample_array < a]
sample_array + sample_array
sample_array / sample_array
10/sample_array
sample_array + 1
np.var(sample_array)
array = np.random.randn(6,6)
array
np.std(array)
np.mean(array)
sports = np.array(['golf', 'cric', 'fball', 'cric', 'Cric', 'fooseball'])
np.unique(sports)
sample_array
simple_array = np.arange(0,20)
simple_array
np.save('sample_array', sample_array)
np.savez('2_arrays.npz', a=sample_array, b=simple_array)
np.load('sample_array.npy')
archive = np.load('2_arrays.npz')
archive['b']
np.savetxt('text_file.txt', sample_array,delimiter=',')
np.loadtxt('text_file.txt', delimiter=',')
data = {'prodID': ['101', '102', '103', '104', '104'],
'prodname': ['X', 'Y', 'Z', 'X', 'W'],
'profit': ['2738', '2727', '3497', '7347', '3743']}
import pandas as pd
score = [10, 15, 20, 25]
pd.Series(data=score, index = ['a','b','c','d'])
demo_matrix = np.array(([13,35,74,48], [23,37,37,38], [73,39,93,39]))
demo_matrix
demo_matrix[2,3]
np.arange(0,22,6)
demo_array=np.arange(0,10)
demo_array
demo_array <3
demo_array[demo_array <6]
np.max(demo_array)
s1 = pd.Series(['a', 'b'])
s2 = pd.Series(['c', 'd'])
pd.concat([s1+s2])
labels = ['w','x','y','z']
list = [10,20,30,40]
array = np.array([10,20,30,40])
dict = {'w':10,'x':20,'y':30,'z':40}
pd.Series(data=list)
pd.Series(data=list,index=labels)
pd.Series(list,labels)
pd.Series(array)
pd.Series(array,labels)
pd.Series(dict)
sports1 = pd.Series([1,2,3,4],index = ['Cricket', 'Football','Basketball', 'Golf'])
sports1
sports2 = pd.Series([1,2,5,4],index = ['Cricket', 'Football','Baseball', 'Golf'])
sports2
sports1 + sports2
from numpy.random import randn
np.random.seed(1)
dataframe = pd.DataFrame(randn(10,5),index='A B C D E F G H I J'.split(),columns='Score1 Score2 Score3 Score4 Score5'.split())
dataframe
dataframe['Score3']
# Pass a list of column names in any order necessary
dataframe[['Score2','Score1']]
#DataFrame Columns are nothing but a Series each
type(dataframe['Score1'])
dataframe['Score6'] = dataframe['Score1'] + dataframe['Score2']
dataframe
# Use axis=0 for dropping rows and axis=1 for dropping columns
dataframe.drop('Score6',axis=1)
# column is not dropped unless inplace input is TRUE
dataframe
dataframe.drop('Score6',axis=1,inplace=True)
dataframe
# Row will also be dropped only if inplace=TRUE is given as input
dataframe.drop('A',axis=0)
dataframe.loc['F']
dataframe.iloc[2]
dataframe.loc['A','Score1']
dataframe.loc[['A','B'],['Score1','Score2']]
dataframe>0.5
dataframe[dataframe>0.5]
dataframe[dataframe['Score1']>0.5]
dataframe[dataframe['Score1']>0.5]['Score2']
dataframe[dataframe['Score1']>0.5][['Score2','Score3']]
# Reset to default index value instead of A to J
dataframe.reset_index()
# Setting new index value
newindex = 'IND JP CAN GE IT PL FY IU RT IP'.split()
dataframe['Countries'] = newindex
dataframe
dataframe.set_index('Countries')
# Once again, ensure that you input inplace=TRUE
dataframe
dataframe.set_index('Countries',inplace=True)
dataframe
dataframe = pd.DataFrame({'Cricket':[1,2,np.nan,4,6,7,2,np.nan],
'Baseball':[5,np.nan,np.nan,5,7,2,4,5],
'Tennis':[1,2,3,4,5,6,7,8]})
dataframe
dataframe.dropna()
# Use axis=1 for dropping columns with nan values
dataframe.dropna(axis=1)
dataframe.dropna(thresh=2)
dataframe.fillna(value=0)
dataframe['Baseball'].fillna(value=dataframe['Baseball'].mean())
dat = {'CustID':['1001','1001','1002','1002','1003','1003'],
'CustName':['UIPat','DatRob','Goog','Chrysler','Ford','GM'],
'Profitinlakhs':[2005,3245,1245,8765,5463,3547]}
dataframe = pd.DataFrame(dat)
dataframe
We can now use the .groupby() method to group rows together based on a column name. For example let's group based on CustID. This will create a DataFrameGroupBy object:
dataframe.groupby('CustID') #This object can be saved as a variable
CustID_grouped = dataframe.groupby("CustID") #Now we can aggregate using the variable
CustID_grouped.mean()
dataframe.groupby('CustID').mean()
CustID_grouped.std()
CustID_grouped.min()
CustID_grouped.max()
CustID_grouped.count()
CustID_grouped.describe()
CustID_grouped.describe().transpose()
CustID_grouped.describe().transpose()['1001']
dafa1 = pd.DataFrame({'CustID': ['101', '102', '103', '104'],
'Sales': [13456, 45321, 54385, 53212],
'Priority': ['CAT0', 'CAT1', 'CAT2', 'CAT3'],
'Prime': ['yes', 'no', 'no', 'yes']},
index=[0, 1, 2, 3])
dafa2 = pd.DataFrame({'CustID': ['101', '103', '104', '105'],
'Sales': [13456, 54385, 53212, 4534],
'Payback': ['CAT4', 'CAT5', 'CAT6', 'CAT7'],
'Imp': ['yes', 'no', 'no', 'no']},
index=[4, 5, 6, 7])
dafa3 = pd.DataFrame({'CustID': ['101', '104', '105', '106'],
'Sales': [13456, 53212, 4534, 3241],
'Pol': ['CAT8', 'CAT9', 'CAT10', 'CAT11'],
'Level': ['yes', 'no', 'no', 'yes']},
index=[8, 9, 10, 11])
pd.concat([dafa1,dafa2])
pd.concat([dafa1,dafa2,dafa3],axis=1)
pd.merge(dafa1,dafa2,how='outer',on='CustID')
dataframe = pd.DataFrame({'custID':[1,2,3,4],'SaleType':['big','small','medium','big'],'SalesCode':['121','131','141','151']})
dataframe.head()
Info on Unique Values
dataframe['SaleType'].unique()
dataframe['SaleType'].nunique()
dataframe['SaleType'].value_counts()
Selecting Data
#Select from DataFrame using criteria from multiple columns
newdataframe = dataframe[(dataframe['custID']!=3) & (dataframe['SaleType']=='big')]
newdataframe
Applying Functions
def profit(a):
return a*4
dataframe['custID'].apply(profit)
dataframe['SaleType'].apply(len)
dataframe['custID'].sum()
dataframe
del dataframe['custID']
dataframe
dataframe.columns
dataframe.index
dataframe.sort_values(by='SaleType') #inplace=False by default
dataframe.isnull()
# Drop rows with NaN Values
dataframe.dropna()
dataframe = pd.DataFrame({'Sale1':[5,np.nan,10,np.nan],
'Sale2':[np.nan,121,np.nan,141],
'Sale3':['XUI','VYU','NMA','IUY']})
dataframe.head()
dataframe.fillna('Not nan')
CSV Input
# dataframe = pd.read_csv('filename.csv')
CSV output
#If index=FALSE then csv does not store index values
# dataframe.to_csv('filename.csv',index=False)
Excel Input
# pd.read_excel('filename.xlsx',sheet_name='Data1')
Excel Output
# dataframe.to_excel('Consumer2.xlsx',sheet_name='Sheet1')