Python Collections

Collection Types:

1) List is a collection which is ordered and changeable. Allows duplicate members

2) Tuple is a collection which is ordered and unchangeable. Allows duplicate members

3) Set is a collection which is unordered and unindexed. No duplicate members

4) Dictionary is a collection which is unordered, changeable and indexed. No duplicate members.

1) List

list = ["apple", "grapes", "banana"]
print(list)
['apple', 'grapes', 'banana']
print(list[1]) #access the list items by referring to the index number
grapes
print(list[-1]) #Negative indexing means beginning from the end, -1 refers to the last item
banana
list2 = ["apple", "banana", "cherry", "orange", "kiwi", "melon", "mango"] 
print(list2[:4]) #By leaving out the start value, the range will start at the first item
['apple', 'banana', 'cherry', 'orange']
print(list2[2:])
['cherry', 'orange', 'kiwi', 'melon', 'mango']
print(list2[-4:-1]) #range
['orange', 'kiwi', 'melon']
list3 = ["A", "B", "C"]
list3[1] = "D" #change the value of a specific item, by refering to the index number
print(list3)
['A', 'D', 'C']
# For loop

list4 = ["apple", "banana", "cherry"]
for x in list4:
  print(x)
apple
banana
cherry
#To determine if a specified item is present in a list

if "apple" in list4:
  print("Yes")
Yes
#To determine how many items a list has

print(len(list4))
3

List Methods:

  • append() : Adds an element at the end of the list
  • clear() : Removes all the elements from the list
  • copy() : Returns a copy of the list
  • count() : Returns the number of elements with the specified value
  • extend() : Add the elements of a list (or any iterable), to the end of the current list
  • index() : Returns the index of the first element with the specified value
  • insert() : Adds an element at the specified position
  • pop() : Removes the element at the specified position
  • remove() : Removes the item with the specified value
  • reverse() : Reverses the order of the list
  • sort() : Sorts the list
#append() method to append an item

list4.append("orange")
print(list4)
['apple', 'banana', 'cherry', 'orange']
#Insert an item as the second position

list4.insert(1, "orange")
print(list4)
['apple', 'orange', 'banana', 'cherry', 'orange']
#The remove() method removes the specified item

list4.remove("banana")
print(list4)
['apple', 'orange', 'cherry', 'orange']
#pop() method removes the specified index
#and the last item if index is not specified

list4.pop()
print(list4)
['apple', 'orange', 'cherry']
#The del keyword removes the specified index

del list4[0]
print(list4)
['orange', 'cherry']
#The del keyword can also delete the list completely
del list4
ptint(list4)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-27-fcb70c6c4d66> in <module>
      1 #The del keyword can also delete the list completely
----> 2 del list4
      3 ptint(list4)

NameError: name 'list4' is not defined
#The clear() method empties the list

list5 = ["apple", "banana", "cherry"]
list5.clear()
print(list5)
[]
#the copy() method to make a copy of a list 
list5 = ["apple", "banana", "cherry"]
mylist = list5.copy()
print(mylist)
['apple', 'banana', 'cherry']
#Join Two Lists

list1 = ["a", "b" , "c"]
list2 = [1, 2, 3]

list3 = list1 + list2
print(list3)
['a', 'b', 'c', 1, 2, 3]
#Append list2 into list1

list1 = ["a", "b" , "c"]
list2 = [1, 2, 3]

for x in list2:
  list1.append(x)

print(list1)
['a', 'b', 'c', 1, 2, 3]
#the extend() method to add list2 at the end of list1

list1 = ["a", "b" , "c"]
list2 = [1, 2, 3]

list1.extend(list2)
print(list1)
['a', 'b', 'c', 1, 2, 3]

2) Tuple

A tuple is a collection which is ordered and unchangeable.

tuple1 = ("apple", "banana", "cherry")
print(tuple1)
('apple', 'banana', 'cherry')
#access tuple item

print(tuple1[1])
banana
#Negative indexing means beginning from the end, -1 refers to the last item

print(tuple1[-1])
cherry
#Range : Return the third, fourth, and fifth item

tuple2 = ("apple", "banana", "cherry", "orange", "kiwi", "melon", "mango")
print(tuple2[2:5])
('cherry', 'orange', 'kiwi')
#Specify negative indexes if you want to start the search from the end of the tuple

print(tuple2[-4:-1])
('orange', 'kiwi', 'melon')
#loop through the tuple items by using a for loop

tuple3 = ("apple", "banana", "cherry")
for x in tuple3:
  print(x)
apple
banana
cherry
#Check if Item Exists

if "apple" in tuple3:
  print("Yes")
Yes
#Print the number of items in the tuple

print(len(tuple3))
3
# join two or more tuples you can use the + operator

tuple1 = ("a", "b" , "c")
tuple2 = (1, 2, 3)

tuple3 = tuple1 + tuple2
print(tuple3)
('a', 'b', 'c', 1, 2, 3)
#Using the tuple() method to make a tuple

thistuple = tuple(("apple", "banana", "cherry")) # note the double round-brackets
print(thistuple)
('apple', 'banana', 'cherry')

3) Set

A set is a collection which is unordered and unindexed. Sets are written with curly brackets.

set1 = {"apple", "banana", "cherry"}
print(set1)
{'banana', 'apple', 'cherry'}
#Access items, Loop through the set, and print the values

for x in set1:
  print(x)
banana
apple
cherry
if "apple" in set1:
  print("Yes")
Yes

Set methods:

  • add() Adds an element to the set
  • clear() Removes all the elements from the set
  • copy() Returns a copy of the set
  • difference() Returns a set containing the difference between two or more sets
  • difference_update() Removes the items in this set that are also included in another, specified set
  • discard() Remove the specified item
  • intersection() Returns a set, that is the intersection of two other sets
  • intersection_update() Removes the items in this set that are not present in other, specified set(s)
  • isdisjoint() Returns whether two sets have a intersection or not
  • issubset() Returns whether another set contains this set or not
  • issuperset() Returns whether this set contains another set or not
  • pop() Removes an element from the set
  • remove() Removes the specified element
  • symmetric_difference() Returns a set with the symmetric differences of two sets
  • symmetric_difference_update() inserts the symmetric differences from this set and another
  • union() Return a set containing the union of sets
  • update() Update the set with the union of this set and others
# Adding new items 


set1.add("orange")
print(set1)
{'banana', 'apple', 'cherry', 'orange'}
#Add multiple items to a set, using the update() method

set1.update(["orange", "mango", "grapes"])

print(set1)
{'banana', 'cherry', 'orange', 'apple', 'grapes', 'mango'}
# length of the set


print(len(set1))
6
# remove item 

set1.remove("banana")

print(set1)
{'cherry', 'orange', 'apple', 'grapes', 'mango'}
#Remove the last item by using the pop() method

set2 = {"apple", "banana", "cherry"}

x = set2.pop()

print(x)
print(set2)
banana
{'apple', 'cherry'}
#clear() method empties the set


thisset = {"apple", "banana", "cherry"}

thisset.clear()

print(thisset)
set()
#del keyword will delete the set completely

thisset = {"apple", "banana", "cherry"}

del thisset

print(thisset)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-67-b8e1fa6a22f4> in <module>
      5 del thisset
      6 
----> 7 print(thisset)

NameError: name 'thisset' is not defined
#use the union() method that returns a new set containing all items from both sets, 
#or the update() method that inserts all the items from one set into another

set1 = {"a", "b" , "c"}
set2 = {1, 2, 3}

set3 = set1.union(set2)
print(set3)
{'b', 1, 2, 3, 'a', 'c'}
#update() method inserts the items in set2 into set1

set1 = {"a", "b" , "c"}
set2 = {1, 2, 3}

set1.update(set2)
print(set1)
{'b', 1, 2, 3, 'a', 'c'}

4) Dictionary

A dictionary is a collection which is unordered, changeable and indexed.

dict = {
  "brand": "Ford",
  "model": "Mustang",
  "year": 1964
}
print(dict)
{'brand': 'Ford', 'model': 'Mustang', 'year': 1964}
#access the items of a dictionary by referring to its key name, inside square brackets

dict["model"]
'Mustang'

Dict methods

  • clear() Removes all the elements from the dictionary
  • copy() Returns a copy of the dictionary
  • fromkeys() Returns a dictionary with the specified keys and value
  • get() Returns the value of the specified key
  • items() Returns a list containing a tuple for each key value pair
  • keys() Returns a list containing the dictionary's keys
  • pop() Removes the element with the specified key
  • popitem() Removes the last inserted key-value pair
  • setdefault() Returns the value of the specified key. If the key does not exist: insert the key, with the specified value
  • update() Updates the dictionary with the specified key-value pairs
  • values() Returns a list of all the values in the dictionary
#use get() to get the same result

dict.get("model")
'Mustang'
#change the value of a specific item by referring to its key name

dict1 = {
  "brand": "Ford",
  "model": "Mustang",
  "year": 1964
}
dict1["year"] = 2018

print(dict1)
{'brand': 'Ford', 'model': 'Mustang', 'year': 2018}
#loop through a dictionary by using a for loop

for x in dict1:
  print(x)
brand
model
year
#Print all values in the dictionary, one by one

for x in dict1:
  print(dict1[x])
Ford
Mustang
2018
#use the values() method to return values of a dictionary

for x in dict1.values():
  print(x)
Ford
Mustang
2018
#Loop through both keys and values, by using the items() method

for x, y in dict1.items():
  print(x, y)
brand Ford
model Mustang
year 2018
#Check if an item present in the dictionary

if "model" in dict1:
  print("Yes")
Yes
print(len(dict1))
3
#adding items

thisdict = {
  "brand": "Ford",
  "model": "Mustang",
  "year": 1964
}
thisdict["color"] = "red"
print(thisdict)
{'brand': 'Ford', 'model': 'Mustang', 'year': 1964, 'color': 'red'}
#pop() method removes the item with the specified key name

thisdict = {
  "brand": "Ford",
  "model": "Mustang",
  "year": 1964
}
thisdict.pop("model")
print(thisdict)
{'brand': 'Ford', 'year': 1964}
# popitem() method removes the last inserted item

thisdict = {
  "brand": "Ford",
  "model": "Mustang",
  "year": 1964
}
thisdict.popitem()
print(thisdict)
{'brand': 'Ford', 'model': 'Mustang'}
#del keyword removes the item with the specified key name

thisdict = {
  "brand": "Ford",
  "model": "Mustang",
  "year": 1964
}
del thisdict["model"]
print(thisdict)
{'brand': 'Ford', 'year': 1964}
#dictionary can also contain many dictionaries, this is called nested dictionaries

myfamily = {
  "child1" : {
    "name" : "Emil",
    "year" : 2004
  },
  "child2" : {
    "name" : "Tobias",
    "year" : 2007
  },
  "child3" : {
    "name" : "Linus",
    "year" : 2011
  }
}
#Create three dictionaries, then create one dictionary that will contain the other three dictionaries

child1 = {
  "name" : "Emil",
  "year" : 2004
}
child2 = {
  "name" : "Tobias",
  "year" : 2007
}
child3 = {
  "name" : "Linus",
  "year" : 2011
}

myfamily = {
  "child1" : child1,
  "child2" : child2,
  "child3" : child3
}

Python Conditions

If statement

a = 100
b = 200
if b > a:
  print("b is greater than a")
b is greater than a
#simplyfied:

a = 100
b = 200
if a < b: print("a is greater than b")
a is greater than b
a = 20
b = 20
if b > a:
  print("b is greater than a")
elif a == b:
  print("a and b are equal")
a and b are equal
a = 200
b = 100
if b > a:
  print("b is greater than a")
elif a == b:
  print("a and b are equal")
else:
  print("a is greater than b")
a is greater than b
# simplyfied:

a = 100
b = 300
print("A") if a > b else print("B")
B

AND and OR Statement

a = 200
b = 33
c = 500
if a > b and c > a:
  print("Both conditions are True")
Both conditions are True
a = 200
b = 33
c = 500
if a > b or a > c:
  print("At least one of the conditions is True")
At least one of the conditions is True

Nested If

x = 41

if x > 10: print("Above ten,") if x > 20: print("and also above 20!") else: print("but not above 20.")

Pass

#if statements cannot be empty, but if you for some reason have an if statement 
#with no content, put in the pass statement to avoid getting an error

a = 33
b = 200

if b > a:
  pass

The while Loop

i = 1
while i < 6:
  print(i)
  i += 1
1
2
3
4
5

Break Statement

i = 1
while i < 6:
  print(i)
  if i == 3:
    break
  i += 1
1
2
3
# with Continue

i = 0
while i < 6:
  i += 1
  if i == 3:
    continue
  print(i)
1
2
4
5
6
### Else statement

i = 1
while i < 6:
  print(i)
  i += 1
else:
  print("i is no longer less than 6")
1
2
3
4
5
i is no longer less than 6

For Loops

# For loop for List

fruits = ["apple", "banana", "cherry"]
for x in fruits:
  print(x)
apple
banana
cherry
# strings

for x in "banana":
  print(x)
b
a
n
a
n
a
#break statement

fruits = ["apple", "banana", "cherry"]
for x in fruits:
  print(x)
  if x == "banana":
    break
apple
banana
fruits = ["apple", "banana", "cherry"]
for x in fruits:
  if x == "banana":
    break
  print(x)
apple
#continue

fruits = ["apple", "banana", "cherry"]
for x in fruits:
  if x == "banana":
    continue
  print(x)
apple
cherry
# Range

for x in range(6):
  print(x)
0
1
2
3
4
5
for x in range(2, 6):
  print(x)
2
3
4
5
for x in range(2, 30, 3):
  print(x)
2
5
8
11
14
17
20
23
26
29
for x in range(6):
  print(x)
else:
  print("Finally finished!")
0
1
2
3
4
5
Finally finished!
adj = ["red", "big", "tasty"]
fruits = ["apple", "banana", "cherry"]

for x in adj:
  for y in fruits:
    print(x, y)
red apple
red banana
red cherry
big apple
big banana
big cherry
tasty apple
tasty banana
tasty cherry
for x in [0, 1, 2]:
  pass

Creating a Function

def my_function():
  print("Hello")


my_function()
Hello
def my_function(*kids):
  print("The youngest child is " + kids[2])

my_function("Emil", "Tobias", "Linus")
The youngest child is Linus
def my_function(child3, child2, child1):
  print("The youngest child is " + child3)

my_function(child1 = "Emil", child2 = "Tobias", child3 = "Linus")
The youngest child is Linus
#Passing a List as an Argument

def my_function(food):
  for x in food:
    print(x)

fruits = ["apple", "banana", "cherry"]

my_function(fruits)
apple
banana
cherry
#return value

def my_function(x):
  return 5 * x

print(my_function(3))
15
#Recursion Example

def tri_recursion(k):
  if(k > 0):
    result = k + tri_recursion(k - 1)
    print(result)
  else:
    result = 0
  return result

print("\n\nRecursion Example Results")
tri_recursion(6)

Recursion Example Results
1
3
6
10
15
21
21

lambda function

x = lambda a, b, c : a + b + c
print(x(5, 6, 2))
13
def myfunc(n):
  return lambda a : a * n
def myfunc(n):
  return lambda a : a * n

mydoubler = myfunc(2)

print(mydoubler(11))
22
def myfunc(n):
  return lambda a : a * n

mydoubler = myfunc(2)
mytripler = myfunc(3)

print(mydoubler(11))
print(mytripler(11))
22
33

Open a File on the Server

Reading files

#f = open("demofile.txt", "r")
#print(f.read())

#f = open("D:\\myfiles\welcome.txt", "r")
#print(f.read())
#Read one line of the file

#f = open("demofile.txt", "r")
#print(f.readline())
#Loop through the file line by line

#f = open("demofile.txt", "r")
#for x in f:
#  print(x)
#Close the file when you are finish with it

#f = open("demofile.txt", "r")
#print(f.readline())
#f.close()

Writing files:

#Open the file "demofile2.txt" and append content to the file

#f = open("demofile2.txt", "a")
#f.write("Now the file has more content!")
#f.close()

#open and read the file after the appending:
#f = open("demofile2.txt", "r")
#print(f.read())
#Open the file "demofile3.txt" and overwrite the content

#f = open("demofile3.txt", "w")
#f.write("Woops! I have deleted the content!")
#f.close()

#open and read the file after the appending:
#f = open("demofile3.txt", "r")
#print(f.read())
#Create a file called "myfile.txt"

#f = open("myfile.txt", "x")
#Remove the file "demofile.txt"

#import os
#os.remove("demofile.txt")
#Check if file exists, then delete it:

#import os
#if os.path.exists("demofile.txt"):
#  os.remove("demofile.txt")
#else:
#  print("The file does not exist")
#Try to open and write to a file that is not writable:

#try:
#  f = open("demofile.txt")
#  f.write("Lorum Ipsum")
#except:
#  print("Something went wrong when writing to the file")
#finally:
#  f.close()
#Raise an error and stop the program if x is lower than 0:

#x = -1

#if x < 0:
#  raise Exception("Sorry, no numbers below zero")
#Raise a TypeError if x is not an integer:

#x = "hello"

#if not type(x) is int:
# raise TypeError("Only integers are allowed")

NumPy

import numpy as np
simple_list = [1,2,3]
np.array(simple_list)
array([1, 2, 3])
list_of_lists = [[1,2,3], [4,5,6], [7,8,9]]
np.array(list_of_lists)
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])
np.arange(0,10)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
np.arange(0,21,5)
array([ 0,  5, 10, 15, 20])
np.zeros(50)
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
np.ones((4,5))
array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])
np.linspace(0,20,10)
array([ 0.        ,  2.22222222,  4.44444444,  6.66666667,  8.88888889,
       11.11111111, 13.33333333, 15.55555556, 17.77777778, 20.        ])
np.eye(5)
array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])
np.random.rand(3,2)
array([[0.24202235, 0.57396416],
       [0.0400231 , 0.38224147],
       [0.30024483, 0.20187655]])
np.random.randint(5,20,10)
array([10, 14, 18, 11,  9, 15, 16, 19, 13,  9])
np.arange(30)
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29])
np.random.randint(0,100,20)
array([81, 69, 90, 47,  3, 97, 31,  9, 58, 77, 92, 64, 73, 37, 65, 66,  9,
       21, 25, 73])
sample_array = np.arange(30)
sample_array.reshape(5,6)
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29]])
rand_array = np.random.randint(0,100,20)
rand_array.argmin()
12
sample_array.shape
(30,)
sample_array.reshape(1,30)
array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15,
        16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]])
sample_array.reshape(30,1)
array([[ 0],
       [ 1],
       [ 2],
       [ 3],
       [ 4],
       [ 5],
       [ 6],
       [ 7],
       [ 8],
       [ 9],
       [10],
       [11],
       [12],
       [13],
       [14],
       [15],
       [16],
       [17],
       [18],
       [19],
       [20],
       [21],
       [22],
       [23],
       [24],
       [25],
       [26],
       [27],
       [28],
       [29]])
sample_array.dtype
dtype('int32')
a = np.random.randn(2,3)
a.T
array([[-1.866579  , -0.77167212],
       [-0.24050824, -1.86954729],
       [ 1.09606272,  0.5064306 ]])
sample_array = np.arange(10,21)
sample_array
array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20])
sample_array[[2,5]]
array([12, 15])
sample_array[1:2] = 100
sample_array
array([ 10, 100,  12,  13,  14,  15,  16,  17,  18,  19,  20])
sample_array = np.arange(10,21)
sample_array[0:7]
array([10, 11, 12, 13, 14, 15, 16])
sample_array = np.arange(10,21)
                        
sample_array
array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20])
subset_sample_array = sample_array[0:7]

subset_sample_array
array([10, 11, 12, 13, 14, 15, 16])
subset_sample_array[:]=1001
subset_sample_array
array([1001, 1001, 1001, 1001, 1001, 1001, 1001])
sample_array
array([1001, 1001, 1001, 1001, 1001, 1001, 1001,   17,   18,   19,   20])
copy_sample_array = sample_array.copy()
copy_sample_array
array([1001, 1001, 1001, 1001, 1001, 1001, 1001,   17,   18,   19,   20])
copy_sample_array[:]=10
copy_sample_array
array([10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10])
sample_array
array([1001, 1001, 1001, 1001, 1001, 1001, 1001,   17,   18,   19,   20])
sample_matrix = np.array(([50,20,1,23], [24,23,21,32], [76,54,32,12], [98,6,4,3]))
sample_matrix
array([[50, 20,  1, 23],
       [24, 23, 21, 32],
       [76, 54, 32, 12],
       [98,  6,  4,  3]])
sample_matrix[0][3]
23
sample_matrix[0,3]
23
sample_matrix[3,:]
array([98,  6,  4,  3])
sample_matrix[3]
array([98,  6,  4,  3])
sample_matrix = np.array(([50,20,1,23,34], [24,23,21,32,34], [76,54,32,12,98], [98,6,4,3,67], [12,23,34,56,67]))
sample_matrix
array([[50, 20,  1, 23, 34],
       [24, 23, 21, 32, 34],
       [76, 54, 32, 12, 98],
       [98,  6,  4,  3, 67],
       [12, 23, 34, 56, 67]])
sample_matrix[:,[1,3]]
array([[20, 23],
       [23, 32],
       [54, 12],
       [ 6,  3],
       [23, 56]])
sample_matrix[:,(3,1)]
array([[23, 20],
       [32, 23],
       [12, 54],
       [ 3,  6],
       [56, 23]])
sample_array=np.arange(1,31)
sample_array
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30])
bool = sample_array < 10
sample_array[bool]
array([1, 2, 3, 4, 5, 6, 7, 8, 9])
sample_array[sample_array <10]
array([1, 2, 3, 4, 5, 6, 7, 8, 9])
a=11
sample_array[sample_array < a]
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])
sample_array + sample_array
array([ 2,  4,  6,  8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34,
       36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60])
sample_array / sample_array
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
10/sample_array
array([10.        ,  5.        ,  3.33333333,  2.5       ,  2.        ,
        1.66666667,  1.42857143,  1.25      ,  1.11111111,  1.        ,
        0.90909091,  0.83333333,  0.76923077,  0.71428571,  0.66666667,
        0.625     ,  0.58823529,  0.55555556,  0.52631579,  0.5       ,
        0.47619048,  0.45454545,  0.43478261,  0.41666667,  0.4       ,
        0.38461538,  0.37037037,  0.35714286,  0.34482759,  0.33333333])
sample_array + 1
array([ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
       19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31])
np.var(sample_array)
74.91666666666667
array = np.random.randn(6,6)
array
array([[-1.2513939 ,  0.63036933,  1.34352857,  0.69169362,  0.01026876,
         0.59189891],
       [-1.17904234, -0.12504466,  0.31374784,  0.09035803, -0.61388114,
         1.1150514 ],
       [ 1.06328715,  0.46405969,  0.00697848, -2.29704625,  0.96100601,
         0.83872649],
       [ 0.3548689 , -0.20216495, -1.17393345,  0.04961487, -0.67034172,
         0.55421924],
       [-2.2873708 , -1.24865618, -0.5852612 , -1.14245419,  0.63155215,
        -0.86846749],
       [-0.19474274,  0.26641693, -1.72485259,  1.13081737, -0.48967084,
        -0.56814362]])
np.std(array)
0.9421603403314502
np.mean(array)
-0.15316678701199804
sports = np.array(['golf', 'cric', 'fball', 'cric', 'Cric', 'fooseball'])

np.unique(sports)
array(['Cric', 'cric', 'fball', 'fooseball', 'golf'], dtype='<U9')
sample_array
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30])
simple_array = np.arange(0,20)
simple_array
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])
np.save('sample_array', sample_array)
np.savez('2_arrays.npz', a=sample_array, b=simple_array)
np.load('sample_array.npy')
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30])
archive = np.load('2_arrays.npz')
archive['b']
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])
np.savetxt('text_file.txt', sample_array,delimiter=',')
np.loadtxt('text_file.txt', delimiter=',')
array([ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12., 13.,
       14., 15., 16., 17., 18., 19., 20., 21., 22., 23., 24., 25., 26.,
       27., 28., 29., 30.])
data = {'prodID': ['101', '102', '103', '104', '104'],

                    'prodname': ['X', 'Y', 'Z', 'X', 'W'],

                     'profit': ['2738', '2727', '3497', '7347', '3743']}

Pandas

import pandas as pd
score = [10, 15, 20, 25]
pd.Series(data=score, index = ['a','b','c','d'])
a    10
b    15
c    20
d    25
dtype: int64
demo_matrix = np.array(([13,35,74,48], [23,37,37,38], [73,39,93,39]))
demo_matrix
array([[13, 35, 74, 48],
       [23, 37, 37, 38],
       [73, 39, 93, 39]])
demo_matrix[2,3]
39
np.arange(0,22,6)
array([ 0,  6, 12, 18])
demo_array=np.arange(0,10)
demo_array
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
demo_array <3
array([ True,  True,  True, False, False, False, False, False, False,
       False])
demo_array[demo_array <6]
array([0, 1, 2, 3, 4, 5])
np.max(demo_array)
9
s1 = pd.Series(['a', 'b'])
s2 = pd.Series(['c', 'd'])
pd.concat([s1+s2])
0    ac
1    bd
dtype: object

Creating a Series using Pandas

You could convert a list,numpy array, or dictionary to a Series in the following manner

labels = ['w','x','y','z']
list = [10,20,30,40]
array = np.array([10,20,30,40])
dict = {'w':10,'x':20,'y':30,'z':40}
pd.Series(data=list)
0    10
1    20
2    30
3    40
dtype: int64
pd.Series(data=list,index=labels)
w    10
x    20
y    30
z    40
dtype: int64
pd.Series(list,labels)
w    10
x    20
y    30
z    40
dtype: int64
pd.Series(array)
0    10
1    20
2    30
3    40
dtype: int32
pd.Series(array,labels)
w    10
x    20
y    30
z    40
dtype: int32
pd.Series(dict)
w    10
x    20
y    30
z    40
dtype: int64

Using an Index

We shall now see how to index in a Series using the following examples of 2 series

sports1 = pd.Series([1,2,3,4],index = ['Cricket', 'Football','Basketball', 'Golf'])
sports1
Cricket       1
Football      2
Basketball    3
Golf          4
dtype: int64
sports2 = pd.Series([1,2,5,4],index = ['Cricket', 'Football','Baseball', 'Golf'])
sports2
Cricket     1
Football    2
Baseball    5
Golf        4
dtype: int64
sports1 + sports2
Baseball      NaN
Basketball    NaN
Cricket       2.0
Football      4.0
Golf          8.0
dtype: float64

DataFrames

DataFrames concept in python is similar to that of R programming language. DataFrame is a collection of Series combined together to share the same index positions.

from numpy.random import randn
np.random.seed(1)
dataframe = pd.DataFrame(randn(10,5),index='A B C D E F G H I J'.split(),columns='Score1 Score2 Score3 Score4 Score5'.split())
dataframe
Score1 Score2 Score3 Score4 Score5
A 1.624345 -0.611756 -0.528172 -1.072969 0.865408
B -2.301539 1.744812 -0.761207 0.319039 -0.249370
C 1.462108 -2.060141 -0.322417 -0.384054 1.133769
D -1.099891 -0.172428 -0.877858 0.042214 0.582815
E -1.100619 1.144724 0.901591 0.502494 0.900856
F -0.683728 -0.122890 -0.935769 -0.267888 0.530355
G -0.691661 -0.396754 -0.687173 -0.845206 -0.671246
H -0.012665 -1.117310 0.234416 1.659802 0.742044
I -0.191836 -0.887629 -0.747158 1.692455 0.050808
J -0.636996 0.190915 2.100255 0.120159 0.617203

Selection and Indexing

Ways in which we can grab data from a DataFrame

dataframe['Score3']
A   -0.528172
B   -0.761207
C   -0.322417
D   -0.877858
E    0.901591
F   -0.935769
G   -0.687173
H    0.234416
I   -0.747158
J    2.100255
Name: Score3, dtype: float64
# Pass a list of column names in any order necessary
dataframe[['Score2','Score1']]
Score2 Score1
A -0.611756 1.624345
B 1.744812 -2.301539
C -2.060141 1.462108
D -0.172428 -1.099891
E 1.144724 -1.100619
F -0.122890 -0.683728
G -0.396754 -0.691661
H -1.117310 -0.012665
I -0.887629 -0.191836
J 0.190915 -0.636996
#DataFrame Columns are nothing but a Series each
type(dataframe['Score1'])
pandas.core.series.Series

Adding a new column to the DataFrame

dataframe['Score6'] = dataframe['Score1'] + dataframe['Score2']
dataframe
Score1 Score2 Score3 Score4 Score5 Score6
A 1.624345 -0.611756 -0.528172 -1.072969 0.865408 1.012589
B -2.301539 1.744812 -0.761207 0.319039 -0.249370 -0.556727
C 1.462108 -2.060141 -0.322417 -0.384054 1.133769 -0.598033
D -1.099891 -0.172428 -0.877858 0.042214 0.582815 -1.272319
E -1.100619 1.144724 0.901591 0.502494 0.900856 0.044105
F -0.683728 -0.122890 -0.935769 -0.267888 0.530355 -0.806618
G -0.691661 -0.396754 -0.687173 -0.845206 -0.671246 -1.088414
H -0.012665 -1.117310 0.234416 1.659802 0.742044 -1.129975
I -0.191836 -0.887629 -0.747158 1.692455 0.050808 -1.079465
J -0.636996 0.190915 2.100255 0.120159 0.617203 -0.446080

Removing Columns from DataFrame

# Use axis=0 for dropping rows and axis=1 for dropping columns
    
dataframe.drop('Score6',axis=1)             
Score1 Score2 Score3 Score4 Score5
A 1.624345 -0.611756 -0.528172 -1.072969 0.865408
B -2.301539 1.744812 -0.761207 0.319039 -0.249370
C 1.462108 -2.060141 -0.322417 -0.384054 1.133769
D -1.099891 -0.172428 -0.877858 0.042214 0.582815
E -1.100619 1.144724 0.901591 0.502494 0.900856
F -0.683728 -0.122890 -0.935769 -0.267888 0.530355
G -0.691661 -0.396754 -0.687173 -0.845206 -0.671246
H -0.012665 -1.117310 0.234416 1.659802 0.742044
I -0.191836 -0.887629 -0.747158 1.692455 0.050808
J -0.636996 0.190915 2.100255 0.120159 0.617203
# column is not dropped unless inplace input is TRUE
dataframe
Score1 Score2 Score3 Score4 Score5 Score6
A 1.624345 -0.611756 -0.528172 -1.072969 0.865408 1.012589
B -2.301539 1.744812 -0.761207 0.319039 -0.249370 -0.556727
C 1.462108 -2.060141 -0.322417 -0.384054 1.133769 -0.598033
D -1.099891 -0.172428 -0.877858 0.042214 0.582815 -1.272319
E -1.100619 1.144724 0.901591 0.502494 0.900856 0.044105
F -0.683728 -0.122890 -0.935769 -0.267888 0.530355 -0.806618
G -0.691661 -0.396754 -0.687173 -0.845206 -0.671246 -1.088414
H -0.012665 -1.117310 0.234416 1.659802 0.742044 -1.129975
I -0.191836 -0.887629 -0.747158 1.692455 0.050808 -1.079465
J -0.636996 0.190915 2.100255 0.120159 0.617203 -0.446080
dataframe.drop('Score6',axis=1,inplace=True)
dataframe
Score1 Score2 Score3 Score4 Score5
A 1.624345 -0.611756 -0.528172 -1.072969 0.865408
B -2.301539 1.744812 -0.761207 0.319039 -0.249370
C 1.462108 -2.060141 -0.322417 -0.384054 1.133769
D -1.099891 -0.172428 -0.877858 0.042214 0.582815
E -1.100619 1.144724 0.901591 0.502494 0.900856
F -0.683728 -0.122890 -0.935769 -0.267888 0.530355
G -0.691661 -0.396754 -0.687173 -0.845206 -0.671246
H -0.012665 -1.117310 0.234416 1.659802 0.742044
I -0.191836 -0.887629 -0.747158 1.692455 0.050808
J -0.636996 0.190915 2.100255 0.120159 0.617203

Dropping rows using axis=0

# Row will also be dropped only if inplace=TRUE is given as input

dataframe.drop('A',axis=0)      
Score1 Score2 Score3 Score4 Score5
B -2.301539 1.744812 -0.761207 0.319039 -0.249370
C 1.462108 -2.060141 -0.322417 -0.384054 1.133769
D -1.099891 -0.172428 -0.877858 0.042214 0.582815
E -1.100619 1.144724 0.901591 0.502494 0.900856
F -0.683728 -0.122890 -0.935769 -0.267888 0.530355
G -0.691661 -0.396754 -0.687173 -0.845206 -0.671246
H -0.012665 -1.117310 0.234416 1.659802 0.742044
I -0.191836 -0.887629 -0.747158 1.692455 0.050808
J -0.636996 0.190915 2.100255 0.120159 0.617203

Selecting Rows

dataframe.loc['F']
Score1   -0.683728
Score2   -0.122890
Score3   -0.935769
Score4   -0.267888
Score5    0.530355
Name: F, dtype: float64

select based off of index position instead of label - use iloc instead of loc function

dataframe.iloc[2]
Score1    1.462108
Score2   -2.060141
Score3   -0.322417
Score4   -0.384054
Score5    1.133769
Name: C, dtype: float64

Selecting subset of rows and columns using loc function

dataframe.loc['A','Score1']
1.6243453636632417
dataframe.loc[['A','B'],['Score1','Score2']]
Score1 Score2
A 1.624345 -0.611756
B -2.301539 1.744812

Conditional Selection

Similar to NumPy, we can make conditional selections using Brackets

dataframe>0.5
Score1 Score2 Score3 Score4 Score5
A True False False False True
B False True False False False
C True False False False True
D False False False False True
E False True True True True
F False False False False True
G False False False False False
H False False False True True
I False False False True False
J False False True False True
dataframe[dataframe>0.5]
Score1 Score2 Score3 Score4 Score5
A 1.624345 NaN NaN NaN 0.865408
B NaN 1.744812 NaN NaN NaN
C 1.462108 NaN NaN NaN 1.133769
D NaN NaN NaN NaN 0.582815
E NaN 1.144724 0.901591 0.502494 0.900856
F NaN NaN NaN NaN 0.530355
G NaN NaN NaN NaN NaN
H NaN NaN NaN 1.659802 0.742044
I NaN NaN NaN 1.692455 NaN
J NaN NaN 2.100255 NaN 0.617203
dataframe[dataframe['Score1']>0.5]
Score1 Score2 Score3 Score4 Score5
A 1.624345 -0.611756 -0.528172 -1.072969 0.865408
C 1.462108 -2.060141 -0.322417 -0.384054 1.133769
dataframe[dataframe['Score1']>0.5]['Score2']
A   -0.611756
C   -2.060141
Name: Score2, dtype: float64
dataframe[dataframe['Score1']>0.5][['Score2','Score3']]
Score2 Score3
A -0.611756 -0.528172
C -2.060141 -0.322417

Some more features of indexing includes

  • resetting the index
  • setting a different value
  • index hierarchy
# Reset to default index value instead of A to J
dataframe.reset_index()
Countries Score1 Score2 Score3 Score4 Score5
0 IND 1.624345 -0.611756 -0.528172 -1.072969 0.865408
1 JP -2.301539 1.744812 -0.761207 0.319039 -0.249370
2 CAN 1.462108 -2.060141 -0.322417 -0.384054 1.133769
3 GE -1.099891 -0.172428 -0.877858 0.042214 0.582815
4 IT -1.100619 1.144724 0.901591 0.502494 0.900856
5 PL -0.683728 -0.122890 -0.935769 -0.267888 0.530355
6 FY -0.691661 -0.396754 -0.687173 -0.845206 -0.671246
7 IU -0.012665 -1.117310 0.234416 1.659802 0.742044
8 RT -0.191836 -0.887629 -0.747158 1.692455 0.050808
9 IP -0.636996 0.190915 2.100255 0.120159 0.617203
# Setting new index value
newindex = 'IND JP CAN GE IT PL FY IU RT IP'.split()
dataframe['Countries'] = newindex
dataframe
Score1 Score2 Score3 Score4 Score5 Countries
Countries
IND 1.624345 -0.611756 -0.528172 -1.072969 0.865408 IND
JP -2.301539 1.744812 -0.761207 0.319039 -0.249370 JP
CAN 1.462108 -2.060141 -0.322417 -0.384054 1.133769 CAN
GE -1.099891 -0.172428 -0.877858 0.042214 0.582815 GE
IT -1.100619 1.144724 0.901591 0.502494 0.900856 IT
PL -0.683728 -0.122890 -0.935769 -0.267888 0.530355 PL
FY -0.691661 -0.396754 -0.687173 -0.845206 -0.671246 FY
IU -0.012665 -1.117310 0.234416 1.659802 0.742044 IU
RT -0.191836 -0.887629 -0.747158 1.692455 0.050808 RT
IP -0.636996 0.190915 2.100255 0.120159 0.617203 IP
dataframe.set_index('Countries')
Score1 Score2 Score3 Score4 Score5
Countries
IND 1.624345 -0.611756 -0.528172 -1.072969 0.865408
JP -2.301539 1.744812 -0.761207 0.319039 -0.249370
CAN 1.462108 -2.060141 -0.322417 -0.384054 1.133769
GE -1.099891 -0.172428 -0.877858 0.042214 0.582815
IT -1.100619 1.144724 0.901591 0.502494 0.900856
PL -0.683728 -0.122890 -0.935769 -0.267888 0.530355
FY -0.691661 -0.396754 -0.687173 -0.845206 -0.671246
IU -0.012665 -1.117310 0.234416 1.659802 0.742044
RT -0.191836 -0.887629 -0.747158 1.692455 0.050808
IP -0.636996 0.190915 2.100255 0.120159 0.617203
# Once again, ensure that you input inplace=TRUE
dataframe
Score1 Score2 Score3 Score4 Score5 Countries
Countries
IND 1.624345 -0.611756 -0.528172 -1.072969 0.865408 IND
JP -2.301539 1.744812 -0.761207 0.319039 -0.249370 JP
CAN 1.462108 -2.060141 -0.322417 -0.384054 1.133769 CAN
GE -1.099891 -0.172428 -0.877858 0.042214 0.582815 GE
IT -1.100619 1.144724 0.901591 0.502494 0.900856 IT
PL -0.683728 -0.122890 -0.935769 -0.267888 0.530355 PL
FY -0.691661 -0.396754 -0.687173 -0.845206 -0.671246 FY
IU -0.012665 -1.117310 0.234416 1.659802 0.742044 IU
RT -0.191836 -0.887629 -0.747158 1.692455 0.050808 RT
IP -0.636996 0.190915 2.100255 0.120159 0.617203 IP
dataframe.set_index('Countries',inplace=True)
dataframe
Score1 Score2 Score3 Score4 Score5
Countries
IND 1.624345 -0.611756 -0.528172 -1.072969 0.865408
JP -2.301539 1.744812 -0.761207 0.319039 -0.249370
CAN 1.462108 -2.060141 -0.322417 -0.384054 1.133769
GE -1.099891 -0.172428 -0.877858 0.042214 0.582815
IT -1.100619 1.144724 0.901591 0.502494 0.900856
PL -0.683728 -0.122890 -0.935769 -0.267888 0.530355
FY -0.691661 -0.396754 -0.687173 -0.845206 -0.671246
IU -0.012665 -1.117310 0.234416 1.659802 0.742044
RT -0.191836 -0.887629 -0.747158 1.692455 0.050808
IP -0.636996 0.190915 2.100255 0.120159 0.617203

Missing Data

Methods to deal with missing data in Pandas

dataframe = pd.DataFrame({'Cricket':[1,2,np.nan,4,6,7,2,np.nan],
                  'Baseball':[5,np.nan,np.nan,5,7,2,4,5],
                  'Tennis':[1,2,3,4,5,6,7,8]})
dataframe
Cricket Baseball Tennis
0 1.0 5.0 1
1 2.0 NaN 2
2 NaN NaN 3
3 4.0 5.0 4
4 6.0 7.0 5
5 7.0 2.0 6
6 2.0 4.0 7
7 NaN 5.0 8
dataframe.dropna()
Cricket Baseball Tennis
0 1.0 5.0 1
3 4.0 5.0 4
4 6.0 7.0 5
5 7.0 2.0 6
6 2.0 4.0 7
# Use axis=1 for dropping columns with nan values

dataframe.dropna(axis=1)       
Tennis
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
dataframe.dropna(thresh=2)
Cricket Baseball Tennis
0 1.0 5.0 1
1 2.0 NaN 2
3 4.0 5.0 4
4 6.0 7.0 5
5 7.0 2.0 6
6 2.0 4.0 7
7 NaN 5.0 8
dataframe.fillna(value=0)
Cricket Baseball Tennis
0 1.0 5.0 1
1 2.0 0.0 2
2 0.0 0.0 3
3 4.0 5.0 4
4 6.0 7.0 5
5 7.0 2.0 6
6 2.0 4.0 7
7 0.0 5.0 8
dataframe['Baseball'].fillna(value=dataframe['Baseball'].mean())
0    5.000000
1    4.666667
2    4.666667
3    5.000000
4    7.000000
5    2.000000
6    4.000000
7    5.000000
Name: Baseball, dtype: float64

Groupby

The groupby method is used to group rows together and perform aggregate functions

dat = {'CustID':['1001','1001','1002','1002','1003','1003'],
       'CustName':['UIPat','DatRob','Goog','Chrysler','Ford','GM'],
       'Profitinlakhs':[2005,3245,1245,8765,5463,3547]}
dataframe = pd.DataFrame(dat)
dataframe
CustID CustName Profitinlakhs
0 1001 UIPat 2005
1 1001 DatRob 3245
2 1002 Goog 1245
3 1002 Chrysler 8765
4 1003 Ford 5463
5 1003 GM 3547

We can now use the .groupby() method to group rows together based on a column name. For example let's group based on CustID. This will create a DataFrameGroupBy object:

dataframe.groupby('CustID') #This object can be saved as a variable
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001FDFCEFE9C8>
CustID_grouped = dataframe.groupby("CustID") #Now we can aggregate using the variable
CustID_grouped.mean()
Profitinlakhs
CustID
1001 2625
1002 5005
1003 4505

groupby function for each aggregation

dataframe.groupby('CustID').mean()
Profitinlakhs
CustID
1001 2625
1002 5005
1003 4505
CustID_grouped.std()
Profitinlakhs
CustID
1001 876.812409
1002 5317.442995
1003 1354.816593
CustID_grouped.min()
CustName Profitinlakhs
CustID
1001 DatRob 2005
1002 Chrysler 1245
1003 Ford 3547
CustID_grouped.max()
CustName Profitinlakhs
CustID
1001 UIPat 3245
1002 Goog 8765
1003 GM 5463
CustID_grouped.count()
CustName Profitinlakhs
CustID
1001 2 2
1002 2 2
1003 2 2
CustID_grouped.describe()
Profitinlakhs
count mean std min 25% 50% 75% max
CustID
1001 2.0 2625.0 876.812409 2005.0 2315.0 2625.0 2935.0 3245.0
1002 2.0 5005.0 5317.442995 1245.0 3125.0 5005.0 6885.0 8765.0
1003 2.0 4505.0 1354.816593 3547.0 4026.0 4505.0 4984.0 5463.0
CustID_grouped.describe().transpose()
CustID 1001 1002 1003
Profitinlakhs count 2.000000 2.000000 2.000000
mean 2625.000000 5005.000000 4505.000000
std 876.812409 5317.442995 1354.816593
min 2005.000000 1245.000000 3547.000000
25% 2315.000000 3125.000000 4026.000000
50% 2625.000000 5005.000000 4505.000000
75% 2935.000000 6885.000000 4984.000000
max 3245.000000 8765.000000 5463.000000
CustID_grouped.describe().transpose()['1001']
Profitinlakhs  count       2.000000
               mean     2625.000000
               std       876.812409
               min      2005.000000
               25%      2315.000000
               50%      2625.000000
               75%      2935.000000
               max      3245.000000
Name: 1001, dtype: float64

combining DataFrames together:

  • Merging
  • Joining
  • Concatenating
dafa1 = pd.DataFrame({'CustID': ['101', '102', '103', '104'],
                        'Sales': [13456, 45321, 54385, 53212],
                        'Priority': ['CAT0', 'CAT1', 'CAT2', 'CAT3'],
                        'Prime': ['yes', 'no', 'no', 'yes']},
                        index=[0, 1, 2, 3])

dafa2 = pd.DataFrame({'CustID': ['101', '103', '104', '105'],
                        'Sales': [13456, 54385, 53212, 4534],
                        'Payback': ['CAT4', 'CAT5', 'CAT6', 'CAT7'],
                        'Imp': ['yes', 'no', 'no', 'no']},
                         index=[4, 5, 6, 7]) 

dafa3 = pd.DataFrame({'CustID': ['101', '104', '105', '106'],
                        'Sales': [13456, 53212, 4534, 3241],
                        'Pol': ['CAT8', 'CAT9', 'CAT10', 'CAT11'],
                        'Level': ['yes', 'no', 'no', 'yes']},
                        index=[8, 9, 10, 11])

Concatenation

Concatenation joins DataFrames basically either by rows or colums(axis=0 or 1).

We also need to ensure dimension sizes of dataframes are the same

pd.concat([dafa1,dafa2])
D:\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  """Entry point for launching an IPython kernel.
CustID Imp Payback Prime Priority Sales
0 101 NaN NaN yes CAT0 13456
1 102 NaN NaN no CAT1 45321
2 103 NaN NaN no CAT2 54385
3 104 NaN NaN yes CAT3 53212
4 101 yes CAT4 NaN NaN 13456
5 103 no CAT5 NaN NaN 54385
6 104 no CAT6 NaN NaN 53212
7 105 no CAT7 NaN NaN 4534
pd.concat([dafa1,dafa2,dafa3],axis=1)
CustID Sales Priority Prime CustID Sales Payback Imp CustID Sales Pol Level
0 101 13456.0 CAT0 yes NaN NaN NaN NaN NaN NaN NaN NaN
1 102 45321.0 CAT1 no NaN NaN NaN NaN NaN NaN NaN NaN
2 103 54385.0 CAT2 no NaN NaN NaN NaN NaN NaN NaN NaN
3 104 53212.0 CAT3 yes NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN 101 13456.0 CAT4 yes NaN NaN NaN NaN
5 NaN NaN NaN NaN 103 54385.0 CAT5 no NaN NaN NaN NaN
6 NaN NaN NaN NaN 104 53212.0 CAT6 no NaN NaN NaN NaN
7 NaN NaN NaN NaN 105 4534.0 CAT7 no NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN NaN NaN 101 13456.0 CAT8 yes
9 NaN NaN NaN NaN NaN NaN NaN NaN 104 53212.0 CAT9 no
10 NaN NaN NaN NaN NaN NaN NaN NaN 105 4534.0 CAT10 no
11 NaN NaN NaN NaN NaN NaN NaN NaN 106 3241.0 CAT11 yes

Merging

Just like SQL tables, merge function in python allows us to merge dataframes

pd.merge(dafa1,dafa2,how='outer',on='CustID')
CustID Sales_x Priority Prime Sales_y Payback Imp
0 101 13456.0 CAT0 yes 13456.0 CAT4 yes
1 102 45321.0 CAT1 no NaN NaN NaN
2 103 54385.0 CAT2 no 54385.0 CAT5 no
3 104 53212.0 CAT3 yes 53212.0 CAT6 no
4 105 NaN NaN NaN 4534.0 CAT7 no

Operations

Let us discuss some useful Operations using Pandas

dataframe = pd.DataFrame({'custID':[1,2,3,4],'SaleType':['big','small','medium','big'],'SalesCode':['121','131','141','151']})
dataframe.head()
custID SaleType SalesCode
0 1 big 121
1 2 small 131
2 3 medium 141
3 4 big 151

Info on Unique Values

dataframe['SaleType'].unique()
array(['big', 'small', 'medium'], dtype=object)
dataframe['SaleType'].nunique()
3
dataframe['SaleType'].value_counts()
big       2
small     1
medium    1
Name: SaleType, dtype: int64

Selecting Data

#Select from DataFrame using criteria from multiple columns
newdataframe = dataframe[(dataframe['custID']!=3) & (dataframe['SaleType']=='big')]
newdataframe
custID SaleType SalesCode
0 1 big 121
3 4 big 151

Applying Functions

def profit(a):
    return a*4
dataframe['custID'].apply(profit)
0     4
1     8
2    12
3    16
Name: custID, dtype: int64
dataframe['SaleType'].apply(len)
0    3
1    5
2    6
3    3
Name: SaleType, dtype: int64
dataframe['custID'].sum()
10

Permanently Removing a Column

dataframe
custID SaleType SalesCode
0 1 big 121
1 2 small 131
2 3 medium 141
3 4 big 151
del dataframe['custID']
dataframe
SaleType SalesCode
0 big 121
1 small 131
2 medium 141
3 big 151

Get column and index names

dataframe.columns
Index(['SaleType', 'SalesCode'], dtype='object')
dataframe.index
RangeIndex(start=0, stop=4, step=1)

Sorting and Ordering a DataFrame

dataframe.sort_values(by='SaleType') #inplace=False by default
SaleType SalesCode
0 big 121
3 big 151
2 medium 141
1 small 131

Find Null Values or Check for Null Values

dataframe.isnull()
SaleType SalesCode
0 False False
1 False False
2 False False
3 False False
# Drop rows with NaN Values
dataframe.dropna()
SaleType SalesCode
0 big 121
1 small 131
2 medium 141
3 big 151

Filling in NaN values with something else

dataframe = pd.DataFrame({'Sale1':[5,np.nan,10,np.nan],
                   'Sale2':[np.nan,121,np.nan,141],
                   'Sale3':['XUI','VYU','NMA','IUY']})
dataframe.head()
Sale1 Sale2 Sale3
0 5.0 NaN XUI
1 NaN 121.0 VYU
2 10.0 NaN NMA
3 NaN 141.0 IUY
dataframe.fillna('Not nan')
Sale1 Sale2 Sale3
0 5 Not nan XUI
1 Not nan 121 VYU
2 10 Not nan NMA
3 Not nan 141 IUY

Data Input and Output

Reading DataFrames from external sources using pd.read functions

CSV Input

# dataframe = pd.read_csv('filename.csv')

CSV output

#If index=FALSE then csv does not store index values

# dataframe.to_csv('filename.csv',index=False)    

Excel Input

# pd.read_excel('filename.xlsx',sheet_name='Data1')

Excel Output

# dataframe.to_excel('Consumer2.xlsx',sheet_name='Sheet1')