Blog

Musings of a Developer

Let’s learn Panda

Pandas…. Abbreviation of Panel data. Panda is a Python library module for data analysis and data manipulation. Data, as it is said, is one of the most important aspect in the field of Machine Learning. Machines cannot be trained until and unless it is provided with data (because Duh!!! What will it learn ???). That being said, In this world of infinite type of data formats and their uneven scattering, Data will not present itself every time in a decorated and beautiful manner(Damn! Its a big mess). This is where Panda comes to the rescue. Panda makes data cleaning, formatting, refining and all sorts of data manipulation millions of times faster and efficient ! Its ease of use, open source nature, Fantastic documentation, Huge community, active development since its initial release, are just some of the features that panda owes to for its fame among the data scientist.

Without any further due let’s get our hands dirty and get familiar with some of the basic functionalities Panda has to offer.

NOTE: In this post I am going to assume that you already have panda installed on your machine. If not, you might want to do that first i.e. ‘pip install pandas’

Lets start with some basic panda setup and usage!

import pandas as pd import matplotlib.pyplot as plt from matplotlib import stylev style.use('fivethirtyeight') stats = { 'Day':[1,2,3,4,5,6], 'Visitors':[43,34,65,56,29,76], 'Bounce_Rate':[65,67,78,65,45,52] } df = pd.DataFrame(stats) print(df.head()) print(df.tail()) print(df.head(2)) print(df.tail(2)) print(df.shape) df.set_index('Day', inplace = True) print(df) df.index.name = 'Day' df.reset_index(inplace = True) print(df)

Now lets move on to some of the basic column operation.

import pandas as pd import matplotlib.pyplot as plt from matplotlib import style style.use('fivethirtyeight') stats = { 'Day':[1,2,3,4,5,6], 'Visitors':[43,34,65,56,29,76], 'Bounce_Rate':[65,67,78,65,45,52] } df = pd.DataFrame(stats) #AXIS = 0 STANDS FOR ROWS df.drop(['Day', 'Visitors'], axis = 1, inplace = True) print(df.Bounce_Rate) print('Bounce_Rate') df.rename(columns = {'Day': 'day', 'Visitors': 'visitors'}, inplace = True) print(df.columns)

Some basic Filtering and Sorting operations can be done as follows.

import pandas as pd import matplotlib.pyplot as plt from matplotlib import style style.use('fivethirtyeight') stats = { 'Day':[1,2,3,4,5,6], 'Visitors':[43,34,65,56,29,76], 'Bounce_Rate':[65,67,78,65,45,52] } df = pd.DataFrame(stats) x = df.Visitors.sort_values() print(x) ''' OR another method to do the same would be: x = df['Visitors'].sort_values() print(x) ''' x = df.sort_values('Day') print(x) x = df[df.Day > 2] print(x) x = df[(Day > 2) and (Visitors > 29)] print(x) y = df[(Day > 2) or (Visitors > 29)] print(y)

Let’s read from a csv file and write to a new csv file.

import pandas as pd df = pd.read_csv('somecsvfile.csv') df.to_csv('anothercsvfile.csv') df['somecolumn'].to_csv('yetanothercsvfile.csv') df = pd.read_csv('somecsvfile.csv', usecols = ['column_name', 'another_column_name']) df = pd.read_csv('somecsvfile.csv', index_col = 0) df.rename(columns = {'originalname' : 'newname'}, inplace = True)

Let’s read from an HTML page.

import pandas as pd df = pd.read_html('here goes the url link of the website') print(df) print(df[0]) print(df[0][0]) for item in df[0][0][1:]: print(str(item))

Now, lets try to join two pandas dataframe in 4 different ways namely Joining, Concatenating, Appending and Merging. All of them might seem to work in a similar manner at first glance but dont be mistaken because they all have very unique and different methodologies of operations.

#CONCATENATING AND APPENDING import pandas as pd df1 = pd.DataFrame({'HPI':[80,85,88,85], 'Int_rate':[2, 3, 2, 2], 'US_GDP_Thousands':[50, 55, 65, 55]}, index = [2001, 2002, 2003, 2004]) df2 = pd.DataFrame({'HPI':[80,85,88,85], 'Int_rate':[2, 3, 2, 2], 'US_GDP_Thousands':[50, 55, 65, 55]}, index = [2005, 2006, 2007, 2008]) df3 = pd.DataFrame({'HPI':[80,85,88,85], 'Int_rate':[2, 3, 2, 2], 'Low_tier_HPI':[50, 52, 50, 53]}, index = [2001, 2002, 2003, 2004]) concat_1 = pd.concat([df1, df2]) print concat concat_2 = pd.concat([df1, df2, df3]) print(concat_2) df4 = df1.append(df2) print(df4) s = pd.Series([40, 90, 65], index = ['HPI', 'Int_rate', 'US_GDP_Thousands']) df5 = df1.append(s, ignore_index = True) print(df5) #JOINING AND MERGING import pandas as pd df1 = pd.DataFrame({ 'id':[1,2,3,4,5], 'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 'subject_id':['sub1','sub2','sub4','sub6','sub5']}) df2 = pd.DataFrame( {'id':[1,2,3,4,5], 'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'subject_id':['sub2','sub4','sub3','sub6','sub5']}) first_merge = pd.merge(df1, df2, on = 'id') print(first_merge) second_merge = pd.merge(df1, df2, on = ['id', 'subject_id']) print(second_merge) # MERGE USING 'HOW' ARGUMENT """ THERE ARE 4 TYPES OF MERGING METHOD USING 'HOW' ARGUMENT 1) LEFT OUTER 2) RIGHT OUTER 3) FULL OUTER $) INNER """ left_merge = pd.merge(df1, df2, on = 'subject_id', how = 'left') print(left_merge) right_merge = pd.merge(df1, df2, on = 'subject_id', how = 'right') print(right_merge) outer_merge = pd.merge(df1, df2, on = 'subject_id', how = 'outer') print(outer_merge) inner_merge = pd.merge(df1, df2, on = 'subject_id', how = 'inner') print(inner_merge) ''' JOINING WILL BE PERFORMED ON INDEX. JOIN OPERATION HONORS THE OBJECT ON WHICH IT IS CALLED. SO, A.JOIN(B) IS NOT EQUAL TO B.JOIN(A). ''' df1.set_index('subject_id', inplace=True) df3.set_index('subject_id', inplace=True) right_joined = df1.join(df3, how="right") print(right_joined) right_joined = df1.join(df3, how="left") print(left_joined) right_joined = df1.join(df3, how="inner") print(inner_joined) right_joined = df1.join(df3, how="outer") print(outer_joined)

One very common occurring instance in data sets is the missing of values. A NAN(Not A Number). To handle those missing values, Panda has some built in features which we can utilize.

import pandas as pd df = pd.read_csv('http://bit.ly/uforeports') df.isnull().tail() df.notnull().tail() df.isnull().sum() df[df.City.isnull()] df.dropna(how = 'any') df.dropna(how = 'all') df.dropna(subset = ['City', 'Shape Reported'], how = 'any')\ df.fillna(value = 'heregoesthevalueyouwanttofill')

Quandl is an online website with tons of dataset in a variety of fields (e.g: American housing price indexes, equities, commodities, foreign exchange and more). For more on Quandl visit quandl.com. Let’s setup Quandl with pandas to fetch data from Quandl and use it in analysis.

import quandl api_key = 'yoursuperamazingquandlAPIkey' df = quandl.get('heregoesthequandlcode', authtoken = api_key) print(df.head())

Here we conclude this post of dirtying our hands with Pandas module library. Pandas is a very feature rich module and has a Myriad of functionalities to offer, most of which cannot be readily covered in a single blog post or even in a series of blog posts. This post was intended to get your hands adept with the library and to show its capabilities, but it does not cover even the slightest part of what truly pandas has to offer. Thus if you are inspired to learn more of pandas then please head to the official docs page.

Footnote : I have a Github repository which contains all of the above code in a very well commented structure alongside all the necessary references that I have used to explore this beautiful library. I will keep updating this repository as the time passes and as I learn something worth updating.

Stay tuned. Until next time…!