Pandas is an open source Python library for data analysis. With Panda, the data becomes spreadsheet-like (i.e. with headers and columns) which could be loaded, manipulated, aligned, merged, etc.
Unlike a spreadsheet programme which has its own “macros” language for more complex calculations and analysis, it provides additional features on top of Python, enabling users to automate, reproduce and share analysis across many different operating systems.
Pandas introduces two new data types - Series and DataFrame. The DataFrame represents the entire spreadsheet or rectangle data, whereas the Series is a single column of the DataFrame. A Pandas DataFrame can also be thought of as a dictionary or collection of Series objects.
By using Pandas DataFrame, you can put labels to your variables, making it easier to manage and manipulate data.
Pandas is a Python library that makes handling tabular data easier. Since we're doing data science - this is something we'll use from time to time!
It's one of three libraries you'll encounter repeatedly in the field of data science:
Introduces "Data Frames" and "Series" that allow you to slice and dice rows and columns of information.
Usually you'll encounter "NumPy arrays", which are multi-dimensional array objects. It is easy to create a Pandas DataFrame from a NumPy array, and Pandas DataFrames can be cast as NumPy arrays. NumPy arrays are mainly important because of...
The machine learning library we'll use throughout this course is scikit_learn, or sklearn, and it generally takes NumPy arrays as its input.
So, a typical thing to do is to load, clean, and manipulate your input data using Pandas. Then convert your Pandas DataFrame into a NumPy array as it's being passed into some Scikit_Learn function. That conversion can often happen automatically.
Let's start by loading some comma-separated value data using Pandas into a DataFrame:
head() is a handy way to visualize what you've loaded. You can pass it an integer to see some specific number of rows at the beginning of your DataFrame:
You can also view the end of your data with tail():
We often talk about the "shape" of your DataFrame. This is just its dimensions. This particular CSV file has 13 rows with 7 columns per row:
The total size of the data frame is the rows * columns:
The len() function gives you the number of rows in a DataFrame:
If your DataFrame has named columns (in our case, extracted automatically from the first row of a .csv file,) you can get an array of them back:
Extracting a single column from your DataFrame looks like this - this gives you back a "Series" in Pandas:
You can also extract a given range of rows from a named column, like so:
Or even extract a single value from a specified column / row combination:
To extract more than one column, you pass in a list of column names instead of a single one:
You can also extract specific ranges of rows from more than one column, in the way you'd expect:
Sorting your DataFrame by a specific column looks like this:
You can break down the number of unique values in a given column into a Series using value_counts() - this is a good way to understand the distribution of your data:
Pandas even makes it easy to plot a Series or DataFrame - just call plot():
The first main data type we will learn about for pandas is the Series data type. Let's import Pandas and explore the Series object.
A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.
Let's explore this concept through some examples:
2.1 Creating a Series
You can convert a list,numpy array, or dictionary to a Series: