Selecting, Slicing and Filtering data in a Pandas DataFrame

Source:www.opentechguides.com | Date Published: 2019-10-16 12:30:14

One of the essential features that a data analysis tool must provide users for working with large data-sets is the ability to select, slice, and filter data easily. Pandas provide this feature through the use of DataFrames. A data frame consists of data, which is arranged in rows and columns, and row and column labels. You can easily select, slice or take a subset of the data in several different ways, for example by using labels, by index location, by value and so on. Here we demonstrate some of these operations using a sample DataFrame.

First and foremost, let's create a DataFrame with a dataset that contains 5 rows and 4 columns and values from ranging from 0 to 19. We will use the arange() and reshape() functions from NumPy library to create a two-dimensional array and this array is passed to the Pandas DataFrame constructor function.

import numpy as np
import pandas as pd

df = pd.DataFrame(np.arange(20).reshape(5,4), columns=["A","B","C","D"])
print(df)

Output:

    A   B   C   D
0   0   1   2   3
1   4   5   6   7
2   8   9  10  11
3  12  13  14  15
4  16  17  18  19

Select rows and columns using labels

You can select rows and columns in a Pandas DataFrame by using their corresponding labels.

Select by Index Position

You can select data from a Pandas DataFrame by its location. Note, Pandas indexing starts from zero.

Slicing Rows and Columns using labels

You can select a range of rows or columns using labels or by position. To slice by labels you use loc attribute of the DataFrame.

Slicing Rows and Columns by position

To slice a Pandas dataframe by position use the iloc attribute. Remember index starts from 0 to (number of rows/columns - 1).

Subsetting by boolean conditions

You can use boolean conditions to obtain a subset of the data from the DataFrame.


Open Tech Guides | www.opentechguides.com