A question came up the other day. How do you filter rows in a dataframe based on specific values in a column?
My original inclination was to perform a series of tests on the values needed, then use that as the mask.
First, a recap of row selection from a Pandas DataFrame...
To select all rows, just call the dataframe...
df
To return a single column, specify that in brackets.
df["day"]
That returns a Series.
To see whether certain values are in that Series, you can test for it using a specific value like shown below:
df["day"] == "Friday"
This returns a Series of True/False values depending on whether a particular entry matches the test or not.
Let us assign that to a mask variable. We will use Monday instead of Friday.
mask = df["day"] == "Monday"
We can now use this to subset rows that meet this requirement.
df[mask]
This gives us just the rows that meet this requirement.
If we need to return rows for several masks, we need to use the bitwise "or" operator "|"
First, let us create the different masks
mask_Saturday = df["day"] == "Saturday"
mask_Sunday = df["day"] == "Sunday"
mask_Monday = df["day"] == "Monday"
Now let's subset the rows with these masks.
df[mask_Saturday | mask_Sunday | mask_Monday]
This returns the following rows.
That is a lot of typing.
That got me thinking...
Was there a simpler way to do this?
I ran the following command:
dir(pd.Series)
to get a list of attributes/methods belonging to it.
On scrolling down, I came across the isin()
method. That looked promising. On checking the documentation, I realised it was exactly what I needed. You just pass a list of words to it, and if it finds a match in the DataFrame, it will create that mask and return the values needed.
df[df["day"].isin(["Saturday", "Sunday", "Monday"])]
Exact same results as before. With much less typing.
What's not to like about pd.Series.isin()
?