Dates and Times with pandas.DatetimeIndex

Jun 24, 2021

Introduction

When working with data, one task that we might have to deal with often is manipulation of dates and times. The format provided in the dataset may not always be one which we can easily use in the various tasks required. Transforming from that format into one that can be easily used is something that forms a core part of the preparatory work needed to be performed on a dataset.

In this article, we shall explore the use of the Pandas function DatetimeIndex in transforming date data into a format we can use for further exploration.

We shall then look at a practical example using data from the Kiva for Good dataset.

Finally we shall use Streamlit to generate a dashboard from the practical examples.

Preliminary steps

Read in the Data

Import Pandas and read in the data. We shall use the kiva loans dataset. We shall use just the funded_amount and funded_time columns for this exercise.

        
import pandas as pd

df = pd.read_csv("data/kiva_loans.csv", usecols=[1,13])
df.dropna(inplace=True)
df.reset_index(drop=True,inplace=True)
df = df[["funded_amount", "funded_time"]]

view raw dta_001.py hosted with ❤ by GitHub

Conversion to `datetime` Object

Say you want to parse the funded_time column and get the different portions of the date (second, minute, hour, day, month, year). You could use regular text processing to do this (similar to the methods described in this article). Or you could use the Pandas DatetimeIndex function, which is what we shall use in this article.

If you explore the data types of the two columns, you will find that the funded_time column has the object data type.

First we need to convert the values in that column into a datetime object. For that we use the pandas.to_datetime() method.

df["funded_time"] = pd.to_datetime(df["funded_time"])

view raw dta_002.py hosted with ❤ by GitHub

When you now check the data types after performing this operation, you will see that the funded_time column is now of type datetime64[ns, UTC]. The number 64 indicates that the data is stored as a 64-bit integer. The ns indicates that this data is stored in units of nanoseconds. UTC indicates that this time is stored as Coordinated Universal Time (more on UTC can be found at the Bureau International des Poids et Mesures).

We can now proceed to process this column using the DatetimeIndex.

DatetimeIndex Methods and Attributes

Before processing all the rows, let us first pick out one instance of a timestamp and use that to walk through the various methods (denoted by parenthesis () after the name) and attributes.

Date

date() returns just the date portion of the time stamp.

Year

year returns the year.

is_year_start tests to see if it is the first day of the year.

is_year_end tests to see if it is the last day of the year.

is_leap_year tests to see if it is a leap year.

Quarter

quarter returns the quarter number of the year.

is_quarter_start tests to see if it is the first day of the quarter.

is_quarter_end tests to see if it is the last day of the quarter.

Month

month returns the month number.

month_name() returns the name of the month.

is_month_start tests to see if it is the first day of the month.

is_month_end tests to see if it is the last day of the month.

Day

day returns the date in the month.

day_name() returns the name of the day.

dayofweek returns the ordinal position of the day in a week. Monday is considered the first, with an ordinal position of 0.

dayofyear returns the number of the day in the year.

daysinmonth returns the number of days in that month.

Time

time() returns just the time portion of the time stamp.

Hour

hour returns the hour of the day.

Minute

minute returns the minute of the hour.

Second and Sub-Seconds

second returns the second of the minute.

microsecond returns microseconds.

nanosecond returns nanoseconds.

Normalize

normalize() this converts all times to midnight. It is useful in cases where the time does not matter, and all you are interested in is the date.

Timezones

tz returns the timezone of the timestamp.

tz_convert() converts the timestamp into one of the specified timezone. Provide the new timezone as Continent/City

Practical Example

Now that we have seen some of the attributes and methods of DatetimeIndex, let us use some of them in exploring our data. In this part, we shall answer a series of questions.

But before we get to the questions, we need to first set the index of our dataframe to the funded_time column. This will make it easier to get the required date or time component.

df.set_index("funded_time",inplace=True)

view raw dta_008.py hosted with ❤ by GitHub

Without doing this, we would have to use a for loop or function to make the required change in an entire column as shown in the code snippet below:

df["year"] = df["funded_time"].apply(lambda x: x.year)

view raw dta_009.py hosted with ❤ by GitHub

With the funded_time column set as the index, all we need to do is call the appropriate attribute or method to get the required change as shown below:

df["year"] = df.index.year

view raw dta_010.py hosted with ❤ by GitHub

On to the questions.

Question 1: What year saw the highest funded loan amounts?

Create a new column with years derived from the timestamp.
Aggregate funded amounts in each year.
Generate and display a dataframe with this information

        
df["year"] = df.index.year

year_aggregates = df.groupby("year")["funded_amount"].sum().sort_values(ascending=False).reset_index()
year_aggregates

view raw dta_003.py hosted with ❤ by GitHub

The resulting dataframe shows 2016 as having the highest funded amount. This is also visually evident from the Seaborn bar plot.

Question 2: In each year, which month recorded the highest funded amount?

Create a new column that has the month.
For each year, aggregate by month.
Identify the month with the highest funded amount.
Generate and display a dataframe that has year, month and funded amount.

        
df["month"] = df.index.month_name()

high_month_list = []
for year in list(df["year"].unique()):
    df_high_month = df[df["year"] == year][["year","funded_amount","month"]]
    df_high_month = df_high_month.groupby(["year","month"])["funded_amount"].sum().sort_values(ascending=False).head(1)
    high_month_list.append(df_high_month)
df_high_month = pd.concat(high_month_list).reset_index()
df_high_month

view raw dta_004.py hosted with ❤ by GitHub

We see that in each year other than 2017, December was the month that recorded the highest funded amount.

Question 3: Is there a specific day in each month that recorded the highest funded amount?

Create two new columns, one for number of month, and the second for day. The number of month column will be used to sort the final dataframe so that the months appear chronologically.
For each year, aggregate the daily funded amounts.
Get the day in each month that has the highest funded amount.
Generate and display a dataframe that has year, month, day and funded amount.

        
df["month_number"] = df.index.month
df["day"] = df.index.day

high_day_list = []
for year in list(df["year"].unique()):
    df_year = df.groupby(["year","month","month_number","day"])["funded_amount"].sum().reset_index()
    df_year = df_year[df_year["year"] == year][["year","month","day","funded_amount", "month_number"]]
    for month in list(df["month"].unique()):
        df_high_day = df_year[df_year["month"] == month]
        df_high_day = df_high_day[df_high_day["funded_amount"] == df_high_day["funded_amount"].max()]
        high_day_list.append(df_high_day)
df_high_day = pd.concat(high_day_list).reset_index(drop=True)
sorted_df_high_day = []
for year in list (df_high_day["year"].unique()):
    sorted_df_high_day.append(df_high_day[df_high_day["year"] == year].sort_values(by="month_number"))
df_high_day = pd.concat(sorted_df_high_day).reset_index(drop=True).drop("month_number","columns")
df_high_day

view raw dta_005.py hosted with ❤ by GitHub

Let’s use Plotly Express to visualize this data.

We can see from the plot the days in each year that saw the highest funded amount.

Question 4: For each month, what was the daily average funded amount?

Create a new column with number of days in month
Aggregate the funded amount per month for each year.
Get the average funded amount for each month.
Generate and display a dataframe with year, month and average funded amount.

        
df["days_in_month"] = df.index.daysinmonth

df_average = df.groupby(["year","month","month_number","days_in_month"])["funded_amount"].sum().reset_index()
df_average["monthly_average"] = round(df_average["funded_amount"] / df_average["days_in_month"],2)
sorted_df_average = []
for year in list(df_average["year"].unique()):
    sorted_df_average.append(df_average[df_average["year"] == year].sort_values(by="month_number"))
df_average = pd.concat(sorted_df_average).reset_index(drop=True).\
         drop(["month_number","days_in_month","funded_amount"],"columns")
df_average

view raw dta_007.py hosted with ❤ by GitHub

An aside:

While going over the article before publishing, it occurred to me that I had not really checked to see if the daysinmonth attribute would report the correct number of days for February for leap years and regular years. At first I thought I would have to use the is_leap_year test in this portion of the exercise, but on further research I found that this has been well taken care of by DatetimeIndex.

To test this, I generated two dates in February, one on a leap year, one on a regular year, and checked the number of days in the month for both years. The expected results were produced by the code.

Like for the daily highest, let us visualize the monthly averages using a Plotly bar chart.

Going through these four examples should provide enough guidance on how to proceed. There’s more that can be gleamed from this data, and for that I leave it to you for further exploration.

Streamlit Dashboard

Finally, we can deploy a dashboard using Streamlit based on the brief exploration we have done. This will display a sample of the dataframe, plus the three plots generated in the course of this exercise.

For details on how to work with Streamlit, refer to this article.

Conclusion

In this article, we explored the methods and attributes of the Pandas function, DatetimeIndex.

We saw it’s use in extracting needed date and time information from a timestamp to aid in analysis of data.

We walked through several examples using sample data from the Kiva loans dataset.

Finally we deployed a dashboard showing a dataframe with the generated columns and some visualizations using Streamlit.

The Jupyter Notebook with the code from this article can be accessed here.

The Streamlit code for the dashboard deployment can be accessed here.

Tech & Art with Eric

Discussion about this post