- Week 13: Datetime and Time Series
- Objective
- Datasets
- Datetime
- Create datetime object
- Convert from string to datetime
- Convert from datetime to utctimstamp and vice versa
- Format a datetime object to string
- Arithmetics on datetime
- Timezone conversion
- Bonus: how to get a datetime object for the current time without microseconds
- Bonus: Deal with different scales of time durations
- Time Series Basics
- Time Series Advanced Topics
- References
- Understand the principle of timestamp and datetime format
- Master basic computation on datetime values, including handling timezone conversion
- Understand periodical analysis (daily, weekly, monthly, seasonal, etc)
Modules:
datetime
dtparser
pandas
- basic visualisation using polyline
.plot()
- zoom in/ out:
.resample
,.aggregate
- basic visualisation using polyline
seaborn
- NBC Russian Troll on Twitter dataset
- Twitter Data of the Donald & Ivanka Trump analysis -- reproduce the charts.
- Financial data usually comes in form of time series, e.g. Yahoo Finance API.
- Bitcoin transactions are available via blockchain.com API. A free crypocurrency API is available to query the exchange ratio between two symbols.
The datetime module supplies classes for manipulating dates and times in the fields like time parsing, formatting or even arithmetic. You can read its document here.
First we create a datetime object:
from datetime import datetime
dt = datetime(year=1993, month=10, day=4,
hour=9, minute=8, second=7)
dt
Or you can use a positional arguments
from datetime import datetime
dt = datetime(1993, 10, 4, 9, 8, 7)
dt
Output:
datetime.datetime(1993, 10, 4, 9, 8, 7)
The result is a datetime
object. A datetime object is a single object containing all the information from a time point.
In many cases, we may need to standardize the format of date/time we scraped from the Internet into datetime objects for further application. We can use parse
in dateutil
library. See this case:
from dateutil.parser import parse
dt_1 = parse("Thu Sep 25 10:36:28 2018")
dt_2 = parse('19/May/2017 04:10:06')
dt_3 = parse('2018.2.3')
dt_4 = parse('June 12, 2018')
dt_1, dt_2, dt_3, dt_4
Output:
(datetime.datetime(2018, 9, 25, 10, 36, 28),
datetime.datetime(2017, 5, 19, 4, 10, 6),
datetime.datetime(2018, 2, 3, 0, 0),
datetime.datetime(2018, 6, 12, 0, 0))
All the time strings in different formats have been transferred into datetime objects. The level of details depends on how explicit the information the raw data provided is.
In some cases, we may need to parse some ambiguous dates like parse("10-09-2003")
. We need to give the parameter what the first figure represents:
from dateutil.parser import parse
dt_1 = parse("10-09-2003", dayfirst=True)
dt_2 = parse("10-09-03", yearfirst=True)
dt_1, dt_2
Output:
(datetime.datetime(2003, 9, 10, 0, 0), datetime.datetime(2010, 9, 3, 0, 0))
Many times on the Internet may not be as normative as 2018/12/22
. Instead, many of them are 12/22
or evenThu 10:36:28
. We can define a default time.
from datetime import datetime
from dateutil.parser import parse
DEFAULT = datetime(2018, 11, 25)
dt_1 = parse("Thu Sep 10:36:28", default=DEFAULT)
dt_2 = parse("Thu 10:36:28", default=DEFAULT)
dt_3 = parse("12/25", default=DEFAULT)
dt_4 = parse("10:36", default=DEFAULT)
dt_1,dt_2,dt_3,dt_4
Output
(datetime.datetime(2018, 9, 27, 10, 36, 28),
datetime.datetime(2018, 11, 29, 10, 36, 28),
datetime.datetime(2018, 12, 25, 0, 0),
datetime.datetime(2018, 11, 25, 10, 36))
- The parameter
default
here means how python autofill the time which miss some units. For example,Thu Sep 10:36:28
lack the information 'in which year' and12/25
lack 'on which time' and 'in which year'. Python will autofill them according to the corresponding part indefault
.
However, The parsing will fail if we input a time against the regulations.
from dateutil.parser import parse
parse("2月15日 10:36:28")
This line will raise a ValueError
:
ValueError: ('Unknown string format:', '2月15日 10:36:28')
- If we are going to parse
parse("2月15日 10:36:28")
, we need to convert it into a format without Chinese characters. Try to usestr.replace()
before parsing.
Generally, the dateutil module provides powerful extensions to the standard datetime module. You can check more parse examples here
The timestamp is the time in seconds since an epoch as a floating point number. The epoch is the point where the time starts, and is platform dependent. On Windows and most Unix systems, the epoch is January 1, 1970, 00:00:00 (UTC).
from datetime import datetime, timezone
dt = datetime(
1993, 10, 4, 9, 8, 7,
tzinfo=timezone.utc)
ts = dt.timestamp()
ts
Output:
749725687.0
#the output figure depends on your current time
from datetime import datetime
datetime.utcfromtimestamp(ts)
Output:
datetime.datetime(1993, 10, 4, 9, 8, 8, 12345)
Bonus: UTC (abbreviated from Coordinated Universal Time) is the primary time standard by which the world regulates clocks and time.
from datetime import datetime
dt = datetime.now()
ds = dt.isoformat(timespec='seconds',sep=' ')
print(ds,type(ds))
Output:
2018-11-19 16:45:44 <class 'str'>
- Question: What is the type of
ds
? You can try to change its parameters in.isoformat()
or remove it to see what will happen. - You can also use
str(dt)
to transfer a datetime object into string.
In some cases, we may need to convert a datetime into a specific format. Now we can use strftime()
. Here is an example:
from datetime import datetime
dt = datetime(2018, 11, 20, 14, 0, 0)
print(dt.strftime('%I:%M%p %m/%d(%a),%Y'))
print(dt.strftime('%H:%M:%S %Y/%m/%d'))
Output:
02:00PM 11/20(Tue),2018
14:00:00 2018/11/20
In this case,%H
and %I
represent hour in 24-hour clock and12-hour clock respectively. For 12-hour clock, we use %p
to get AM/PM from the datetime object. You can click here to see what each parameter represents in strftime()
.
One can perform boolean comparison on datetime
objects:
from datetime import datetime
result_1 = datetime(2018, 6, 12, 0, 0) > datetime(2018, 2, 3, 0, 0)
result_2 = datetime(2018, 6, 12, 0, 0) == datetime(2018, 2, 3, 0, 0)
result_3 = datetime(2018, 6, 12, 0, 0) < datetime(2018, 2, 3, 0, 0)
print(result_1,result_2,result_3)
Output:
True False False
A timedelta object represents a duration, the difference between two dates or times. We can build a timedelta object like this:
from datetime import timedelta
td = timedelta(days = 1)
td
Output:
datetime.timedelta(days=1)
The parameter days
can be replaced with seconds
, microseconds
, milliseconds
, minutes
, hours
and weeks
. We can also combine them like timedelta(weeks = 1, days = 2, hours = 12)
.
To get the duration between two datetime objects, we can calculate like this:
from datetime import datetime
datetime(2018, 6, 12, 0, 0) - datetime(2018, 2, 3, 0, 0)
Output:
datetime.timedelta(days=129)
We can also do calculation between a datetime object and a timedelta object.e.g.what is the date 4 weeks later?
from datetime import datetime, timedelta
td_today = datetime(2018, 11, 19)
#td_today = datetime.date.today()
#td_today = datetime.now()
td = td_today + timedelta(weeks = 4)
str(td)
Output:
'2018-12-17 00:00:00'
In this case, you can use datetime.date.today()
instead to get the real-time date.
- For TZ conversion in
datetime
object: https://stackoverflow.com/a/18646797- Use
.replace
to add TZ information - Use
.astimezone
to convert to new TZ
- Use
- For TZ conversion of
pandas.Timestamp
object: module-pandas.md#timezone-converson-for-pandastimestamp- Use
.tz_localize
to add TZ information - Use
.tz_convert
to convert to new TZ
- Use
You may have found that datetime.now()
will return datetime.datetime(2018, 11, 19, 16, 48, 33, 369859)
. In the former part, we use dt.isoformat(timespec='seconds',sep=' ')
to omit the microseconds, but this method will convert the datetime object into a string. We can hold the current time as a datetime object for further calculation with these lines:
from datetime import datetime
dt = datetime.now()
dt_without_microseconds = datetime(dt.year,dt.month,dt.day,dt.hour,dt.minute,dt.second)
dt_without_microseconds
Output:
datetime.datetime(2018, 11, 19, 16, 41, 15)
After we have scape the strings referring to time from a website, we may need to deal with a group of different scales of time durations:
from datetime import datetime, timedelta
time_list = ['30 Minutes ago','12 Hours ago',
'4 Days ago','3 Weeks ago',
'2 Month ago','1 Year ago']
dt = datetime.now()
now = datetime(dt.year,dt.month,dt.day,dt.hour,dt.minute,dt.second)
print('The current time is {}'.format(now))
mylist=[]
for i in time_list:
if i.lower().find('minute') != -1:
post_time = now - timedelta(minutes = int(i.split()[0]))
elif i.lower().find('hour') != -1:
post_time = now - timedelta(hours = int(i.split()[0]))
elif i.lower().find('day') != -1:
post_time = now - timedelta(days = int(i.split()[0]))
elif i.lower().find('week') != -1:
post_time = now - timedelta(weeks = int(i.split()[0]))
elif i.lower().find('month') != -1:
post_time = now - timedelta(days = int(i.split()[0]) * 30)
elif i.lower().find('year') != -1:
post_time = now - timedelta(days = int(i.split()[0]) * 365)
mylist.append(post_time)
output = 'The time {} is {}.'.format(i, post_time)
print(output)
Now we get their precise time point. Output:
The current time is 2018-11-19 10:57:50.
The time 30 Minutes ago is 2018-11-19 10:27:50.
The time 12 Hours ago is 2018-11-18 22:57:50.
The time 4 Days ago is 2018-11-15 10:57:50.
The time 3 Weeks ago is 2018-10-29 10:57:50.
The time 2 Month ago is 2018-09-20 10:57:50.
The time 1 Year ago is 2017-11-19 10:57:50.
- Basic requirement: plot time series at different granularity, e.g. by hour, by day, by week, ... Articulate the findings on the polyline plot.
- Checkout this notebook for a concrete case of analysing term frequency changes over time in the Tweets.
The core codes are as follows:
df_kws = df.set_index('datetime').resample('1m').aggregate('sum')
df_kws.plot()
The key points of plotting time series using pandas:
- First you need to put
datetime
type of data onto index. This usually involves.apply
a function to convert from string to datetime.set_index
to move thedatetime
type from column to index. This is essential step to perform time series operation because later functions all refer to index for the datetime value.
- Use
.resample
to put the data points into different buckets. This is essentially a.groupby
operation. Instead of working on categorical values like.groupby
,.resample
works on datetime ranges. One can specify a time length when performing resample operation, e.g. one week1w
and two days2d
. - Use
.aggregate
to turn the bucket of data points into a single value. This is the same process like groupby + aggregate approach, but applied on datetime data types.
There is a small missing piece of the above Tweets keyword time series from current discussion. Besides handling the index, you also need to have numeric data on columns, e.g. kw-hillary
as you can see from the chart. You can checkout Most Common Names in Tweets example to see how to encode tweet text into such numeric indicator variables.
A time series is a series of data points indexed (or listed or graphed) in time order. They are very frequently plotted via line charts and used in many fields like statistics, pattern recognition, mathematical finance, weather forecasting, earthquake prediction, astronomy and communications engineering. Check here for more information: Time series - Wikipedia. Time series will become more important when we are dealing with the rather bigger datasets. See this case:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/hupili/python-for-data-and-media-communication/master/text-analysis/regular_reader_tweets.csv')
print('The length of df is {}'.format(len(df)))
df.head()
Output:
The length of df is 203482
Their are more than 200 thousand lines in this dataframe. However, this is the very beginning and we can extract data by different time series from it.
In early stage, we can use sample()
to return a random sample of items from an axis of object. The sample procedure may lower the reliability but help us deal with large amount of data which are hard for making a census. One can make inferences or extrapolations from the sample to the population. See this step of sampling:
import pandas as pd
df = df.sample(frac=0.1)
print('After sampling, the length of df is {}'.format(len(df)))
df.head()
Output:
After sampling, the length of df is 20348
We can find that there are 1/10 (because of frac=0.1
) data have been randomly selected and the data has been disrupted the order. You can also learn more about the regulations of sampling in pandas official document.
In pandas
library, resample()
is a convenience method for frequency conversion and resampling of time series. Its object must have a index composed by datetime-like values like Datetime or Timedelta. Therefore, let's first utilise what we learned before to parse these twitts' post time, formatting them into machine recognizable ones:
from datetime import datetime
from dateutil import parser
import numpy
def parse_datetime(x):
try:
return parser.parse(x)
except:
return numpy.nan
df['datetime'] = df['created_str'].apply(parse_datetime)
Now we can use resample('1W')
to know how many twitts emerged every week.
df.set_index('datetime').resample('1w').aggregate('count').tail()
Output:
Notes:
- Setting the 'datetime' column as index is necessary, for
resample()
must have a index composed by datetime-like values. '1W'
is an essential positional argument which means we collect twitts per 7-day period. You can also use the parameters like'S'
(second),'Min'
(minute),'M'
(month),'SM'
(semi-month) and so forth. You can check here to read more instructions.aggregate('count')
counts how many Tweets posted on a weekly level. We will introduce 'aggregate' in the next part.
In statistics, resampling is method for drawing randomly with replacement from a set of data points, including exchanging labels on data points when performing significance tests or validating models by using random subsets. The resampling as a methodology has been widely used in the field of analogue signal processing or audio compression for many years. See its basic mode: You can learn more about it from Resampling - Wikipedia.
The aggregate is a process where the values of multiple rows are grouped together. It is aimed to form a single value of more significant meaning or measurement e.g. a sum, a max or a mean. See how it works in this case:
def has_hillary(t):
return 'hillary' in str(t).lower()
def has_trump(t):
return 'trump' in str(t).lower()
df['kw-hillary'] = df['text'].apply(has_hillary)
df['kw-trump'] = df['text'].apply(has_trump)
df.head()
df.set_index('datetime').resample('1w').aggregate('sum').tail()
Output:
The sum of each line have been figured. You can check here to learn more about what aggregate can do.
After resample and aggregate, we can use plot()
to do the visualisation. Here is an example:
df['kw-all'] = df['text'].apply(lambda x: 1)
df.set_index('datetime').resample('1w').aggregate('sum').plot()
Output: You can check out more visual aid analysis in this notebook. It is a concrete case of analysing term frequency changes over time in the Tweets.
When analysing/ visualising time series, one most common issue is to deal with short period fluctuations. This is especially important in technical analysis of stock price. Stock price can fluctuate a lot in minutes but the fluctuation is less impactful when viewed in the larger time span. We need to smooth the time series curves in order to discover long term trend. pandas
provides DataFrame.rolling_mean
and Series.rolling
to calculate "moving average" (The "MA-xx" curves you see in stock software). The moving average captures the momentum in the data and the crossing of two MAs of different length are usually used as indicators of buy/ sell signals. Checkout this notebook for more details.
Image credit: Michael Galarnyk
A time series usually involves several components:
- Trend - the non-repeating movement in the data, e.g. increasing stock price
- Seasonal - repeating movement in the data, e.g. more sells before tax payment period.
- Noise/ residual - other movements that do not belong to the above, e.g. interruptions in the market/ move-the-market news.
Image credit: Jae Duk Seo
Next question is how to forecast a time series? Predictive analysis is not a requirement from this introductory course. Our main focus is on the descriptive part. Interested readers can checkout the following models from other literatures.
- AR
- MA
- ARMA
- ARIMA
Checkout this tutorial for an implementation of ARIMA using pyramid-arima
(on pypi) and statsmodels
.
This tutorial has a more detailed explanation of AR and MA and its decompositions.
Note that the above models are highly simplified presentation of the reality. It works resonably with one-way market like sales forecast, where vendor and consumer have clear roles. It does not work well in stock market price prediction, because the market participants play the counterpart of each other and their predictions affect their behaviour which further affect the market status.
- timestamp usually come in unit of milliseconds (1/1000) of a second. An example to parse this timestamp format into
datetime
format. - Past notes of datetime from spring 2018.
- Brockwell, Peter J., and Richard A. Davis. Introduction to Time Series and Forecasting. 2nd ed. Springer Texts in Statistics. New York: Springer, 2002.