Event log pre-processing tutorial

This tutorial shows how to process your event log before feeding it into Apromore, on the example of BPI2017 event log. Importantly, it only describes the minimum required set of preprocessing operations. The accuracy of predictions will greatly benefit from more sophisticated feature engineering.

Import Required Libraries

In [2]:
import pandas as pd
import numpy as np

Define which column contains case identifiers, activity names and timestamps

In [3]:
case_id_col = 'Case ID'
activity_col = 'Activity'
timestamp_col = 'Complete Timestamp'

For each column that will be used for predictions, define its type

On the one hand, columns may contain event or case attrributes; on the other hand, each atrribute can be either numeric or categorical.

dynamic_cat_cols - Event attributes that are of categorical type
static_cat_cols - Case attributes that are of categorical type
dynamic_num_cols - Event attributes that are of numeric type
static_num_cols - Case attributes that are of numeric type

In [4]:
dynamic_cat_cols = ['Activity', 'Resource', 'Action', 'EventOrigin'] # i.e. event attributes
static_cat_cols = ['ApplicationType', 'LoanGoal'] # i.e. case attributes that are known from the start
dynamic_num_cols = ['FirstWithdrawalAmount', 'NumberOfTerms', 'OfferedAmount', 'duration']
static_num_cols = []
In [5]:
static_cols = static_cat_cols + static_num_cols + [case_id_col]
dynamic_cols = dynamic_cat_cols + dynamic_num_cols + [timestamp_col]
cat_cols = dynamic_cat_cols + static_cat_cols

Define a function to extract useful features from timestamps

  • Time elapsed since previous event
  • Time elapsed since the case started
  • Month, weekday and hour of event completion
  • Remaining processing time (this will be predicted)
In [6]:
def extract_timestamp_features(group):
    
    group = group.sort_values(timestamp_col, ascending=False, kind='mergesort')
    
    tmp = group[timestamp_col] - group[timestamp_col].shift(-1)
    tmp = tmp.fillna(0)
    group["timesincelastevent"] = tmp.apply(lambda x: float(x / np.timedelta64(1, 's')))

    tmp = group[timestamp_col] - group[timestamp_col].iloc[-1]
    tmp = tmp.fillna(0)
    group["timesincecasestart"] = tmp.apply(lambda x: float(x / np.timedelta64(1, 's')))
    
    group["month"] = group[timestamp_col].dt.month
    group["weekday"] = group[timestamp_col].dt.weekday
    group["hour"] = group[timestamp_col].dt.hour
    
    tmp = group[timestamp_col].iloc[0] - group[timestamp_col]
    tmp = tmp.fillna(0)
    group["remtime"] = tmp.apply(lambda x: float(x / np.timedelta64(1, 's')))
   
    return group

Optionally, define a function for assigning a label for each case, based on user-defined rule(s)

For example, let's say a case is labeled as True if activity A_Denied occurred at least once in this case. Otherwise, it is labeled as False.

In [7]:
def set_label(group):
    group["label"] = True if sum(group["Activity"] == "A_Denied") > 0 else False
    return group

Read in your event log

In [8]:
data = pd.read_csv("../logdata/bpi17_sample.csv", sep=",")
data.head()
Out[8]:
Case ID Activity Resource Complete Timestamp Variant index ApplicationType LoanGoal label Action EventOrigin FirstWithdrawalAmount NumberOfTerms OfferedAmount duration event_nr remtime
0 1002747267 A_Create_Application User_1 2016/10/10 18:27:48 12 New credit Home improvement True Created Application 15000.0 48.0 15000.0 0.000000e+00 1 7.245104
1 1002747267 A_Submitted User_1 2016/10/10 18:27:49 12 New credit Home improvement True statechange Application 15000.0 48.0 15000.0 1.469907e-07 2 7.245093
2 1002747267 A_Concept User_1 2016/10/10 18:28:28 12 New credit Home improvement True statechange Application 15000.0 48.0 15000.0 7.598958e-06 3 7.244641
3 1002747267 A_Accepted other 2016/10/11 21:55:32 12 New credit Home improvement True statechange Application 15000.0 48.0 15000.0 1.906322e-02 4 6.100845
4 1002747267 O_Create_Offer other 2016/10/11 21:58:19 12 New credit Home improvement True Created Offer 15000.0 48.0 15000.0 3.231269e-05 5 6.098912
In [9]:
data = data[static_cols + dynamic_cols]

Extract useful features from timestamps

In [10]:
data[timestamp_col] = pd.to_datetime(data[timestamp_col], format="%Y-%m-%d %H:%M:%S")
data = data.groupby(case_id_col).apply(extract_timestamp_features)
data.head()
Out[10]:
ApplicationType LoanGoal Case ID Activity Resource Action EventOrigin FirstWithdrawalAmount NumberOfTerms OfferedAmount duration Complete Timestamp timesincelastevent timesincecasestart month weekday hour remtime
Case ID
1079589 273 New credit Home improvement 1079589 A_Denied other statechange Application NaN 62.0 8000.0 8.169291e-02 2016-11-15 17:12:05 0.0 2320394.0 11 1 17 0.0
274 New credit Home improvement 1079589 O_Refused other statechange Offer NaN 62.0 8000.0 4.436728e-09 2016-11-15 17:12:05 0.0 2320394.0 11 1 17 0.0
275 New credit Home improvement 1079589 W_Validate_application other Deleted Workflow NaN 62.0 8000.0 1.543209e-09 2016-11-15 17:12:05 423496.0 2320394.0 11 1 17 0.0
272 New credit Home improvement 1079589 O_Returned other statechange Offer NaN 62.0 8000.0 7.586806e-07 2016-11-10 19:33:49 4.0 1896898.0 11 3 19 423496.0
271 New credit Home improvement 1079589 A_Validating other statechange Application NaN 62.0 8000.0 3.340686e-01 2016-11-10 19:33:45 1731812.0 1896894.0 11 3 19 423500.0

Optionally, apply your case labeling function

In [11]:
data = data.groupby(case_id_col).apply(set_label)
data["label"].value_counts()
Out[11]:
False    5661
True     4331
Name: label, dtype: int64

Impute missing values

In [12]:
grouped = data.sort_values(timestamp_col, ascending=True, kind='mergesort').groupby(case_id_col)
for col in static_cols + dynamic_cols:
    data[col] = grouped[col].transform(lambda grp: grp.fillna(method='ffill'))

data[cat_cols] = data[cat_cols].fillna('missing')
data = data.fillna(0)

For categorical variables, set infrequent factor levels to "other"

In [13]:
max_category_levels = 10
for col in cat_cols:
    if col != activity_col:
        counts = data[col].value_counts()
        mask = data[col].isin(counts.index[max_category_levels:])
        data.loc[mask, col] = "other"
data.head()
Out[13]:
ApplicationType LoanGoal Case ID Activity Resource Action EventOrigin FirstWithdrawalAmount NumberOfTerms OfferedAmount duration Complete Timestamp timesincelastevent timesincecasestart month weekday hour remtime label
Case ID
1079589 273 New credit Home improvement 1079589 A_Denied other statechange Application 0.0 62.0 8000.0 8.169291e-02 2016-11-15 17:12:05 0.0 2320394.0 11 1 17 0.0 True
274 New credit Home improvement 1079589 O_Refused other statechange Offer 0.0 62.0 8000.0 4.436728e-09 2016-11-15 17:12:05 0.0 2320394.0 11 1 17 0.0 True
275 New credit Home improvement 1079589 W_Validate_application other Deleted Workflow 0.0 62.0 8000.0 1.543209e-09 2016-11-15 17:12:05 423496.0 2320394.0 11 1 17 0.0 True
272 New credit Home improvement 1079589 O_Returned other statechange Offer 0.0 62.0 8000.0 7.586806e-07 2016-11-10 19:33:49 4.0 1896898.0 11 3 19 423496.0 True
271 New credit Home improvement 1079589 A_Validating other statechange Application 0.0 62.0 8000.0 3.340686e-01 2016-11-10 19:33:45 1731812.0 1896894.0 11 3 19 423500.0 True

Finally, save your modified log as a CSV file

In [14]:
data.to_csv("../logdata/bpi17_sample_processed.csv", sep=",", index=False)