This little Power BI project is something like my “Hello World” for using R visuals. I picked Process Mining as the topic for the model because of several reasons:
- Process Mining is quite different to typical number-aggregating-scanarios, so hopefully interesting!
- The R visuals are adding possibilities to the project, that can’t be replaced by standard visuals easily.
- It is still easy to model and understand.
Here is the link to the Github Repository that hosts the Power BI file and some additional resources:
Process Mining is about getting insights about processes. The basic information source is a so called event log. I use the example of a purchase order. In usual models one would have a key like “purchase order number”, combined with “line number”, and several attributes. If we capture the state of the purchase order in different situations, we get a more detailed date model, the event log.
We talk about cases (here: the purchase order lines) and activities with timestamps. Here is the list of activities in the purchase order example:
The order of these activities is absolutely not fixed, and of course not all activities appear in all purchase orders, some appear multiple times. Here is an exaple of a so called “Process Map” that visualizes the order of activities of a small subset of 3 example purchase orders. Fro the start, two of the three (=66%) have the activity “Create Purchase Requisition Item” as their first activity. And then “Create Purchase Order Item” as the second activity. One (33%) directly goes to “Create Purchase Order Item”.
While the Process Map visualizes all processes, the so called “Traces” visualize each of the processes. In the example, the three purchase orders have the following traces:
Of course many processes can (and will) have identical traces. For this case the frequency of the trace is given on the right side of the visualization.
The Power BI data model has two tables: PurchaseDoc and PurchaseLog, standing in a 1:n relationsship. For the Process Mining part of the model, only the PurchaseLog table is necessary, as it is the event log.
The PurchaseDoc is more the “classical” purchase order table. It has dimensions like spend area, vendor, fiscal periods and also a purchase value, which is an important measure for the interpretaion of the insights into the processes. But technically it is not required for the process mining.
The field “case_id” is the number of the purchase order line, something like: 4507002829_00010. The purchase order number is 4507002829, the line number 00010. Both fields are redundantly still available as purchase_doc and purchase_doc_item.
There are about 214.000 purchase order lines (=records in PurchaseDoc) with more than 1.1 Million records in the event log (PurchaseLog).
The data is originally coming from a competition of the “Internaction Conference on Process Mining 2019” (ICPM): https://icpmconference.org/2019/icpm-2019/contests-challenges/bpi-challenge-2019/
The conference webpage is also a good source for a better understanding of how process mining is done (on the contrast to this little blog post), since there are case studies of several participants in this competition for download.
I find it useful to build a separate R script to test the visualizations “standalone”, thus without Power BI. The script ProcessMining_work.R contains data load, some histograms to look at the data, and finally two visualizations, Process Map and Trace Explorer, that will be used from within Power BI.
But the most important step is the first one: the declaration of the library, that is specialised on process mining and so contains the two visualizations. It is bupar: https://www.bupar.net/
I don’t explain line by line, the main thing is to transfer the data into the eventlog structure that bupaR requires. My data already uses the column names that bupaR suggests. E.g. case_id instead of purchase order line. So the expression to feed the case_id is case_id = “case_id”. Otherwise it would be something like case_id = “purchase order line id”
Actually, after having built this prototype, I’m of the opinion that the model should not use the process mining terms like case and activity and rather should terms from the domain of the processes that are being analysed (like purchase_order, order_activity, order_status…)
When the eventlog is initialized, it can be used easily in the process_map and trace_explorer functions (and I admit: x is not the best variable name for the eventlog):
x = eventlog(purchase_log_excerpt, activity_id = "activity_id", case_id = "case_id", resource_id = "resource_id", activity_instance_id = "activity_instance_id", lifecycle_id = "lifecycle_id", timestamp = "timestamp" ) process_map(x, type = frequency("relative", color_scale = "Purples"), render = TRUE) trace_explorer(x, type = "frequent", coverage = 0.985)
Power BI Visualization
The Power BI project consists of three pages. For the R-Visuals, only the page “Process Map” is relevant. The “Dashboard” is a rather classical surrounding, and it is accompanied by a drillthrough target “Details”. In the Process Map page there are two R visualizations, one for the Process Map, one for the Trace Explorer.
The R code is quite simple, more or less the lines that are shown in the paragraph above with the process_map respectively the trace_explorer function. The R-visuals are aware of the filters that are created by the visuals on the left. This is important because without filtering the R-visuals will try to visualize the full 1,1 Million records, which takes some time and wouldn’t result in useful outputs.