Theoretical Background

XES Standard

The XES standard defines a grammar for a tag-based language whose aim is to provide designers of information systems with a unified and extensible methodology for capturing systems behaviors by means of event logs and event streams is defined in the XES standard. An XML Schema describing the structure of an XES event log/stream and a XML Schema describing the structure of an extension of such a log/stream are included in this standard. Moreover, a basic collection of so-called XES extension prototypes that provide semantics to certain attributes as recorded in the event log/stream is included in this standard [2].

Several formats have been proposed during the years for the standard storage of event logs in process mining. The IEEE standard is XES, for which different implementations exist in the ProM6 process mining framework. Among noticeable implementations, we can cite XES Lite, that provides a memory-efficient handling of event logs, while supports the XES standard on relational databases, albeit with a performance deficit, and DBXES that use relational databases to support some intermediate calculations. OpyenXES took XES in an open-source Python implementation and the PM4Py Python process mining library followed obtaining a full certification [3].

Event Logs

We assume the existence of an event log where each event refers to a case, an activity, and a point in time. An event log can be seen as a collection of cases. A case can be seen as a trace/sequence of events.

Event data may come from

  • a database system (e.g., patient data in a hospital),

  • a comma-separated values (CSV) file or spreadsheet,

  • a transaction log (e.g., a trading system),

  • a business suite/ERP system (SAP, Oracle, etc.),

  • a message log (e.g., from IBM middleware),

  • an open API providing data from websites or social media and others [4]

Events are listed together with their attributes in an event log. Attributes that are typically listed are the case ID, the time stamps of the start and end times, and other attributes of the event recorded by the IT system. An event log can also be the documentation of several related business processes [5].

Process Mining for Python (PM4PY)

Pm4py provides a process mining software which is easily extendable, allows for algorithmic customization and allows user to easily conduct large scale experiments.

The data science world, both for classic data science and for cutting-edge machine learning research is heavily using Python. Other libraries, albeit with a lower number of features, exist already for the Python language. The bupaR library supports process mining in the statistical language R, that is widely used in data science. The main focal points of the novel PM4Py library are:

  • Lowering the barrier for algorithmic development and customization when performing a process mining analysis compared to existing academic tools such as ProM, RAPIdProM and Apromore.

  • Allow for the easy integration of process mining algorithms with algorithms from other data science fields, implemented in various state-of-the-art Python packages.

  • Create a collaborative eco-system that easily allows researchers and practitioners to share valuable code and results with the process mining world.

  • Provide accurate user-support by means of a rich body of documentation on the process mining techniques made available in the library.

  • Algorithmic stability by means of rigorous testing. [1]

Architecture and features

In order to maximize the possibility to understand and re-use the code, and to be able to execute large-scale experiments, the following architectural guidelines have been adopted on the development of PM4Py:

  • A strict separation between objects, algorithms (Alpha Miner, Inductive Miner, alignments) and visualizations in different packages. In the pm4py.object package, classes to import/export and to store the information related to the objects are provided, along with some utilities to convert objects (e.g. process trees into Petri nets); while in the pm4py.algo package, algorithms to discover, perform conformance checking, enhancement and evaluation are provided. All visualizations of objects are provided in the pm4py.visualization package.

  • Most functionality in PM4Py has been realized through factory methods. These factory methods provide a single access point for each algorithm, with a standardized set of input objects, e.g., event data and a parameters object. Consider the factory method of the Alpha Miner. Factory methods allow for the extension of existing algorithms whilst ensuring backward-compatibility. The factory methods typically accept the name of the variant of the algorithm to use, and some parameters (shared among variants, or variant-specific). [1]

Object management

Within process mining, the main source of data are event data, often referred to as an event log. Such an event log, represents a collection of events, describing what activities have been performed for different instances of the process under study. PM4Py provides support for different types of event data structures:

  • Event logs, i.e., representing a list of traces. Each trace, in turn, is a list of events. The events are structured as key-value maps.

  • Event Streams representing one list of events (again represented as key-value maps) that are not (yet) organized in cases.

Conversion utilities are provided to convert event data objects from one format to the other. Furthermore, PM4Py supports the use of pandas data frames, which are efficient in case of using larger event data. Other objects currently supported by PM4Py include: heuristic nets, accepting Petri nets, process trees and transition systems. [1]

Algorithms

The PM4Py library provides several mainstream process mining techniques, including:

  • Process discovery: Alpha (+) Miner and Inductive Miner.

  • Conformance Checking: Token-based replay and alignments.

  • Measurement of fitness, precision, generalization and simplicity of process models.

  • Filtering based on time-frame, case performance, trace endpoints, trace variants, attributes, and paths.

  • Case management: statistics on variants and cases.

  • Graphs: case duration, events per time, distribution of a numeric attribute’s values.

  • Social Network Analysis: handover of work, working together, subcontracting and similar activities networks.[1]

API Definition

API stands for Application Programming Interface. A Web API is an application programming interface for the Web. A Browser API can extend the functionality of a web browser. A Server API can extend the functionality of a web server [9].

The REST architecture was introduced in the year 2000, by Thomas Fielding, and is based on the principles that support the World Wide Web. In summary, according to the REST principles, REST interfaces rely exclusively on Uniform Resource Identifiers (URI) for resource detection and interaction, and usually on the Hypertext Trans-fer Protocol (HTTP) for message transfer. A REST service URI only provides location and name of the resource, which serves as a unique resource identifier. The predefined HTTP verbs are used to define the type of operation that should be performed on the selected resource (e.g., GET to retrieve, DELETE to remove a resource).

Possibly due to HTTP’s features (which fit the REST architecture rather well), long-term presence, and general understandability, REST has become a de facto standard way for offering a service on the Web. Despite this, REST is merely an architectural style, provided without standard specifications. This implies that several decisions have to be made by developers when exposing service APIs, which may result in diverse APIs and, in some cases, in poor design decisions (e.g., using a single HTTP verb for retrieving or deleting a resource) [10].