transitionMatrix Documentation

_images/overview.png

transitionMatrix is a pure Python powered library for the statistical analysis and visualization of state transition phenomena. It can be used to analyze any dataset that captures timestamped transitions in a discrete state space.

Use cases include applications in finance (for example credit rating transitions), IT (system state event logs) and more.

NB: transitionMatrix is still in alpha release / active development. If you encounter issues please raise them in our github repository

The transitionMatrix Library

_images/overview.png

transitionMatrix is a pure Python powered library for the statistical analysis and visualization of state transition phenomena. It can be used to analyze any dataset that captures timestamped transitions in a discrete state space.

Functionality

You can use transitionMatrix to:

  • Estimate transition matrices from historical event data using a variety of estimators
  • Characterise transition matrices (identify their key properties)
  • Visualize event data and transition matrices
  • Manipulate transition matrices (derive generators, perform comparisons, stress transition rates etc.)
  • Access standardized Datasets for testing
  • Extract and work with credit default curves (absorbing states)
  • Map credit ratings using mapping tables
  • More (still to be documented :-)

Architecture

  • transitionMatrix provides intuitive objects for handling transition matrices individually and as sets (based on numpy arrays)
  • supports file input/output in json and csv formats
  • it has a powerful API for handling event data (based on pandas and numpy)
  • supports visualization using matplotlib

Installation

You can install and use the transitionMatrix package in any system that supports the Scipy ecosystem of tools

Dependencies

  • TransitionMatrix requires Python 3 (currently 3.7)
  • It depends on numerical and data processing Python libraries (Numpy, Scipy, Pandas).
  • The Visualization API depends on Matplotlib.
  • The precise dependencies are listed in the requirements.txt file.
  • TransitionMatrix may work with earlier versions of python / these packages but it is not tested.

From PyPI

pip3 install transitionMatrix

From sources

Download the sources in your preferred directory:

git clone https://github.com/open-risk/transitionMatrix

Using virtualenv

It is advisable to install the package in a virtualenv so as not to interfere with your system’s python distribution

virtualenv -p python3 tm_test
source tm_test/bin/activate

If you do not have pandas already installed make sure you install it first (this will also install numpy and other required dependencies).

pip3 install -r requirements.txt

Finally issue the install command and you are ready to go!

python3 setup.py install

File structure

The distribution has the following structure:

| transitionMatrix/     Directory with the library source code
| -- model.py           File with main data structures
| -- estimators/        Directory with the estimator methods
| -- statespaces/       Directory with state space objects and methods
| -- creditratings/     Directory with predefined credit rating structures
| -- generators/        Directory with data generator methods
| -- utils/             Directory with helper classes and methods
| -- examples/          Directory with usage examples
| ---- python/          Examples as standalone python scripts
| ---- notebooks/       Examples as jupyter notebooks
| -- datasets/          Directory with a variety of datasets useful for getting started
| -- tests/             Directory with the testing suite

Other similar open source software

  • etm, an R package for estimating empirical transition matrices
  • msSurv, an R Package for Nonparametric Estimation of Multistate Models
  • msm, Multi-state modelling with R
  • mstate, competing risks and multistate models in R
  • lifelines, python survival package

Getting Started

The transitionMatrix components

The transitionMatrix package includes several components (organized in sub-packages) providing a variety of functionality for working with state transition phenomena. The overall organization and functionality is summarized in the following graphic:

_images/Architecture_Overview.png

The library is structured in a modular way: users may mix and match the various components to meet their own needs. The main workflow can be captured in the standard pre-processing, modelling and post-processing stages:

An secondary segmentation that is important to keep in mind is between general functionality that is relevant to general data about state transitions and more specific domains with more specific conventions and needs. At present the only specific domain concerns credit ratings.

Here we will dive straight-in into using transitionMatrix going a concrete (and typical) example using historical credit rating transitions. Further resources and links to more detailed and specific usage are available at the end of this section. People who are not at all familiar with the machinery of transition matrices might want to start with Basic Operations.

An end-to-end usage example from credit risk

In order to give a quick introduction to the package we discuss here a concrete and end-to-end example of using transitionMatrix that is drawn for the credit ratings space. The example does not cover all functionality, but it demonstrates the core workflow.

We will use the data set “rating_data.csv” that is available in Datasets directory. The code snippets discussed here are all from the script <examples/python/estimate_matrix.py>

Step 1: Loading the data

Data loading is best done via pandas dataframes:

data = pd.read_csv('../../datasets/rating_data.csv')
print(data.head())

   CustomerId        Date Rating  RatingNum
0           1  30-05-2000   CCC+          7
1           1  31-12-2000     B+          6
2           2  21-05-2003     B+          6
3           3  30-12-1999    BB+          5
4           3  30-10-2000     B+          6

We see that there is just enough metadata in the csv header to get an impression of how the data set captures transitions:

  • Each entity is identified by an integer (First column: CustomerId)
  • State measurements / transitions are observed at dates (in the DD-MM-YYYY format) (Second column: Date)
  • There is an implied credit rating scale using symbols ‘B+, BB+’ etc (Third column: Rating)
  • The rating scale is also expressed as integers (Fourth column: RatingNum)

Note

There are several important points we need to clarify before we can confidently extract information from this dataset and (ultimately) estimate a transition matrix. For example:

  • Do we understand the column labels or do we need additional (metadata)
  • What is the observation window?
  • Are the dates indicating a measurement (including no change) or a changed state?
  • Are all possible transitions observed in the sample?
  • Are all data provided consistent?
  • Etc

Some of those questions maybe answerable with the tools offered by transitionMatrix but it is always the responsibility of the data scientist to make sure they are correctly interpreting the data and using the tools accordingly! The

Step 2: Understanding the data format

Our first task is to identify which data format is closest for us to use. In Input Data Formats we see that what we have looks closest to a Compact Form of Long Format with the temporal information in String Dates.

Step 3: Data cleaning and normalization

Having data in the right format is only the first step!

Warning

As mentioned above, we need to be careful that the input data are “clean” and have unambiguous interpretation. Here are some examples of potential issues:

Example 1: The entity with ID=41 has only one measurement and it is NR. What does it mean? Can we remove it from the data without impact?

  • 40, 30-12-2003, A+, 3
  • 41, 21-07-2000, NR, 0
  • 42, 30-06-2004, A+, 3

Example 2: ID 46 has three identical measurements at different times. What does it mean? Can we ignore the intermediate observations without impact? (Observing a no-change is no the same as not observing a change!)

  • 46, 30-05-1999, AA+, 2
  • 46, 30-08-2001, AA+, 2
  • 46, 30-12-2002, AA+, 2
  • 46, 30-12-2003, A+, 3

Example 3: ID 54 is transitioning to D (absorbing state) and then to NR. This means that the label ‘NR’ is used in multiple ways: Something that is not rated because we know its state anyway (D) and something that

  • 54,30-10-2001, CCC+, 7
  • 54,30-07-2002, D, 8
  • 54,30-12-2002, NR, 0

Those examples illustrate that converting the raw input data into a clean dataset might require additional assumptions. This must be done on a case-by-case basis. For example: if an entity is only observed once in a state, maybe it is valid to assume it is in that state throughout the observation window. Another example: maybe it is valid to assume that multiple observations of no changing state do not carry any information and thus can be merged, etc.

Note

For a (non-exhaustive) list of data cleaning steps check out the script examples/python/data_cleaning_example.py

Step 4: Establish the State Space

Lets rename the columns accordingly:

data = data.rename(columns={"Rating": "State", "Date": "Time", "CustomerId": "ID"})
print(unique_states(data))

['CCC+' 'B+' 'BB+' 'AA+' 'A+' 'BBB+' 'NR' 'D' 'AAA']

We see that we have 9 unique states:

  • 7 ratings states: AAA, AA+, etc presumably refer to different credit qualities (it is typical when the rating scale uses the (+) qualifier to also have (-) but here this is not the case).
  • D probably means an absorbing (Default) state
  • NR probably means not rated

Let us create the State Space

originator = 'me'
full_name = 'my state space'
definition = [('0', 'NR'), ('1', "AAA"), ('2', "AA+"), ('3', "A+"), ('4', "BBB+"),
              ('5', "BB+"), ('6', "B+"), ('7', "CCC+"),
              ('8', "D")]

mySS = StateSpace(definition=definition, originator=originator, full_name=full_name, cqs_mapping=None)

print(mySS.validate_dataset(data))

Note

The above shows the functionality of the StateSpace object. In this case the validation is expected as we constructed the labels from what we found on the data set, but if the rating scale we use is given this becomes a more insightful validation exercise

Further Resources

There is a large and growing set of examples and other training material to get you started:

Examples Directory

Look at the Usage Examples directory of the transitionMatrix distribution for a variety of typical workflows.

Note

Many scripts contain multiple examples. You need to manually edit the example ID within the file to select the desired example

Open Risk Academy

For more in depth study, the Open Risk Academy has courses elaborating on the use of the library:

Note

The Example scripts from the Open Risk Academy course PYT26038 are available in a separate repo

Input Data Formats

The transitionMatrix package supports a variety of input data formats for empirical (observation) data. Two key ones are described here in more detail. More background about data formats is available at the Open Risk Manual Risk Data Category

Long Data Format

Long Data Format is a tabular representation of time series data that records the states (measurements) of multiple entities. Its defining characteristic is that each table row contains data pertaining to one entity at one point in time.

Canonical Form of Long Data

The Long Data Format (also Narrow or Stacked) consists of Tuples, e.g. (Entity ID, Time, From State, To State) indicating the time T at which an entity with ID migrated from the (From State) -> to the (To State).

The canonical form used as input to duration based estimators uses normalized timestamps (from 0 to T_max, where T_max is the last timepoint) and looks as follows:

ID Time From To
1 1.1 0 1
1 2.0 1 2
1 3.4 2 3
1 4.0 3 2
2 1.2 0 1
2 2.4 1 2
2 3.5 2 3

The canonical form has the advantage of being unambiguous about the context where the transition occurs. The meaning of each row of data stands on its own and does not rely on the order (or even the presence) of other records. This facilitates, for example, the algorithmic processing of the data. On the flipside, the format is less efficient in terms of storage (the state information occurs twice) compared to the compact format (See below).

The canonical format requires that the final state of all entities at the end of the observation window (Time F) is included (otherwise we have no indication about when the measurements stopped). Alternatively such information is provided as separate metadata (or implicitly, for example if measurements are understood to span a number of full annual periods).

Note

Synthetic_data(7, 8, 9) in the Datasets collection are examples of data in long format and canonical form

String Dates

It is frequent that transition data (e.g. from financial applications) have timestamps in the form of a date string. For example:

ID Date String From To
1 10-10-2010 0 1
1 10-11-2010 1 2

String dates must be converted to a numerical representation before we can work with the transition data. transitionMatrix offers the transitionMatrix.utils.converters.datetime_to_float() function of transitionMatrix.utils subpackage can be used to convert data into the canonical form.

Note

Synthetic_data9 and rating_data in the Datasets collection have observation times in string data form.

Compact Form of Long Format

The format uses triples (ID, Time, State), indicating the time T at which an entity ID Left its previous state S (the state it migrates to is encoded in the next observation of the same entity). The convention can obviously be reversed to indicate the time of entering a new state (in which case we need some information to bound the start of the observation window).

The compact long format avoids the duplication of data of the canonical approach but requires the presence of other records to infer the realised sequence of events.

The format also requires that the final state of all entities at the end of the observation window (Time F) is included as the last record (otherwise we have no indication about when the measurements stopped). Alternatively such information is provided separately (or implicitly, e.g. if measurements are understood to span a number of full annual periods).

ID Time State
1 1.1 0
1 2.0 1
1 3.4 2
1 4.0 3
1 F 2
2 1.2 0
2 2.4 1
2 3.5 2
2 F 3

Wide Data Format

Wide Data Format is an alternative tabular representation of time series data that records the states (measurements) of multiple entities. Its defining characteristic is that each table row contains all the data pertaining to any one entity. The measurement times are not arbitrary but encoded in the column labels:

ID 2011 2012 2013
A1 1 0 1
A2 2 1 3
A3 0 1 2

Conversion from wide to long formats can be handled using the pandas wide_to_long method.

(This method will be more integrated in the future)

Other Formats

As mentioned, a design choice is that data ingestion of transitionMatrix is via a pandas dataframe so other formats can be handled with additional code by the user. If there is a format that you repeatedly encounter submit an issue with your desired format / transformation suggestion.

Datasets

The transitionMatrix distribution includes a number of datasets to support testing / training objectives. Datasets come in two main types:

  • State Transition Data (used in estimation). There are both dummy (synthetic) examples and some actual data. Transition data are usually in CSV format.
  • Transition Matrices and Multi-period Sets of matrices (again both dummy and actual examples). Transition matrices are usually in JSON format.

State Transition Data

The scripts are located in examples/python. For testing purposes all examples can be run using the run_examples.py script located in the root directory. Some scripts have an example flag that selects alternative input data or estimators.

List of Transition Datasets
File Format Events Entities States Generator Description
rating_data_raw.csv Compact 4000 1829 9 Extract A typical credit rating dataset
rating_data.csv Compact 3780 1642 9 Data cleaning script A typical credit rating dataset
scenario_data.csv Compact 550 50 5    
synthetic_data.csv Compact 100 10 2    
synthetic_data1.csv Compact 100 1 4 Generator(=1) DURATION TYPE DATASETS (Compact format)
synthetic_data2.csv Compact 10000 1000 2 Generator(=2) DURATION TYPE DATASETS (Compact format)
synthetic_data3.csv Compact 2000 100 7 Generator(=3) DURATION TYPE DATASETS (Compact format)
synthetic_data4.csv Compact 10000 1000 8 Generator(=4) Cohort type dataset (Generic Rating Matrix). Offers a semi-realistic example
synthetic_data5.csv Compact 50000 10000 3 Generator(=5) Large cohort type dataset useful for testing convergence
synthetic_data6.csv Compact 20000 1000 2 Generator(=6) COHORT TYPE DATASETS
synthetic_data7.csv Canonical 1295 1000 8 Generator(=7) Duration type datasets in Long Format
synthetic_data8.csv Canonical 10000 10000 2 Generator(=8) Duration type datasets in Long Format
synthetic_data9.csv Canonical 1338 1000 8 Generator(=9) Duration type datasets in Long Format
synthetic_data10.csv Canonical 12000 2000 9 Generator(=10) Credit Rating Migrations in Long Format / Compact Form
test.csv Compact 14 7 3    

Transition Matrices

  • generic_monthly
  • generic_multiperiod
  • JLT
  • sp 2017

Preprocessing

The preprocessing stage includes preparatory steps leading up to the matrix Estimation to produce a transition matrix (or matrix set).

The precise steps required depend on the sources of data, the nature of data, use specific requirements (best practices, regulation etc) and, not least, the desired estimation method.

State Spaces

A State Space is a fundamental concept in probability theory and computer science representing the possible configurations for a modelled system

The StateSpace object stores a state space structure as a List of tuples. The first two elements of each tuple contain the index (base-0) and label of the state space respectively.

Additional fields are reserved for further characterisation

Example: Map credit ratings between systems

  • Script: state_space_operations.py

Example workflows for converting data from one credit rating system to another using an established mapping table

Cohorts

Organizing data in cohorts can be an important step in understating transition data or towards applying a Cohort Estimator. Cohorts in this context are understood as the grouping of entities within a temporal interval. For example in a credit rating analysis context cohorts could be groups of annual observations. The implication of cohorting data is that the more granular information embedded in a more precise timestamp is not relevant. It is also possible that input data are only available in cohort form (when the precise timestamp information is not recorded at the source)

Note

Cohorting can bias the estimation in various subtle ways, so it is important that any procedure is well documented.

Cohorting Utilities

Cohorting utilities are part of Preprocessing. Presently the core algorithm is implemented in transitionMatrix.utils.preprocessing.bin_timestamps().

Intermediate Cohort Data Formats

The cohort data format is a tabular representation of time series data that records the states (measurements) of multiple entities. Its defining characteristic is that each table row contains data pertaining to one entity at one point in time.

The canonical form used as input to duration based estimators uses normalized timestamps (from 0 to T_max, where T_max is the last timepoint) and looks as follows:

ID Time From To
1 1.1 0 1
1 2.0 1 2
1 3.4 2 3
1 4.0 3 2
2 1.2 0 1
2 2.4 1 2
2 3.5 2 3

Cohorting Examples

Cohorting Example 1

An example with limited data (dataset contains only one entity). It is illustrated in script examples/python./matrix_from_duration_data.py with example flag set to 1. Input data set is synthetic_data1.csv

The state space is as follows (for brevity we work directly with the integer representation)

[('0', "A"), ('1', "B"), ('2', "C"), ('3', "D")]

The cohorting algorithm that assigns the last state to the cohort results in the following table. We notice that there is alot of movement inside each cohort (high count) and that only two of the states are represented at the cohort level (0 and 1).

   ID  Cohort State       Time  Count
0   0       0     0   2.061015   21.0
1   0       1     1   4.400105   14.0
2   0       2     0   6.665899   28.0
3   0       3     0   8.842277   14.0
4   0       4     0  11.111733   21.0
5   0       5     0  11.182184    2.0

Credit Ratings

Working with credit data is a core use case of transitionMatrix. Functionality that is specific to credit ratings is generally grouped in the credit ratings subpackage (although the distinction of what is generic and what credit specific is not always clear).

The following sections document various credit rating related activities. General documentation about credit rating systems

Predefined Rating Scales

The transitionMatrix package supports a variety of credit rating scales. They are grouped together in transitionMatrix.creditratings.creditsystems.

The key ones are described here in more detail.

Rating Scales currently covered

The focus of the current selection is on long-term issuer ratings scales (others will be added):

  • AM Best Europe-Rating Services Ltd.
  • ARC Ratings S.A.
  • Cerved Rating Agency S.p.A.
  • Creditreform Rating AG
  • DBRS Ratings Limited
  • Fitch Ratings
  • Moody’s Investors Service
  • Scope Ratings AG
  • Standard & Poor’s Ratings Services

Data per Scale

Each rating scale is a StateSpace (see State Spaces) and thus inherits the attributes and methods of that object, namely:

  • The entity defining the scale (the originating entity)
  • The full name of the scale (as most originators of rating scales offer multiple scales with different meaning an/or use)
  • The definition of the scale (as a list of tuples in the form [(‘0’, ‘X1’), … , (‘N-1’, ‘XN)] where X are the symbols used to denote the credit state
  • The CQS (credit quality step) mapping of the scale as defined by regulatory authorities (see next section)

CQS Mappings

The Credit Quality Step (CQS) denotes a standardised indicator of Credit Risk that is recognized in the European Union

  • The CQS Credit Rating Scale is based on numbers, ranging from 1 to 6.
  • 1 is the highest quality, 6 is the lowest quality

The European Supervisory Authorities maintain mappings between credit rating agencies and CQS

Note

Consult the original documents from definitive mappings available at the EBA Website

The Rating Agency State Spaces and mappings are obtained from the latest (20 May 2019) Regulatory Reference:

JC 2018 11, FINAL REPORT: REVISED DRAFT ITS ON THE MAPPING OF ECAIS’ CREDIT ASSESSMENTS UNDER CRR
Example of Label Conversion

Convert labels between credit rating scales

_images/scale_conversions.png

Withdrawn Ratings

Withdrawn ratings are a common issue that needs to be handled in the context of estimating transition matrices. See right censoring issues

Adjust NR (Not Rated) States

Adjusting for NR states can be done via the transitionMatrix.model.TransitionMatrix.remove() method.

Single Period Matrix

Example of using transitionMatrix to adjust the (not-rated) NR state. Input data are the Standard and Poor’s historical data (1981 - 2016) for corporate credit rating migrations. Example of handling

  • Script: examples/python/adjust_nr_states.py

Multi-period Matrix

  • Script: examples/python/fix_multiperiod_matrix.py

Example of using transitionMatrix to detect and solve various pathologies that might be affecting transition matrix data

Credit Curves

A Credit Curve denotes a grouping of credit risk metrics (parameters) that provide estimates that a legal entity experiences a Credit Event over different (an increasing sequence of longer) time periods. See Credit Curves

A multi-period matrix and a credit curve are closely related objects (under some circumstances the later can be thought of as a subset of the former). The transitionMatrix package offers the following main functionality concerning credit curves:

  • The transitionMatrix.creditratings.creditcurve.CreditCurve class for storing and working with credit curves
  • The transitionMatrix.model.TransitionMatrixSet.default_curves() transitionMatrixSet method that extracts from a matrix set the default curve

Example: Calculate and Plot Credit Curves

Example of using transitionMatrix to calculate and visualize multi-period

  • Script: examples/python/credit_curves.py
_images/credit_curves.png

Estimation

The estimation of a transition matrix is one of the core functionalities of transitionMatrix. Several methods and variations are available in the literature depending on aspects such as:

  • The nature of the observations / data (e.g., whether temporal homogeneity is a valid assumption)
  • Whether or not there are competing risk effects
  • Whether or not observations have coincident values
  • Treating the Right-Censorship of observations (Outcomes beyond the observation window)
  • Treating the Left-Truncation of observations (Outcomes prior to the the observation window)

Estimator Types

  • Cohort Based Methods that group observations in cohorts
  • Duration (also Hazard Rate or Intensity) Based Methods that utilize the actual duration of each state

The main estimators currently implemented are as follows:

Simple Estimator

The estimation of a transition matrix is one of the core functionalities of transitionMatrix. The two main estimators currently implemented are:

Cohort Estimator

A cohort estimator (more accurately discrete time estimator) is class of estimators of multi-state transitions that is a simpler alternative to Duration type estimators

Estimate a Transition Matrix from Cohort Data

Example workflows using transitionMatrix to estimate a transition matrix from data that are already grouped in cohorts

  • Script: examples/python/matrix_from_cohort_data.py
  • Example ID: 3
data = pd.read_csv(dataset_path + 'synthetic_data6.csv', dtype={'State': str})
sorted_data = data.sort_values(['ID', 'Timestep'], ascending=[True, True])
myState = tm.StateSpace()
myState.generic(2)
print(myState.validate_dataset(dataset=sorted_data))
myEstimator = es.CohortEstimator(states=myState, ci={'method': 'goodman', 'alpha': 0.05})
result = myEstimator.fit(sorted_data)
myMatrixSet = tm.TransitionMatrixSet(values=result, temporal_type='Incremental')

myEstimator.print(select='Counts', period=0)
myEstimator.print(select='Frequencies', period=18)

Aalen-Johansen Estimator

The Aalen-Johansen estimator is a multi-state (matrix) version of the Kaplan–Meier estimator for the hazard of a survival process. The estimator can be used to estimate the transition probability matrix of a Markov process with a finite number of states. See

Whichever the estimator choice, the outcome of the estimation is an Empirical Transition Matrix (or potentially a matrix set)

Implementation Notes

  • All estimators derive from the highest level BaseEstimator class.
  • Duration type estimators derive from the DurationEstimator class

Estimation Examples

The first example of estimating a transition matrix is covered in the Getting Started section. Here we have a few more examples:

Estimation Example 1

Example workflows using transitionMatrix to estimate an empirical transition matrix from duration type data. The datasets are produced using examples/generate_synthetic_data.py This example uses the Aalen-Johansen estimator

  • Script: examples/python/empirical_transition_matrix.py

By setting the example variable the script covers a number of variations:

  • Version 1: Credit Rating Migration example
  • Version 2: Simple 2x2 Matrix for testing
  • Version 3: Credit Rating Migration example with timestamps in raw date format

Plot of estimated transition probabilities

_images/transition_probabilities.png

Estimation Example 2

Example workflows using transitionMatrix to estimate a transition matrix from data that are in duration format. The datasets are first grouped in period cohorts

  • Script: examples/python/matrix_from_duration_data.py

Post-processing

The post-processing stage includes steps and activities after the estimation of a transition matrix. The precise steps required depend on specific circumstances but might involve some of the following:

  • “Fixing” a matrix by correcting deficiencies linked to data quality
  • Obtaining the infinitesimal generator of a matrix, a powerful tool for further analysis
  • Working with multi-period matrices
  • Visualizing transition datasets and transition frequencies

Basic Operations

The core TransitionMatrix object implements a typical (one period) transition matrix. It supports a variety of operations (more details are documented in the API section)

  • Initialize a matrix (from data, predefined matrices etc)
  • Validate a matrix
  • Attempt to fix a matrix
  • Compute generators, powers etc.
  • Print a matrix
  • Output to json/csv/xlsx formats
  • Output to html format

Simple Operation Examples

Note

The script examples/python/matrix_operations.py contains the below and plenty more simple single matrix examples

Initialize a matrix with values

There is a growing list of ways to initialize a transition matrix

  • Initialize a generic matrix of dimension n
  • Any list can be used for initialization (but not all shapes are valid transition matrices!)
  • Any numpy array can be used for initialization (but not all are valid transition matrices!)
  • Values can be loaded from json or csv files
  • The transitionMatrix.creditratings.predefined module includes a number of predefined matrices
A = tm.TransitionMatrix(values=[[0.6, 0.2, 0.2], [0.2, 0.6, 0.2], [0.2, 0.2, 0.6]])
print(A)
A.print_matrix(format_type='Standard', accuracy=2)

[[0.6 0.2 0.2]
 [0.2 0.6 0.2]
 [0.2 0.2 0.6]]

0.60 0.20 0.20
0.20 0.60 0.20
0.20 0.20 0.60

A.print_matrix(format_type='Standard', accuracy=2)

60.0% 20.0% 20.0%
20.0% 60.0% 20.0%
20.0% 20.0% 60.0%

Both the intrinsic print function and the specific print_matrix will print you the matrix, but the print_matrix method clearly aims to present the values in a more legible formats.

General Matrix Algebra

Note

All standard numerical matrix operations are available as per the numpy API.

Some example operations that leverage the underlying numpy API:

E = tm.TransitionMatrix(values=[[0.75, 0.25], [0.0, 1.0]])
print(E.validate())
# ATTRIBUTES
# Getting matrix info (dimensions, shape)
print(E.ndim)
print(E.shape)
# Obtain the matrix transpose
print(E.T)
# Obtain the matrix inverse
print(E.I)
# Summation methods:
# - along columns
print(E.sum(0))
# - along rows
print(E.sum(1))
# Multiplying all elements of a matrix by a scalar
print(0.01 * A)
# Transition Matrix algebra is very intuitive
print(A * A)
print(A ** 2)
print(A ** 10)

Validating, Fixing and Characterizing a matrix

Validate a Matrix

The validate() method of the object checks for required properties of a valid transition matrix:

  1. check squareness
  2. check that all values are probabilities (between 0 and 1)
  3. check that all rows sum to one
C = tm.TransitionMatrix(values=[1.0, 3.0])
print(C.validate())

[('Matrix Dimensions Differ: ', (1, 2))]
Characterise a Matrix

The characterise() method attempts to characterise a matrix

  1. diagonal dominance

Working with an actual matrix

The core capability of transitionMatrix is to produce estimated matrices but getting a realistic example requires quite some work. In this section we assume we have estimated one.

Lets look at a realistic example from the JLT paper

# Reproduce JLT Generator
# We load it using different sources
E = tm.TransitionMatrix(values=JLT)
E_2 = tm.TransitionMatrix(json_file=dataset_path + "JLT.json")
E_3 = tm.TransitionMatrix(csv_file=dataset_path + "JLT.csv")
# Lets check there are no errors
Error = E - E_3
print(np.linalg.norm(Error))
# Lets look at validation and generators"
# Empirical matrices will not satisfy constraints exactly
print(E.validate(accuracy=1e-3))
print(E.characterize())
print(E.generator())
Error = E - expm(E.generator())
# Frobenious norm
print(np.linalg.norm(Error))
# L1 norm
print(np.linalg.norm(Error, 1))
# Use pandas style API for saving to files
E.to_csv("JLT.csv")
E.to_json("JLT.json")

Multi-Period Transitions

Th transitionMatrix package adopts a multi-period paradigm that is more general than a Markov-Chain framework that imposes the Markov assumption over successive periods. In this direction, the TransitionMatrixSet object stores a family of TransitionMatrix objects as a time ordered list. Besides basic storage this structure allows a variety of simultaneous operations on the collection of related matrices

There are two basic representations of the a multi-period set of transitions:

  • The first (cumulative form) is the most fundamental. Each successive (k-th) element stores transition rates from an initial time to timepoint k. This could be for example the input of an empirical transition matrix dataset
  • In the second (incremental form) successive elements store transition rates from timepoint k-1 to timepoint k.

The TransitionMatrixSet class allows converting between the two representations

Matrix Set Operations

  • Script: matrix_set_operations.py

Contains examples using transitionMatrix to perform various transition matrix set operations (Multi-period measurement context)

Default Curves

Absorbing states (in credit risk context a borrower default) are particularly important therefore some specific functionality to isolate the corresponding default rate curve. (See Also the CreditCurve object)

Visualization

transitionMatrix aims to support native (Python based) visualization of various transition related datasets using matplotlib and other native python visualization libraries.

Note

The visualization functionality is not yet refactored into a reusable API. For now the visualization functionality is implemented separately as a demo script.

Visualization Examples

Example workflows using transitionMatrix to generate visualizations of migration phenomena

  • Script: examples/python/generate_visuals.py
Example 1

Plotting the state space trajectory of a single entity

_images/single_entity.png
Example 2

Plotting the state space trajectory of multiple entities

_images/sampled_histories.png
Example 3

Histogram plot of transition frequencies

_images/estimation.png
Example 4

Colored scatterplot of entity transitions over time

_images/scatterplot.png
Example 5

Colored scatterplot of entity transitions over time (alternative form)

_images/scatterplot2.png
Example 6

Visualize a transition matrix using Hinton-style visual

_images/TransitionMatrix.png
Example 7

Visualize a transition matrix using a sankey visual (a logarithmic adaptation that is useful for qualitative insight)

_images/sankey.png

Data Generators

The transitionMatrix distribution includes a number of data generators to support testing / training objectives.

  • exponential_transitions: Generate continuous time events from exponential distribution and uniform sampling from state space. Suitable for testing cohorting algorithms and duration based estimators.
  • markov_chain: Generate discrete events from a markov chain matrix in Compact data format. Suitable for testing cohort based estimators
  • long_format: Generate continuous events from a markov chain matrix in Long data format. Suitable for testing duration based estimators
  • portfolio_lables: Generate a collection of credit rating states emulating a snapshot of portfolio data. Suitable for mappings and transformations of credit rating states

Note

Do not confuse data generators with matrix generators

Data Generation Examples

All data data generation examples are in script examples/python/generate_synthetic_data.py

Federation

Credit Rating Ontology

The Credit Ratings Ontology is a framework that aims to represent and categorize knowledge about Credit Rating Agencies and related data (Credit Ratings) using semantic web information technologies.

This is a new project, related resources can be found here:

Note

transitionMatrix functionality to federate semantically annotated credit data is planned

Usage Examples

The examples directory includes both standalone python scripts and jupyter notebooks to help you get started. (NB: Currently there are more scripts than notebooks).

A selection of topics covered:

  • Generating transition matrices from data (using various estimators)
  • Manipulating transition matrices
  • Computing and visualizing credit curves corresponding to a set of transition matrices
  • Mapping rating states between different rating systems

Python Scripts

The scripts are located in examples/python. For testing purposes all examples can be run using the run_examples.py script located in the root directory. Some scripts have an example flag that selects alternative input data or estimators.

List of Example Scripts
Script Name Flag Input Data Description
adjust_nr_state.py 1   Adjust the NR (not-rated) statistics.
adjust_nr_state.py 2   Adjust the NR (not-rated) statistics.
credit_curves.py     Compute and Visualize credit curves
characterize_datasets.py     Load the available datasets and compute various statistics
compare_estimators.py   synthetic_data4.csv Compare the cohort and aalen-johansen estimators on a discrete timestep sample
data_cleaning_example.py   rating_data_raw.csv Prepare transition data sets (data cleansing) using some provided methods
deterministic_paths.py     Create a transition dataset by replicating give trajectories through a graph
empirical_transition_matrix.py 1 synthetic_data7.csv Credit Rating Migration example
empirical_transition_matrix.py 2 synthetic_data8.csv Simple 2x2 Matrix for testing
empirical_transition_matrix.py 3 synthetic_data9.csv Credit Rating Migration example with timestamps in raw date format
estimate_matrix.py   rating_data.csv An end-to-end example of estimating a credit rating matrix from historical data
fix_multiperiod_matrix.py   sp_1981-2016.csv Detect and solve various pathologies that might be affecting transition matrix data
generate_full_multiperiod_set.py   sp_NR_adjusted.json Use infinitesimal generator methods to generate a full multi-period matrix set.
generate_synthetic_data.py 1   Generate synthetic data. The first set of examples produces duration type data.
generate_synthetic_data.py 2   The second set of examples produces cohort type data using markov chain simulation
generate_synthetic_data.py 3   The second set of examples produces cohort type data using markov chain simulation
generate_visuals.py 6 JLT.json Plot Transition Probabilities
generate_visuals.py 7 JLT.json Logarithmic Sankey Diagram of Credit Migration Rates
generate_visuals.py 5 scenario_data.csv Plot Entity Transitions Plot
generate_visuals.py 1 synthetic_data1.csv Step Plot of a single observation
generate_visuals.py 4 synthetic_data3.csv Entity Transitions Plot
generate_visuals.py 2 synthetic_data4.csv Step Plot of individual observations
generate_visuals.py 3 synthetic_data5.csv Histogram Plots of transition frequencies
matrix_from_cohort_data.py 3 synthetic_data4.csv S&P Style Credit Rating Migration Matrix
matrix_from_cohort_data.py 2 synthetic_data5.csv IFRS 9 Style Migration Matrix (Large sample for testing)
matrix_from_cohort_data.py 1 synthetic_data6.csv Simplest Absorbing Case for validation
matrix_from_duration_data.py 1 synthetic_data1.csv Duration example with limited data (dataset contains only one entity)
matrix_from_duration_data.py 2 synthetic_data2.csv Duration example n entities with ~10 observations each, [0,1] state, 50%/50% transition matrix
matrix_from_duration_data.py 3 synthetic_data3.csv  
matrix_lendingclub.py     Estimate a matrix from LendingClub data. Input data are in a special cohort format as the published datasets have some limitations
matrix_operations.py     Perform various transition matrix operations illustrating the matrix algebra
matrix_set_lendingclub.py     Estimate a matrix from LendingClub data. Input data are in a special cohort format as the published datasets have some limitations
matrix_set_operations.py     Perform operations with multi-period transition matrix sequences
state_space_operations.py     Examples working with state spaces (mappings)

Jupyter Notebooks

  • Adjust_NotRated_State.ipynb
  • Matrix_Operations.ipynb
  • Monthly_from_Annual.ipynb

Open Risk Academy Scripts

Additional examples are available in the Open Risk Academy course Analysis of Credit Migration using Python TransitionMatrix. The scripts developed in the course are available here

API

The transitionMatrix package structure and API.

Warning

The library is still being expanded / refactored. Significant structure and API changes are likely.

transitionMatrix Package

The core module

transitionMatrix Classes

TransitionMatrix
TransitionMatrixSet
EmpiricalTransitionMatrix

Todo

This is future functionality

transitionMatrix Subpackages

Estimators SubPackage

This subpackage implements the various estimators

transitionMatrix.estimators.simple_estimator module
transitionMatrix.estimators.cohort_estimator module
transitionMatrix.estimators.aalen_johansen_estimator module
transitionMatrix.estimators.kaplan_meier_estimator module

Todo

This is future functionality

State Spaces SubPackage

This subpackage implements state space functionality

transitionMatrix.statespaces.statespace module

Credit Ratings SubPackage

This subpackage collects credit rating specific functionality

transitionMatrix.creditratings.creditcurve module
transitionMatrix.creditratings.creditsystems module
transitionMatrix.creditratings.predefined module

Generators SubPackage

This subpackage implements test data generation

transitionMatrix.generators contents

Visualization subpackage

This subpackage implements visualization functionality

Warning

not yet implemented

transitionMatrix.visualization contents

Utilities SubPackage

This subpackage collects various utilities

transitionMatrix.utils.converters module
transitionMatrix.utils.preprocessing module

Testing

Testing transitionMatrix has two major components:

  • normal code testing aiming to certify the correctness of code execution
  • algorithm testing aiming to validate the correctness of algorithmic implementation

Note

In general algorithmic testing is not as precise as code testing and may be more subject to uncertainties such as numerical accuracy. To make those tests as revealing as possible transitionMatrix implements a number of standardized round-trip tests:

  • starting with a matrix
  • generating compatible data
  • estimate a matrix from the data
  • comparing the values of input and estimated matrices

Running all the examples

Running all the examples is a quick way to check that everything is installed properly, all paths are defined etc. At the root of the distribution:

python3 run_examples.py

The file simply iterates and executes a standalone list of Usage Examples.

filelist = ['adjust_nr_state', 'credit_curves', 'empirical_transition_matrix', 'fix_multiperiod_matrix', 'generate_synthetic_data', 'generate_visuals', 'matrix_from_cohort_data', 'matrix_from_duration_data', 'matrix_lendingclub', 'matrix_set_lendingclub', 'matrix_operations', 'matrix_set_operations']

Warning

The script might generate a number of files / images at random places within the distribution

Test Suite

The testing framework is based on unittest. Before you get started and depending on how you obtained / installed the library check:

  • If required adjust the source directory path in transitionMatrix/__init__
  • Unzip the data files in the datasets directory

Then run all tests

python3 test.py

For an individual test:

pytest tests/test_TESTNAME.py

Roadmap

transitionMatrix is an ongoing project. Several significant extensions are already in the pipeline. transitionMatrix aims to become the most intuitive and versatile tool to analyse discrete transition data. The Roadmap lays out upcoming steps / milestones in this journey. The Todo list is a more granular collection of outstanding items.

You are welcome to contribute to the development of transitionMatrix by creating Issues or Pull Requests on the github repository. Feature requests, bug reports and any other issues are welcome to log at the Github Repository

Discussing general usage of the library is happening here

0.5

The 0.5 will be the next major release (still considered alpha) that will be available e.g. on PyPI

0.4.X

The 0.4.X family of updates will focus on rounding out and (above all) documenting a number of functionalities already introduced

Todo List

A list of todo items, no triaging / prioritisation implied

Core Architecture and API

  • Introduce exceptions / error handling throughout
  • Solve numpy.matrix deprecation (implement equivalent API in terms of ndarray)
  • Complete testing framework

Input Data Preprocessing

  • Handing of markov chain transition formats (single entity)
  • Native handling of Wide Data Formats (concrete data sets missing)
  • Generalize cohorting algorithm to user specified function

Reference Data

  • Additional credit rating scales (e.g short term ratings)
  • Integration with credit rating ontology

Transition Matrix Analysis Functionality

  • Further validation and characterisation of transition matrices (mobility indexes)
  • Generate random matrix subject to constraints
  • Fixing common problems encountered by empirically estimated transition matrices

Statistical Analysis Functionality

  • Aalen Johansen Estimator
    • Covariance calculation
    • Various other improvements / tests
  • Cohort Estimator
    • Read Data by labels
    • Edge cases
  • Kaplan Meier Estimator NEW
    • (link to survival frameworks)
  • Duration based methods
  • Bootstrap based confidence intervals

State Space package

  • Multiple absorbing states (competing risks)
  • Automated coarsening of states (merging of similar)

Utilities

  • Continuous time data generation from arbitrary chain

Further Refactoring of packages

  • Introduce visualization objects / API

Performance / Big data

  • Handling very large data sets, moving away from in-memory processing

Documentation

  • Sphinx documentation (complete)
  • Expand the jupyter notebook collection to (at least) match the standalone scripts

Releases / Distribution

  • Adopt regular github/PyPI release schedule
  • Conda distribution

ChangeLog

PLEASE NOTE THAT THE API OF TRANSITION MATRIX IS STILL UNSTABLE AS MORE USE CASES / FEATURES ARE ADDED REGULARLY

v0.5.0 (21-02-2022)

  • Installation:
    • Bump python dependency to 3.7
    • PyPI release update

v0.4.9 (04-05-2021)

  • Refactoring: All non-core functionality moved to separate directories/sub-packages
    • credit curve stuff moved to creditratings modules
    • data generators moved to generators modules
    • etc.
  • Documentation: Major expansion (Still incomplete)
    • Expanded Data Formats
    • Rating Scales, CQS etc
    • Listing all datasets and examples
  • Testing / Training: An interesting use case raised as issue #20
    • Added an end-to-end example of estimating a credit rating matrix from raw data
    • Includes various data preprocessing examples
  • Datasets:
    • rating_data.csv (cleaned up credit data)
    • synthetic_data10.csv Credit Rating Migrations in Long Format / Compact Form (for testing)
    • deterministic generator (replicate given trajectories)
  • Tests:
    • test_roundtrip.py testing via roundtriping methods

v0.4.8 (07-02-2021)

  • Documentation: Pulled all rst files in docs
  • Refactoring: credit rating data moved into separate module

v0.4.7 (29-09-2020)

  • Documentation: Expanded and updated description of classes
  • Documentation: Including Open Risk Academy code examples
  • Feature: logarithmic sankey visualization

v0.4.6 (22-05-2019)

  • Feature: Update of CQS Mappings, addition of new rating scales
  • Documentation: Documentation of rating scale structure and mappings
  • Training: Example of mapping portfolio data to CQS

v0.4.5 (21-04-2019)

  • Training: Monthly_from_Annual.ipynb (a Jupyter notebook illustrating how to obtain interpolate transition rates on monthly intervals)
  • Datasets: generic_monthly.json
  • Feature: print_matrix function for generic matrix pretty printing
  • Feature: matrix_exponent function for obtaining arbitrary integral matrices from a given generator

v0.4.4 (03-04-2019)

  • Documentation: Cleanup of docs following separation of threshold / portfolio models
  • Datasets: generic_multiperiod.json
  • Feature: CreditCurve class for holding credit curves

v0.4.3 (29-03-2019)

  • Refactoring: Significant rearrangement of code (the threshold models package moved to portfolioAnalytics for more consistent structure of the code base / functionality)

v0.4.2 (29-01-2019)

  • Feature: converter function in transitionMatrix.utils.converters to convert long form dataframes into canonical float form
  • Datasets: synthetic_data9.csv (datetime in string format)
  • Training: new data generator in examples/generate_synthetic_data.py to generate long format with string dates
  • Training: Additional example (=3) in examples/empirical_transition_matrix.py to process long format with string dates
  • Documentation: More detailed explanation of Long Data Formats with links to Open Risk Manual
  • Documentation: Enabled sphinx.ext.autosectionlabel for easy internal links / removed duplicate labels

v0.4.1 (31-10-2018)

  • Feature: Added functionality for conditioning multi-period transition matrices
  • Training: Example calculation and visualization of conditional matrices
  • Datasets: State space description and CGS mappings for top-6 credit rating agencies

v0.4.0 (23-10-2018)

  • Installation: First PyPI and wheel installation options
  • Feature: Added Aalen-Johansen Duration Estimator
  • Documentation: Major overhaul of documentation, now targeting ReadTheDocs distribution
  • Training: Streamlining of all examples
  • Datasets: Synthetic Datasets in long format

v0.3.1 (21-09-2018)

  • Feature: Expanded functionality to compute and visualize credit curves

v0.3 (27-08-2018)

  • Feature: Addition of portfolio models (formerly portfolio_analytics_library) for data generation and testing
  • Training: Added examples in jupyter notebook format

v0.2 (05-06-2018)

  • Feature: Addition of threshold generation algorithms

v0.1.3 (04-05-2018)

  • Documentation: Sphinx based documentation
  • Training: Additional visualization examples

v0.1.2 (05-12-2017)

  • Refactoring: Dataset paths
  • Bugfix: Correcting requirement dependencies (missing matplotlib)
  • Documentation: More detailed instructions

v0.1.1 (03-12-2017)

  • Feature: TransitionMatrix model: new methods to merge States, fix problematic probability matrices, I/O API’s
  • Feature: TransitionMatrixSet mode: json and csv readers, methods for set-wise manipulations
  • Datasets: Additional multiperiod datasets (Standard and Poors historical corporate rating transition rates)
  • Feature: Enhanced matrix comparison functionality
  • Training: Three additional example workflows
    • fixing multiperiod matrices (completing State Space)
    • adjusting matrices for withdrawn entries
    • generating full multi-period sets from limited observations

v0.1.0 (11-11-2017)

  • First public release of the package

Indexes and tables