Utilities SubPackage

This subpackage collects various utilities

transitionMatrix.utils.converters module

Converter utilities to help switch between various formats

transitionMatrix.utils.converters.datetime_to_float(dataframe, time_column='Time', format=None)[source]

datetime_to_float() converts dates from string format to the canonical float format

Parameters:
  • time_column – the column label of the observation times

  • dataframe – Pandas dataframe with dates in string format

Returns:

Pandas dataframe with dates in float format

Return type:

object

Note

The date string must be recognizable by the pandas to_datetime function.

transitionMatrix.utils.converters.frame_to_array(dataframe)[source]

Convert pandas to numpy array :param dataframe: :return:

transitionMatrix.utils.converters.to_canonical(dataframe)[source]

to_canonical() converts a dataframe that is in compact form into a canonical form

Parameters:

dataframe

Returns:

dataframe

transitionMatrix.utils.converters.to_compact(dataframe)[source]

to_compact() converts a dataframe that is in canonical form into a compact form

Parameters:

dataframe

Returns:

dataframe

transitionMatrix.utils.preprocessing module

module transitionMatrix.utils - helper classes and functions

transitionMatrix.utils.preprocessing.bin_timestamps(sorted_data, cohorts, output_format=0, remove_stale=False)[source]

Bin timestamped data in a dataframe so as to have ingoing and outgoing states per cohort interval

Parameters:
  • data (pandas dataframe) – the dataframe to cohort

  • cohorts – the number of cohorts

  • output_format (int) – how to structure the outputs (0=cohorts, 1=event_list)

  • remove_stale (bool) – whether to remove successive observations with identical state

Returns:

returns dataframe with cohorted data and cohort intervals

Note

The ‘ID’ and ‘Time’ column labels are used by default.

Warning

Cohorting is a ‘lossy’ operation: Timestamps are discretised (binned) and any intermediate state transitions are lost.

Warning

The data must be sorted already

transitionMatrix.utils.preprocessing.generate_cohort_bounds(data, cohorts)[source]

Generate cohort intervals given an input transition dataframe and the desired number of cohorts. The function finds the range of timestamps and divides it equally

Parameters:
  • data – a pandas dataframe

  • cohorts (int) – the number of cohorts

Returns:

cohort_bounds

Returns:

dt

Warning

the Time column must be in float format

transitionMatrix.utils.preprocessing.generate_event_dict(data, dt, cohort_bounds)[source]

Loop over all events and construct a dictionary in the following format:

event_dict = {
  (entity_id, cohort interval) : [(time, state), ..., (time, state)]
  (entity_id, cohort interval) : (time, state), ..., (time, state)]
}
  • Create a unique key as per (entity, interval)

  • Find the interval of each event (the cohort it belongs it)

  • Add (time, state) pairs as variable length list

This data structure allows applying arbitrary state assignment to each cohort interval

Parameters:
  • data – a pandas dataframe

  • dt – the cohort interval

  • cohort_bounds – the boundaries of the cohort intervals

Returns:

dict

transitionMatrix.utils.preprocessing.remove_stale_events(data)[source]

Parse an event dictionary and remove transitions to the same state:

event_dict = {
  (entity_id, cohort interval) : [(time, state), ..., (time, state)]
  (entity_id, cohort interval) : (time, state), ..., (time, state)]
}
Parameters:

data – a pandas dataframe

Returns:

dict

transitionMatrix.utils.preprocessing.total_timestamps(data)[source]

Count total number of timestamps in a dataframe

Parameters:

data – dataframe. The ‘Time’ column is used by default

Returns:

returns an integer

transitionMatrix.utils.preprocessing.transitions_summary(dataframe)[source]

Calculate some summary statistics about transitions :param dataframe: input dataframe :return: dict

transitionMatrix.utils.preprocessing.unique_entities(data)[source]

Identify unique entities in a dataframe

Parameters:

data – dataframe. The ‘ID’ column is used by default

Returns:

returns a numpy array

transitionMatrix.utils.preprocessing.unique_states(data)[source]

Identify unique states in a dataframe

Parameters:

data – dataframe. The ‘State’ column is used by default for Compact formats, ‘From’ column as fallback for Canonical format

Returns:

returns a numpy array

transitionMatrix.utils.preprocessing.unique_timestamps(data)[source]

Identify unique timestamps in a dataframe

Parameters:

data – dataframe. The ‘Time’ column is used by default

Returns:

returns a sorted numpy array

transitionMatrix.utils.preprocessing.validate_absorbing_state(dataframe, state)[source]

Validate whether a given state is actually absorbing (there should be no transitions to another state)

Parameters:
  • dataframe – an input data frame

  • state (int) – the state to validate

Returns:

a list of exceptions