Utilities SubPackage
This subpackage collects various utilities
transitionMatrix.utils.converters module
Converter utilities to help switch between various formats
- transitionMatrix.utils.converters.datetime_to_float(dataframe, time_column='Time', format=None)[source]
datetime_to_float() converts dates from string format to the canonical float format
- Parameters:
time_column – the column label of the observation times
dataframe – Pandas dataframe with dates in string format
- Returns:
Pandas dataframe with dates in float format
- Return type:
object
Note
The date string must be recognizable by the pandas to_datetime function.
- transitionMatrix.utils.converters.frame_to_array(dataframe)[source]
Convert pandas to numpy array :param dataframe: :return:
transitionMatrix.utils.preprocessing module
module transitionMatrix.utils - helper classes and functions
- transitionMatrix.utils.preprocessing.bin_timestamps(sorted_data, cohorts, output_format=0, remove_stale=False)[source]
Bin timestamped data in a dataframe so as to have ingoing and outgoing states per cohort interval
- Parameters:
data (pandas dataframe) – the dataframe to cohort
cohorts – the number of cohorts
output_format (int) – how to structure the outputs (0=cohorts, 1=event_list)
remove_stale (bool) – whether to remove successive observations with identical state
- Returns:
returns dataframe with cohorted data and cohort intervals
Note
The ‘ID’ and ‘Time’ column labels are used by default.
Warning
Cohorting is a ‘lossy’ operation: Timestamps are discretised (binned) and any intermediate state transitions are lost.
Warning
The data must be sorted already
- transitionMatrix.utils.preprocessing.generate_cohort_bounds(data, cohorts)[source]
Generate cohort intervals given an input transition dataframe and the desired number of cohorts. The function finds the range of timestamps and divides it equally
- Parameters:
data – a pandas dataframe
cohorts (int) – the number of cohorts
- Returns:
cohort_bounds
- Returns:
dt
Warning
the Time column must be in float format
- transitionMatrix.utils.preprocessing.generate_event_dict(data, dt, cohort_bounds)[source]
Loop over all events and construct a dictionary in the following format:
event_dict = { (entity_id, cohort interval) : [(time, state), ..., (time, state)] (entity_id, cohort interval) : (time, state), ..., (time, state)] }
Create a unique key as per (entity, interval)
Find the interval of each event (the cohort it belongs it)
Add (time, state) pairs as variable length list
This data structure allows applying arbitrary state assignment to each cohort interval
- Parameters:
data – a pandas dataframe
dt – the cohort interval
cohort_bounds – the boundaries of the cohort intervals
- Returns:
dict
- transitionMatrix.utils.preprocessing.remove_stale_events(data)[source]
Parse an event dictionary and remove transitions to the same state:
event_dict = { (entity_id, cohort interval) : [(time, state), ..., (time, state)] (entity_id, cohort interval) : (time, state), ..., (time, state)] }
- Parameters:
data – a pandas dataframe
- Returns:
dict
- transitionMatrix.utils.preprocessing.total_timestamps(data)[source]
Count total number of timestamps in a dataframe
- Parameters:
data – dataframe. The ‘Time’ column is used by default
- Returns:
returns an integer
- transitionMatrix.utils.preprocessing.transitions_summary(dataframe)[source]
Calculate some summary statistics about transitions :param dataframe: input dataframe :return: dict
- transitionMatrix.utils.preprocessing.unique_entities(data)[source]
Identify unique entities in a dataframe
- Parameters:
data – dataframe. The ‘ID’ column is used by default
- Returns:
returns a numpy array
- transitionMatrix.utils.preprocessing.unique_states(data)[source]
Identify unique states in a dataframe
- Parameters:
data – dataframe. The ‘State’ column is used by default for Compact formats, ‘From’ column as fallback for Canonical format
- Returns:
returns a numpy array
- transitionMatrix.utils.preprocessing.unique_timestamps(data)[source]
Identify unique timestamps in a dataframe
- Parameters:
data – dataframe. The ‘Time’ column is used by default
- Returns:
returns a sorted numpy array
- transitionMatrix.utils.preprocessing.validate_absorbing_state(dataframe, state)[source]
Validate whether a given state is actually absorbing (there should be no transitions to another state)
- Parameters:
dataframe – an input data frame
state (int) – the state to validate
- Returns:
a list of exceptions