medicaid_utils.preprocessing package¶

Submodules¶

medicaid_utils.preprocessing.max_cc module¶

This module has MAXCC class which wraps together cleaning/ preprocessing routines specific for MAX CC files

class medicaid_utils.preprocessing.max_cc.MAXCC(year: int, state: str, data_root: str, index_col: str = 'BENE_MSIS', clean: bool = True, preprocess: bool = True, tmp_folder: str | None = None, pq_engine: str = 'pyarrow')[source]¶

Bases: MAXFile

Scripts to preprocess CC file

agg_conditions(lst_conditions: List[str]) → None[source]¶

Aggregate condition indicators across payer levels

Parameters:: lst_conditions (list of str) – List of condition columns to aggregate

get_chronic_conditions(lst_conditions: List[str] | None = None) → DataFrame[source]¶

Get chronic condition indidcators

Parameters:: lst_conditions (list of str, default=None) – List of condition columns to check

medicaid_utils.preprocessing.max_file module¶

This module has MAXFile class from which is the base class for all MAX file type classes

class medicaid_utils.preprocessing.max_file.MAXFile(ftype: str, year: int, state: str, data_root: str, index_col: str = 'BENE_MSIS', clean: bool = True, preprocess: bool = True, tmp_folder: str | None = None, pq_engine: str = 'pyarrow')[source]¶

Bases: object

Parent class for all MAX file classes, each of which will have clean and preprocess functions

add_gender() → None[source]¶: Adds integer ‘female’ column based on ‘EL_SEX_CD’ column. Undefined values (‘U’) in EL_SEX_CD column will result in female column taking the value -1

cache_results(repartition: bool = False) → None[source]¶

Save results in intermediate steps of some lengthy processing. Saving intermediate results speeds up processing

Parameters:: repartition (bool, default=False) – Repartition the dask dataframe

calculate_payment() → None[source]¶: Calculates payment amount New Column(s):

pymt_amt - “MDCD_PYMT_AMT” + “TP_PYMT_AMT”

clean() → None[source]¶: Cleaning routines to processes date and gender columns

clean_diag_codes() → None[source]¶: Clean diagnostic code columns by removing non-alphanumeric characters and converting them to upper case

clean_proc_codes() → None[source]¶: Clean diagnostic code columns by removing non-alphanumeric characters and converting them to upper case

export(dest_folder: str, output_format: str = 'csv', repartition: bool = False) → None[source]¶

Exports the files.

Parameters:

dest_folder (str) – Destination folder
output_format (str, default='csv') – Export format (‘csv’ or ‘parquet’)
repartition (bool, default=False) – Repartition the dask dataframe

flag_ed_use() → None[source]¶

Detects ed use in claims New Column(s):

ed_cpt - 0 or 1, Claim has a procedure billed in ED code range (99281–99285)
(PRCDR_CD_SYS_{1-6} == 01 & PRCDR_CD_{1-6} in (99281–99285))

ed_ub92 - 0 or 1, Claim has a revenue center codes (0450 - 0459, 0981) - UB_92_REV_CD_GP_{1-23}

ed_tos - 0 or 1, Claim has an outpatient type of service (MAX_TOS = 11) (if ftype == ‘ip’)

ed_pos - 0 or 1, Claim has PLC_OF_SRVC_CD set to 23 (if ftype == ‘ot’)

ed_use - any of ed_cpt, ed_ub92, ed_tos or ed_pos is 1

any_ed - 0 or 1, 1 when any other claim from the same visit has ed_use set to 1 (if ftype == ‘ot’)

Uses the below as reference:

If the patient is a Medicare beneficiary, the general surgeon should bill the level of

ED code (99281-99285) (https://web.archive.org/web/20231125185256/https://bulletin.facs.org/2013/02/coding-for-hospital-admission/) - Inpatient files: Revenue Center Codes 0450-0459, 0981 (https://web.archive.org/web/20210303085851/https://www.resdac.org/resconnect/articles/144)

classmethod get_claim_instance(claim_type: str, *args: Any, **kwargs: Any) → MAXFile[source]¶

Returns an instance of the requested claim type

Parameters:

claim_type ({'ip', 'ot', 'cc', 'rx'}) – Claim type
*args (list) – List of position arguments
**kwargs (dict) – Dictionary of keyword arguments

pq_export(dest_path_and_fname: str, repartition: bool = False) → DataFrame[source]¶

Export parquet files (overwrite safe)

Parameters:

dest_path_and_fname (str) – Destination path
repartition (bool, default=False) – Repartition the dask dataframe

preprocess() → None[source]¶: Add basic constructed variables

process_date_cols() → None[source]¶

Convert datetime columns to datetime type and add basic date based constructed variables

New columns:

birth_year, birth_month, birth_day - date compoments of EL_DOB (date of birth)

birth_date - birth date (EL_DOB)

death - 0 or 1, if EL_DOD or MDCR_DOD is not empty and falls in the claim year or before

age - age in years, integer format

age_day - age in days

adult - 0 or 1, 1 when patient’s age is in [18,115]

child - 0 or 1, 1 when patient’s age is in [0,17]

if ftype == ‘ip’:

Clean/ impute admsn_date and add ip duration related columns New column(s):

admsn_date - Admission date (ADMSN_DT)

srvc_bgn_date - Service begin date (SRVC_BGN_DT)

srvc_end_date - Service end date (SRVC_END_DT)

prncpl_proc_date - Principal procedure date (PRNCPL_PRCDR_DT)

missing_admsn_date - 0 or 1, 1 when admission date is missing

missing_prncpl_proc_date - 0 or 1, 1 when principal procedure date is missing

flag_admsn_miss - 0 or 1, 1 when admsn_date was imputed

los - ip service duration in days

ageday_admsn - age in days as on admsn_date

age_admsn - age in years, with decimals, as on admsn_date

age_prncpl_proc - age in years as on principal procedure date

age_day_prncpl_proc - age in days as on principal procedure date

if ftype == ‘ot’:

Adds duration column, provided service end and begin dates are clean New Column(s):

srvc_bgn_date - Service begin date (SRVC_BGN_DT)

srvc_end_date - Service end date (SRVC_END_DT)

diff & duration - duration of service in days

age_day_srvc_bgn - age in days as on service begin date

age_srvc_bgn - age in years, with decimals, as on service begin date

if ftype == ‘ps:

New Column(s):

date_of_death - Date of death (EL_DOD)

medicare_date_of_death - Medicare date of death (MDCR_DOD)

medicaid_utils.preprocessing.max_ip module¶

This module has MAXIP class which wraps together cleaning/ preprocessing routines specific for MAX IP files

class medicaid_utils.preprocessing.max_ip.MAXIP(year: int, state: str, data_root: str, index_col: str = 'BENE_MSIS', clean: bool = True, preprocess: bool = True, tmp_folder: str | None = None, pq_engine: str = 'pyarrow')[source]¶

Bases: MAXFile

clean() → None[source]¶: Runs cleaning routines and adds common exclusion flags based on default filters

flag_common_exclusions() → None[source]¶

flag_duplicates() → None[source]¶

flag_ip_overlaps() → None[source]¶

Identifies duplicate/ overlapping claims. When several/ overlapping claims exist with the same MSIS_ID, claim with the largest payment amount is retained. New Column(s):

flag_ip_undup - 0 or 1, 1 when row is not a duplicate flag_ip_dup_drop - 0 or 1, 1 when row is duplicate and must be dropped flag_ip_overlap_drop - 0 or 1, 1 when row overlaps with another claim ip_incl - 0 or 1, 1 when row is clean (flag_ip_dup_drop = 0 & flag_ip_overlap_drop = 0) and has los > 0

Parameters:: dd.DataFrame (df)
Return type:: None

preprocess() → None[source]¶: Adds payment, ed use, and overlap flags

medicaid_utils.preprocessing.max_ot module¶

This module has MAXOT class which wraps together cleaning/ preprocessing routines specific for MAX OT files

class medicaid_utils.preprocessing.max_ot.MAXOT(year: int, state: str, data_root: str, index_col: str = 'BENE_MSIS', clean: bool = True, preprocess: bool = True, tmp_folder: str | None = None, pq_engine: str = 'pyarrow')[source]¶

Bases: MAXFile

add_ot_flags() → None[source]¶

Assign flags for IP, OT and ED calculation Based on hierarchical principal: IP first,then ED, and then OT Marks claims that have overlapping IP claims, has ED services or have only OT services New Column(s):

ip_incl - 0 or 1, 1 when has no dental and transport claims, and has an overlapping IP claim ed_incl - 0 or 1, 1 when has no dental and transport claims, has no overlapping IP claim, and has an ED

service in any visits corresponding this claim

ot_incl - 0 or 1, 1 when has no dental and transport claims, has no overlapping IP claim, and has no ED
service in any visits corresponding this claim

flag_drop - 0 or 1, 1 when ip_incl, ed_incl and ot_incl are all null

clean() → None[source]¶: Runs cleaning routines and adds common exclusion flags based on default filters

find_ot_ip_overlaps(df_ip: DataFrame) → None[source]¶

Checks for OT claims that have an overlapping IP claim New Column(s):

overlap - 0 or 1, 1 when OT claim has an overlapping IP claim

Parameters:: df_ip (DataFrame) – IP DataFrame

flag_common_exclusions() → None[source]¶

flag_dental() → None[source]¶: Flag dental claims New Column(s):

dental_TOS - 0 or 1, 1 when MAX_TOS = 9 dental_PRCDR - 0 or 1, 1 when PRCDR_CD starts with ‘D’ dental - 0 or 1, 1 when any of dental_TOS or dental_PRCDR

flag_duplicates() → None[source]¶

flag_em() → None[source]¶: Flag claim if procedure code belongs to E/M category New Column(s):

EM - 0 or 1, 1 when PRCDR_CD in [99201, 99215] or [99301, 99350]

flag_ip_overlaps_and_ed(df_ip: DataFrame) → None[source]¶

Adds flags to indicate overlaps with IP claims

Parameters:: df_ip (pd.DataFrame) – IP claim dataframe

flag_transport() → None[source]¶: Flag transport claims New Column(s):

transport_TOS - 0 or 1, 1 when MAX_TOS = 26 transport_POS - 0 or 1, 1 when PLC_OF_SRVC_CD is 41 or 42 transport - 0 or 1, 1 when any of transport_TOS or transport_POS

preprocess() → None[source]¶: Adds payment, ed use, transport, dental, and em flags

medicaid_utils.preprocessing.max_ps module¶

This module has MAXPS class which wraps together cleaning/ preprocessing routines specific for MAX PS files

class medicaid_utils.preprocessing.max_ps.MAXPS(year: int, state: str, data_root: str, index_col: str = 'BENE_MSIS', clean: bool = True, preprocess: bool = True, rural_method: str = 'ruca', tmp_folder: str | None = None, pq_engine: str = 'pyarrow')[source]¶

Bases: MAXFile

Scripts to preprocess PS file

add_eligibility_status_columns() → None[source]¶

Add eligibility columns based on MAX_ELG_CD_MO_{month} values for each month. MAX_ELG_CD_MO:00 = NOT ELIGIBLE, 99 = UNKNOWN ELIGIBILITY => codes to denote ineligibility

New Column(s):

elg_mon_{month} - 0 or 1 value column, denoting eligibility

for each month - total_elg_mon - No. of eligible months - elg_full_year - 0 or 1 value column, 1 if total_elg_mon = 12 - elg_over_9mon - 0 or 1 value column, 1 if total_elg_mon >= 9 - elg_over_6mon - 0 or 1 value column, 1 if total_elg_mon >= 6 - elg_cont_6mon - 0 or 1 value column, 1 if patient has 6 continuous eligible months - mas_elg_change - 0 or 1 value column, 1 if patient had multiple mas group memberships during claim year - mas_assignments - comma separated list of MAS assignments - boe_assignments - comma separated list of BOE assignments - dominant_boe_group - BOE status held for the most number of months - boe_elg_change - 0 or 1 value column, 1 if patient had multiple boe group memberships during claim year - child_boe_elg_change - 0 or 1 value column, 1 if patient had multiple boe group memberships during claim year - elg_change - 0 or 1 value column, 1 if patient had multiple eligibility group memberships during claim year - eligibility_aged - Eligibility as aged anytime during the claim year - eligibility_child - Eligibility as child anytime during the claim year - max_gap - Maximum gap in enrollment in months - max_cont_enrollment - Maximum duration of continuous enrollment

clean() → None[source]¶: Runs cleaning routines and adds common exclusion flags based on default filters

flag_common_exclusions() → None[source]¶

Adds exclusion flags New Column(s):

excl_duplicated_bene_id - 0 or 1, 1 when bene’s index column

is repeated

flag_duals() → None[source]¶: Flags dual patients New column(s):

dual - 0 or 1 column, 0 if 0 <= EL_MDCR_DUAL_ANN <= 9 for years 2007, 2009, 2011 0 <= EL_MDCR_DUAL_ANN <= 9 for other years

flag_restricted_benefits() → None[source]¶

Checks individual’s eligibility for various medicaid services, based on EL_RSTRCT_BNFT_FLG_{month} values,

1 = full scope; INDIVIDUAL IS ELIGIBLE FOR MEDICAID DURING THE

MONTH AND IS ENTITLED TO THE FULL SCOPE OF MEDICAID BENEFITS. - 2 = alien; INDIVIDUAL IS ELIGIBLE FOR MEDICAID DURING THE MONTH BUT ONLY ENTITLED TO RESTRICTED BENEFITS

BASED ON ALIEN STATUS

3 = dual

4 = pregnancy

5 = other, eg. substance abuse, medically needy

6 = family planning

7 = alternative package of benchmark equivalent coverage,

2011 data had no values of 7 and 8 - 8 = “money follows the person” rebalancing demonstration, 2011 data had no values of 7 and 8 - 9 = unknown - A = Psychiatric residential treatments demonstration - B = Health Opportunity Account - C = CHIP dental coverage, supplemental to employer sponsored insurance - W = Medicaid health insurance premium payment assistance (MA, NJ, VT, OK) - X = rx drug - Y = drug and dual - Z = drug and dual, but Medicaid was not paying for the benefits.

Benefits are non-comprehensive (restricted) when EL_RSTRCT_BNFT_FLG_{month} has any of the below values:

“2”, “3”, “6”: for states other than “AR”, “ID”, “SD”

“2”, “4”, “3”, “6”: for states “AR”, “ID”, “SD”

New column(s):

any_restricted_benefit_month: 0 or 1, 1 when bene’s benefits

are restricted for atleast 1 month - restricted_benefit_months: Number of restricted benefit months - restricted_benefits: 0 or 1, 1 when number of restricted benefit months are more than the number of number of months the bene was enrolled in medicaid

flag_rural(method: str = 'ruca') → None[source]¶

Classifies benes into rural/ non-rural on the basis of RUCA/ RUCC of their resident ZIP/ FIPS codes

New Columns:

resident_state_cd

rural - 0/ 1/ -1, 1 when bene’s residence is in a rural

location, 0 when not. -1 when zip code is missing - pcsa - resident PCSA code - {ruca_code/ rucc_code} - resident ruca_code

This function uses

RUCA 3.1 dataset (from https://www.ers.usda.gov/data-products/rural-urban-commuting-area-codes/). RUCA codes >= 4 denote rural and the rest denote urban as per https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6286055/#SD1
RUCC codes were downloaded from https://www.ers.usda.gov/data-products/rural-urban-continuum-codes/. RUCC codes >= 8 denote rural and the rest denote urban.
ZCTAs x zipcode crosswalk from UDSMapper (https://udsmapper.org/zip-code-to-zcta-crosswalk/),
zipcodes from multiple sources
distance between centroids of zipcodes using NBER data (https://nber.org/distance/2016/gaz/zcta5/gaz2016zcta5centroid.csv)

Parameters:: method ({'ruca', 'rucc'}) – Method to use for rural variable construction

flag_tanf() → None[source]¶

The Temporary Assistance for Needy Families (TANF) program provides temporary financial assistance for pregnant women and families with one or more dependent children. This provides financial assistance to help pay for food, shelter, utilities, and expenses other than medical. In MAX files this is identified via

EL_TANF_CASH_FLG:

1 = INDIVIDUAL DID NOT RECEIVE TANF BENEFITS DURING THE MONTH;
2 = INDIVIDUAL DID RECEIVE TANF BENEFITS DURING THE MONTH. CO

and ID either 0 or 9

New Column(s):

tanf : 0 or 1, denoting usage of TANF benefits in any of the

months

preprocess(rural_method: str = 'ruca') → None[source]¶

Adds rural, eligibility criteria, dual, and restricted benefits indicator variables

Parameters:: rural_method ({'ruca', 'rucc'}) – Method to use for rural variable construction

medicaid_utils.preprocessing.taf_file module¶

This module has TAFFile class from which is the base class for all TAF file type classes

class medicaid_utils.preprocessing.taf_file.TAFFile(ftype: str, year: int, state: str, data_root: str, index_col: str = 'BENE_MSIS', clean: bool = True, preprocess: bool = True, tmp_folder: str | None = None, pq_engine: str = 'pyarrow')[source]¶

Bases: object

Parent class for all TAF file classes, each of which will have clean and preprocess functions

add_custom_subtype(subtype_name: str, df_file: DataFrame) → None[source]¶

Add custom subtype file to claim object.

Examples

>>> from medicaid_utils.preprocessing.taf_file import TAFFile
>>> taf = TAFFile('ip', 2019, 'AL', '/data/cms')
>>> import dask.dataframe as dd
>>> import pandas as pd
>>> df = dd.from_pandas(pd.DataFrame({'col': [1]}), npartitions=1)
>>> taf.add_custom_subtype('my_subtype', df)

cache_results(subtype: str | None = None, repartition: bool = False) → None[source]¶

Save results in intermediate steps of some lengthy processing. Saving intermediate results speeds up processing, and avoid dask cluster crashes for large datasets

Parameters:

subtype (str, default=None) – File type. Eg. ‘header’. If empty, all subtypes will be cached
repartition (bool, default=False) – Repartition the dask dataframe

Examples

>>> from medicaid_utils.preprocessing.taf_file import TAFFile
>>> taf = TAFFile('ip', 2019, 'AL', '/data/cms', tmp_folder='/tmp/cache')
>>> taf.cache_results(subtype='base')

clean() → None[source]¶

Cleaning routines to processes date and gender columns, and add duplicate check flags.

Examples

>>> from medicaid_utils.preprocessing.taf_file import TAFFile
>>> taf = TAFFile('ip', 2019, 'AL', '/data/cms', clean=False)
>>> taf.clean()

clean_codes() → None[source]¶

Clean diagnostic code columns by removing non-alphanumeric characters and converting them to upper case and NDC codes columns by removing white space characters and padding 0s to the left so the codes are of length 12.

Examples

>>> from medicaid_utils.preprocessing.taf_file import TAFFile
>>> taf = TAFFile('ip', 2019, 'AL', '/data/cms', clean=False)
>>> taf.clean_codes()

clean_diag_codes() → None[source]¶

Clean diagnostic code columns by removing non-alphanumeric characters and converting them to upper case.

Examples

>>> import pandas as pd
>>> import dask.dataframe as dd
>>> pdf = pd.DataFrame({'DGNS_CD_1': ['a12.3', 'B45-6'],
...                     'other_col': [1, 2]})
>>> ddf = dd.from_pandas(pdf, npartitions=1)
>>> from medicaid_utils.preprocessing.taf_file import TAFFile
>>> obj = object.__new__(TAFFile)
>>> obj.dct_files = {'base': ddf}
>>> obj.clean_diag_codes()
>>> result = obj.dct_files['base'].compute()
>>> list(result['DGNS_CD_1'])
['A123', 'B456']

clean_ndc_codes() → None[source]¶

Clean NDC codes columns by removing white space characters and padding 0s to the left so the codes are of length 12.

Examples

>>> import pandas as pd
>>> import dask.dataframe as dd
>>> pdf = pd.DataFrame({'NDC': ['1234', ' 5678 ']})
>>> ddf = dd.from_pandas(pdf, npartitions=1)
>>> from medicaid_utils.preprocessing.taf_file import TAFFile
>>> obj = object.__new__(TAFFile)
>>> obj.dct_files = {'line': ddf}
>>> obj.clean_ndc_codes()
>>> result = obj.dct_files['line'].compute()
>>> list(result['NDC'])
['000000001234', '000000005678']

clean_proc_codes() → None[source]¶

Clean procedure code columns by removing non-alphanumeric characters and converting them to upper case.

Examples

>>> import pandas as pd
>>> import dask.dataframe as dd
>>> pdf = pd.DataFrame({'PRCDR_CD_1': ['ab.1', 'C2-d'],
...                     'PRCDR_CD_SYS_1': ['ICD', 'CPT']})
>>> ddf = dd.from_pandas(pdf, npartitions=1)
>>> from medicaid_utils.preprocessing.taf_file import TAFFile
>>> obj = object.__new__(TAFFile)
>>> obj.dct_files = {'base': ddf}
>>> obj.clean_proc_codes()
>>> result = obj.dct_files['base'].compute()
>>> list(result['PRCDR_CD_1'])
['AB1', 'C2D']
>>> list(result['PRCDR_CD_SYS_1'])
['ICD', 'CPT']

export(dest_folder: str, output_format: str = 'csv', repartition: bool = False) → None[source]¶

Exports the files.

Parameters:

dest_folder (str) – Destination folder
output_format (str, default='csv') – Export format (‘csv’ or ‘parquet’)
repartition (bool, default=False) – Repartition the dask dataframe

Examples

>>> from medicaid_utils.preprocessing.taf_file import TAFFile
>>> taf = TAFFile('ip', 2019, 'AL', '/data/cms')
>>> taf.export('/tmp/output', output_format='csv')

flag_duplicates() → None[source]¶

Removes duplicated rows. TAF claims have multiple versions for each month. This function keeps the most recent file version date for each month using the variables IP_VRSN, LT_VRSN, OT_VRSN, and RX_VRSN. Retains only the claims with maximum value of production data run ID (DA_RUN_ID) for each claim ID (CLM_ID).

References

Identifying beneficiaries with a substance use disorder

Examples

>>> from medicaid_utils.preprocessing.taf_file import TAFFile
>>> taf = TAFFile('ip', 2019, 'AL', '/data/cms', clean=False)
>>> taf.flag_duplicates()

flag_ffs_and_encounter_claims() → None[source]¶

Flags claims where CLM_TYPE_CD is equal to one of the following values:

1: A FFS Medicaid or Medicaid-expansion claim
3: Medicaid or Medicaid-expanding managed care encounter record
A: Separate CHIP (Title XXI) FFS claim
C: Separate CHIP (Title XXI) encounter record

References

Identifying beneficiaries with a substance use disorder

Examples

>>> from medicaid_utils.preprocessing.taf_file import TAFFile
>>> taf = TAFFile('ip', 2019, 'AL', '/data/cms')
>>> taf.flag_ffs_and_encounter_claims()
>>> 'ffs_or_encounter_claim' in taf.dct_files['base'].columns
True

gather_bene_level_diag_ndc_codes() → None[source]¶

Constructs patient level NDC and diagnosis code list columns and saves them to individual files.

Examples

>>> from medicaid_utils.preprocessing.taf_file import TAFFile
>>> taf = TAFFile('ip', 2019, 'AL', '/data/cms')
>>> taf.gather_bene_level_diag_ndc_codes()

classmethod get_claim_instance(claim_type: str, *args: Any, **kwargs: Any) → TAFFile[source]¶

Returns an instance of the requested claim type

Parameters:

claim_type ({'ip', 'ot', 'cc', 'rx'}) – Claim type
*args (list) – List of position arguments
**kwargs (dict) – Dictionary of keyword arguments

Examples

>>> from medicaid_utils.preprocessing.taf_file import TAFFile
>>> ip_claim = TAFFile.get_claim_instance('ip', 2019, 'AL', '/data/cms')

pq_export(f_subtype: str, dest_path_and_fname: str, repartition: bool = False) → None[source]¶

Export parquet files (overwrite safe)

Parameters:

f_subtype (str) – File type. Eg. ‘header’
dest_path_and_fname (str) – Destination path
repartition (bool, default=False) – Repartition the dask dataframe

Examples

>>> from medicaid_utils.preprocessing.taf_file import TAFFile
>>> taf = TAFFile('ip', 2019, 'AL', '/data/cms')
>>> taf.pq_export('base', '/tmp/output/base')

preprocess() → None[source]¶

Add basic constructed variables.

Examples

>>> from medicaid_utils.preprocessing.taf_file import TAFFile
>>> taf = TAFFile('ip', 2019, 'AL', '/data/cms', preprocess=False)
>>> taf.preprocess()

process_date_cols() → None[source]¶

Convert datetime columns to datetime type and add basic date based constructed variables

New columns:

birth_year, birth_month, birth_day - date components of EL_DOB (date of birth)

birth_date - birth date (EL_DOB)

death - 0 or 1, if DEATH_DT is not empty and falls in the claim year or before

age - age in years, integer format

age_day - age in days

adult - 0 or 1, 1 when patient’s age is in [18,115]

child - 0 or 1, 1 when patient’s age is in [0,17]

If ftype == ‘ip’:

Clean/ impute admsn_date and add ip duration related columns

New column(s):

admsn_date - Admission date (ADMSN_DT)

srvc_bgn_date - Service begin date (SRVC_BGN_DT)

srvc_end_date - Service end date (SRVC_END_DT)

prncpl_proc_date - Principal procedure date (PRCDR_CD_DT_1)

missing_admsn_date - 0 or 1, 1 when admission date is missing

missing_prncpl_proc_date - 0 or 1, 1 when principal procedure date is missing

flag_admsn_miss - 0 or 1, 1 when admsn_date was imputed

los - ip service duration in days

ageday_admsn - age in days as on admsn_date

age_admsn - age in years, with decimals, as on admsn_date

age_prncpl_proc - age in years as on principal procedure date

age_day_prncpl_proc - age in days as on principal procedure date

if ftype == ‘ot’:

Adds duration column, provided service end and begin dates are clean

New Column(s):

srvc_bgn_date - Service begin date (SRVC_BGN_DT)

srvc_end_date - Service end date (SRVC_END_DT)

diff & duration - duration of service in days

age_day_srvc_bgn - age in days as on service begin date

age_srvc_bgn - age in years, with decimals, as on service begin date

if ftype == ‘ps:

New Column(s):

date_of_death - Date of death (DEATH_DT)

Examples

>>> from medicaid_utils.preprocessing.taf_file import TAFFile
>>> taf = TAFFile('ip', 2019, 'AL', '/data/cms', clean=False)
>>> taf.process_date_cols()

medicaid_utils.preprocessing.taf_ip module¶

This module has TAFIP class which wraps together cleaning/ preprocessing routines specific for TAF IP files

class medicaid_utils.preprocessing.taf_ip.TAFIP(year: int, state: str, data_root: str, index_col: str = 'BENE_MSIS', clean: bool = True, preprocess: bool = True, tmp_folder: str | None = None, pq_engine: str = 'pyarrow')[source]¶

Bases: TAFFile

clean() → None[source]¶

Cleaning routines to clean diagnosis & procedure code columns, processes date and gender columns, and add duplicate check flags.

Examples

>>> from medicaid_utils.preprocessing.taf_ip import TAFIP
>>> ip = TAFIP(2019, 'AL', '/data/cms', clean=False)
>>> ip.clean()

flag_common_exclusions() → None[source]¶

Adds commonly used IP claim exclusion flag columns. New Columns:

ffs_or_encounter_claim, 0 or 1, 1 when base claim is an FFS or Encounter claim

excl_missing_dob, 0 or 1, 1 when base claim does not have birth date

excl_missing_admsn_date, 0 or 1, 1 when base claim does not have admission date

excl_missing_prncpl_proc_date, 0 or 1, 1 when base claim does not have principal procedure date

Examples

>>> from medicaid_utils.preprocessing.taf_ip import TAFIP
>>> ip = TAFIP(2019, 'AL', '/data/cms', clean=False)
>>> ip.flag_common_exclusions()

preprocess() → None[source]¶

Add basic constructed variables.

Examples

>>> from medicaid_utils.preprocessing.taf_ip import TAFIP
>>> ip = TAFIP(2019, 'AL', '/data/cms', preprocess=False)
>>> ip.preprocess()

medicaid_utils.preprocessing.taf_lt module¶

This module has TAFLT class which wraps together cleaning/ preprocessing routines specific for TAF LT files

class medicaid_utils.preprocessing.taf_lt.TAFLT(year: int, state: str, data_root: str, index_col: str = 'BENE_MSIS', clean: bool = True, preprocess: bool = True, tmp_folder: str | None = None, pq_engine: str = 'pyarrow')[source]¶

Bases: TAFFile

clean() → None[source]¶

Cleaning routines to clean diagnosis & procedure code columns, processes date and gender columns, and add duplicate check flags.

Examples

>>> from medicaid_utils.preprocessing.taf_lt import TAFLT
>>> lt = TAFLT(2019, 'AL', '/data/cms', clean=False)
>>> lt.clean()

preprocess() → None[source]¶

Add basic constructed variables.

Examples

>>> from medicaid_utils.preprocessing.taf_lt import TAFLT
>>> lt = TAFLT(2019, 'AL', '/data/cms', preprocess=False)
>>> lt.preprocess()

medicaid_utils.preprocessing.taf_ot module¶

This module has TAFOT class which wraps together cleaning/ preprocessing routines specific for TAF OT files

class medicaid_utils.preprocessing.taf_ot.TAFOT(year: int, state: str, data_root: str, index_col: str = 'BENE_MSIS', clean: bool = True, preprocess: bool = True, tmp_folder: str | None = None, pq_engine: str = 'pyarrow')[source]¶

Bases: TAFFile

clean() → None[source]¶

Cleaning routines to clean diagnosis & procedure code columns, processes date and gender columns, and add duplicate check flags.

Examples

>>> from medicaid_utils.preprocessing.taf_ot import TAFOT
>>> ot = TAFOT(2019, 'AL', '/data/cms', clean=False)
>>> ot.clean()

flag_common_exclusions() → None[source]¶

Adds commonly used IP claim exclusion flag columns. New Columns:

ffs_or_encounter_claim, 0 or 1, 1 when base claim is an FFS or Encounter claim

excl_missing_dob, 0 or 1, 1 when base claim does not have birth date

excl_missing_srvc_bgn_date, 0 or 1, 1 when base claim does not have service begin date

Examples

>>> from medicaid_utils.preprocessing.taf_ot import TAFOT
>>> ot = TAFOT(2019, 'AL', '/data/cms', clean=False)
>>> ot.flag_common_exclusions()

preprocess() → None[source]¶

Add basic constructed variables.

Examples

>>> from medicaid_utils.preprocessing.taf_ot import TAFOT
>>> ot = TAFOT(2019, 'AL', '/data/cms', preprocess=False)
>>> ot.preprocess()

medicaid_utils.preprocessing.taf_ps module¶

This module has TAFPS class which wraps together cleaning/ preprocessing routines specific for TAF PS files

class medicaid_utils.preprocessing.taf_ps.TAFPS(year: int, state: str, data_root: str, index_col: str = 'BENE_MSIS', clean: bool = True, preprocess: bool = True, rural_method: str = 'ruca', tmp_folder: str | None = None, pq_engine: str = 'pyarrow')[source]¶

Bases: TAFFile

Scripts to preprocess PS file

add_gender() → None[source]¶

Adds integer ‘female’ column based on ‘SEX_CD’ column. Undefined values (‘U’) in SEX_CD column will result in female column taking the value -1.

Examples

>>> from medicaid_utils.preprocessing.taf_ps import TAFPS
>>> ps = TAFPS(2019, 'AL', '/data/cms', clean=False, preprocess=False)
>>> ps.add_gender()
>>> 'female' in ps.dct_files['base'].columns
True

add_mas_boe() → None[source]¶

Adds columns denoting number of months in each Maintenance Assistance Status (MAS) and Basis of Eligibility (BOE) category. Columns added are,

boe_chip_months : Number of months in Separate-CHIP BOE category

boe_aged_months : Number of months in Aged BOE category

boe_blind_disabled_months : Number of months in Blind/Disabled BOE category

boe_child_months : Number of months in Children BOE category

boe_adults_months : Number of months in Adult BOE category

boe_breast_and_cervical_cancer_months : Number of months in Breast and Cervical Cancer Prevention and Treatment Act of 2000 BOE category

boe_child_of_unemployed_months : Number of months in Child of Unemployed Adult BOE category

boe_unemployed_months : Number of months in Unemployed Adult BOE category

boe_foster_care_children_months : Number of months in Foster Care Children BOE category

boe_unknown_months : Number of months in Uknown BOE category

mas_chip_months : Number of months in Separate-CHIP MAS category

mas_cash_sec_1931_months : Number of months in Individuals receiving cash assistance or eligible under section 1931 of the Act MAS category

mas_medically_needy_months : Number of months in Medically Needy MAS category

mas_poverty_months : Number of months in Poverty Related Eligibles MAS category

mas_other_months : Number of months in Other Eligibles MAS category

mas_demonstration_months : Number of months in Section 1115 Demonstration expansion eligible MAS category

mas_unknown_months : Number of months in Unknown MAS category

max_mas_type : Top MAS category for the bene

max_boe_type : Top BOE category for the bene

Examples

>>> from medicaid_utils.preprocessing.taf_ps import TAFPS
>>> ps = TAFPS(2019, 'AL', '/data/cms', clean=False, preprocess=False)
>>> ps.add_mas_boe()

add_risk_adjustment_scores() → None[source]¶

Adds bene level risk adjustment scores. Currently supports Elixhauser scores.

Examples

>>> from medicaid_utils.preprocessing.taf_ps import TAFPS
>>> ps = TAFPS(2019, 'AL', '/data/cms', tmp_folder='/tmp/ps')
>>> ps.add_risk_adjustment_scores()

clean() → None[source]¶

Runs cleaning routines and creates common exclusion flags based on default filters.

Examples

>>> from medicaid_utils.preprocessing.taf_ps import TAFPS
>>> ps = TAFPS(2019, 'AL', '/data/cms', clean=False)
>>> ps.clean()

compute_enrollment_gaps() → None[source]¶

Computes enrollment gaps using dates file. Adds number of enrollment gaps and length of maximum enrollment gap in days columns.

Examples

>>> from medicaid_utils.preprocessing.taf_ps import TAFPS
>>> ps = TAFPS(2019, 'AL', '/data/cms', clean=False, preprocess=False)
>>> ps.compute_enrollment_gaps()

flag_common_exclusions() → None[source]¶

Adds commonly used exclusion flags

New Column(s):

excl_duplicated_bene_id - 0 or 1, 1 when bene’s index column is repeated

Examples

>>> from medicaid_utils.preprocessing.taf_ps import TAFPS
>>> ps = TAFPS(2019, 'AL', '/data/cms', clean=False)
>>> ps.flag_common_exclusions()

flag_dual() → None[source]¶

Flags benes with DUAL_ELGBL_CD equal to 1 (full dual), 2 (partial dual), or 3 (other dual) in any month are flagged as duals.

References

Identifying beneficiaries with a substance use disorder

Examples

>>> from medicaid_utils.preprocessing.taf_ps import TAFPS
>>> ps = TAFPS(2019, 'AL', '/data/cms', clean=False, preprocess=False)
>>> ps.flag_dual()

flag_ffs_months() → None[source]¶

Creates flags for months enrolled in medicaid without enrollment in managed care plans of 3 categories, and adds columns denoting total number of months enrolled in these plans and the enrollment sequence pattern.

Examples

>>> from medicaid_utils.preprocessing.taf_ps import TAFPS
>>> ps = TAFPS(2019, 'AL', '/data/cms', clean=False, preprocess=False)
>>> ps.flag_ffs_months()

flag_managed_care_months() → None[source]¶

Creates flags for 3 categories of managed care plans for each month, and adds columns denoting total number of months enrolled in these plans and the enrollment sequence pattern.

Examples

>>> from medicaid_utils.preprocessing.taf_ps import TAFPS
>>> ps = TAFPS(2019, 'AL', '/data/cms', clean=False, preprocess=False)
>>> ps.flag_managed_care_months()

flag_medicaid_enrolled_months() → None[source]¶

Creates flags for medicaid enrollment for each month and computes the total number of months enrolled in medicaid. Bene has to be enrolled for all days of the month without missing eligibility information for the month to be considered a medicaid enrolled month.

Examples

>>> from medicaid_utils.preprocessing.taf_ps import TAFPS
>>> ps = TAFPS(2019, 'AL', '/data/cms', clean=False, preprocess=False)
>>> ps.flag_medicaid_enrolled_months()

flag_restricted_benefits() → None[source]¶

Flags beneficiaries whose benefits are restricted. Benes with the below values in their RSTRCTD_BNFTS_CD_XX columns are NOT assumed to have restricted benefits:

1. Individual is eligible for Medicaid or CHIP and entitled to the full scope of Medicaid or CHIP benefits.
4. Individual is eligible for Medicaid or CHIP but only entitled to restricted benefits for pregnancy-related services.
5. Individual is eligible for Medicaid or Medicaid-Expansion CHIP but, for reasons other than alien, dual-eligibility or pregnancy-related status, is only entitled to restricted benefits (e.g., restricted benefits based upon substance abuse, medically needy or other criteria).
7. Individual is eligible for Medicaid and entitled to Medicaid benefits under an alternative package of benchmark-equivalent coverage, as enacted by the Deficit Reduction Act of 2005.

Reference: Identifying beneficiaries with a substance use disorder

Examples

>>> from medicaid_utils.preprocessing.taf_ps import TAFPS
>>> ps = TAFPS(2019, 'AL', '/data/cms', clean=False, preprocess=False)
>>> ps.flag_restricted_benefits()

flag_rural(method: str = 'ruca') → None[source]¶

Classifies benes into rural/ non-rural on the basis of RUCA/ RUCC of their resident ZIP/ FIPS codes

New Columns:

resident_state_cd
rural - 0/ 1/ np.nan, 1 when bene’s residence is in a rural location, 0 when not, -1 when zip code is missing
pcsa - resident PCSA code
census_region - resident census region
census_division - resider census division
{ruca_code/ rucc_code} - resident ruca_code

This function uses

RUCA 3.1 dataset. RUCA codes >= 4 denote rural and the rest denote urban as per Cole, Megan B et al

RUCC codes. RUCC codes >= 8 denote rural and the rest denote urban.

ZCTAs x zipcode crosswalk from UDSMapper.

zipcodes from multiple sources

Distance between centroids of zipcodes using NBER data

Parameters:: method ({'ruca', 'rucc'}) – Method to use for rural variable construction

Examples

>>> from medicaid_utils.preprocessing.taf_ps import TAFPS
>>> ps = TAFPS(2019, 'AL', '/data/cms', clean=False, preprocess=False)
>>> ps.flag_rural(method='ruca')

flag_tanf() → None[source]¶

TANF_CASH_CD:

1: INDIVIDUAL DID NOT RECEIVE TANF BENEFITS DURING THE YEAR;
2: INDIVIDUAL DID RECEIVE TANF BENEFITS DURING THE YEAR

Examples

>>> from medicaid_utils.preprocessing.taf_ps import TAFPS
>>> ps = TAFPS(2019, 'AL', '/data/cms', clean=False, preprocess=False)
>>> ps.flag_tanf()

gather_bene_level_diag_ndc_codes() → None[source]¶

Constructs patient level NDC and diagnosis code list columns and saves them to individual file.

Examples

>>> from medicaid_utils.preprocessing.taf_ps import TAFPS
>>> ps = TAFPS(2019, 'AL', '/data/cms', tmp_folder='/tmp/ps')
>>> ps.gather_bene_level_diag_ndc_codes()

preprocess(rural_method: str = 'ruca', add_risk_adjustment_scores: bool = False) → None[source]¶

Adds rural and eligibility criteria indicator variables.

Parameters:

rural_method (str, default='ruca') – Method to use for rural classification. Options: ‘ruca’, ‘rucc’.
add_risk_adjustment_scores (bool, default=False) – Whether to add Elixhauser risk adjustment scores.

Examples

>>> from medicaid_utils.preprocessing.taf_ps import TAFPS
>>> ps = TAFPS(2019, 'AL', '/data/cms', preprocess=False)
>>> ps.preprocess(rural_method='ruca')

medicaid_utils.preprocessing.taf_rx module¶

This module has TAFRX class which wraps together cleaning/ preprocessing routines specific for TAF Pharmacy files

class medicaid_utils.preprocessing.taf_rx.TAFRX(year: int, state: str, data_root: str, index_col: str = 'BENE_MSIS', clean: bool = True, preprocess: bool = True, tmp_folder: str | None = None, pq_engine: str = 'pyarrow')[source]¶

Bases: TAFFile

clean() → None[source]¶

Cleaning routines to clean diagnosis & procedure code columns, processes date and gender columns, and add duplicate check flags.

Examples

>>> from medicaid_utils.preprocessing.taf_rx import TAFRX
>>> rx = TAFRX(2019, 'AL', '/data/cms', clean=False)
>>> rx.clean()