medicaid_utils.preprocessing package¶
Submodules¶
medicaid_utils.preprocessing.max_cc module¶
This module has MAXCC class which wraps together cleaning/ preprocessing routines specific for MAX CC files
- class medicaid_utils.preprocessing.max_cc.MAXCC(year: int, state: str, data_root: str, index_col: str = 'BENE_MSIS', clean: bool = True, preprocess: bool = True, tmp_folder: str | None = None, pq_engine: str = 'pyarrow')[source]¶
Bases:
MAXFileScripts to preprocess CC file
medicaid_utils.preprocessing.max_file module¶
This module has MAXFile class from which is the base class for all MAX file type classes
- class medicaid_utils.preprocessing.max_file.MAXFile(ftype: str, year: int, state: str, data_root: str, index_col: str = 'BENE_MSIS', clean: bool = True, preprocess: bool = True, tmp_folder: str | None = None, pq_engine: str = 'pyarrow')[source]¶
Bases:
objectParent class for all MAX file classes, each of which will have clean and preprocess functions
- add_gender() None[source]¶
Adds integer ‘female’ column based on ‘EL_SEX_CD’ column. Undefined values (‘U’) in EL_SEX_CD column will result in female column taking the value -1
- cache_results(repartition: bool = False) None[source]¶
Save results in intermediate steps of some lengthy processing. Saving intermediate results speeds up processing
- Parameters:
repartition (bool, default=False) – Repartition the dask dataframe
- calculate_payment() None[source]¶
Calculates payment amount New Column(s):
pymt_amt - “MDCD_PYMT_AMT” + “TP_PYMT_AMT”
- clean_diag_codes() None[source]¶
Clean diagnostic code columns by removing non-alphanumeric characters and converting them to upper case
- clean_proc_codes() None[source]¶
Clean diagnostic code columns by removing non-alphanumeric characters and converting them to upper case
- export(dest_folder: str, output_format: str = 'csv', repartition: bool = False) None[source]¶
Exports the files.
- flag_ed_use() None[source]¶
Detects ed use in claims New Column(s):
- ed_cpt - 0 or 1, Claim has a procedure billed in ED code range (99281–99285)
(PRCDR_CD_SYS_{1-6} == 01 & PRCDR_CD_{1-6} in (99281–99285))
ed_ub92 - 0 or 1, Claim has a revenue center codes (0450 - 0459, 0981) - UB_92_REV_CD_GP_{1-23}
ed_tos - 0 or 1, Claim has an outpatient type of service (MAX_TOS = 11) (if ftype == ‘ip’)
ed_pos - 0 or 1, Claim has PLC_OF_SRVC_CD set to 23 (if ftype == ‘ot’)
ed_use - any of ed_cpt, ed_ub92, ed_tos or ed_pos is 1
any_ed - 0 or 1, 1 when any other claim from the same visit has ed_use set to 1 (if ftype == ‘ot’)
- Uses the below as reference:
If the patient is a Medicare beneficiary, the general surgeon should bill the level of
ED code (99281-99285) (https://web.archive.org/web/20231125185256/https://bulletin.facs.org/2013/02/coding-for-hospital-admission/) - Inpatient files: Revenue Center Codes 0450-0459, 0981 (https://web.archive.org/web/20210303085851/https://www.resdac.org/resconnect/articles/144)
- classmethod get_claim_instance(claim_type: str, *args: Any, **kwargs: Any) MAXFile[source]¶
Returns an instance of the requested claim type
- pq_export(dest_path_and_fname: str, repartition: bool = False) DataFrame[source]¶
Export parquet files (overwrite safe)
- process_date_cols() None[source]¶
Convert datetime columns to datetime type and add basic date based constructed variables
New columns:
birth_year, birth_month, birth_day - date compoments of EL_DOB (date of birth)
birth_date - birth date (EL_DOB)
death - 0 or 1, if EL_DOD or MDCR_DOD is not empty and falls in the claim year or before
age - age in years, integer format
age_day - age in days
adult - 0 or 1, 1 when patient’s age is in [18,115]
child - 0 or 1, 1 when patient’s age is in [0,17]
- if ftype == ‘ip’:
Clean/ impute admsn_date and add ip duration related columns New column(s):
admsn_date - Admission date (ADMSN_DT)
srvc_bgn_date - Service begin date (SRVC_BGN_DT)
srvc_end_date - Service end date (SRVC_END_DT)
prncpl_proc_date - Principal procedure date (PRNCPL_PRCDR_DT)
missing_admsn_date - 0 or 1, 1 when admission date is missing
missing_prncpl_proc_date - 0 or 1, 1 when principal procedure date is missing
flag_admsn_miss - 0 or 1, 1 when admsn_date was imputed
los - ip service duration in days
ageday_admsn - age in days as on admsn_date
age_admsn - age in years, with decimals, as on admsn_date
age_prncpl_proc - age in years as on principal procedure date
age_day_prncpl_proc - age in days as on principal procedure date
- if ftype == ‘ot’:
Adds duration column, provided service end and begin dates are clean New Column(s):
srvc_bgn_date - Service begin date (SRVC_BGN_DT)
srvc_end_date - Service end date (SRVC_END_DT)
diff & duration - duration of service in days
age_day_srvc_bgn - age in days as on service begin date
age_srvc_bgn - age in years, with decimals, as on service begin date
- if ftype == ‘ps:
New Column(s):
date_of_death - Date of death (EL_DOD)
medicare_date_of_death - Medicare date of death (MDCR_DOD)
medicaid_utils.preprocessing.max_ip module¶
This module has MAXIP class which wraps together cleaning/ preprocessing routines specific for MAX IP files
- class medicaid_utils.preprocessing.max_ip.MAXIP(year: int, state: str, data_root: str, index_col: str = 'BENE_MSIS', clean: bool = True, preprocess: bool = True, tmp_folder: str | None = None, pq_engine: str = 'pyarrow')[source]¶
Bases:
MAXFile- clean() None[source]¶
Runs cleaning routines and adds common exclusion flags based on default filters
- flag_ip_overlaps() None[source]¶
Identifies duplicate/ overlapping claims. When several/ overlapping claims exist with the same MSIS_ID, claim with the largest payment amount is retained. New Column(s):
flag_ip_undup - 0 or 1, 1 when row is not a duplicate flag_ip_dup_drop - 0 or 1, 1 when row is duplicate and must be dropped flag_ip_overlap_drop - 0 or 1, 1 when row overlaps with another claim ip_incl - 0 or 1, 1 when row is clean (flag_ip_dup_drop = 0 & flag_ip_overlap_drop = 0) and has los > 0
- Parameters:
dd.DataFrame (df)
- Return type:
None
medicaid_utils.preprocessing.max_ot module¶
This module has MAXOT class which wraps together cleaning/ preprocessing routines specific for MAX OT files
- class medicaid_utils.preprocessing.max_ot.MAXOT(year: int, state: str, data_root: str, index_col: str = 'BENE_MSIS', clean: bool = True, preprocess: bool = True, tmp_folder: str | None = None, pq_engine: str = 'pyarrow')[source]¶
Bases:
MAXFile- add_ot_flags() None[source]¶
Assign flags for IP, OT and ED calculation Based on hierarchical principal: IP first,then ED, and then OT Marks claims that have overlapping IP claims, has ED services or have only OT services New Column(s):
ip_incl - 0 or 1, 1 when has no dental and transport claims, and has an overlapping IP claim ed_incl - 0 or 1, 1 when has no dental and transport claims, has no overlapping IP claim, and has an ED
service in any visits corresponding this claim
- ot_incl - 0 or 1, 1 when has no dental and transport claims, has no overlapping IP claim, and has no ED
service in any visits corresponding this claim
flag_drop - 0 or 1, 1 when ip_incl, ed_incl and ot_incl are all null
- clean() None[source]¶
Runs cleaning routines and adds common exclusion flags based on default filters
- find_ot_ip_overlaps(df_ip: DataFrame) None[source]¶
Checks for OT claims that have an overlapping IP claim New Column(s):
overlap - 0 or 1, 1 when OT claim has an overlapping IP claim
- Parameters:
df_ip (DataFrame) – IP DataFrame
- flag_dental() None[source]¶
Flag dental claims New Column(s):
dental_TOS - 0 or 1, 1 when MAX_TOS = 9 dental_PRCDR - 0 or 1, 1 when PRCDR_CD starts with ‘D’ dental - 0 or 1, 1 when any of dental_TOS or dental_PRCDR
- flag_em() None[source]¶
Flag claim if procedure code belongs to E/M category New Column(s):
EM - 0 or 1, 1 when PRCDR_CD in [99201, 99215] or [99301, 99350]
- flag_ip_overlaps_and_ed(df_ip: DataFrame) None[source]¶
Adds flags to indicate overlaps with IP claims
- Parameters:
df_ip (pd.DataFrame) – IP claim dataframe
medicaid_utils.preprocessing.max_ps module¶
This module has MAXPS class which wraps together cleaning/ preprocessing routines specific for MAX PS files
- class medicaid_utils.preprocessing.max_ps.MAXPS(year: int, state: str, data_root: str, index_col: str = 'BENE_MSIS', clean: bool = True, preprocess: bool = True, rural_method: str = 'ruca', tmp_folder: str | None = None, pq_engine: str = 'pyarrow')[source]¶
Bases:
MAXFileScripts to preprocess PS file
- add_eligibility_status_columns() None[source]¶
Add eligibility columns based on MAX_ELG_CD_MO_{month} values for each month. MAX_ELG_CD_MO:00 = NOT ELIGIBLE, 99 = UNKNOWN ELIGIBILITY => codes to denote ineligibility
- New Column(s):
elg_mon_{month} - 0 or 1 value column, denoting eligibility
for each month - total_elg_mon - No. of eligible months - elg_full_year - 0 or 1 value column, 1 if total_elg_mon = 12 - elg_over_9mon - 0 or 1 value column, 1 if total_elg_mon >= 9 - elg_over_6mon - 0 or 1 value column, 1 if total_elg_mon >= 6 - elg_cont_6mon - 0 or 1 value column, 1 if patient has 6 continuous eligible months - mas_elg_change - 0 or 1 value column, 1 if patient had multiple mas group memberships during claim year - mas_assignments - comma separated list of MAS assignments - boe_assignments - comma separated list of BOE assignments - dominant_boe_group - BOE status held for the most number of months - boe_elg_change - 0 or 1 value column, 1 if patient had multiple boe group memberships during claim year - child_boe_elg_change - 0 or 1 value column, 1 if patient had multiple boe group memberships during claim year - elg_change - 0 or 1 value column, 1 if patient had multiple eligibility group memberships during claim year - eligibility_aged - Eligibility as aged anytime during the claim year - eligibility_child - Eligibility as child anytime during the claim year - max_gap - Maximum gap in enrollment in months - max_cont_enrollment - Maximum duration of continuous enrollment
- clean() None[source]¶
Runs cleaning routines and adds common exclusion flags based on default filters
- flag_common_exclusions() None[source]¶
Adds exclusion flags New Column(s):
excl_duplicated_bene_id - 0 or 1, 1 when bene’s index column
is repeated
- flag_duals() None[source]¶
Flags dual patients New column(s):
dual - 0 or 1 column, 0 if 0 <= EL_MDCR_DUAL_ANN <= 9 for years 2007, 2009, 2011 0 <= EL_MDCR_DUAL_ANN <= 9 for other years
- flag_restricted_benefits() None[source]¶
Checks individual’s eligibility for various medicaid services, based on EL_RSTRCT_BNFT_FLG_{month} values,
1 = full scope; INDIVIDUAL IS ELIGIBLE FOR MEDICAID DURING THE
MONTH AND IS ENTITLED TO THE FULL SCOPE OF MEDICAID BENEFITS. - 2 = alien; INDIVIDUAL IS ELIGIBLE FOR MEDICAID DURING THE MONTH BUT ONLY ENTITLED TO RESTRICTED BENEFITS
BASED ON ALIEN STATUS
3 = dual
4 = pregnancy
5 = other, eg. substance abuse, medically needy
6 = family planning
7 = alternative package of benchmark equivalent coverage,
2011 data had no values of 7 and 8 - 8 = “money follows the person” rebalancing demonstration, 2011 data had no values of 7 and 8 - 9 = unknown - A = Psychiatric residential treatments demonstration - B = Health Opportunity Account - C = CHIP dental coverage, supplemental to employer sponsored insurance - W = Medicaid health insurance premium payment assistance (MA, NJ, VT, OK) - X = rx drug - Y = drug and dual - Z = drug and dual, but Medicaid was not paying for the benefits.
Benefits are non-comprehensive (restricted) when EL_RSTRCT_BNFT_FLG_{month} has any of the below values:
“2”, “3”, “6”: for states other than “AR”, “ID”, “SD”
“2”, “4”, “3”, “6”: for states “AR”, “ID”, “SD”
- New column(s):
any_restricted_benefit_month: 0 or 1, 1 when bene’s benefits
are restricted for atleast 1 month - restricted_benefit_months: Number of restricted benefit months - restricted_benefits: 0 or 1, 1 when number of restricted benefit months are more than the number of number of months the bene was enrolled in medicaid
- flag_rural(method: str = 'ruca') None[source]¶
Classifies benes into rural/ non-rural on the basis of RUCA/ RUCC of their resident ZIP/ FIPS codes
New Columns:
resident_state_cd
rural - 0/ 1/ -1, 1 when bene’s residence is in a rural
location, 0 when not. -1 when zip code is missing - pcsa - resident PCSA code - {ruca_code/ rucc_code} - resident ruca_code
- This function uses
RUCA 3.1 dataset (from https://www.ers.usda.gov/data-products/rural-urban-commuting-area-codes/). RUCA codes >= 4 denote rural and the rest denote urban as per https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6286055/#SD1
RUCC codes were downloaded from https://www.ers.usda.gov/data-products/rural-urban-continuum-codes/. RUCC codes >= 8 denote rural and the rest denote urban.
ZCTAs x zipcode crosswalk from UDSMapper (https://udsmapper.org/zip-code-to-zcta-crosswalk/),
zipcodes from multiple sources
distance between centroids of zipcodes using NBER data (https://nber.org/distance/2016/gaz/zcta5/gaz2016zcta5centroid.csv)
- Parameters:
method ({'ruca', 'rucc'}) – Method to use for rural variable construction
- flag_tanf() None[source]¶
The Temporary Assistance for Needy Families (TANF) program provides temporary financial assistance for pregnant women and families with one or more dependent children. This provides financial assistance to help pay for food, shelter, utilities, and expenses other than medical. In MAX files this is identified via
- EL_TANF_CASH_FLG:
1 = INDIVIDUAL DID NOT RECEIVE TANF BENEFITS DURING THE MONTH;
2 = INDIVIDUAL DID RECEIVE TANF BENEFITS DURING THE MONTH. CO
and ID either 0 or 9
- New Column(s):
tanf : 0 or 1, denoting usage of TANF benefits in any of the
months
medicaid_utils.preprocessing.taf_file module¶
This module has TAFFile class from which is the base class for all TAF file type classes
- class medicaid_utils.preprocessing.taf_file.TAFFile(ftype: str, year: int, state: str, data_root: str, index_col: str = 'BENE_MSIS', clean: bool = True, preprocess: bool = True, tmp_folder: str | None = None, pq_engine: str = 'pyarrow')[source]¶
Bases:
objectParent class for all TAF file classes, each of which will have clean and preprocess functions
- add_custom_subtype(subtype_name: str, df_file: DataFrame) None[source]¶
Add custom subtype file to claim object.
Examples
>>> from medicaid_utils.preprocessing.taf_file import TAFFile >>> taf = TAFFile('ip', 2019, 'AL', '/data/cms') >>> import dask.dataframe as dd >>> import pandas as pd >>> df = dd.from_pandas(pd.DataFrame({'col': [1]}), npartitions=1) >>> taf.add_custom_subtype('my_subtype', df)
- cache_results(subtype: str | None = None, repartition: bool = False) None[source]¶
Save results in intermediate steps of some lengthy processing. Saving intermediate results speeds up processing, and avoid dask cluster crashes for large datasets
- Parameters:
Examples
>>> from medicaid_utils.preprocessing.taf_file import TAFFile >>> taf = TAFFile('ip', 2019, 'AL', '/data/cms', tmp_folder='/tmp/cache') >>> taf.cache_results(subtype='base')
- clean() None[source]¶
Cleaning routines to processes date and gender columns, and add duplicate check flags.
Examples
>>> from medicaid_utils.preprocessing.taf_file import TAFFile >>> taf = TAFFile('ip', 2019, 'AL', '/data/cms', clean=False) >>> taf.clean()
- clean_codes() None[source]¶
Clean diagnostic code columns by removing non-alphanumeric characters and converting them to upper case and NDC codes columns by removing white space characters and padding 0s to the left so the codes are of length 12.
Examples
>>> from medicaid_utils.preprocessing.taf_file import TAFFile >>> taf = TAFFile('ip', 2019, 'AL', '/data/cms', clean=False) >>> taf.clean_codes()
- clean_diag_codes() None[source]¶
Clean diagnostic code columns by removing non-alphanumeric characters and converting them to upper case.
Examples
>>> import pandas as pd >>> import dask.dataframe as dd >>> pdf = pd.DataFrame({'DGNS_CD_1': ['a12.3', 'B45-6'], ... 'other_col': [1, 2]}) >>> ddf = dd.from_pandas(pdf, npartitions=1) >>> from medicaid_utils.preprocessing.taf_file import TAFFile >>> obj = object.__new__(TAFFile) >>> obj.dct_files = {'base': ddf} >>> obj.clean_diag_codes() >>> result = obj.dct_files['base'].compute() >>> list(result['DGNS_CD_1']) ['A123', 'B456']
- clean_ndc_codes() None[source]¶
Clean NDC codes columns by removing white space characters and padding 0s to the left so the codes are of length 12.
Examples
>>> import pandas as pd >>> import dask.dataframe as dd >>> pdf = pd.DataFrame({'NDC': ['1234', ' 5678 ']}) >>> ddf = dd.from_pandas(pdf, npartitions=1) >>> from medicaid_utils.preprocessing.taf_file import TAFFile >>> obj = object.__new__(TAFFile) >>> obj.dct_files = {'line': ddf} >>> obj.clean_ndc_codes() >>> result = obj.dct_files['line'].compute() >>> list(result['NDC']) ['000000001234', '000000005678']
- clean_proc_codes() None[source]¶
Clean procedure code columns by removing non-alphanumeric characters and converting them to upper case.
Examples
>>> import pandas as pd >>> import dask.dataframe as dd >>> pdf = pd.DataFrame({'PRCDR_CD_1': ['ab.1', 'C2-d'], ... 'PRCDR_CD_SYS_1': ['ICD', 'CPT']}) >>> ddf = dd.from_pandas(pdf, npartitions=1) >>> from medicaid_utils.preprocessing.taf_file import TAFFile >>> obj = object.__new__(TAFFile) >>> obj.dct_files = {'base': ddf} >>> obj.clean_proc_codes() >>> result = obj.dct_files['base'].compute() >>> list(result['PRCDR_CD_1']) ['AB1', 'C2D'] >>> list(result['PRCDR_CD_SYS_1']) ['ICD', 'CPT']
- export(dest_folder: str, output_format: str = 'csv', repartition: bool = False) None[source]¶
Exports the files.
- Parameters:
Examples
>>> from medicaid_utils.preprocessing.taf_file import TAFFile >>> taf = TAFFile('ip', 2019, 'AL', '/data/cms') >>> taf.export('/tmp/output', output_format='csv')
- flag_duplicates() None[source]¶
Removes duplicated rows. TAF claims have multiple versions for each month. This function keeps the most recent file version date for each month using the variables IP_VRSN, LT_VRSN, OT_VRSN, and RX_VRSN. Retains only the claims with maximum value of production data run ID (DA_RUN_ID) for each claim ID (CLM_ID).
References
Examples
>>> from medicaid_utils.preprocessing.taf_file import TAFFile >>> taf = TAFFile('ip', 2019, 'AL', '/data/cms', clean=False) >>> taf.flag_duplicates()
- flag_ffs_and_encounter_claims() None[source]¶
Flags claims where CLM_TYPE_CD is equal to one of the following values:
1: A FFS Medicaid or Medicaid-expansion claim
3: Medicaid or Medicaid-expanding managed care encounter record
A: Separate CHIP (Title XXI) FFS claim
C: Separate CHIP (Title XXI) encounter record
References
Examples
>>> from medicaid_utils.preprocessing.taf_file import TAFFile >>> taf = TAFFile('ip', 2019, 'AL', '/data/cms') >>> taf.flag_ffs_and_encounter_claims() >>> 'ffs_or_encounter_claim' in taf.dct_files['base'].columns True
- gather_bene_level_diag_ndc_codes() None[source]¶
Constructs patient level NDC and diagnosis code list columns and saves them to individual files.
Examples
>>> from medicaid_utils.preprocessing.taf_file import TAFFile >>> taf = TAFFile('ip', 2019, 'AL', '/data/cms') >>> taf.gather_bene_level_diag_ndc_codes()
- classmethod get_claim_instance(claim_type: str, *args: Any, **kwargs: Any) TAFFile[source]¶
Returns an instance of the requested claim type
- Parameters:
Examples
>>> from medicaid_utils.preprocessing.taf_file import TAFFile >>> ip_claim = TAFFile.get_claim_instance('ip', 2019, 'AL', '/data/cms')
- pq_export(f_subtype: str, dest_path_and_fname: str, repartition: bool = False) None[source]¶
Export parquet files (overwrite safe)
- Parameters:
Examples
>>> from medicaid_utils.preprocessing.taf_file import TAFFile >>> taf = TAFFile('ip', 2019, 'AL', '/data/cms') >>> taf.pq_export('base', '/tmp/output/base')
- preprocess() None[source]¶
Add basic constructed variables.
Examples
>>> from medicaid_utils.preprocessing.taf_file import TAFFile >>> taf = TAFFile('ip', 2019, 'AL', '/data/cms', preprocess=False) >>> taf.preprocess()
- process_date_cols() None[source]¶
Convert datetime columns to datetime type and add basic date based constructed variables
New columns:
birth_year, birth_month, birth_day - date components of EL_DOB (date of birth)
birth_date - birth date (EL_DOB)
death - 0 or 1, if DEATH_DT is not empty and falls in the claim year or before
age - age in years, integer format
age_day - age in days
adult - 0 or 1, 1 when patient’s age is in [18,115]
child - 0 or 1, 1 when patient’s age is in [0,17]
- If ftype == ‘ip’:
Clean/ impute admsn_date and add ip duration related columns
New column(s):
admsn_date - Admission date (ADMSN_DT)
srvc_bgn_date - Service begin date (SRVC_BGN_DT)
srvc_end_date - Service end date (SRVC_END_DT)
prncpl_proc_date - Principal procedure date (PRCDR_CD_DT_1)
missing_admsn_date - 0 or 1, 1 when admission date is missing
missing_prncpl_proc_date - 0 or 1, 1 when principal procedure date is missing
flag_admsn_miss - 0 or 1, 1 when admsn_date was imputed
los - ip service duration in days
ageday_admsn - age in days as on admsn_date
age_admsn - age in years, with decimals, as on admsn_date
age_prncpl_proc - age in years as on principal procedure date
age_day_prncpl_proc - age in days as on principal procedure date
- if ftype == ‘ot’:
Adds duration column, provided service end and begin dates are clean
New Column(s):
srvc_bgn_date - Service begin date (SRVC_BGN_DT)
srvc_end_date - Service end date (SRVC_END_DT)
diff & duration - duration of service in days
age_day_srvc_bgn - age in days as on service begin date
age_srvc_bgn - age in years, with decimals, as on service begin date
- if ftype == ‘ps:
New Column(s):
date_of_death - Date of death (DEATH_DT)
Examples
>>> from medicaid_utils.preprocessing.taf_file import TAFFile >>> taf = TAFFile('ip', 2019, 'AL', '/data/cms', clean=False) >>> taf.process_date_cols()
medicaid_utils.preprocessing.taf_ip module¶
This module has TAFIP class which wraps together cleaning/ preprocessing routines specific for TAF IP files
- class medicaid_utils.preprocessing.taf_ip.TAFIP(year: int, state: str, data_root: str, index_col: str = 'BENE_MSIS', clean: bool = True, preprocess: bool = True, tmp_folder: str | None = None, pq_engine: str = 'pyarrow')[source]¶
Bases:
TAFFile- clean() None[source]¶
Cleaning routines to clean diagnosis & procedure code columns, processes date and gender columns, and add duplicate check flags.
Examples
>>> from medicaid_utils.preprocessing.taf_ip import TAFIP >>> ip = TAFIP(2019, 'AL', '/data/cms', clean=False) >>> ip.clean()
- flag_common_exclusions() None[source]¶
Adds commonly used IP claim exclusion flag columns. New Columns:
ffs_or_encounter_claim, 0 or 1, 1 when base claim is an FFS or Encounter claim
excl_missing_dob, 0 or 1, 1 when base claim does not have birth date
excl_missing_admsn_date, 0 or 1, 1 when base claim does not have admission date
excl_missing_prncpl_proc_date, 0 or 1, 1 when base claim does not have principal procedure date
Examples
>>> from medicaid_utils.preprocessing.taf_ip import TAFIP >>> ip = TAFIP(2019, 'AL', '/data/cms', clean=False) >>> ip.flag_common_exclusions()
medicaid_utils.preprocessing.taf_lt module¶
This module has TAFLT class which wraps together cleaning/ preprocessing routines specific for TAF LT files
- class medicaid_utils.preprocessing.taf_lt.TAFLT(year: int, state: str, data_root: str, index_col: str = 'BENE_MSIS', clean: bool = True, preprocess: bool = True, tmp_folder: str | None = None, pq_engine: str = 'pyarrow')[source]¶
Bases:
TAFFile
medicaid_utils.preprocessing.taf_ot module¶
This module has TAFOT class which wraps together cleaning/ preprocessing routines specific for TAF OT files
- class medicaid_utils.preprocessing.taf_ot.TAFOT(year: int, state: str, data_root: str, index_col: str = 'BENE_MSIS', clean: bool = True, preprocess: bool = True, tmp_folder: str | None = None, pq_engine: str = 'pyarrow')[source]¶
Bases:
TAFFile- clean() None[source]¶
Cleaning routines to clean diagnosis & procedure code columns, processes date and gender columns, and add duplicate check flags.
Examples
>>> from medicaid_utils.preprocessing.taf_ot import TAFOT >>> ot = TAFOT(2019, 'AL', '/data/cms', clean=False) >>> ot.clean()
- flag_common_exclusions() None[source]¶
Adds commonly used IP claim exclusion flag columns. New Columns:
ffs_or_encounter_claim, 0 or 1, 1 when base claim is an FFS or Encounter claim
excl_missing_dob, 0 or 1, 1 when base claim does not have birth date
excl_missing_srvc_bgn_date, 0 or 1, 1 when base claim does not have service begin date
Examples
>>> from medicaid_utils.preprocessing.taf_ot import TAFOT >>> ot = TAFOT(2019, 'AL', '/data/cms', clean=False) >>> ot.flag_common_exclusions()
medicaid_utils.preprocessing.taf_ps module¶
This module has TAFPS class which wraps together cleaning/ preprocessing routines specific for TAF PS files
- class medicaid_utils.preprocessing.taf_ps.TAFPS(year: int, state: str, data_root: str, index_col: str = 'BENE_MSIS', clean: bool = True, preprocess: bool = True, rural_method: str = 'ruca', tmp_folder: str | None = None, pq_engine: str = 'pyarrow')[source]¶
Bases:
TAFFileScripts to preprocess PS file
- add_gender() None[source]¶
Adds integer ‘female’ column based on ‘SEX_CD’ column. Undefined values (‘U’) in SEX_CD column will result in female column taking the value -1.
Examples
>>> from medicaid_utils.preprocessing.taf_ps import TAFPS >>> ps = TAFPS(2019, 'AL', '/data/cms', clean=False, preprocess=False) >>> ps.add_gender() >>> 'female' in ps.dct_files['base'].columns True
- add_mas_boe() None[source]¶
Adds columns denoting number of months in each Maintenance Assistance Status (MAS) and Basis of Eligibility (BOE) category. Columns added are,
boe_chip_months : Number of months in Separate-CHIP BOE category
boe_aged_months : Number of months in Aged BOE category
boe_blind_disabled_months : Number of months in Blind/Disabled BOE category
boe_child_months : Number of months in Children BOE category
boe_adults_months : Number of months in Adult BOE category
boe_breast_and_cervical_cancer_months : Number of months in Breast and Cervical Cancer Prevention and Treatment Act of 2000 BOE category
boe_child_of_unemployed_months : Number of months in Child of Unemployed Adult BOE category
boe_unemployed_months : Number of months in Unemployed Adult BOE category
boe_foster_care_children_months : Number of months in Foster Care Children BOE category
boe_unknown_months : Number of months in Uknown BOE category
mas_chip_months : Number of months in Separate-CHIP MAS category
mas_cash_sec_1931_months : Number of months in Individuals receiving cash assistance or eligible under section 1931 of the Act MAS category
mas_medically_needy_months : Number of months in Medically Needy MAS category
mas_poverty_months : Number of months in Poverty Related Eligibles MAS category
mas_other_months : Number of months in Other Eligibles MAS category
mas_demonstration_months : Number of months in Section 1115 Demonstration expansion eligible MAS category
mas_unknown_months : Number of months in Unknown MAS category
max_mas_type : Top MAS category for the bene
max_boe_type : Top BOE category for the bene
Examples
>>> from medicaid_utils.preprocessing.taf_ps import TAFPS >>> ps = TAFPS(2019, 'AL', '/data/cms', clean=False, preprocess=False) >>> ps.add_mas_boe()
- add_risk_adjustment_scores() None[source]¶
Adds bene level risk adjustment scores. Currently supports Elixhauser scores.
Examples
>>> from medicaid_utils.preprocessing.taf_ps import TAFPS >>> ps = TAFPS(2019, 'AL', '/data/cms', tmp_folder='/tmp/ps') >>> ps.add_risk_adjustment_scores()
- clean() None[source]¶
Runs cleaning routines and creates common exclusion flags based on default filters.
Examples
>>> from medicaid_utils.preprocessing.taf_ps import TAFPS >>> ps = TAFPS(2019, 'AL', '/data/cms', clean=False) >>> ps.clean()
- compute_enrollment_gaps() None[source]¶
Computes enrollment gaps using dates file. Adds number of enrollment gaps and length of maximum enrollment gap in days columns.
Examples
>>> from medicaid_utils.preprocessing.taf_ps import TAFPS >>> ps = TAFPS(2019, 'AL', '/data/cms', clean=False, preprocess=False) >>> ps.compute_enrollment_gaps()
- flag_common_exclusions() None[source]¶
Adds commonly used exclusion flags
New Column(s):
excl_duplicated_bene_id - 0 or 1, 1 when bene’s index column is repeated
Examples
>>> from medicaid_utils.preprocessing.taf_ps import TAFPS >>> ps = TAFPS(2019, 'AL', '/data/cms', clean=False) >>> ps.flag_common_exclusions()
- flag_dual() None[source]¶
Flags benes with DUAL_ELGBL_CD equal to 1 (full dual), 2 (partial dual), or 3 (other dual) in any month are flagged as duals.
References
Examples
>>> from medicaid_utils.preprocessing.taf_ps import TAFPS >>> ps = TAFPS(2019, 'AL', '/data/cms', clean=False, preprocess=False) >>> ps.flag_dual()
- flag_ffs_months() None[source]¶
Creates flags for months enrolled in medicaid without enrollment in managed care plans of 3 categories, and adds columns denoting total number of months enrolled in these plans and the enrollment sequence pattern.
Examples
>>> from medicaid_utils.preprocessing.taf_ps import TAFPS >>> ps = TAFPS(2019, 'AL', '/data/cms', clean=False, preprocess=False) >>> ps.flag_ffs_months()
- flag_managed_care_months() None[source]¶
Creates flags for 3 categories of managed care plans for each month, and adds columns denoting total number of months enrolled in these plans and the enrollment sequence pattern.
Examples
>>> from medicaid_utils.preprocessing.taf_ps import TAFPS >>> ps = TAFPS(2019, 'AL', '/data/cms', clean=False, preprocess=False) >>> ps.flag_managed_care_months()
- flag_medicaid_enrolled_months() None[source]¶
Creates flags for medicaid enrollment for each month and computes the total number of months enrolled in medicaid. Bene has to be enrolled for all days of the month without missing eligibility information for the month to be considered a medicaid enrolled month.
Examples
>>> from medicaid_utils.preprocessing.taf_ps import TAFPS >>> ps = TAFPS(2019, 'AL', '/data/cms', clean=False, preprocess=False) >>> ps.flag_medicaid_enrolled_months()
- flag_restricted_benefits() None[source]¶
Flags beneficiaries whose benefits are restricted. Benes with the below values in their RSTRCTD_BNFTS_CD_XX columns are NOT assumed to have restricted benefits:
1. Individual is eligible for Medicaid or CHIP and entitled to the full scope of Medicaid or CHIP benefits.
4. Individual is eligible for Medicaid or CHIP but only entitled to restricted benefits for pregnancy-related services.
5. Individual is eligible for Medicaid or Medicaid-Expansion CHIP but, for reasons other than alien, dual-eligibility or pregnancy-related status, is only entitled to restricted benefits (e.g., restricted benefits based upon substance abuse, medically needy or other criteria).
7. Individual is eligible for Medicaid and entitled to Medicaid benefits under an alternative package of benchmark-equivalent coverage, as enacted by the Deficit Reduction Act of 2005.
Reference: Identifying beneficiaries with a substance use disorder
Examples
>>> from medicaid_utils.preprocessing.taf_ps import TAFPS >>> ps = TAFPS(2019, 'AL', '/data/cms', clean=False, preprocess=False) >>> ps.flag_restricted_benefits()
- flag_rural(method: str = 'ruca') None[source]¶
Classifies benes into rural/ non-rural on the basis of RUCA/ RUCC of their resident ZIP/ FIPS codes
New Columns:
resident_state_cd
rural - 0/ 1/ np.nan, 1 when bene’s residence is in a rural location, 0 when not, -1 when zip code is missing
pcsa - resident PCSA code
census_region - resident census region
census_division - resider census division
{ruca_code/ rucc_code} - resident ruca_code
This function uses
RUCA 3.1 dataset. RUCA codes >= 4 denote rural and the rest denote urban as per Cole, Megan B et al
RUCC codes. RUCC codes >= 8 denote rural and the rest denote urban.
ZCTAs x zipcode crosswalk from UDSMapper.
zipcodes from multiple sources
Distance between centroids of zipcodes using NBER data
- Parameters:
method ({'ruca', 'rucc'}) – Method to use for rural variable construction
Examples
>>> from medicaid_utils.preprocessing.taf_ps import TAFPS >>> ps = TAFPS(2019, 'AL', '/data/cms', clean=False, preprocess=False) >>> ps.flag_rural(method='ruca')
- flag_tanf() None[source]¶
The Temporary Assistance for Needy Families (TANF) program provides temporary financial assistance for pregnant women and families with one or more dependent children. This provides financial assistance to help pay for food, shelter, utilities, and expenses other than medical. In TAF files this is identified via
- TANF_CASH_CD:
1: INDIVIDUAL DID NOT RECEIVE TANF BENEFITS DURING THE YEAR;
2: INDIVIDUAL DID RECEIVE TANF BENEFITS DURING THE YEAR
Examples
>>> from medicaid_utils.preprocessing.taf_ps import TAFPS >>> ps = TAFPS(2019, 'AL', '/data/cms', clean=False, preprocess=False) >>> ps.flag_tanf()
- gather_bene_level_diag_ndc_codes() None[source]¶
Constructs patient level NDC and diagnosis code list columns and saves them to individual file.
Examples
>>> from medicaid_utils.preprocessing.taf_ps import TAFPS >>> ps = TAFPS(2019, 'AL', '/data/cms', tmp_folder='/tmp/ps') >>> ps.gather_bene_level_diag_ndc_codes()
- preprocess(rural_method: str = 'ruca', add_risk_adjustment_scores: bool = False) None[source]¶
Adds rural and eligibility criteria indicator variables.
- Parameters:
Examples
>>> from medicaid_utils.preprocessing.taf_ps import TAFPS >>> ps = TAFPS(2019, 'AL', '/data/cms', preprocess=False) >>> ps.preprocess(rural_method='ruca')
medicaid_utils.preprocessing.taf_rx module¶
This module has TAFRX class which wraps together cleaning/ preprocessing routines specific for TAF Pharmacy files