medicaid_utils.filters.patients package¶
Submodules¶
medicaid_utils.filters.patients.cohort_extraction module¶
This module has functions that extract cohorts for studies based on multiple filters
- medicaid_utils.filters.patients.cohort_extraction.apply_range_filter(tpl_range: tuple, df: DataFrame, filter_name: str, col_name: str, data_type: str, logger_name: str = '/home/runner/work/medicaid-utils/medicaid-utils/medicaid_utils/filters/patients/cohort_extraction.py') DataFrame[source]¶
Applies data/ numeric range based filter on a dataframe
- Parameters:
tpl_range (tuple) – Upper and lower bound tuple
df (dd.DataFrame) – Dataframe to be filtered
filter_name (str) – Name of filter. Should be of the format range_[datatype]_[col_name]. date and numeric range type filters are currently supported
col_name (str) – Name of column
data_type (str) – Datatype of column. Eg. date, int
logger_name (str, default=__file__) – Logger name
- Return type:
dd.DataFrame
Examples
Apply a numeric range filter:
>>> import pandas as pd >>> df = pd.DataFrame({'age': [5, 12, 20, 35]}) >>> result = apply_range_filter( ... (0, 18), df, 'range_numeric_age', 'age', 'int') >>> result['age'].tolist() [5, 12]
Apply a date range filter:
>>> df_dates = pd.DataFrame({ ... 'service_date': pd.to_datetime( ... ['20200101', '20200601', '20210101']) ... }) >>> result = apply_range_filter( ... ('20200101', '20200630'), df_dates, ... 'range_date_service_date', 'service_date', 'date') >>> len(result) 2
- medicaid_utils.filters.patients.cohort_extraction.export_cohort_datasets(df_cohort: DataFrame, year: int, state: str, lst_types_to_export: List[str], dct_export_filters: dict, dct_data_paths: dict, cms_format: str = 'MAX', clean_exports: bool = False, preprocess_exports: bool = False, export_format: str = 'csv', logger_name: str = '/home/runner/work/medicaid-utils/medicaid-utils/medicaid_utils/filters/patients/cohort_extraction.py') None[source]¶
Exports MAX files corresponding to the cohort as defined by the filters input to this function
- Parameters:
df_cohort (dd.DataFrame) – Pandas dataframe with patient IDs (BENE_MSIS) and indicator flag denoting inclusion into the cohort (include=1)
year (int) – Year of the claim files
state (str) – State
lst_types_to_export (list of str) – List of file types to export. Supported types are [ip, ot, ps, rx]
dct_export_filters (dict) –
Additional filters that should be applied to the raw claims of the selected cohort while exporting. Filter dictionary should be of the format:
{claim_type_1: {range_[datatype]_[col_name]: (start, end), excl_[col_name]: [0/1], [col_name]: value, ..} claim_type_2: ...}}
date and numeric range type filters are currently supported. Filter names beginning with excl_ with values set to 1 will exclude benes that have a positive value for that exclusion flag. Filter names that are just column names will restrict the result to benes with the filter value for the corresponding column. Eg:
{'ip': {'range_numeric_age_prncpl_proc': (0, 18), 'missing_dob': 0, 'excl_female': 1}} 'ot': {'range_numeric_age_srvc_bgn': (0, 18), 'missing_dob': 0, 'excl_female': 1}} }
The example filter will exclude all IP claims of female benes and also claims with missing DOB. The resulting set will also be restricted to those of benes whose age is between 0-18 (inclusive of both 0 and 18) as of principal procedure date/ service begin date.
dct_data_paths (dict) –
Dictionary with information on raw claim files root folder and export folder. Should be of the format,
{'source_root': /path/to/medicaid/folder, 'export_folder': /path/to/export/data}
cms_format ({'MAX', TAF'}) – CMS file format.
clean_exports (bool, default=False) – Should the exported datasets be cleaned?
preprocess_exports (bool, default=False) – Should the exported datasets be preprocessed?
export_format (str, default='csv') – Format of exported files
logger_name (str, default=__file__) – Logger name
- Raises:
FileNotFoundError – Raised when any of file types requested to be imported does not exist for the state and year
OSError – Raised when the cohort file index does not match the claim file index
Examples
>>> from medicaid_utils.filters.patients.cohort_extraction import ( ... export_cohort_datasets, ... ) >>> import pandas as pd >>> import dask.dataframe as dd >>> pdf_cohort = pd.DataFrame({ ... 'MSIS_ID': ['A', 'B'], ... 'include': [1, 1], ... }).set_index('MSIS_ID') >>> ddf_cohort = dd.from_pandas(pdf_cohort, npartitions=1) >>> dct_paths = { ... 'source_root': '/data/cms/', ... 'export_folder': '/output/cohort/', ... } >>> export_cohort_datasets( ... ddf_cohort, 2012, 'AL', ... ['ip', 'ps'], {}, dct_paths, ... cms_format='MAX')
- medicaid_utils.filters.patients.cohort_extraction.extract_cohort(state: str, lst_year: List[int], dct_diag_proc_codes: dict, dct_filters: dict, lst_types_to_export: List[str], dct_data_paths: dict, restrict_dx_proc_to_ip: bool = False, cms_format: str = 'MAX', clean_exports: bool = True, preprocess_exports: bool = True, export_format: str = 'csv', pq_engine: str = 'pyarrow', logger_name: str = '/home/runner/work/medicaid-utils/medicaid-utils/medicaid_utils/filters/patients/cohort_extraction.py') None[source]¶
Extracts and exports claim files corresponded cohort defined by the input filters
- Parameters:
state (str) – State
lst_year (list of int) – List of years from which cohort should be created
dct_diag_proc_codes (dict) –
Dictionary of diagnosis and procedure codes. Should be in the format
{'diag_codes': {condition_name: {['incl' / 'excl']: {[9/ 10]: list of codes}}, 'proc_codes': {procedure_name: {procedure_system_code: list of codes} } 'column_values': {condition_or_procedure_name: {column_name: list of numerical values} } }
Eg:
{'diag_codes': {'oud_nqf': {'incl': {9: ['3040','3055']}}}, 'proc_codes': {'methadone_7': {7: 'HZ81ZZZ,HZ84ZZZ,HZ85ZZZ,HZ86ZZZ,HZ91ZZZ,HZ94ZZZ, HZ95ZZZ,HZ96ZZZ'.split(",")} }, 'column_values': {'diag_delivery': {'RCPNT_DLVRY_CD': [1]} }
dct_filters (dict) –
Filters to apply to the cohort, and the exported claim files. Filter dictionary should be of the format:
{'cohort': {claim_type_1: {range_[datatype]_[col_name]: (start, end), excl_[col_name]: [0/1], [col_name]: value, ..}, 'export': {claim_type_1: {range_[datatype]_[col_name]: (start, end), excl_[col_name]: [0/1], [col_name]: value, ..} }
date and numeric range type filters are currently supported. Filter names beginning with excl_ with values set to 1 will exclude benes that have a positive value for that exclusion flag. Filter names that are just column names will restrict the result to benes with the filter value for the corresponding column. Eg:
{'cohort': {'ip': {'range_numeric_age_prncpl_proc': (0, 18), 'missing_dob': 0, 'excl_female': 1 } 'ot': {'range_numeric_age_srvc_bgn': (0, 18), 'missing_dob': 0, 'excl_female': 1 } }
The example filter will exclude the cohort to all IP claims of female benes and also claims with missing DOB. The resulting set will also be restricted to those of benes whose age is between 0-18 (inclusive of both 0 and 18) as of principal procedure date/ service begin date.
lst_types_to_export (list of str) – List of types to export. Currently supported types are ip, ot, rx, ps.
dct_data_paths (dict) –
Dictionary with information on raw claim files root folder and export folder. Should be of the format,
{'source_root': /path/to/medicaid/folder, 'export_folder': /path/to/export/data}
restrict_dx_proc_to_ip (bool, default=False) – Apply dx proc filter to IP file only
cms_format ({'MAX', 'TAF'}) – CMS file format.
clean_exports (bool, default=True) – Should the exported datasets be cleaned?
preprocess_exports (bool, default=True) – Should the exported datasets be preprocessed?
export_format (str, default=csv) – Format of exported files
logger_name (str, default=__file__) – Logger name
- Raises:
FileNotFoundError – Raised when any of file types requested to be imported does not exist for the state and year
Examples
>>> from medicaid_utils.filters.patients.cohort_extraction import ( ... extract_cohort, ... ) >>> dct_codes = { ... 'diag_codes': {'oud': {'incl': {9: ['3040', '3055']}}}, ... 'proc_codes': {}, ... } >>> dct_filters = { ... 'cohort': {'ip': {'missing_dob': 0}}, ... 'export': {}, ... } >>> dct_paths = { ... 'source_root': '/data/cms/', ... 'export_folder': '/output/cohort/', ... } >>> extract_cohort( ... 'AL', [2012], dct_codes, dct_filters, ... ['ip', 'ps'], dct_paths, ... cms_format='MAX')
- medicaid_utils.filters.patients.cohort_extraction.filter_claim_files(claim: MAXFile | TAFFile, dct_claim_filters: dict, tmp_folder: str, subtype: str | None = None, logger_name: str = '/home/runner/work/medicaid-utils/medicaid-utils/medicaid_utils/filters/patients/cohort_extraction.py') Tuple[MAXFile | TAFFile, DataFrame][source]¶
Filters claim files
- Parameters:
claim (Union[max_file.MAXFile, taf_file.TAFFile]) – Claim object
dct_claim_filters (dict) –
Filters to apply. Filter dictionary should be of the format:
{claim_type_1: {range_[datatype]_[col_name]: (start, end), excl_[col_name]: [0/1], [col_name]: value, ..} claim_type_2: ...} }
date and numeric range type filters are currently supported. Filter names beginning with excl_ with values set to 1 will exclude benes that have a positive value for that exclusion flag. Filter names that are just column names will restrict the result to benes with the filter value for the corresponding column. Eg:
{'ip': {'range_numeric_age_prncpl_proc': (0, 18), 'missing_dob': 0, 'excl_female': 1}} 'ot': {'range_numeric_age_srvc_bgn': (0, 18), 'missing_dob': 0, 'excl_female': 1 } }
The example filter will exclude all IP claims of female benes and also claims with missing DOB. The resulting set will also be restricted to those of benes whose age is between 0-18 (inclusive of both 0 and 18) as of principal procedure data/ service begin date.
tmp_folder (str) – Temporary folder to cache results mid-processing. This is useful for large datasets, as the dask cluster can crash if the task graph is too large for large datasets. This is handled by caching results at intermediate stages.
subtype (str, default=None) – Claim subtype (required for TAF datasets)
logger_name (str, default=None) – Logger name
- Return type:
Tuple[Union[max_file.MAXFile, taf_file.TAFFile], pd.DataFrame]
- Raises:
ValueError – When subtype is parameter is missing for a function with TAFFile claim type input
Examples
>>> from medicaid_utils.filters.patients.cohort_extraction import ( ... filter_claim_files, ... ) >>> claim = max_ip.MAXIP( ... 2012, 'AL', '/data/cms/') >>> filtered_claim, stats = filter_claim_files( ... claim, ... {'ip': {'missing_dob': 0}}, ... '/tmp/cache')