medicaid_utils.filters.patients package

Submodules

medicaid_utils.filters.patients.cohort_extraction module

This module has functions that extract cohorts for studies based on multiple filters

medicaid_utils.filters.patients.cohort_extraction.apply_range_filter(tpl_range: tuple, df: DataFrame, filter_name: str, col_name: str, data_type: str, logger_name: str = '/home/runner/work/medicaid-utils/medicaid-utils/medicaid_utils/filters/patients/cohort_extraction.py') DataFrame[source]

Applies data/ numeric range based filter on a dataframe

Parameters:
  • tpl_range (tuple) – Upper and lower bound tuple

  • df (dd.DataFrame) – Dataframe to be filtered

  • filter_name (str) – Name of filter. Should be of the format range_[datatype]_[col_name]. date and numeric range type filters are currently supported

  • col_name (str) – Name of column

  • data_type (str) – Datatype of column. Eg. date, int

  • logger_name (str, default=__file__) – Logger name

Return type:

dd.DataFrame

Examples

Apply a numeric range filter:

>>> import pandas as pd
>>> df = pd.DataFrame({'age': [5, 12, 20, 35]})
>>> result = apply_range_filter(
...     (0, 18), df, 'range_numeric_age', 'age', 'int')
>>> result['age'].tolist()
[5, 12]

Apply a date range filter:

>>> df_dates = pd.DataFrame({
...     'service_date': pd.to_datetime(
...         ['20200101', '20200601', '20210101'])
... })
>>> result = apply_range_filter(
...     ('20200101', '20200630'), df_dates,
...     'range_date_service_date', 'service_date', 'date')
>>> len(result)
2
medicaid_utils.filters.patients.cohort_extraction.export_cohort_datasets(df_cohort: DataFrame, year: int, state: str, lst_types_to_export: List[str], dct_export_filters: dict, dct_data_paths: dict, cms_format: str = 'MAX', clean_exports: bool = False, preprocess_exports: bool = False, export_format: str = 'csv', logger_name: str = '/home/runner/work/medicaid-utils/medicaid-utils/medicaid_utils/filters/patients/cohort_extraction.py') None[source]

Exports MAX files corresponding to the cohort as defined by the filters input to this function

Parameters:
  • df_cohort (dd.DataFrame) – Pandas dataframe with patient IDs (BENE_MSIS) and indicator flag denoting inclusion into the cohort (include=1)

  • year (int) – Year of the claim files

  • state (str) – State

  • lst_types_to_export (list of str) – List of file types to export. Supported types are [ip, ot, ps, rx]

  • dct_export_filters (dict) –

    Additional filters that should be applied to the raw claims of the selected cohort while exporting. Filter dictionary should be of the format:

    {claim_type_1: {range_[datatype]_[col_name]: (start, end),
                    excl_[col_name]: [0/1],
                    [col_name]: value,
                    ..}
    claim_type_2: ...}}
    

    date and numeric range type filters are currently supported. Filter names beginning with excl_ with values set to 1 will exclude benes that have a positive value for that exclusion flag. Filter names that are just column names will restrict the result to benes with the filter value for the corresponding column. Eg:

    {'ip': {'range_numeric_age_prncpl_proc': (0, 18),
            'missing_dob': 0,
            'excl_female': 1}}
     'ot': {'range_numeric_age_srvc_bgn': (0, 18),
            'missing_dob': 0,
            'excl_female': 1}}
    }
    

    The example filter will exclude all IP claims of female benes and also claims with missing DOB. The resulting set will also be restricted to those of benes whose age is between 0-18 (inclusive of both 0 and 18) as of principal procedure date/ service begin date.

  • dct_data_paths (dict) –

    Dictionary with information on raw claim files root folder and export folder. Should be of the format,

    {'source_root': /path/to/medicaid/folder,
     'export_folder': /path/to/export/data}
    

  • cms_format ({'MAX', TAF'}) – CMS file format.

  • clean_exports (bool, default=False) – Should the exported datasets be cleaned?

  • preprocess_exports (bool, default=False) – Should the exported datasets be preprocessed?

  • export_format (str, default='csv') – Format of exported files

  • logger_name (str, default=__file__) – Logger name

Raises:
  • FileNotFoundError – Raised when any of file types requested to be imported does not exist for the state and year

  • OSError – Raised when the cohort file index does not match the claim file index

Examples

>>> from medicaid_utils.filters.patients.cohort_extraction import (
...     export_cohort_datasets,
... )
>>> import pandas as pd
>>> import dask.dataframe as dd
>>> pdf_cohort = pd.DataFrame({
...     'MSIS_ID': ['A', 'B'],
...     'include': [1, 1],
... }).set_index('MSIS_ID')
>>> ddf_cohort = dd.from_pandas(pdf_cohort, npartitions=1)
>>> dct_paths = {
...     'source_root': '/data/cms/',
...     'export_folder': '/output/cohort/',
... }
>>> export_cohort_datasets(
...     ddf_cohort, 2012, 'AL',
...     ['ip', 'ps'], {}, dct_paths,
...     cms_format='MAX')
medicaid_utils.filters.patients.cohort_extraction.extract_cohort(state: str, lst_year: List[int], dct_diag_proc_codes: dict, dct_filters: dict, lst_types_to_export: List[str], dct_data_paths: dict, restrict_dx_proc_to_ip: bool = False, cms_format: str = 'MAX', clean_exports: bool = True, preprocess_exports: bool = True, export_format: str = 'csv', pq_engine: str = 'pyarrow', logger_name: str = '/home/runner/work/medicaid-utils/medicaid-utils/medicaid_utils/filters/patients/cohort_extraction.py') None[source]

Extracts and exports claim files corresponded cohort defined by the input filters

Parameters:
  • state (str) – State

  • lst_year (list of int) – List of years from which cohort should be created

  • dct_diag_proc_codes (dict) –

    Dictionary of diagnosis and procedure codes. Should be in the format

    {'diag_codes': {condition_name:
                        {['incl' / 'excl']: {[9/ 10]: list of codes}},
     'proc_codes': {procedure_name:
                        {procedure_system_code: list of codes} }
     'column_values': {condition_or_procedure_name:
                        {column_name: list of numerical values}
                     }
    }
    

    Eg:

    {'diag_codes':
        {'oud_nqf':
            {'incl': {9: ['3040','3055']}}},
     'proc_codes':
        {'methadone_7':
            {7: 'HZ81ZZZ,HZ84ZZZ,HZ85ZZZ,HZ86ZZZ,HZ91ZZZ,HZ94ZZZ,
                 HZ95ZZZ,HZ96ZZZ'.split(",")}
        },
     'column_values':
       {'diag_delivery':
           {'RCPNT_DLVRY_CD': [1]}
    }
    

  • dct_filters (dict) –

    Filters to apply to the cohort, and the exported claim files. Filter dictionary should be of the format:

    {'cohort':
        {claim_type_1:
            {range_[datatype]_[col_name]: (start, end),
             excl_[col_name]: [0/1],
             [col_name]: value,
             ..},
     'export':
        {claim_type_1:
            {range_[datatype]_[col_name]: (start, end),
             excl_[col_name]: [0/1],
             [col_name]: value,
             ..}
    }
    

    date and numeric range type filters are currently supported. Filter names beginning with excl_ with values set to 1 will exclude benes that have a positive value for that exclusion flag. Filter names that are just column names will restrict the result to benes with the filter value for the corresponding column. Eg:

    {'cohort':
        {'ip':
            {'range_numeric_age_prncpl_proc': (0, 18),
             'missing_dob': 0,
             'excl_female': 1
            }
         'ot':
            {'range_numeric_age_srvc_bgn': (0, 18),
             'missing_dob': 0,
             'excl_female': 1
            }
    }
    

    The example filter will exclude the cohort to all IP claims of female benes and also claims with missing DOB. The resulting set will also be restricted to those of benes whose age is between 0-18 (inclusive of both 0 and 18) as of principal procedure date/ service begin date.

  • lst_types_to_export (list of str) – List of types to export. Currently supported types are ip, ot, rx, ps.

  • dct_data_paths (dict) –

    Dictionary with information on raw claim files root folder and export folder. Should be of the format,

    {'source_root': /path/to/medicaid/folder,
     'export_folder': /path/to/export/data}
    

  • restrict_dx_proc_to_ip (bool, default=False) – Apply dx proc filter to IP file only

  • cms_format ({'MAX', 'TAF'}) – CMS file format.

  • clean_exports (bool, default=True) – Should the exported datasets be cleaned?

  • preprocess_exports (bool, default=True) – Should the exported datasets be preprocessed?

  • export_format (str, default=csv) – Format of exported files

  • logger_name (str, default=__file__) – Logger name

Raises:

FileNotFoundError – Raised when any of file types requested to be imported does not exist for the state and year

Examples

>>> from medicaid_utils.filters.patients.cohort_extraction import (
...     extract_cohort,
... )
>>> dct_codes = {
...     'diag_codes': {'oud': {'incl': {9: ['3040', '3055']}}},
...     'proc_codes': {},
... }
>>> dct_filters = {
...     'cohort': {'ip': {'missing_dob': 0}},
...     'export': {},
... }
>>> dct_paths = {
...     'source_root': '/data/cms/',
...     'export_folder': '/output/cohort/',
... }
>>> extract_cohort(
...     'AL', [2012], dct_codes, dct_filters,
...     ['ip', 'ps'], dct_paths,
...     cms_format='MAX')
medicaid_utils.filters.patients.cohort_extraction.filter_claim_files(claim: MAXFile | TAFFile, dct_claim_filters: dict, tmp_folder: str, subtype: str | None = None, logger_name: str = '/home/runner/work/medicaid-utils/medicaid-utils/medicaid_utils/filters/patients/cohort_extraction.py') Tuple[MAXFile | TAFFile, DataFrame][source]

Filters claim files

Parameters:
  • claim (Union[max_file.MAXFile, taf_file.TAFFile]) – Claim object

  • dct_claim_filters (dict) –

    Filters to apply. Filter dictionary should be of the format:

    {claim_type_1:
        {range_[datatype]_[col_name]: (start, end),
         excl_[col_name]: [0/1],
         [col_name]: value,
         ..}
     claim_type_2: ...}
    }
    

    date and numeric range type filters are currently supported. Filter names beginning with excl_ with values set to 1 will exclude benes that have a positive value for that exclusion flag. Filter names that are just column names will restrict the result to benes with the filter value for the corresponding column. Eg:

    {'ip':
        {'range_numeric_age_prncpl_proc': (0, 18),
         'missing_dob': 0,
         'excl_female': 1}}
     'ot':
        {'range_numeric_age_srvc_bgn': (0, 18),
         'missing_dob': 0,
         'excl_female': 1
         }
    }
    

    The example filter will exclude all IP claims of female benes and also claims with missing DOB. The resulting set will also be restricted to those of benes whose age is between 0-18 (inclusive of both 0 and 18) as of principal procedure data/ service begin date.

  • tmp_folder (str) – Temporary folder to cache results mid-processing. This is useful for large datasets, as the dask cluster can crash if the task graph is too large for large datasets. This is handled by caching results at intermediate stages.

  • subtype (str, default=None) – Claim subtype (required for TAF datasets)

  • logger_name (str, default=None) – Logger name

Return type:

Tuple[Union[max_file.MAXFile, taf_file.TAFFile], pd.DataFrame]

Raises:

ValueError – When subtype is parameter is missing for a function with TAFFile claim type input

Examples

>>> from medicaid_utils.filters.patients.cohort_extraction import (
...     filter_claim_files,
... )
>>> claim = max_ip.MAXIP(
...     2012, 'AL', '/data/cms/')
>>> filtered_claim, stats = filter_claim_files(
...     claim,
...     {'ip': {'missing_dob': 0}},
...     '/tmp/cache')