medicaid-utils: Python Toolkit for Medicaid Claims Data Analysis

PyPI version Python 3.11+ CI License: MIT Documentation

medicaid-utils is an open-source Python toolkit for constructing patient-level analytic files from Medicaid claims data. It implements validated cleaning routines, variable construction methods, and public-domain clinical algorithms for both MAX (Medicaid Analytic eXtract) and TAF (Transformed Medicaid Statistical Information System) file formats published by the Centers for Medicare & Medicaid Services (CMS).

Built on Dask for scalable, distributed processing of large-scale claims datasets — from single-state analyses to multi-state observational studies.

pip install medicaid-utils

Why medicaid-utils?

Working with Medicaid claims data involves repetitive, error-prone preprocessing that every research team reimplements from scratch: cleaning diagnosis codes, constructing enrollment windows, applying risk adjustment algorithms, and building cohorts from millions of claims. medicaid-utils packages these validated routines so researchers can focus on their study design rather than data plumbing.

  • Dual-format support — seamless handling of both MAX (ICD-9 era) and TAF (ICD-10 era) Medicaid claims data

  • Validated preprocessing — standardized cleaning, deduplication, and variable construction for inpatient, outpatient, pharmacy, long-term care, and person summary files

  • 8 clinical algorithms — Elixhauser comorbidity scoring, CDPS-Rx risk adjustment, BETOS classification, ED PQI, IP PQI, NYU/Billings ED classification, PMCA, and low-value care measures

  • Flexible cohort extraction — filter patients by diagnosis codes, procedure codes, prescriptions, and demographic criteria across claim types

  • Scalable — Dask-based distributed computing handles state-level and multi-state datasets on laptops, workstations, and HPC clusters

Getting Started

Load and clean inpatient claims in three lines:

from medicaid_utils.preprocessing import max_ip

ip = max_ip.MAXIP(year=2012, state="WY", data_root="/path/to/data")
df_ip = ip.df  # cleaned Dask DataFrame, ready for analysis

Extract a Type 2 diabetes cohort:

from medicaid_utils.filters.patients.cohort_extraction import extract_cohort

extract_cohort(
    state="WY", lst_year=[2012],
    dct_diag_proc_codes={
        "diag_codes": {"diabetes_t2": {"incl": {9: ["250"], 10: ["E11"]}}},
        "proc_codes": {},
    },
    dct_filters={"cohort": {"ip": {"missing_dob": 0}}, "export": {}},
    lst_types_to_export=["ip", "ot", "ps"],
    dct_data_paths={"source_root": "/data", "export_folder": "/output/"},
    cms_format="MAX",
)

Apply Elixhauser comorbidity scoring (requires constructing LST_DIAG_CD first — see MAX vs TAF: CMS File Formats):

from medicaid_utils.adapted_algorithms.py_elixhauser.elixhauser_comorbidity import score

df_scored = score(ip.df, lst_diag_col_name="LST_DIAG_CD", cms_format="MAX")

User Guide

Indices and tables