Expected Data Layout¶
The package expects Medicaid claim files stored as Parquet datasets, split by year and
state, and sorted by beneficiary ID (BENE_MSIS or MSIS_ID).
The folder hierarchy under your data_root must follow the structure below.
MAX Files¶
data_root/
medicaid/
{YEAR}/
{STATE}/
max/
ip/parquet/ # Inpatient claims
ot/parquet/ # Outpatient claims
ps/parquet/ # Person Summary
cc/parquet/ # Chronic Conditions
Example path: data_root/medicaid/2012/WY/max/ip/parquet/
TAF Files¶
TAF claims are split into multiple subtypes per claim type:
data_root/
medicaid/
{YEAR}/
{STATE}/
taf/
ip/ # Inpatient
iph/parquet/ # Header (base)
ipl/parquet/ # Line
ipoccr/parquet/ # Occurrence codes
ipdx/parquet/ # Diagnosis codes
ipndc/parquet/ # NDC codes
ot/ # Outpatient
oth/parquet/
otl/parquet/
otoccr/parquet/
otdx/parquet/
otndc/parquet/
lt/ # Long-Term Care
lth/parquet/
ltl/parquet/
ltoccr/parquet/
ltdx/parquet/
ltndc/parquet/
rx/ # Pharmacy
rxh/parquet/ # Header (base)
rxl/parquet/ # Line
rxndc/parquet/ # NDC codes
de/ # Demographics/Eligibility
debse/parquet/ # Base demographics
dedts/parquet/ # Dates
demc/parquet/ # Managed care
dedsb/parquet/ # Disability
demfp/parquet/ # Money Follows the Person
dewvr/parquet/ # Waiver
dehsp/parquet/ # Home health/SPF
dedxndc/parquet/ # Diagnosis & NDC codes
Each Parquet dataset can be a single file or a directory of partitioned Parquet files. Files must be pre-sorted by beneficiary ID to enable efficient partition-level operations.
Preparing Your Data¶
If your raw CMS data is in SAS or CSV format, you will need to convert it to Parquet and organize it into the folder structure above. Key points:
Sort by beneficiary ID before writing to Parquet. This enables efficient partition-level joins and lookups.
Split by year and state into separate directories.
Use
pyarrowas the Parquet engine for best compatibility.