Cohort Extraction¶
The cohort extraction module is the primary tool for building patient-level analytic files. It identifies patients matching diagnosis/procedure criteria, applies inclusion/exclusion filters, and exports the resulting claim files.
Overview¶
A typical cohort extraction workflow:
Define diagnosis/procedure codes that identify your condition of interest
Define filters for inclusion criteria (age, gender, enrollment)
Call
extract_cohort()Receive exported claim files for the matching cohort
The extract_cohort Function¶
from medicaid_utils.filters.patients.cohort_extraction import extract_cohort
extract_cohort(
state="AL",
lst_year=[2016, 2017, 2018],
dct_diag_proc_codes=dct_codes,
dct_filters=dct_filters,
lst_types_to_export=["ip", "ot", "ps"],
dct_data_paths=dct_paths,
cms_format="TAF",
)
This function:
Loads IP, OT, PS (and optionally RX) claims for each year
Applies cohort filters to narrow the population
Identifies patients with the specified diagnoses/procedures
Exports filtered claim files for the matching cohort
Defining Diagnosis and Procedure Codes¶
The dct_diag_proc_codes dictionary supports three types of criteria:
Diagnosis Codes¶
Use ICD-9 and/or ICD-10 codes with inclusion and exclusion logic:
dct_codes = {
"diag_codes": {
"diabetes_t2": {
"incl": {
9: ["250"], # ICD-9 prefix
10: ["E11"], # ICD-10 prefix
},
"excl": {
9: ["25001", "25003", "25011", "25013",
"25021", "25023", "25031", "25033",
"25041", "25043", "25051", "25053",
"25061", "25063", "25071", "25073",
"25081", "25083", "25091", "25093"],
10: ["E10"], # Exclude Type 1
},
},
},
"proc_codes": {},
}
Codes are matched using prefix matching – "250" matches "2500", "25000",
"25002", etc. Exclusion codes take precedence over inclusion codes.
Procedure Codes¶
Procedure codes are keyed by procedure coding system:
dct_codes = {
"diag_codes": {},
"proc_codes": {
"methadone": {
7: [ # ICD-10-PCS (system code 7)
"HZ81ZZZ", "HZ84ZZZ", "HZ85ZZZ", "HZ86ZZZ",
"HZ91ZZZ", "HZ94ZZZ", "HZ95ZZZ", "HZ96ZZZ",
],
},
},
}
Common procedure system codes:
1– CPT/HCPCS6– ICD-9-CM procedure7– ICD-10-PCS
Column Value Conditions¶
You can also flag patients based on specific column values:
dct_codes = {
"diag_codes": {},
"proc_codes": {},
"column_values": {
"diag_delivery": {
"RCPNT_DLVRY_CD": [1], # Delivery recipient code = 1
},
},
}
Defining Filters¶
Filters control which claims and patients are included. The filter dictionary has two
sections: cohort (applied before cohort identification) and export (applied when
exporting claim files).
dct_filters = {
"cohort": {
"ip": {
"missing_dob": 0, # Exclude missing DOB
"range_numeric_age_prncpl_proc": (18, 64), # Age 18-64
},
"ot": {
"missing_dob": 0,
"range_numeric_age_srvc_bgn": (18, 64),
},
},
"export": {
"ip": {
"missing_dob": 0,
},
},
}
Filter Types¶
Range filters restrict to a numeric or date range:
# Numeric range (inclusive on both ends)
"range_numeric_age_srvc_bgn": (18, 64)
# Date range (YYYYMMDD format)
"range_date_srvc_bgn_date": ("20160101", "20181231")
Exclusion filters remove patients with a positive exclusion flag:
# Exclude female patients (excl_female column = 1)
"excl_female": 1
Column value filters restrict to rows with a specific value:
# Keep only claims where missing_dob = 0
"missing_dob": 0
Data Paths¶
dct_paths = {
"source_root": "/data/cms/", # Where raw claim files live
"export_folder": "/output/my_cohort/", # Where to write results
"tmp_folder": "/tmp/cohort_cache/", # Optional intermediate cache
}
Output Files¶
After extraction, the export folder contains:
cohort_{STATE}.csv– patient-level file with condition flags, inclusion indicator, and date of birthcohort_{STATE}_{YEAR}.csv– year-specific patient filecohort_exclusions_{TYPE}_{STATE}_{YEAR}.parquet– filter statistics showing how many claims were removed at each stepExported claim files in the requested format (CSV or Parquet)
Cohort File Columns¶
The cohort_{STATE}_{YEAR}.csv file (indexed by BENE_MSIS) contains:
Column Pattern |
Description |
|---|---|
|
1 if patient included in final cohort, 0 if excluded |
|
Claim year |
|
State code |
|
Date of birth (merged from PS) |
|
1 if condition found in claim type (e.g., |
|
Date of first occurrence |
|
1 if procedure found in claim type |
|
Date of first occurrence |
|
1 if ANY diagnosis condition matched |
|
1 if ANY procedure condition matched |
Where {type} is the claim type (ip, ot, ot_line), {condition} comes
from your dct_diag_codes keys, and {procedure} comes from your dct_proc_codes keys.
Lower-Level Functions¶
For more control, you can use the building blocks directly:
Flagging Diagnoses and Procedures¶
from medicaid_utils.filters.claims import dx_and_proc
# Flag claims with specific diagnosis codes
df_flagged = dx_and_proc.flag_diagnoses_and_procedures(
dct_diag_codes={"diabetes_t2": {"incl": {10: ["E11"]}}},
dct_proc_codes={},
df_claims=ip.dct_files["base"],
cms_format="TAF",
)
# Get patient IDs with conditions
pdf_patients, dct_stats = dx_and_proc.get_patient_ids_with_conditions(
dct_diag_codes={"diabetes_t2": {"incl": {10: ["E11"]}}},
dct_proc_codes={},
cms_format="TAF",
ip=ip.dct_files["base"],
ot=ot.dct_files["base"],
)
Flagging Prescriptions¶
from medicaid_utils.filters.claims import rx as rx_filter
# Flag pharmacy claims by NDC codes
df_rx_flagged = rx_filter.flag_prescriptions(
dct_ndc_codes={
"metformin": ["00093727801", "00093727901"],
},
df_claims=rx_claims.dct_files["base"],
)
Filtering Claim Files¶
from medicaid_utils.filters.patients.cohort_extraction import filter_claim_files
claim, filter_stats = filter_claim_files(
claim=ip,
dct_claim_filters={"ip": {"missing_dob": 0}},
tmp_folder="/tmp/cache",
)