In [13]:
#pip install textstat pdfplumber python-docx kaleido
Requirement already satisfied: textstat in c:\users\victo\anaconda3\lib\site-packages (0.7.13)Note: you may need to restart the kernel to use updated packages.

Requirement already satisfied: pdfplumber in c:\users\victo\anaconda3\lib\site-packages (0.11.10)
Requirement already satisfied: python-docx in c:\users\victo\anaconda3\lib\site-packages (1.2.0)
Collecting kaleido
  Downloading kaleido-1.3.0-py3-none-any.whl (55 kB)
     ---------------------------------------- 55.6/55.6 kB ? eta 0:00:00
Requirement already satisfied: nltk in c:\users\victo\anaconda3\lib\site-packages (from textstat) (3.7)
Requirement already satisfied: setuptools in c:\users\victo\anaconda3\lib\site-packages (from textstat) (65.6.3)
Requirement already satisfied: pyphen in c:\users\victo\anaconda3\lib\site-packages (from textstat) (0.17.2)
Requirement already satisfied: pypdfium2>=5.9.0 in c:\users\victo\anaconda3\lib\site-packages (from pdfplumber) (5.10.1)
Requirement already satisfied: pdfminer.six==20260107 in c:\users\victo\anaconda3\lib\site-packages (from pdfplumber) (20260107)
Requirement already satisfied: Pillow>=12.2.0 in c:\users\victo\anaconda3\lib\site-packages (from pdfplumber) (12.2.0)
Requirement already satisfied: charset-normalizer>=2.0.0 in c:\users\victo\anaconda3\lib\site-packages (from pdfminer.six==20260107->pdfplumber) (2.0.4)
Requirement already satisfied: cryptography>=36.0.0 in c:\users\victo\anaconda3\lib\site-packages (from pdfminer.six==20260107->pdfplumber) (39.0.1)
Requirement already satisfied: lxml>=3.1.0 in c:\users\victo\anaconda3\lib\site-packages (from python-docx) (6.1.1)
Requirement already satisfied: typing_extensions>=4.9.0 in c:\users\victo\anaconda3\lib\site-packages (from python-docx) (4.15.0)
Requirement already satisfied: packaging in c:\users\victo\anaconda3\lib\site-packages (from kaleido) (22.0)
Collecting logistro>=1.0.8
  Downloading logistro-2.0.1-py3-none-any.whl (8.6 kB)
Collecting orjson>=3.10.15
  Downloading orjson-3.11.9-cp310-cp310-win_amd64.whl (127 kB)
     -------------------------------------- 127.3/127.3 kB 7.3 MB/s eta 0:00:00
Collecting choreographer>=1.3.0
  Downloading choreographer-1.3.0-py3-none-any.whl (52 kB)
     ---------------------------------------- 52.6/52.6 kB 2.6 MB/s eta 0:00:00
Collecting platformdirs>=4.3.6
  Downloading platformdirs-4.10.0-py3-none-any.whl (22 kB)
Collecting simplejson>=3.19.3
  Downloading simplejson-4.1.1-cp310-cp310-win_amd64.whl (90 kB)
     ---------------------------------------- 90.5/90.5 kB 5.3 MB/s eta 0:00:00
Requirement already satisfied: tqdm in c:\users\victo\anaconda3\lib\site-packages (from nltk->textstat) (4.64.1)
Requirement already satisfied: click in c:\users\victo\anaconda3\lib\site-packages (from nltk->textstat) (8.0.4)
Requirement already satisfied: regex>=2021.8.3 in c:\users\victo\anaconda3\lib\site-packages (from nltk->textstat) (2022.7.9)
Requirement already satisfied: joblib in c:\users\victo\anaconda3\lib\site-packages (from nltk->textstat) (1.1.1)
Requirement already satisfied: cffi>=1.12 in c:\users\victo\anaconda3\lib\site-packages (from cryptography>=36.0.0->pdfminer.six==20260107->pdfplumber) (1.15.1)
Requirement already satisfied: colorama in c:\users\victo\anaconda3\lib\site-packages (from click->nltk->textstat) (0.4.6)
Requirement already satisfied: pycparser in c:\users\victo\anaconda3\lib\site-packages (from cffi>=1.12->cryptography>=36.0.0->pdfminer.six==20260107->pdfplumber) (2.21)
Installing collected packages: simplejson, platformdirs, orjson, logistro, choreographer, kaleido
  Attempting uninstall: platformdirs
    Found existing installation: platformdirs 2.5.2
    Uninstalling platformdirs-2.5.2:
      Successfully uninstalled platformdirs-2.5.2
Successfully installed choreographer-1.3.0 kaleido-1.3.0 logistro-2.0.1 orjson-3.11.9 platformdirs-4.10.0 simplejson-4.1.1

Readability Analysis¶

Overview¶

This analysis scores the reading difficulty of member-facing privacy documents from five major private insurers operating in Georgia. Three validated readability metrics are applied to each document to assess whether members can meaningfully understand and act on the consent terms they are presented with.

Metrics¶

Flesch-Kincaid Grade Level¶

Translates readability into a US school grade level using average sentence length and average syllables per word. A score of 8 means the document requires an 8th grade reading level. The American Medical Association recommends patient health materials be written at or below a 6th grade reading level.

Flesch Reading Ease¶

A 0–100 scale where higher scores indicate easier reading. Documents scoring below 30 are considered very difficult and typically require a college education to comprehend. Scores of 60–70 are considered standard and accessible to most adults. Historically used in state insurance regulation to set minimum readability requirements for policy documents.

SMOG Index¶

The Simple Measure of Gobbledygook estimates years of education required to understand a document based on the density of polysyllabic words, meaning words with three or more syllables. Considered more accurate than Flesch-Kincaid for health materials specifically (McLaughlin, 1969) and is the metric most commonly used in health literacy research.

Why Readability Matters for This Population¶

1 in 5 US adults may find dense consent documents difficult to read, making truly informed consent a challenge (Kutner et al., 2006). Lower health literacy is more prevalent among racial and ethnic minorities, older adults, people with low educational attainment, and people with chronic illness (Berkman et al., 2011).

A systematic review of 114 US medical school IRB consent forms found a mean Flesch-Kincaid score of grade 10.6, exceeding AMA readability standards by 4.6 grade levels (Paasche-Orlow et al., 2003).

People navigating insurance consent during periods of housing instability, reentry from incarceration, or recovery from substance use disorder face compounded barriers. They are disproportionately likely to have lower health literacy, less stable access to the internet or printing, and less capacity to engage in multi-step bureaucratic opt-out processes. For this population, a postgraduate level consent document is not an inconvenience. It is a structural barrier to meaningful consent.

Key Findings¶

All ten documents in the Georgia primary case exceed the AMA recommended 6th grade threshold. Scores range from grade 6.7 (UnitedHealthcare HIPAA Notice) to grade 18.4 (UnitedHealthcare Online Privacy Policy).

The documents governing the broadest data collection and offering the least member control are consistently the hardest to read. The two most extreme examples belong to the same insurer. UnitedHealthcare's HIPAA notice is the most readable document in the dataset at grade 6.7 while its online privacy policy is the hardest at grade 18.4. The HIPAA notice covers the narrowest data practices. The online privacy policy covers location data, behavioral tracking, device identifiers, and third-party data combining with no unified opt-out.

The Cigna Gramm-Leach-Bliley notice scores grade 16.5. It is the only document in the dataset that explicitly states members cannot limit data sharing. The hardest documents to read contain the most harmful terms.

The Anthem BCBS Spanish notice scores grade 12.4, harder than the English version at grade 9.9. Spanish-speaking members receive a harder document than English-speaking members governed by identical terms.

Methods¶

Text was extracted from PDF documents using pdfplumber and from Word documents using python-docx. Readability scores were computed using the textstat Python library. Each document was scored in full without section-level filtering. All runs are timestamped to enable reproducibility and comparison across document versions.

See scripts/readability_scoring.py for the full scoring script.

In [2]:
### Readability Scoring
In [3]:
import pdfplumber
import textstat
import csv
import os
from docx import Document
from datetime import datetime
In [4]:
def extract_text_from_pdf(path):
    with pdfplumber.open(path) as pdf:
        return " ".join(
            page.extract_text() for page in pdf.pages
            if page.extract_text()
        )

def extract_text_from_docx(path):
    doc = Document(path)
    return " ".join(para.text for para in doc.paragraphs if para.text)
In [5]:
documents = [
    {"insurer": "Aetna", "state": "Georgia", "doc_type": "Web Privacy Policy", "path": r"C:\Users\victo\OneDrive\Desktop\Privacy Policies\Aetna privacy.docx"},
    {"insurer": "Anthem BCBS", "state": "Georgia", "doc_type": "HIPAA Notice", "path": r"C:\Users\victo\OneDrive\Desktop\Privacy Policies\anthem BCBS privacy practices.pdf"},
    {"insurer": "Anthem BCBS", "state": "Georgia", "doc_type": "HIPAA Notice (Spanish)", "path": r"C:\Users\victo\OneDrive\Desktop\Privacy Policies\anthem BCBS privacy spanish.pdf"},
    {"insurer": "Cigna", "state": "Georgia", "doc_type": "Data Sharing Notice", "path": r"C:\Users\victo\OneDrive\Desktop\Privacy Policies\Cigna privacy data sharing.docx"},
    {"insurer": "Cigna", "state": "Georgia", "doc_type": "Global Health Benefits Notice", "path": r"C:\Users\victo\OneDrive\Desktop\Privacy Policies\cigna-global-health-benefits-privacy-notice-eng_copy.pdf"},
    {"insurer": "Cigna", "state": "Georgia", "doc_type": "HIPAA Notice", "path": r"C:\Users\victo\OneDrive\Desktop\Privacy Policies\cigna-health-care-and-cigna-supplemental-benefits-privacy-notice-eng_copy.pdf"},
    {"insurer": "Cigna", "state": "Georgia", "doc_type": "GLB Notice", "path": r"C:\Users\victo\OneDrive\Desktop\Privacy Policies\gramm-leach-bliley-act-privacy-notice_copy.pdf"},
    {"insurer": "Humana", "state": "Georgia", "doc_type": "HIPAA Notice", "path": r"C:\Users\victo\OneDrive\Desktop\Privacy Policies\humana privacy practices.pdf"},
    {"insurer": "UnitedHealthcare", "state": "Georgia", "doc_type": "Web Privacy Policy", "path": r"C:\Users\victo\OneDrive\Desktop\Privacy Policies\UHC privacy.docx"},
    {"insurer": "UnitedHealthcare", "state": "Georgia", "doc_type": "HIPAA Notice", "path": r"C:\Users\victo\OneDrive\Desktop\Privacy Policies\united hipaa privacy.pdf"},
]

run_timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
results = []
In [6]:
for doc in documents:
    print(f"Scoring: {doc['insurer']} - {doc['doc_type']}")
    try:
        if doc["path"].endswith(".pdf"):
            text = extract_text_from_pdf(doc["path"])
        elif doc["path"].endswith(".docx"):
            text = extract_text_from_docx(doc["path"])
        else:
            print(f"Unsupported file type: {doc['path']}")
            continue

        results.append({
            "run_timestamp": run_timestamp,
            "insurer": doc["insurer"],
            "state": doc["state"],
            "doc_type": doc["doc_type"],
            "word_count": textstat.lexicon_count(text),
            "flesch_kincaid_grade": round(textstat.flesch_kincaid_grade(text), 2),
            "flesch_reading_ease": round(textstat.flesch_reading_ease(text), 2),
            "smog_index": round(textstat.smog_index(text), 2),
            "path": doc["path"],
        })

    except Exception as e:
        print(f"Error processing {doc['path']}: {e}")

output_path = r"C:\Users\victo\OneDrive\Desktop\Privacy Policies\readability_scores.csv"
Scoring: Aetna - Web Privacy Policy
Scoring: Anthem BCBS - HIPAA Notice
Scoring: Anthem BCBS - HIPAA Notice (Spanish)
Scoring: Cigna - Data Sharing Notice
Scoring: Cigna - Global Health Benefits Notice
Scoring: Cigna - HIPAA Notice
Scoring: Cigna - GLB Notice
Scoring: Humana - HIPAA Notice
Scoring: UnitedHealthcare - Web Privacy Policy
Scoring: UnitedHealthcare - HIPAA Notice
In [7]:
existing_rows = []
if os.path.exists(output_path):
    with open(output_path, "r", newline="") as f:
        reader = csv.DictReader(f)
        existing_rows = list(reader)

existing_keys = {
    (row["insurer"], row["state"], row["doc_type"], row["run_timestamp"])
    for row in existing_rows
}

new_rows = [
    r for r in results
    if (r["insurer"], r["state"], r["doc_type"], r["run_timestamp"])
    not in existing_keys
]

all_rows = existing_rows + new_rows

with open(output_path, "w", newline="") as f:
    fieldnames = ["run_timestamp", "insurer", "state", "doc_type",
                  "word_count", "flesch_kincaid_grade", "flesch_reading_ease",
                  "smog_index", "path"]
    writer = csv.DictWriter(f, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerows(all_rows)

print(f"\nRun timestamp: {run_timestamp}")
print(f"New rows added: {len(new_rows)}")
print(f"Total rows in file: {len(all_rows)}")
print(f"Results saved to {output_path}")
Run timestamp: 2026-06-17 00:15:55
New rows added: 10
Total rows in file: 10
Results saved to C:\Users\victo\OneDrive\Desktop\Privacy Policies\readability_scores.csv
In [36]:
### Visualizations
In [42]:
import pandas as pd
import plotly.graph_objects as go

df = pd.read_csv(r"C:\Users\victo\OneDrive\Desktop\Privacy Policies\readability_scores.csv")

latest = (
    df.sort_values("run_timestamp", ascending=False)
    .drop_duplicates(subset=["insurer", "state", "doc_type"])
)

insurer_avg = (
    latest.groupby("insurer")["flesch_kincaid_grade"]
    .max()
    .sort_values(ascending=False)
)

insurer_order = {insurer: i for i, insurer in enumerate(insurer_avg.index)}
latest["insurer_rank"] = latest["insurer"].map(insurer_order)

latest = latest.sort_values(
    ["insurer_rank", "flesch_kincaid_grade"],
    ascending=[False, False]
)

latest["label"] = latest["insurer"] + "<br>  " + latest["doc_type"]

fig = go.Figure()

fig.add_trace(go.Bar(
    y=latest["label"],
    x=latest["smog_index"],
    name="SMOG index",
    orientation="h",
    marker=dict(color="#888780", opacity=0.5),
    customdata=latest[["insurer", "doc_type", "smog_index"]].values,
    hovertemplate=(
        "<b>%{customdata[0]}</b><br>"
        "%{customdata[1]}<br>"
        "SMOG index: <b>%{customdata[2]:.1f}</b>"
        "<extra></extra>"
    ),
))

fig.add_trace(go.Bar(
    y=latest["label"],
    x=latest["flesch_kincaid_grade"],
    name="Flesch-Kincaid grade level",
    orientation="h",
    marker=dict(color="#3266ad", opacity=0.9),
    customdata=latest[[
        "insurer", "doc_type", "word_count",
        "flesch_reading_ease", "smog_index", "run_timestamp"
    ]].values,
    hovertemplate=(
        "<b>%{customdata[0]}</b><br>"
        "%{customdata[1]}<br>"
        "<br>"
        "Flesch-Kincaid grade: <b>%{x:.1f}</b><br>"
        "Flesch reading ease: <b>%{customdata[3]:.1f}</b><br>"
        "SMOG index: <b>%{customdata[4]:.1f}</b><br>"
        "Word count: <b>%{customdata[2]:,}</b><br>"
        "<br>"
        "<i>Scored: %{customdata[5]}</i>"
        "<extra></extra>"
    ),
))

fig.add_vline(
    x=6,
    line_dash="dash",
    line_color="#c0392b",
    line_width=1.5,
)

fig.update_layout(
    barmode="overlay",
    title=dict(
        text=(
            "Readability of private insurer privacy documents<br>"
            "<sup>Georgia primary case, documents scored June 17, 2026, "
            "dashed line = AMA recommended threshold (grade 6)</sup>"
        ),
        font=dict(size=16, color="#333"),
    ),
    xaxis=dict(
        title="Grade level required to understand document",
        range=[0, 22],
        tickvals=[0, 3, 6, 9, 12, 15, 18, 21],
        ticktext=["0", "3rd", "6th", "9th", "12th", "15th", "18th", "21st"],
        gridcolor="#eeeeee",
        gridwidth=1,
    ),
    yaxis=dict(
        title=None,
        automargin=True,
        tickfont=dict(size=11),
    ),
    legend=dict(
        orientation="h",
        yanchor="top",
        y=-0.08,
        xanchor="center",
        x=0.5,
        font=dict(size=12),
    ),
    height=700,
    margin=dict(l=20, r=20, t=100, b=100),
    plot_bgcolor="white",
    paper_bgcolor="white",
    font=dict(family="Arial", size=12, color="#333"),
)

output_html = r"C:\Users\victo\OneDrive\Desktop\Privacy Policies\readability_chart_grouped.html"
output_png = r"C:\Users\victo\OneDrive\Desktop\Privacy Policies\readability_chart_grouped.png"

fig.write_html(output_html, include_plotlyjs="cdn")
print(f"Interactive chart saved to {output_html}")

try:
    fig.write_image(output_png, width=1200, height=700, scale=2)
    print(f"Static image saved to {output_png}")
except Exception as e:
    print(f"PNG export skipped: {e}")

fig.show()
Interactive chart saved to C:\Users\victo\OneDrive\Desktop\Privacy Policies\readability_chart_grouped.html
PNG export skipped: 
Image export using the "kaleido" engine requires the kaleido package,
which can be installed using pip:
    $ pip install -U kaleido

References¶

Berkman, N. D., Sheridan, S. L., Donahue, K. E., Halpern, D. J., & Crotty, K. (2011). Low health literacy and health outcomes: An updated systematic review. Annals of Internal Medicine, 155(2), 97–107. https://doi.org/10.7326/0003-4819-155-2-201107190-00005

Kutner, M., Greenberg, E., Jin, Y., & Paulsen, C. (2006). The health literacy of America's adults: Results from the 2003 National Assessment of Adult Literacy (NCES 2006–483). U.S. Department of Education, National Center for Education Statistics. https://nces.ed.gov/pubs2006/2006483.pdf

McLaughlin, G. H. (1969). SMOG grading: A new readability formula. Journal of Reading, 12(8), 639–646.

Paasche-Orlow, M. K., Taylor, H. A., & Brancati, F. L. (2003). Readability standards for informed-consent forms as compared with actual readability. New England Journal of Medicine, 348(8), 721–726. https://doi.org/10.1056/NEJMsa021212

Rudd, R. E. (2010). Health literacy skills of U.S. adults. American Journal of Health Behavior, 31(Suppl 1), S8–S18.

Generative AI Statement¶

Portions of this analysis were developed with assistance from Claude (Anthropic, claude-sonnet-4-6). AI assistance was used for code generation, literature identification, and methodological scaffolding. All citations were independently verified against primary sources. Coding decisions, interpretive judgments, and research conclusions are the author's own.