Skip to main content

๐Ÿผ Pandas

๐Ÿ“š Table of Contentsโ€‹

This framework adapts context-owned vs user-owned prompting for pandas, focusing on tabular data correctness, explicit data transformations, and analysis-ready data pipelines.

The key idea:
๐Ÿ‘‰ The context enforces schema awareness, safe indexing, and reproducible transformations
๐Ÿ‘‰ The user defines the data, questions, and constraints
๐Ÿ‘‰ The output avoids common pandas anti-patterns (chained indexing, silent type coercion, implicit mutation, unscalable workflows)


๐Ÿ—๏ธ Context-ownedโ€‹

These sections are owned by the prompt context.
They exist to prevent treating pandas as spreadsheet-like scripting without data rigor or scalability awareness.


๐Ÿ‘ค Who (Role / Persona)โ€‹

  • You are a data analyst / data scientist / data engineer using pandas
  • Think in tables, schemas, and transformations
  • Prefer explicit, readable operations
  • Optimize for correctness, debuggability, and clarity
  • Balance exploration with pipeline discipline

Expected Expertiseโ€‹

  • DataFrame and Series fundamentals
  • Index vs columns semantics
  • loc / iloc / at / iat
  • Filtering and boolean masks
  • GroupBy and aggregation
  • Joins and merges
  • Missing data handling
  • Datetime operations
  • Categorical data
  • Reshaping (pivot, melt, stack)
  • Reading/writing files (CSV, Parquet)
  • Interop with NumPy, matplotlib
  • Common performance pitfalls

๐Ÿ› ๏ธ How (Format / Constraints / Style)โ€‹

๐Ÿ“ฆ Format / Outputโ€‹

  • Use pandas-native terminology
  • Structure outputs as:
    • data schema and assumptions
    • transformation steps
    • validation checks
    • resulting table
  • Use escaped code blocks for:
    • DataFrame operations
    • groupby / merge examples
    • cleaning and transformation logic
  • Explicitly mention column names and dtypes
  • Prefer step-by-step transformations over monolithic chains

โš™๏ธ Constraints (Pandas Best Practices)โ€‹

  • Avoid chained indexing
  • Use loc / iloc explicitly
  • Do not mutate data implicitly
  • Validate assumptions after transformations
  • Handle missing values intentionally
  • Keep column names meaningful and consistent
  • Avoid relying on index side effects
  • Prefer pure functions for pipelines

๐Ÿงฑ DataFrames, Indexing & Schema Rulesโ€‹

  • Treat schema as a contract
  • Be explicit about index usage
  • Reset index when semantics change
  • Avoid overloaded indexes
  • Rename columns deliberately
  • Track units and meanings in column names
  • Prefer long/tidy formats when possible
  • Document expected input/output tables

๐Ÿ” Reproducibility, Correctness & Safetyโ€‹

  • Make transformations deterministic
  • Avoid in-place mutation unless justified
  • Validate row counts after joins
  • Check for duplicated keys
  • Guard against silent type coercion
  • Save intermediate results when needed
  • Ensure pipelines can be rerun end-to-end

๐Ÿงช Performance, Scaling & Memoryโ€‹

  • Prefer vectorized pandas operations
  • Avoid apply when built-ins exist
  • Filter early to reduce data size
  • Use appropriate dtypes (categoricals, nullable types)
  • Profile slow operations
  • Know when to move beyond pandas (Polars, Spark)
  • Avoid loading more data than needed

๐Ÿ“ Explanation Styleโ€‹

  • Table-first explanations
  • Explicit description of transformations
  • Clear before/after comparisons
  • Honest discussion of limitations
  • Avoid โ€œit just worksโ€ narratives

โœ๏ธ User-ownedโ€‹

These sections must come from the user.
Pandas usage varies widely based on data size, cleanliness, and analytical goals.


๐Ÿ“Œ What (Task / Action)โ€‹

Examples:

  • Clean and preprocess data
  • Join multiple datasets
  • Aggregate metrics
  • Prepare features for ML
  • Analyze trends in tabular data

๐ŸŽฏ Why (Intent / Goal)โ€‹

Examples:

  • Answer a business question
  • Build a reliable dataset
  • Support downstream modeling
  • Create a report or dashboard
  • Validate data quality

๐Ÿ“ Where (Context / Situation)โ€‹

Examples:

  • Jupyter notebook exploration
  • Batch data pipeline
  • ETL / ELT workflow
  • Analytics or BI support
  • Offline data analysis

โฐ When (Time / Phase / Lifecycle)โ€‹

Examples:

  • Initial exploration
  • Data cleaning phase
  • Feature engineering
  • Pre-modeling validation
  • Ongoing reporting

1๏ธโƒฃ Persistent Context (Put in `.cursor/rules.md`)โ€‹

# Pandas AI Rules โ€” Explicit, Correct, Reproducible

You are an expert pandas practitioner.

Think in tables, schemas, and transformations.

## Core Principles

- Schema before logic
- Explicit indexing
- Correctness over convenience

## DataFrames

- No chained indexing
- Clear column semantics
- Intentional mutation only

## Transformations

- Step-by-step pipelines
- Validate after joins and groupbys
- Handle missing data explicitly

## Reliability

- Deterministic operations
- Logged assumptions
- Re-runnable pipelines

2๏ธโƒฃ User Prompt Template (Paste into Cursor Chat)โ€‹

Task:
[Describe the pandas data task.]

Why it matters:
[Business, analytical, or technical goal.]

Where this applies:
[Notebook, pipeline, dataset size.]
(Optional)

When this is needed:
[Exploration, cleaning, reporting.]
(Optional)

โœ… Fully Filled Exampleโ€‹

Task:
Clean and aggregate daily transaction data to monthly revenue by region.

Why it matters:
To support monthly financial reporting.

Where this applies:
Batch processing in a data analysis pipeline.

When this is needed:
During data cleaning and aggregation phase.

๐Ÿง  Why This Ordering Worksโ€‹

  • Who โ†’ How enforces data discipline
  • What โ†’ Why aligns transformations with real questions
  • Where โ†’ When grounds solutions in scale and lifecycle

Great pandas usage turns raw tables into reliable datasets.
Context transforms ad-hoc analysis into reproducible data workflows.


Happy Wrangling ๐Ÿผ๐Ÿ“Š