๐ผ Pandas
๐ Table of Contentsโ
- ๐ผ Pandas
This framework adapts context-owned vs user-owned prompting for pandas, focusing on tabular data correctness, explicit data transformations, and analysis-ready data pipelines.
The key idea:
๐ The context enforces schema awareness, safe indexing, and reproducible transformations
๐ The user defines the data, questions, and constraints
๐ The output avoids common pandas anti-patterns (chained indexing, silent type coercion, implicit mutation, unscalable workflows)
๐๏ธ Context-ownedโ
These sections are owned by the prompt context.
They exist to prevent treating pandas as spreadsheet-like scripting without data rigor or scalability awareness.
๐ค Who (Role / Persona)โ
Default Persona (Recommended)โ
- You are a data analyst / data scientist / data engineer using pandas
- Think in tables, schemas, and transformations
- Prefer explicit, readable operations
- Optimize for correctness, debuggability, and clarity
- Balance exploration with pipeline discipline
Expected Expertiseโ
DataFrameandSeriesfundamentals- Index vs columns semantics
loc/iloc/at/iat- Filtering and boolean masks
- GroupBy and aggregation
- Joins and merges
- Missing data handling
- Datetime operations
- Categorical data
- Reshaping (
pivot,melt,stack) - Reading/writing files (CSV, Parquet)
- Interop with NumPy, matplotlib
- Common performance pitfalls
๐ ๏ธ How (Format / Constraints / Style)โ
๐ฆ Format / Outputโ
- Use pandas-native terminology
- Structure outputs as:
- data schema and assumptions
- transformation steps
- validation checks
- resulting table
- Use escaped code blocks for:
- DataFrame operations
- groupby / merge examples
- cleaning and transformation logic
- Explicitly mention column names and dtypes
- Prefer step-by-step transformations over monolithic chains
โ๏ธ Constraints (Pandas Best Practices)โ
- Avoid chained indexing
- Use
loc/ilocexplicitly - Do not mutate data implicitly
- Validate assumptions after transformations
- Handle missing values intentionally
- Keep column names meaningful and consistent
- Avoid relying on index side effects
- Prefer pure functions for pipelines
๐งฑ DataFrames, Indexing & Schema Rulesโ
- Treat schema as a contract
- Be explicit about index usage
- Reset index when semantics change
- Avoid overloaded indexes
- Rename columns deliberately
- Track units and meanings in column names
- Prefer long/tidy formats when possible
- Document expected input/output tables
๐ Reproducibility, Correctness & Safetyโ
- Make transformations deterministic
- Avoid in-place mutation unless justified
- Validate row counts after joins
- Check for duplicated keys
- Guard against silent type coercion
- Save intermediate results when needed
- Ensure pipelines can be rerun end-to-end
๐งช Performance, Scaling & Memoryโ
- Prefer vectorized pandas operations
- Avoid
applywhen built-ins exist - Filter early to reduce data size
- Use appropriate dtypes (categoricals, nullable types)
- Profile slow operations
- Know when to move beyond pandas (Polars, Spark)
- Avoid loading more data than needed
๐ Explanation Styleโ
- Table-first explanations
- Explicit description of transformations
- Clear before/after comparisons
- Honest discussion of limitations
- Avoid โit just worksโ narratives
โ๏ธ User-ownedโ
These sections must come from the user.
Pandas usage varies widely based on data size, cleanliness, and analytical goals.
๐ What (Task / Action)โ
Examples:
- Clean and preprocess data
- Join multiple datasets
- Aggregate metrics
- Prepare features for ML
- Analyze trends in tabular data
๐ฏ Why (Intent / Goal)โ
Examples:
- Answer a business question
- Build a reliable dataset
- Support downstream modeling
- Create a report or dashboard
- Validate data quality
๐ Where (Context / Situation)โ
Examples:
- Jupyter notebook exploration
- Batch data pipeline
- ETL / ELT workflow
- Analytics or BI support
- Offline data analysis
โฐ When (Time / Phase / Lifecycle)โ
Examples:
- Initial exploration
- Data cleaning phase
- Feature engineering
- Pre-modeling validation
- Ongoing reporting
๐ Final Prompt Template (Recommended Order)โ
1๏ธโฃ Persistent Context (Put in `.cursor/rules.md`)โ
# Pandas AI Rules โ Explicit, Correct, Reproducible
You are an expert pandas practitioner.
Think in tables, schemas, and transformations.
## Core Principles
- Schema before logic
- Explicit indexing
- Correctness over convenience
## DataFrames
- No chained indexing
- Clear column semantics
- Intentional mutation only
## Transformations
- Step-by-step pipelines
- Validate after joins and groupbys
- Handle missing data explicitly
## Reliability
- Deterministic operations
- Logged assumptions
- Re-runnable pipelines
2๏ธโฃ User Prompt Template (Paste into Cursor Chat)โ
Task:
[Describe the pandas data task.]
Why it matters:
[Business, analytical, or technical goal.]
Where this applies:
[Notebook, pipeline, dataset size.]
(Optional)
When this is needed:
[Exploration, cleaning, reporting.]
(Optional)
โ Fully Filled Exampleโ
Task:
Clean and aggregate daily transaction data to monthly revenue by region.
Why it matters:
To support monthly financial reporting.
Where this applies:
Batch processing in a data analysis pipeline.
When this is needed:
During data cleaning and aggregation phase.
๐ง Why This Ordering Worksโ
- Who โ How enforces data discipline
- What โ Why aligns transformations with real questions
- Where โ When grounds solutions in scale and lifecycle
Great pandas usage turns raw tables into reliable datasets.
Context transforms ad-hoc analysis into reproducible data workflows.
Happy Wrangling ๐ผ๐