VelvetShark

Data analysis with Claude Code: 500,000 rows without writing code

Data Analysis with Claude Code

Most tutorials about data analysis with AI stay theoretical. A few prompts, some possibilities, but you rarely see someone work through a real dataset, deal with the messy parts, and produce something you could actually send to your manager.

This article is different. I'll show you exactly how to go from a raw Excel file with half a million rows to a finished report - step by step, with real prompts you can copy and use.

By the end, you'll have:

  • An Excel report with KPIs, top products, top customers, sales trends, and country breakdown
  • Professional charts ready for presentations
  • A customer cohort analysis with a retention heatmap
  • A reusable script that regenerates all of this with a single command

And you won't need to know Python.

The dataset

I'm using the Online Retail Dataset from the UCI Machine Learning Repository. It's real transaction data from a UK-based gift retailer - 541,909 transactions over about a year (December 2010 to December 2011).

It has everything you'd expect: invoice numbers, product codes, quantities, prices, customer IDs, countries. And it's messy. There are returns and refunds mixed in with orders, cancellations, negative quantities, ambiguous dates, inconsistencies. About a quarter of the rows are missing customer IDs.

This is what real business data looks like.

"But I don't know Python"

You don't need to.

Throughout this entire process, I'm going to:

  • Write prompts to Claude Code
  • Verify the data makes sense
  • Run CLI commands
  • Review output files and charts

What I'm NOT going to do is open a Python file and read the code. I'm not going to debug syntax errors. I'm not going to explain what pandas or matplotlib does under the hood.

Claude Code handles all of that. My job is to know what questions to ask my data. What do I want to analyze? How should the output be structured? What counts as a "return" versus a "cancellation"?

That's the skill. Knowing your business and knowing what you want to learn from your data. If you're good with Excel, you already have those skills. Prompting is just a different way of expressing them.

Think of it this way: you probably can't write a VLOOKUP formula from memory (I know I can't). But you know when you need one and what it should do. Same idea here - but more powerful.

One-time environment setup

If you've already got Python and Claude Code ready, skip to the next section.

First, you need Claude Code installed. Then I have Claude Code set up the Python environment for me:

Python Environment Setup with UV

I need you to set up my Python environment using uv for data analysis.

My Operating System: macOS

---
macOS Setup

1. First, check if uv is installed by running: uv --version
2. If not installed, install it with: curl -LsSf https://astral.sh/uv/install.sh | sh
3. After installation, ensure ~/.local/bin is in my PATH (restart terminal or run: source ~/.zshrc)
4. Create a virtual environment in my current project directory: uv venv
5. Install the required packages: uv pip install pandas openpyxl seaborn matplotlib jupyter
6. Show me how to activate the environment when needed

---
Update `CLAUDE.md`

After setting up the environment, add the following to my project's `CLAUDE.md` file (create it if it doesn't exist):

## Python Environment

- Always use `uv` for Python package management (not pip directly)
- Always use `uv` for running Python scripts. No exceptions.
- Virtual environment: `.venv` in project root
- Activate before running Python: `source .venv/bin/activate` (macOS/Linux)
- Install packages with: `uv pip install <package>`
- Required packages: pandas, openpyxl, seaborn, matplotlib, jupyter

## Chart Style

When creating charts with matplotlib/seaborn, always apply this configuration at the start of the script:

import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams.update({
'font.family': 'sans-serif',
'font.sans-serif': ['Inter', 'Helvetica Neue', 'Arial', 'sans-serif'],
'font.size': 11,
'axes.titlesize': 16,
'axes.titleweight': 'bold',
'axes.labelsize': 12,
'axes.spines.top': False,
'axes.spines.right': False,
'axes.linewidth': 0.8,
'axes.edgecolor': '#333333',
'axes.grid': True,
'grid.alpha': 0.3,
'grid.linewidth': 0.5,
'axes.axisbelow': True,
'figure.facecolor': 'white',
'axes.facecolor': 'white',
'figure.dpi': 150,
'savefig.dpi': 200,
'savefig.bbox': 'tight',
'savefig.pad_inches': 0.2,
})

CHART_COLORS = ['#4C78A8', '#F58518', '#54A24B', '#E45756', '#72B7B2', '#FF9DA6', '#9D755D', '#BAB0AC']
sns.set_palette(CHART_COLORS)

---
Verification

After setup is complete, verify everything works by running:
uv run python -c "import pandas; import openpyxl; import seaborn; import matplotlib; print('All packages installed successfully!')"

The CLAUDE.md file is like a memory file that helps Claude remember my preferences in future sessions. You run this setup once and never come back to it.

Loading the data: establishing guardrails

AI models can hallucinate. They can make up numbers that sound plausible but aren't real.

So the first thing I do is establish guardrails. I tell Claude: don't guess, don't estimate. Write code that loads the data, run it, and THEN tell me what you found.

I have a retail dataset at @"data/Online Retail.xlsx"

Before you answer anything with numbers:
1. Write Python code that loads the file with pandas
2. Execute it
3. Only then summarize findings from the printed outputs

Rules:
- Do not estimate or infer row counts, date ranges, or totals in chat
- All numbers must come from code execution
- Do not force dayfirst=True for date parsing unless min/max check proves it's needed

Start by printing: columns, dtypes, row count, and min/max InvoiceDate.

Claude writes the code, runs it, and shows me what it found directly from the data. 541,909 rows - that matches what we expected. Date range from December 2010 to December 2011 - that's correct.

If the dates had shown January to September, I'd know something went wrong with the date parsing. This verification step catches those errors early.

The @ symbol is how you reference files in Claude Code. It's pointing to the Excel file in my data folder.

Data prep: keeping the messy stuff

Here's where I do something different from most tutorials. Most people would say "drop the cancellations, drop the returns, clean up the data."

I'm not going to do that.

Returns and cancellations ARE the interesting business data. If a product has a 40% return rate, I want to know about that. If certain months have more cancellations, that's valuable information.

Create a data prep step that DOES NOT delete returns or cancellations.

Requirements:
1. Parse InvoiceDate to datetime - verify min/max after parsing matches Dec 2010 → Dec 2011
2. Add these columns:
- IsCancellation (True if InvoiceNo starts with 'C' or Quantity < 0)
- LineTotal = Quantity * UnitPrice
- GrossLineTotal = max(LineTotal, 0)
- ReturnLineTotal = min(LineTotal, 0)
3. Print a data quality summary:
- Total rows
- Rows flagged as cancellations
- Rows with missing CustomerID
- Rows with UnitPrice <= 0
- Date range confirmed

Do not remove any rows yet - just flag them.

Now I can see clearly: about 10,500 rows flagged as cancellations, 135,000 rows with missing customer IDs. The data is still all there - but it's flagged so we can handle it appropriately.

When I analyze products, I'll use all the data. When I analyze customers, I'll exclude the rows without customer IDs - but only for that specific analysis.

This is what real business analysis looks like. Gross revenue, returns, net revenue. Your CFO cares about all three of those numbers.

Building the report generator

Here's the main event. Instead of running one-off queries, I'm going to have Claude build me a reusable script. One command to generate an entire Excel report with multiple sheets and charts.

Build a Python script called retail_report.py that generates an Excel report from the dataset.

Parameters (with defaults):
- --top_n 20 (number of top products/customers to show)
- --freq M (time frequency: D/W/M for daily/weekly/monthly)
- --start_date, --end_date (optional date filters)
- --output output/retail_report.xlsx

The report should have these sheets:

1. KPIs
- GrossRevenue, Returns (absolute value), NetRevenue
- Total Invoices, Total Customers, Total Products
- Average Order Value
- Return Rate (%)

2. Top_Products (top N by NetRevenue, grouped by StockCode)
- StockCode, Description, QuantitySold, GrossRevenue, Returns, NetRevenue, ReturnRate

3. Top_Customers (top N by NetRevenue)
- CustomerID, Country, NumOrders, GrossRevenue, Returns, NetRevenue, AOV
- Note: exclude rows with missing CustomerID for this sheet only

4. Sales_Over_Time (resampled by --freq)
- Period, GrossRevenue, Returns, NetRevenue, NumInvoices

5. Sales_By_Country
- Country, NumCustomers, NumOrders, GrossRevenue, Returns, NetRevenue, % of Total

6. QA_Checks
- Row counts at each stage
- Reconciliation: NetRevenue via sum(LineTotal) vs sum(GrossLineTotal)+sum(ReturnLineTotal)
- Sample of 5 invoices (seeded for reproducibility) with their line items

Also save charts to output/charts/ using seaborn with a clean, professional style:
- top_products_bar.png (horizontal bar, top 15 products)
- sales_trend_line.png (line with markers)
- countries_bar.png (top 10 countries)

Chart requirements:
- Set figure size to (12, 6) for 16:9 aspect ratio
- Add value labels on bar charts
- Include clear titles and axis labels

After building the script:
- Run it with defaults
- Print a preview of KPIs and Top 3 products
- Print how to use the script and what options are available (remember to always use `uv`)
- Open the generated images with charts

Claude writes the Python code, encounters a few errors, fixes them automatically without me needing to do anything, and produces the finished script.

It took about 4 minutes to generate everything. I didn't need to give any additional input.

The output

The Excel report has all the sheets:

  • KPIs sheet with gross revenue, returns, net revenue, return rate
  • Top Products with quantities, revenue, return rates for each product
  • Top Customers with return numbers, average order values
  • Sales Over Time by month
  • Country breakdown with all the important KPIs for each country
  • QA sheet showing row counts at each stage and reconciliation checks

And the charts - clean, ready to put into a presentation.

Changing parameters

Want to see top 100 products instead of top 20?

Re-run the report with --top_n 100. Show me the new Top_Products row count.

One parameter change. Done.

Weekly data instead of monthly?

Re-run with --freq W (weekly instead of monthly). How many time periods now?

Just change M to W.

Only Q1 2011?

Re-run for just Q1 2011: --start_date 2011-01-01 --end_date 2011-03-31

In Excel, each of these changes would mean rebuilding pivot tables, adjusting formulas, recreating charts. Here it's just a parameter.

Getting insights

Now I've got the data. But what does it mean?

I could read through all the numbers myself, or I can have Claude help:

Based on the generated report, write 5 manager-ready insights.

Rules:
- Every insight must include at least one specific number from the report
- Each insight must cite its source: sheet name and column
Example: "Source: KPIs(ReturnRate), Top_Products(Returns)"
- No speculation about causes - describe what we observe
- Format as a Markdown section I can paste into an email

Each insight has a specific number and a citation. "Return rate is X%, Source: KPIs sheet." This prevents Claude from making things up. If it can't point to where the number came from, it shouldn't be in the insight.

Save the insights as output/summary.md and list any concerning patterns you found.

Now I have a shareable document. Not just chat history - an actual file I can attach to an email.

Verification: don't skip this

Most tutorials skip this part. But it's maybe the most important one.

AI can make mistakes. I've seen it calculate things wrong, misinterpret columns, use the wrong date format. You need to verify before you share this with others in your business.

High-level reconciliation

In QA_Checks, show me a reconciliation for NetRevenue using two independent methods:

- Method A (direct): sum(LineTotal) over the included rows
- Method B (components): sum(GrossLineTotal) + sum(ReturnLineTotal)

Show NetRevenue_A, NetRevenue_B, and the absolute + percent difference.
They should match (difference ~ 0). If not, explain which rows differ.

The difference is zero. That's a good sign - they're calculated differently but give the same result.

Drill-down verification

For the #1 product (by StockCode) in Top_Products:
- Create output/qa_top_product_rows.csv with all its transactions
- Include: StockCode, Description, InvoiceNo, InvoiceDate, Quantity, UnitPrice, LineTotal, IsCancellation
- Print the StockCode, Description, computed NetRevenue total, and row count

Now I can open this CSV and spot-check. Pick a few rows. Does Quantity times UnitPrice equal LineTotal? Sum the LineTotal column - does it match what the report says?

This takes two minutes and can save you from presenting wrong numbers to your manager.

Cohort analysis: something hard to do in Excel

One more thing. This is hard to do in Excel, but valuable for understanding customer behavior: cohort analysis.

Which month's customers are the most loyal? How does retention change over time?

Create a customer cohort analysis showing retention over time.

Requirements:
1. Define each customer's cohort by their FIRST purchase month (e.g., "Dec 2010", "Jan 2011")
2. For each cohort, track how many customers returned in subsequent months (Month 0, Month 1, Month 2, etc.)
3. Calculate retention rate as: (customers active in Month N) / (customers in cohort) × 100

Output:
- A DataFrame showing cohorts as rows, months (0-12) as columns, with retention percentages
- Save to output/cohort_analysis.xlsx with two sheets:
- "Retention_Rates" (percentages)
- "Customer_Counts" (raw numbers)

Visualization:
- Create a heatmap (output/charts/cohort_heatmap.png)
- Use a sequential color palette (e.g., Blues or YlGnBu)
- Annotate cells with the percentage values
- Title: "Customer Retention by Cohort"
- X-axis: "Months Since First Purchase"
- Y-axis: "Cohort (First Purchase Month)"
- Figure size: (14, 8) for readability

Notes:
- Exclude rows with missing CustomerID
- Month 0 should always be 100% (the month they first purchased)
- Only include cohorts with at least 3 months of possible follow-up data

Print a summary: which cohort has the best Month 3 retention? Which has the worst?

One minute later, I have a nice heatmap. Each row is a cohort - customers who made their first purchase in that month. Each column shows what percentage came back. Month 0 is always 100% - that's when they first bought. Then you can see how it drops off.

Try doing this in Excel. It would take hours. We built it with one prompt.

Bonus: Jupyter notebook

If you want a more interactive way to explore the data:

Create a companion Jupyter notebook retail_analysis.ipynb that imports functions from retail_report.py and walks through the analysis step by step with explanations. This is for colleagues who want to understand the methodology.

Claude generates the notebook, and now you can change parameters and see results immediately - without going back to the command line.

What we built

In about 15-20 minutes, we:

  1. Built a reusable report generator with KPIs, rankings, time trends, and country breakdown
  2. Created professional charts ready for presentations
  3. Built a customer retention analysis with a heatmap
  4. Set up verification steps to make sure the numbers are right

Next month, when new data comes in? One command. No pivot table rebuilding. No formula fixes.

This approach works for all kinds of data:

  • Survey responses
  • Financial data and expense reports
  • Marketing campaign performance
  • Inventory tracking
  • Really any data that comes in a spreadsheet

I never even opened a Python file. What matters is knowing what you want to learn from your data and being able to describe it clearly in a prompt.

If Claude makes an error - maybe it misses a file path or gets a column name wrong - you just tell it. "The file is actually here" or "That column is called X not Y." It fixes itself.

And always verify. Always. The QA sheet, the drill-downs, the spot-checks. That's what separates useful AI assistance from blind trust.

Resources

Now, go play with your data!

Shark footer