Data analysis with Claude Code: 500,000 rows without writing code

Most tutorials about data analysis with AI stay theoretical. A few prompts, some possibilities, but you rarely see someone work through a real dataset, deal with the messy parts, and produce something you could actually send to your manager.
This article is different. I'll show you exactly how to go from a raw Excel file with half a million rows to a finished report - step by step, with real prompts you can copy and use.
By the end, you'll have:
- An Excel report with KPIs, top products, top customers, sales trends, and country breakdown
- Professional charts ready for presentations
- A customer cohort analysis with a retention heatmap
- A reusable script that regenerates all of this with a single command
And you won't need to know Python.
The dataset
I'm using the Online Retail Dataset from the UCI Machine Learning Repository. It's real transaction data from a UK-based gift retailer - 541,909 transactions over about a year (December 2010 to December 2011).
It has everything you'd expect: invoice numbers, product codes, quantities, prices, customer IDs, countries. And it's messy. There are returns and refunds mixed in with orders, cancellations, negative quantities, ambiguous dates, inconsistencies. About a quarter of the rows are missing customer IDs.
This is what real business data looks like.
"But I don't know Python"
You don't need to.
Throughout this entire process, I'm going to:
- Write prompts to Claude Code
- Verify the data makes sense
- Run CLI commands
- Review output files and charts
What I'm NOT going to do is open a Python file and read the code. I'm not going to debug syntax errors. I'm not going to explain what pandas or matplotlib does under the hood.
Claude Code handles all of that. My job is to know what questions to ask my data. What do I want to analyze? How should the output be structured? What counts as a "return" versus a "cancellation"?
That's the skill. Knowing your business and knowing what you want to learn from your data. If you're good with Excel, you already have those skills. Prompting is just a different way of expressing them.
Think of it this way: you probably can't write a VLOOKUP formula from memory (I know I can't). But you know when you need one and what it should do. Same idea here - but more powerful.
One-time environment setup
If you've already got Python and Claude Code ready, skip to the next section.
First, you need Claude Code installed. Then I have Claude Code set up the Python environment for me:
The CLAUDE.md file is like a memory file that helps Claude remember my preferences in future sessions. You run this setup once and never come back to it.
Loading the data: establishing guardrails
AI models can hallucinate. They can make up numbers that sound plausible but aren't real.
So the first thing I do is establish guardrails. I tell Claude: don't guess, don't estimate. Write code that loads the data, run it, and THEN tell me what you found.
Claude writes the code, runs it, and shows me what it found directly from the data. 541,909 rows - that matches what we expected. Date range from December 2010 to December 2011 - that's correct.
If the dates had shown January to September, I'd know something went wrong with the date parsing. This verification step catches those errors early.
The @ symbol is how you reference files in Claude Code. It's pointing to the Excel file in my data folder.
Data prep: keeping the messy stuff
Here's where I do something different from most tutorials. Most people would say "drop the cancellations, drop the returns, clean up the data."
I'm not going to do that.
Returns and cancellations ARE the interesting business data. If a product has a 40% return rate, I want to know about that. If certain months have more cancellations, that's valuable information.
Now I can see clearly: about 10,500 rows flagged as cancellations, 135,000 rows with missing customer IDs. The data is still all there - but it's flagged so we can handle it appropriately.
When I analyze products, I'll use all the data. When I analyze customers, I'll exclude the rows without customer IDs - but only for that specific analysis.
This is what real business analysis looks like. Gross revenue, returns, net revenue. Your CFO cares about all three of those numbers.
Building the report generator
Here's the main event. Instead of running one-off queries, I'm going to have Claude build me a reusable script. One command to generate an entire Excel report with multiple sheets and charts.
Claude writes the Python code, encounters a few errors, fixes them automatically without me needing to do anything, and produces the finished script.
It took about 4 minutes to generate everything. I didn't need to give any additional input.
The output
The Excel report has all the sheets:
- KPIs sheet with gross revenue, returns, net revenue, return rate
- Top Products with quantities, revenue, return rates for each product
- Top Customers with return numbers, average order values
- Sales Over Time by month
- Country breakdown with all the important KPIs for each country
- QA sheet showing row counts at each stage and reconciliation checks
And the charts - clean, ready to put into a presentation.
Changing parameters
Want to see top 100 products instead of top 20?
One parameter change. Done.
Weekly data instead of monthly?
Just change M to W.
Only Q1 2011?
In Excel, each of these changes would mean rebuilding pivot tables, adjusting formulas, recreating charts. Here it's just a parameter.
Getting insights
Now I've got the data. But what does it mean?
I could read through all the numbers myself, or I can have Claude help:
Each insight has a specific number and a citation. "Return rate is X%, Source: KPIs sheet." This prevents Claude from making things up. If it can't point to where the number came from, it shouldn't be in the insight.
Now I have a shareable document. Not just chat history - an actual file I can attach to an email.
Verification: don't skip this
Most tutorials skip this part. But it's maybe the most important one.
AI can make mistakes. I've seen it calculate things wrong, misinterpret columns, use the wrong date format. You need to verify before you share this with others in your business.
High-level reconciliation
The difference is zero. That's a good sign - they're calculated differently but give the same result.
Drill-down verification
Now I can open this CSV and spot-check. Pick a few rows. Does Quantity times UnitPrice equal LineTotal? Sum the LineTotal column - does it match what the report says?
This takes two minutes and can save you from presenting wrong numbers to your manager.
Cohort analysis: something hard to do in Excel
One more thing. This is hard to do in Excel, but valuable for understanding customer behavior: cohort analysis.
Which month's customers are the most loyal? How does retention change over time?
One minute later, I have a nice heatmap. Each row is a cohort - customers who made their first purchase in that month. Each column shows what percentage came back. Month 0 is always 100% - that's when they first bought. Then you can see how it drops off.
Try doing this in Excel. It would take hours. We built it with one prompt.
Bonus: Jupyter notebook
If you want a more interactive way to explore the data:
Claude generates the notebook, and now you can change parameters and see results immediately - without going back to the command line.
What we built
In about 15-20 minutes, we:
- Built a reusable report generator with KPIs, rankings, time trends, and country breakdown
- Created professional charts ready for presentations
- Built a customer retention analysis with a heatmap
- Set up verification steps to make sure the numbers are right
Next month, when new data comes in? One command. No pivot table rebuilding. No formula fixes.
This approach works for all kinds of data:
- Survey responses
- Financial data and expense reports
- Marketing campaign performance
- Inventory tracking
- Really any data that comes in a spreadsheet
I never even opened a Python file. What matters is knowing what you want to learn from your data and being able to describe it clearly in a prompt.
If Claude makes an error - maybe it misses a file path or gets a column name wrong - you just tell it. "The file is actually here" or "That column is called X not Y." It fixes itself.
And always verify. Always. The QA sheet, the drill-downs, the spot-checks. That's what separates useful AI assistance from blind trust.
Resources
- Dataset: UCI Online Retail Dataset (CC BY 4.0)
- Claude Code: claude.ai/code
- UV (Python package manager): astral.sh/uv
Now, go play with your data!
