ExcelPython Tutorial: Read, Write, and Analyze Spreadsheets with PythonSpreadsheets remain a core tool for business, finance, data analysis, and personal productivity. Python, with its rich ecosystem of libraries, lets you automate, clean, analyze, and visualize spreadsheet data far more reliably and quickly than manual editing. This tutorial walks through reading, writing, and analyzing Excel files using Python—covering common libraries, practical examples, best practices, and performance tips.
Why use Python with Excel?
- Automation: Replace repetitive manual tasks (formatting, copying, formula updates) with repeatable scripts.
- Scalability: Process many files or very large datasets without manual intervention.
- Reproducibility: Scripts serve as documented, versionable workflows.
- Powerful analysis: Leverage pandas, NumPy, and visualization libraries to do analyses that are cumbersome in Excel.
Overview of popular Python libraries for Excel
- pandas — High-level data manipulation; reads/writes Excel via engine backends.
- openpyxl — Read/write xlsx files; supports styles, charts, and formulas.
- xlrd / xlwt — Legacy libraries for old .xls files (limited for modern use).
- pyxlsb — Read binary Excel files (.xlsb).
- xlwings — Live interaction with Excel application (Windows/macOS), run Python from Excel and manipulate the UI.
- win32com (pywin32) — Automate Excel through COM on Windows (powerful but platform-specific).
- odfpy — Work with OpenDocument spreadsheets (.ods).
For most data tasks, pandas + openpyxl (for xlsx features) or pandas + xlrd/pyxlsb (for specific formats) will be sufficient.
Setup and installation
Install core packages via pip:
pip install pandas openpyxl xlrd pyxlsb xlwings
Note: As of recent library changes, use xlrd only for .xls files and pyxlsb for .xlsb. pandas will automatically select an engine when reading Excel; you can override with the engine parameter.
Reading Excel files
Basic reading with pandas:
import pandas as pd df = pd.read_excel("data.xlsx") # first sheet by default df2 = pd.read_excel("data.xlsx", sheet_name="Sheet2") sheets = pd.read_excel("data.xlsx", sheet_name=None) # returns dict of DataFrames
Common options:
- sheet_name: str, int, list, or None (None -> all sheets)
- usecols: list or string like “A:C, F” to read specific columns
- skiprows: int or list to skip header rows
- header: row index to use for column names
- dtype: enforce data types
- parse_dates: parse columns as datetimes
Example reading specific columns and parsing dates:
df = pd.read_excel( "sales.xlsx", sheet_name="Orders", usecols=["OrderID", "Date", "Total"], parse_dates=["Date"], dtype={"OrderID": str} )
Handling large files:
- Use usecols and nrows to limit IO.
- For extremely large Excel files, convert to CSV if possible and stream with chunksize in pandas.read_csv.
Writing Excel files
Write a single DataFrame:
df.to_excel("output.xlsx", index=False)
Write multiple sheets:
with pd.ExcelWriter("multi_sheet.xlsx", engine="openpyxl") as writer: df_orders.to_excel(writer, sheet_name="Orders", index=False) df_customers.to_excel(writer, sheet_name="Customers", index=False)
Append to an existing workbook:
from openpyxl import load_workbook book = load_workbook("existing.xlsx") with pd.ExcelWriter("existing.xlsx", engine="openpyxl", mode="a", if_sheet_exists="replace") as writer: df_new.to_excel(writer, sheet_name="NewData", index=False)
Preserving formats: pandas writes raw values; to preserve styles or add formatting, use openpyxl directly or style after writing.
Working with openpyxl for formatting and formulas
openpyxl lets you modify workbook structure, cell styles, and formulas.
Example: creating a workbook, adding styles, and formulas:
from openpyxl import Workbook from openpyxl.styles import Font, PatternFill from openpyxl.utils import get_column_letter wb = Workbook() ws = wb.active ws.title = "Report" # Headers with bold font headers = ["Item", "Qty", "Price", "Total"] for col, h in enumerate(headers, start=1): cell = ws.cell(row=1, column=col, value=h) cell.font = Font(bold=True) cell.fill = PatternFill("solid", fgColor="DDDDDD") # Data rows data = [["Apple", 10, 0.5], ["Banana", 5, 0.7]] for i, row in enumerate(data, start=2): ws.cell(row=i, column=1, value=row[0]) ws.cell(row=i, column=2, value=row[1]) ws.cell(row=i, column=3, value=row[2]) ws.cell(row=i, column=4, value=f"=B{i}*C{i}") # Auto-adjust column widths for col in ws.columns: max_length = max(len(str(cell.value)) if cell.value is not None else 0 for cell in col) ws.column_dimensions[get_column_letter(col[0].column)].width = max_length + 2 wb.save("styled_report.xlsx")
openpyxl supports charts, merged cells, filters, and named ranges.
Using xlwings for live Excel automation
xlwings is ideal when you need to control the Excel application (macros, user-interactive sheets) or call Python from Excel.
Basic example:
import xlwings as xw wb = xw.Book("interactive.xlsx") # opens or connects to workbook sht = wb.sheets["Sheet1"] data = sht.range("A1").expand().options(pd.DataFrame, index=False).value # write back results sht.range("F1").value = ["Total", "=SUM(C2:C100)"]
xlwings can create UDFs (user-defined functions) callable from Excel cells and integrate with VBA workflows.
Data cleaning and transformation patterns
pandas makes spreadsheet-style cleaning reproducible.
- Drop or rename columns:
df = df.drop(columns=["Unnecessary"]) df = df.rename(columns={"OldName": "NewName"})
- Fill or drop missing values:
df["Qty"] = df["Qty"].fillna(0) df = df.dropna(subset=["OrderID"])
- Convert data types:
df["Date"] = pd.to_datetime(df["Date"]) df["Price"] = df["Price"].astype(float)
- Pivot, groupby, and aggregate:
summary = df.groupby("Category").agg( total_sales=pd.NamedAgg(column="Total", aggfunc="sum"), avg_price=pd.NamedAgg(column="Price", aggfunc="mean"), orders=pd.NamedAgg(column="OrderID", aggfunc="nunique") ).reset_index()
- Merge/join sheets:
merged = df_orders.merge(df_customers, on="CustomerID", how="left")
Example workflow: Monthly sales report
- Read raw order and product sheets.
- Clean dates and numeric types.
- Compute order totals and join product categories.
- Aggregate monthly totals and top products.
- Write a styled Excel report with summary sheet and charts.
Sketch:
# Read orders = pd.read_excel("orders.xlsx", sheet_name="Orders", parse_dates=["OrderDate"]) products = pd.read_excel("orders.xlsx", sheet_name="Products") # Clean orders["Quantity"] = orders["Quantity"].fillna(0).astype(int) orders["UnitPrice"] = orders["UnitPrice"].astype(float) orders["Total"] = orders["Quantity"] * orders["UnitPrice"] # Join df = orders.merge(products[["ProductID", "Category"]], on="ProductID", how="left") # Aggregate monthly = df.set_index("OrderDate").resample("M")["Total"].sum().rename("MonthlySales").reset_index() top_products = df.groupby("ProductName")["Total"].sum().nlargest(10).reset_index() # Write with pd.ExcelWriter("monthly_report.xlsx", engine="openpyxl") as writer: monthly.to_excel(writer, sheet_name="Monthly", index=False) top_products.to_excel(writer, sheet_name="TopProducts", index=False)
Add charts later via openpyxl or use matplotlib/seaborn to create images and insert into the workbook.
Performance tips
- Read only needed columns and rows (usecols, nrows).
- Avoid reading many small Excel files repeatedly; batch them or convert to a common format (CSV/Parquet).
- Use vectorized pandas operations rather than row-by-row loops.
- For extremely large tabular data, convert to Parquet and operate there; write back to Excel only for final reporting.
- Use multiprocessing or Dask for parallel processing of many files.
Common pitfalls and troubleshooting
- Mixed datatypes in columns cause dtype surprises—use dtype or convert after reading.
- Date parsing can fail for nonstandard formats—use pd.to_datetime with format or dayfirst flags.
- Excel formulas are stored as formula strings; reading via pandas returns evaluated values, not formulas. Use openpyxl to read/edit formulas.
- When appending sheets, watch for index collisions and the behavior of if_sheet_exists parameter.
Security considerations
- Beware of malicious macros in Excel files. Do not enable macros from untrusted sources.
- When automating Excel via COM or xlwings, user interaction and unsaved changes can affect runs—test in a controlled environment.
Next steps and recommended learning path
- Master pandas DataFrame operations (groupby, pivot, joins).
- Learn openpyxl for formatting and charts.
- Explore xlwings if you need tight coupling with the Excel app.
- Practice converting Excel workflows to scripted pipelines, and use version control for reproducibility.
This tutorial covered practical reading, writing, cleaning, and reporting patterns for working with Excel files in Python using pandas, openpyxl, and xlwings. If you want, I can convert any of the code snippets into a ready-to-run script tailored to your Excel file structure.
Leave a Reply