Use openpyxl - Convert to DataFrame in Pandas

October 13, 2018 python

Page content

Introduction

Previously, I wrote several articles on working Excel files using openpyxl.

In this article, I introduce how to convert openpyxl data to Pandas data format called DataFrame.

Preparation

Install modules

First, install module with pip command.

pip install openpyxl pandas

Make sample data

Second, create a new file named sample.xlsx including the following data.

Workbook has a sheet named no_header that doesn’t have header line.

no_header_sample_sheet

Workbook has a sheet named sample that has a header line.

sample_sheet

Convert openpyxl object to DataFrame

Load Excel data with openpyxl and convert to DataFrame. DataFrame is used to represent 2D data on Pandas. Since Excel data is also 2D data expressed by rows and columns, Worksheet object in [openpyxl] (https://openpyxl.readthedocs.io/en/stable/index.html) can be converted to Pandas DataFrame object.

Data without header line

When converting a file that has no header line, give values property on Worksheet object to DataFrame constructor.

 1from openpyxl import load_workbook
 2import pandas as pd
 3
 4# Load workbook
 5wb = load_workbook('sample.xlsx')
 6# Access to a worksheet named 'no_header'
 7ws = wb['no_header']
 8
 9# Convert to DataFrame
10df = pd.DataFrame(ws.values)

Check the result.

1df.head(3)

	0	1	2
0	1	Mike	male
1	2	Mayor	female
2	3	Jon	male

The column number is displayed as a header.

Data with header line

In case of a file that has a header line, it is necessary to change processing. When creating a DataFrame object, specify column name with columns option. At this time, the length of array given columns option must be equal to length of columns in DataFrame.

The sample code is as follows.

 1from openpyxl import load_workbook
 2import pandas as pd
 3
 4wb = load_workbook('sample.xlsx')
 5ws = wb['sample']
 6
 7data = ws.values
 8# Get the first line in file as a header line
 9columns = next(data)[0:]
10# Create a DataFrame based on the second and subsequent lines of data
11df = pd.DataFrame(data, columns=columns)

Check the result.

1df.head(3)

	ID	Name	Sex
0	1	Mike	male
1	2	Mayor	female
2	3	Jon	male

The column name is displayed.

Loading Excel file easier with read_excel function

Using the read_excel function in Pandas, we can do the same processing. To use read_excel function, install xlrd and openpyxl.

1pip install openpyxl pandas xlrd

Call read_excel function as below.

1import pandas as pd
2
3df = pd.read_excel('sample.xlsx', sheet_name='sample')
4df.head()

Conclusion

It is available to

Convert Worksheet object with or without headers to DataFrame object
Make it simpler with read_excel function in Pandas

Use openpyxl - Convert to DataFrame in Pandas

Introduction

Preparation

Install modules

Make sample data

Convert openpyxl object to DataFrame

Data without header line

Data with header line

Loading Excel file easier with read_excel function

Conclusion

Use openpyxl - read and write Cell in Python

Use openpyxl - create a new Worksheet, change sheet property in Python

Use openpyxl - open, save Excel files in Python

Google Colaboratory is the best tool for machine learning engineer

Convert a string representing the date or epoch time to datetime and change timezone in Python