Python Tips for Office Files To PDF

May 31, 2021 Article blog

This article comes from the public number: Python Technology Author: Pieson Sauce

In the daily life work, it is inevitable to use some small Tip to solve the small problems encountered in the work, today's article gives you a quick and easy tip to convert Office (doc/docx/ppt/pptx/xlx/xlsx) files into PDF files in bulk. However, you'll need to install Office on your PC before you can do that, and then use Python's win32com package to convert Office files.

Install win32com

Before you go live, you need to install Python's win32com, and the detailed installation steps are as follows:

Install with the pip command

pip install pywin32

If we encounter an installation error, we can do so again by python -m pip install --upgrade pip update the cloud:

python -m pip install --upgrade pip

Download the offline installation package installation

If the pip command is not installed successfully, you can also download the offline package installation, the method steps are as follows: first select the corresponding Python version on the official website to download the offline package: sourceforge.net/projects/pywin32/files/pywin32/Build%20221 / download after the fool installed it.

File conversion logic

The detailed code is as follows:

class PDFConverter:
    def __init__(self, pathname, export='.'):
        self._handle_postfix = ['doc', 'docx', 'ppt', 'pptx', 'xls', 'xlsx'] # 支持转换的文件类型
        self._filename_list = list()  #列出文件
        self._export_folder = os.path.join(os.path.abspath('.'), 'file_server/pdfconver')
        if not os.path.exists(self._export_folder):
            os.mkdir(self._export_folder)
        self._enumerate_filename(pathname)


    def _enumerate_filename(self, pathname):
        '''
        读取所有文件名
        '''
        full_pathname = os.path.abspath(pathname)
        if os.path.isfile(full_pathname):
            if self._is_legal_postfix(full_pathname):
                self._filename_list.append(full_pathname)
            else:
                raise TypeError('文件 {} 后缀名不合法！仅支持如下文件类型：{}。'.format(pathname, '、'.join(self._handle_postfix)))
        elif os.path.isdir(full_pathname):
            for relpath, _, files in os.walk(full_pathname):
                for name in files:
                    filename = os.path.join(full_pathname, relpath, name)
                    if self._is_legal_postfix(filename):
                        self._filename_list.append(os.path.join(filename))
        else:
            raise TypeError('文件/文件夹 {} 不存在或不合法！'.format(pathname))


    def _is_legal_postfix(self, filename):
        return filename.split('.')[-1].lower() in self._handle_postfix and not os.path.basename(filename).startswith(
            '~')


    def run_conver(self):
        print('需要转换的文件数是：', len(self._filename_list))
        for filename in self._filename_list:
            postfix = filename.split('.')[-1].lower()
            funcCall = getattr(self, postfix)
            print('原文件：', filename)
            funcCall(filename)
        print('转换完成！')

Doc/docx is converted to PDF

The doc/docx conversion to PDF section code looks like this:

    def doc(self, filename):
        name = os.path.basename(filename).split('.')[0] + '.pdf'
        exportfile = os.path.join(self._export_folder, name)
        print('保存 PDF 文件：', exportfile)
        gencache.EnsureModule('{00020905-0000-0000-C000-000000000046}', 0, 8, 4)
        pythoncom.CoInitialize()
        w = Dispatch("Word.Application")
        pythoncom.CoInitialize()  # 加上防止 CoInitialize 未加载
        doc = w.Documents.Open(filename)
        doc.ExportAsFixedFormat(exportfile, constants.wdExportFormatPDF,
                                Item=constants.wdExportDocumentWithMarkup,
                                CreateBookmarks=constants.wdExportCreateHeadingBookmarks)
        w.Quit(constants.wdDoNotSaveChanges)
 def docx(self, filename):
        self.doc(filename)

ppt/pptx converts to PDF

The conversion of ppt/pptx to PDF section code is as follows:

 def ppt(self, filename):
        name = os.path.basename(filename).split('.')[0] + '.pdf'
        exportfile = os.path.join(self._export_folder, name)
        gencache.EnsureModule('{00020905-0000-0000-C000-000000000046}', 0, 8, 4)
        pythoncom.CoInitialize()
        p = Dispatch("PowerPoint.Application")
        pythoncom.CoInitialize()
        ppt = p.Presentations.Open(filename, False, False, False)
        ppt.ExportAsFixedFormat(exportfile, 2, PrintRange=None)
        print('保存 PDF 文件：', exportfile)
        p.Quit()


    def pptx(self, filename):
        self.ppt(filename)

xls/xlsx converts to PDF

    def xls(self, filename):
        name = os.path.basename(filename).split('.')[0] + '.pdf'
        exportfile = os.path.join(self._export_folder, name)
        pythoncom.CoInitialize()
        xlApp = DispatchEx("Excel.Application")
        pythoncom.CoInitialize()
        xlApp.Visible = False
        xlApp.DisplayAlerts = 0
        books = xlApp.Workbooks.Open(filename, False)
        books.ExportAsFixedFormat(0, exportfile)
        books.Close(False)
        print('保存 PDF 文件：', exportfile)
        xlApp.Quit()


    def xlsx(self, filename):
        self.xls(filename)

Perform the conversion logic

if __name__ == "__main__":
    # 支持文件夹批量导入
    #folder = 'tmp'
    #pathname = os.path.join(os.path.abspath('.'), folder)
    # 也支持单个文件的转换
    pathname = "G:/python_study/test.doc"
    pdfConverter = PDFConverter(pathname)
    pdfConverter.run_conver()

That's what W3Cschool编程狮 has to say about Python's tips for turning Office files into PDFs.