Python实践96-用pytesseract做光学字符识别

Home 1.0 Python实践之路 Python实践96-用pytesseract做光学字符识别

简介

OCR （Optical Character Recognition，光学字符识别）是指电子设备（例如扫描仪或数码相机）检查纸上打印的字符，通过检测暗、亮的模式确定其形状，然后用字符识别方法将形状翻译成计算机文字的过程
Tesseract是一个开源的OCR引擎，能识别100多种语言（中，英，韩，日，德，法…等等），但是Tesseract对手写的识别能力较差。pytesseract是python里的一个基于Tesseract的包。

准备

首先下载并安装tesseract-ocr软件。mac上运行 brew install tesseract
安装python库，pip install pytesseract和pip install pillow
截屏一段文字，将图片存为zen.png

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.

代码

import pytesseract
from PIL import Image

if __name__ == '__main__':
    print(pytesseract.image_to_string(Image.open('zen.png')))

代码地址

本系列文章和代码已经作为项目归档到github，仓库地址：jumper2014/PyCodeComplete。大家觉得有帮助就请在github上star一下，你的支持是我更新的动力。什么？你没有github账号？学习Python怎么可以没有github账号呢，快去注册一个啦！

About author

曾月天

View all posts by 曾月天