[Buildroot] [PATCH v2 2/2] tesseract-ocr: new package

Thomas Petazzoni thomas.petazzoni at free-electrons.com
Sun Mar 19 13:54:55 UTC 2017


Hello,

On Sun, 19 Mar 2017 09:07:53 +0100, Gilles Talis wrote:
> diff --git a/package/tesseract-ocr/Config.in b/package/tesseract-ocr/Config.in
> new file mode 100644
> index 0000000..4fd0668
> --- /dev/null
> +++ b/package/tesseract-ocr/Config.in
> @@ -0,0 +1,44 @@
> +comment "tesseract-ocr needs a toolchain w/ threads, C++, gcc >= 4.8 & dynamic library"
> +	depends on BR2_USE_MMU
> +	depends on !BR2_INSTALL_LIBSTDCPP || !BR2_TOOLCHAIN_HAS_THREADS || \
> +        !BR2_TOOLCHAIN_GCC_AT_LEAST_4_8 || BR2_STATIC_LIBS

Indentation of this last line should have been two tabs.

> +menuconfig BR2_PACKAGE_TESSERACT_OCR
> +	bool "tesseract-ocr"
> +	depends on BR2_INSTALL_LIBSTDCPP
> +	depends on BR2_TOOLCHAIN_HAS_THREADS
> +	depends on BR2_TOOLCHAIN_GCC_AT_LEAST_4_8 # C++11
> +	depends on BR2_USE_MMU # fork()
> +	depends on !BR2_STATIC_LIBS
> +	select BR2_PACKAGE_JPEG
> +	select BR2_PACKAGE_LEPTONICA
> +	select BR2_PACKAGE_LIBPNG
> +	select BR2_PACKAGE_TIFF

I don't see where jpeg, libpng and tiff are mandatory. In fact, I don't
see them being used by tesseract-ocr, so I've dropped those
dependencies for nwo.


> +TESSERACT_OCR_VERSION = 3.05.00
> +TESSERACT_OCR_DATA_VERSION = 3.04.00
> +TESSERACT_OCR_SITE = $(call github,tesseract-ocr,tesseract,$(TESSERACT_OCR_VERSION))
> +TESSERACT_OCR_LICENSE = Apache-2.0
> +TESSERACT_OCR_LICENSE_FILES = COPYING
> +
> +# Source from github, no configure script provided
> +TESSERACT_OCR_AUTORECONF = YES
> +
> +TESSERACT_OCR_DEPENDENCIES += leptonica jpeg libpng tiff

I've dropped jpeg, libpng and tiff. Instead, I've added host-pkgconf
which is really needed since configure.ac uses PKG_CHECK_MODULES().

I've also passed --disable-opencl since your package hasn't added
explicit support for OpenCL.

> +# Language data files download
> +ifeq ($(BR2_PACKAGE_TESSERACT_OCR_LANG_ENG),y)
> +TESSERACT_OCR_DATA_FILES += eng.traineddata
> +endif
> +
> +ifeq ($(BR2_PACKAGE_TESSERACT_OCR_LANG_FRA),y)
> +TESSERACT_OCR_DATA_FILES += fra.traineddata
> +endif
> +
> +ifeq ($(BR2_PACKAGE_TESSERACT_OCR_LANG_DEU),y)
> +TESSERACT_OCR_DATA_FILES += deu.traineddata
> +endif
> +
> +ifeq ($(BR2_PACKAGE_TESSERACT_OCR_LANG_SPA),y)
> +TESSERACT_OCR_DATA_FILES += spa.traineddata
> +endif
> +
> +ifeq ($(BR2_PACKAGE_TESSERACT_OCR_LANG_CHI_SIM),y)
> +TESSERACT_OCR_DATA_FILES += chi_sim.traineddata
> +endif
> +
> +ifeq ($(BR2_PACKAGE_TESSERACT_OCR_LANG_CHI_TRA),y)
> +TESSERACT_OCR_DATA_FILES += chi_tra.traineddata
> +endif

Regarding the language files, I'm not entirely happy with the current
solution, but I couldn't come up with something better. I looked at the
two following options:

 * Creating a separate package for the tessdata repository
   https://github.com/tesseract-ocr/tessdata/, but this repository is
   3.4GB in size, which is admittedly a bit annoying to download when
   you just want a single language.

 * Since the list of languages is quite long, having an explicit option
   for each of them is a bit annoying. So I looked into turning your
   one-option-per-language idea into a single option with a space
   separated list of languages. Except that we anyway need to have the
   hash file for each language in tesseract-ocr.hash.

So in the end, I kept it as-is. We'll see if other folks have better
idea.

So in the mean time, I've applied with the fixes described above.

Thanks!

Thomas
-- 
Thomas Petazzoni, CTO, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com



More information about the buildroot mailing list