Call Us India :- +91 9925144200       US :- +1 (732) 927-5544       Email us : hello@contcentric.com

Blog

Mar 16

Configuring OCR in Alfresco

OCR (Optical Character Recognition) is the recognition of printed or written text characters by a computer. It recognizes the characters from the images or scanned documents, and that makes the images (which contain text) search-able. OCR is a very useful feature for any ECM product or software. In this blog, we will see how we can configure it in Alfresco Community Edition. We have tested this with Alfresco versions 5.1.f and 5.2.e. It should also work with other nearby versions.

Prerequisites:ocr-magnifying-glass

  1. AlfrescoCommunity Edition installed and running
  2. Basic knowledge of Alfresco administration

Steps to Configure Tesseract:

  1. Download Tesseract and install it.
  2. Download context file from the following link
    1. For Windows:  Download from here.
    2. For Linux: Download from here.

You must add properties in alfresco-global.properties file as under:

For Windows: ocr.script=/opt/<ALFRESCO-HOME>/ocr.bat

For Linux: ocr.script=/opt/<ALFRESCO-HOME>/ocr.sh

ghostscript.exe=gs

  1. Place the context file at the following location \<ALFRESCO-HOME>\tomcat\shared\classes\alfresco\extension\<tesseract-context.xml>
  2. Create a.bat file for Windows or .sh for Linux and place it at \<ALFRESCO-HOME>\ocr.sh(bat)

a) ocr.bat (for Windows)

REM to see what happens
mkdir c:\tmp
echo from %1 to %2 >> C:\\tmp\ocrtransform.log
copy /Y %1 "C:\TMP\%~n1%~x1"
echo target %~d2%~p2%~n2
REM call tesseract and redirect output to $TARGET
"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe" "C:\tmp\%~n1%~x1" "%~d2%~p2%~n2" -l eng

b) ocr.sh (for Linux)

# save arguments to variables
SOURCE=$1
TARGET=$2
TMPDIR=/tmp/Tesseract
FILENAME=`basename $SOURCE`
OCRFILE=$FILENAME.tif
# Create temp directory if it doesn't exist
sudo mkdir -p $TMPDIR
# to see what happens
#echo "from $SOURCE to $TARGET" >>/tmp/ocrtransform.log
sudo cp -f $SOURCE $TMPDIR/$OCRFILE
# call tesseract and redirect output to $TARGET
sudo /usr/local/bin/tesseract $TMPDIR/$OCRFILE ${TARGET%\.*} -l eng
#sudo tesseract $TMPDIR/$OCRFILE ${TARGET%\.*} -l eng
sudo rm -f $TMPDIR/$OCRFILE
  1. Restart your server and test whether it is working or not by uploading image file in Alfresco repository. Try to search by any word within that file.

Important:

  1. Make sure you are passing correct arguments in the context file (Entries in context files will be   different for Windows and Linux).
  2. Check whether your .bat or .sh commands are properly working or not
  3. Verify that tesseract creates text file for image file
    1. To verify that run the following command
    2. tesseract –tessdata-dir ./ ./<image file-name> ./<text file-name> -l eng

If text file is created with content in it, your tesseract is working.

 

Comment here, if your contents are still not search-able. We are happy to know your ECM challenges, as we love solving them!

Kintu Barot

About The Author

Kintu is an Alfresco Certified Engineer (ACE501). Apart from shaping Alfresco projects to the requirements of the clients, he keeps himself busy in coordinating Alfresco training program for the new develpers at ContCentric.

2 Comments

  1. Luis
    October 9, 2017 at 8:30 pm · Reply

    Hi i want to know where is located the file alfresco-global.properties?

  2. Daniela
    October 24, 2017 at 6:49 pm · Reply

    Buenos días
    He hecho todos los pasos para windows y aún no funciona

Leave a reply

Your email address will not be published. Required fields are marked *