How to Integrate OCR into Your Web Scraping Workflow

hi everyone this is Sarah McKenna it's

Quentin I've got Z Zhang on the line

who's our New York City Tech lead he's

going to show you how to integrate OCR

into your CG Enterprise workflow

with that Z I'm going to hand it over to

you

hi guys

um in today's webinar we're going to be

going over how to integrate OCR to

um Siege Enterprise so usually there's

two ways of handling OCR

um the first case being training Euro CR

in order to

better recognize sub-optimal characters

and the second way is to use like a

third party software in order to enhance

the image so that the letters become

more legible and in this one we're going

to be covering the second case

where we're editing our images

so that um

so first off we're going to be um

coming over here to um gocr and you will

want to First download pocr

in our

um

Echo I'm sorry uh

let me go on mute

first come over to a geocr and download

geocr like in our previous webinar

after you downloaded gocr you would need

to put the converter into a shared

folder like we did last time as well

but then in this case we're also going

to be downloading

um

image magic

can come directly to the home page over

here under um imagemagic.org this is the

third party software that we're going to

use in order to edit our images so that

they become more eligible

so then after you fully downloaded image

magic and glcr we can get started with

um reading a captcha in this webinar so

then the image that we're going to be

reading from is um this one I have over

here that just says a d584f

so first and foremost

um come over here open up a new agent

in the new agent here we can add a new

calculator value

and in this calculator value we're just

going to specify the path of um this

image over here I think this image is on

my desktop so that I'm just going to

specify

um my desktop real quick

um

on their desktop

let me just call it test.jpg

so then in another case where you're

downloading an image directly from a

website you can just use the download

document command and then after you

download the document

it will essentially be the same as a

calculator value where if you reference

the downloaded document it will

reference the location where the image

is stored but in this case we're just

going to be using an image that was

pre-downloaded already so then I'm just

going to delete this command

I'm just going to rename this to um

capture image

and then secondly now it's going to add

another calculator value where we're

going to write our script where we're

going to read from the image I mean read

from the image and then display the text

from from here I'm just going to call

this one here maybe like read captcha

and then just press transformation

script and now in here we're going to be

head over to a c-sharp and we'll write

our

code in here in order to read

the image

the first we're just going to um

just create a empty string called

results

next we're just going to specify the

image path that we defined before

which is under

arcs.data row of uh I think I renamed It

capture image

capture image

I'm going to probably just test to see

if this actually works by just returning

the image pass

and it returns um the image that was on

my desktop okay

so from here

um

which is specify the location of our

image magic program that we have

downloaded this program here is under so

I've downloaded to um

my program folder I mean my program

files folder at uh

it's under program files

image magic version

7.0.9 this should be the latest one

q16

secondly now we need to specify um the

location of our gocr converter

CR

[Music]

ass disco Pat

um I think we need to use our

system.io yep

before we can specify path

so path dot get full pass get full path

now we specify um

full path of this

which is

arcs.aging.directory name so from here

we're just getting um

the directory of um where the agent is

stored and then afterwards we're just

going to navigate back one directory to

our shared folder and then we navigate

to converters

of gocr

followed by the gocr executable

okay from here after we specify

um the location of our image magic

folder and um the OCR

we would need to Define arguments so

that we can run

um this image magic program via command

line

so I'm just going to format a string

magic convert and then this line of code

over here basically runs

image magic via the command line

so from here I'm just going to specify

um this will be

the location of

the image magic folder and then the

second path I'm going to pass in will be

the gocr path so from here I'm just

going to specify some

um parameters in order to um

edit our images so that they become

clear so the first one will be

to flatten our image and then finding

the image

um creates I guess a canvas the size of

the image

using the current background and then it

Clips out the images falling outside of

that canvas

and then secondly we're just gonna

specify the color space

of our image as a gray

yep

um specifying the

okay

um and then after that we're just gonna

normalize

um normalizing the image increases the

contrast in the image by stretching the

range of the intensity values

afterwards

um we're going to specify

um the OPEC as off as none actually

um doing this um change the changes the

color to um fill the color within the

image and we don't actually want that so

I'm just going to specify that as a none

after that I'm going to specify some

additional Fields like Alpha

as of um Alpha tries to combine an image

with a background in order to create the

appearance of like a

partial transparency

then after that

um compose over

um compose um sets the type of the image

composition we want that as a over and

then finally

um we'll specify the repage and this

just adjusts the canvas and offset

information of the image

then after that we're going to specify

um pnm which is the type of file that we

want image to be in afterwards

and then from here we're going to

specify the path of

our gocr

with some additional parameters

now I'm just gonna pass in

and oops all right

and quotes here now I'm going to pass in

my image path and my gocr pass

afterwards after we've defined our

arguments we can um just create a start

info process in order to start up this

process to run this via command line

so then we need to

um

you

start using system.diagnostics

from here we can instantiate a process

start info

let me see if this this new process

starting tool

processed art info

it's going to call it start info

um

text

let's start info

and so pass in CMD which specify that is

going to run via command line and then

I'm just going to pass in the arguments

afterwards

from here we're just going to define a

couple other uh parameters such as

the working directory to be um

the image magic folder

um

um oops that's a additional semicolon

um also going to specify uh the window

Style

to be um hidden

access Windows

next we're gonna specify that it's not

that's not going to create a window

create window create no windows that

that's true

you shall execute set that to false

time for that read

direct standard output

set this one here it's true and then

um redirect standard error with our

extend the error set that to true as

well

once we have specified

um these parameters we can just start up

a process

uh let's create a new process process

process equals

process dots that start and then we pass

in the start info

I'm just gonna

let's go and see if um the process is no

and if this is the case then we're just

going to throw a new exception

um something like our starting process

starting

process

it's going to have to process

um wait for exit set that to

10 seconds

sorry this actually should be inside of

here actually

and then for empty resolution that we

defined before

to have that process standard output to

the

basically standard I'll put that

read to end yep

and then we want to trim this

it's going to add empty space

define environment

that new line

oops process

no

I think standard output yep

and

that trim yep and then afterwards um we

can just return this result

after splitting it with uh the character

and then just return the first character

in here and now this program here

basically

where defining

our capture image

then we're also defining um location of

our image magic program that we're going

to use followed by our OCR directory

from here where

using image magic in order to edit the

image in such a way that it will be

readable by the gocr

and then we're starting out the process

via command line and this will return

the results of the image so then

initially the image here specif is um

d584f now if we test out our

transformation

seems like it's

take okay it looks like look this looks

like froze here

um sorry give this a second

so then after all of this we're just

going to press some test transformation

and we should see um

same result as um our image over here

um the

584f

wow okay I'm just gonna press test

transformation

and you can see that um d584f you can

see that um

our script has successfully um read from

this image over here using um after

editing the image using image magic and

then reading it using um gocr

and um that is how we go about

um

reading I mean using third-party OCR in

order to um

read from captchas

oh hi how are you

good

so thanks everyone for attending today I

really appreciate your time I want to

open the floor up to questions

um are there any burning questions you

have about integrating OCR with CG

Enterprise any troubles that you you'd

like us to delve into any samples you

want to bring to our attention in

any other questions

okay so it seems like there's no

questions today uh with that I'm going

to conclude this webinar thanks again

for joining us next week our webinar

will cover how to integrate text

analytics into your CG Enterprise

workflow for sentiment analysis entity

extraction Etc

um glad to have you and if you have any

other questions after this webinar feel

free to email us at sales at

sequentum.com thank you