How to Evaluate Web Scraping Software Key Features You Need for Success

hi everyone and welcome to the sequentum webinar on key evaluation criteria for

building a large web scraping operation um can everyone hear me

just want to make sure that we're set up properly it looks like we've got a

Fair number of people on the line

let's see it looks like we have some questions okay yes you can hear me fantastic thank

you so much um all right then let's get started um first of all I just want to thank you

for getting ahead of the curve staying ahead of the curve joining our webinar today and educating yourselves and how

best to do this um obviously uh the the the landscape is

changing decisions are becoming data driven the incorporation of data-driven

decision making is becoming a mandate across Enterprises and finding the

smartest most efficient most reliable way to incorporate data feeds from the

public internet are also becoming a mandate uh sequentum has been in this

business for 10 years our practice is um We Believe Best in Class

and we're excited to share this webinar with you today so the key just at a high

level the key evaluation criteria for looking at web scraping software We

Believe are the following number one it has to be easy to use when you're pulling in many different

data feeds and and you're handling the Myriad problems that are going to come up your software needs to make your life

easier and it needs to be easy to use number two

there has to be as much automation as possible around the typical functions

and methods that you're you're going to need as part of your everyday workflow

so productivity enablers are a key evaluation criteria and you should make sure that whatever

software you're choosing has many of them and that they work well and make

your your life easier and your work more efficient the third is a laser focus on

data quality you need to make sure that your uh the data that you're pulling

across is of the highest quality and if there's any degradation to that quality that it's easy for you to configure

rules and handling for any of those cases that may come up since since the

quality of any data-driven decision making process whether it's AI or ml

since it depends on the quality of the underlying data this is really a mission

critical key evaluation criteria for your software the fourth is

blocking um there's hundreds of millions of dollars getting invested in about

blocking software and services um you need to make sure that your web data collection software is viewed as

close to human-like as possible and is not blocked and in the event that it is

blocked there need to be simple tried true ways to automatically configure your agents to no longer be

blocked um so that's number four number five is you want to centralize

everything that you possibly can since there are so many different pieces to a

large web scraping operation you want to centralize as much as possible to

support your operations managers and also to support any compliance oversight

of it and that leads into number

um number seven which is the the tech stack that you put in place

should support any compliance operating guidelines that

your legal group wants to put in place to govern the web scraping operation

um it should make it transparent it should make adherence to compliance

operating guidelines um you know easily checked and there should be uh key features and

key configurations in the software that allow you to conform to compliance guidelines

and the last but last but not least is Rich data interoperability so you need

to be able to put this data collection operation in

the middle of your data-driven decision making process um

and obviously each company is at a different stage of maturity and is choosing different ways to implement

data-driven decision making so this software and Tech stack needs to

interoperate with whatever it is that you put in place and whatever it is that

your team might transition to in the future so you not only do you need to be

able to support many different uh data sources

um but you also need to be able to export to many different formats

to distribute to many different targets uh maybe it's S3 this year and maybe in

the future there's some API you want to distribute your data collection

to those distribution targets should be flexible fungible

um and of course you should have full API integration into every aspect of

your webscrapering operation so the data collection toolset can really be embedded into the

overall enterprise-wide system that you're setting up um

so with that I'm going to go ahead and start with an overview of the sequentum

tool set and how it delivers on these key evaluation criteria at a very basic

level we've got three components one is the desktop where you point and click an

author and maintain your agents it automates 95 of what you need to do in

order to create and maintain these agencies um and I will demo that for you shortly

the servers are just your workhorses the servers essentially

um you know run all of your data collection and deliver that content in whatever

format you've specified to your targets um and then what sits in the middle tying it all together is the agent

control center the agent control center uh added at its at it at its basic most

basic Foundation is a Version Control repository it keeps track of your agent

and all of the associated dependent artifacts like input data

David you're asking why the screen isn't changing I'm just going through the diagram now and I'll I'll go through a

live demo in a moment um yeah so the agent control center has the

Version Control repository that keeps track of all the age independencies for example input data reference data any

third-party libraries you're using um

um your schema uh any credentials that

you have to either third party uh S3 buckets or database

credentials and then on top of that we build a whole deployment mechanism

server management mechanism and proxy management mechanism

okay so with that I'd like to show you a quick demo of

our desktop now for the folks on this call some of you are new to content

Grabbers some of you are very very familiar with content Grabber and have been customers for you know the full

decade that we've been in existence um I'm going to give a quick overview of of

these features and show you how it automates 95 of what you need to do so

as an example I'm going to bring up a simple web page it's a typical web scraping scenario

where I've done a search based on a given category and a list page is coming

up as you see as I'm mousing around there is a browser inside the tool the tool is

not inside a browser and this gives us tremendous capabilities to detect and

handle errors as they come up it also um is a tool that's context aware so for

example when I click on the first item in the list hold down the shift key and

click on the next item in the list it automatically knows that I'm creating a list container and what it's doing is

it's actually generating the XPath for me automatically so these typical things that you need to do as a web scraping

engineer or operations manager your most junior staff can do this without

assistance so instead of having a team of 15 highly paid

you know python programmers or full stack data scientists who are collecting all this information and doing your data

than decision making you can have one senior web scraping engineer and

Reporting it to him you can have 40 more junior staff who are doing you know 95

of the work and that senior engineer is really only focused on the interesting

uh problems that match this level these more basic

things can be done by you know lower level staff that uh are more attuned to

handling a lot of very detail-oriented work um things that programmers are not

really interested in so the Mind numbingly repetitive stuff is automated for the most part

um and then a lot of the checking and whatnot is uh is done by by lower level

staff so here I'm just adding a simple pagination container so it can go

through each one of the items in the list and each page in the list as I

click on the detail page it's keeping track you'll see that and notes Here in

the browser it's keeping track of the workflow automatically it's generating a schema on the back end automatically

um these these uh capture commands that you see under the agent Explorer here these are actually going to be the

um spell this correctly these are going to be the column headers of your X

um another thing that we have automated for you if you want to extract any portion

of any string that you're pulling from an element page you can easily generate the

regular expression automatically by you know simply highlighting the text and

collecting it so again all these things are not rocket science but automating

them all in a tried and true way is saving you time

so your your your junior staff can focus on doing 95 of the work that needs to

happen they could do it with accuracy they can do it quickly um and they can also do it in an audible

way so when something does go wrong you can go back and figure out what it is

okay um another thing that that is very typical in in scraping operations is

there's all kinds of edge cases these are third-party websites they don't want you to pull their data they try to make

it difficult um and and you know sometimes it's really hard to tell why is it 15 of the

time uh your data isn't coming across properly well you can visually debug because we have the browser inside the

tool we can display exactly the flow of that data collection operation and it

makes it so much easier for your teams um to debug um so you really want when you're

looking at at web scraping tools you want to make sure that you have uh incredible ease of use you want point

and click functionality you want all of these productivity enabling in there you

want it to be easy to debug um you know and then of course the next step is taking you to step down you want

to make sure that you have data quality so what we've done is we've built fine-grained data validation at

the level of every field of every row of data collected you can go in and set strict type

expectations for the data that's coming across and you can specify things like whether

or not you allow this field to be null and then you can take it a step further and Define things like let's say you're

pulling used cars and you collect the dealers but you don't want to collect the individual sellers because you

consider that to be pii right you can write regular expression to say if you

know the seller field matches this list of dealers then great otherwise mask this content

um you know similarly with date time for some of you that are already doing large-scale web scraping

um you know that proxies sometimes move one week is in the US the next week it's

in Japan and when that happens the website will often present information in a different way there may be a

different currency symbol there may be a different daytime format and you really don't want to blow up your data driven

decision making so things like date time you want to specify exactly what that format is and

you can do that and validate in real time similarly with value ranges you don't

want a negative price you want that put into an error file and anytime you're pulling Json you want

to validate it because you don't want to trust these third-party websites you can specify things like time zone so if

you're pulling Airline airfare information for example you're dealing with a lot of different time zones

depending on the markets that you're you know your origin and your destination you want to be able to specify exactly

what that time zone is so that you're comparing Apples to Apples when you're doing for example competitive pricing

analysis and then we allow you to specify whether to run this validation at runtime which is the point at which

you're extracting the content from the web page or at export or both

um you know so here you can also specify what are your keys

so let's say you want to do deduplication another extremely typical

um you know action that your you know web scraping Engineers are going to have to implement

um you want this built into the tool and you want it to be easily configured to specify what are your keys you know

Amazon for example maybe their keys are not just the job ID but also the location

um you want to be able to specify what your keys are for deduplication and you want it to work every time you don't

want to reinvent the wheel for every agent that you're writing um and of course

um you don't want to be hemmed in so you want all these productivity enablers but you don't want to be restricted in any

way if you ever need to do something custom you want to be able to write freeform custom logic at any point and

that's uh that you know that's what's in this tool you can use python3 you can

use all the python3 um uh libraries you can do c-sharp

normal programming languages that that you're familiar with JavaScript regular expression Etc

I see I have a question can this software deal with downloading Excel files through JavaScript links uh yes

absolutely we can download files and um because it is a Windows based uh

software we have integrated the I filter packs that are supported by SharePoint

so we can deal with almost any file type so if it's Microsoft Office files if

it's OpenOffice files if it's PDF if it's any sort of archive file every

single type of file that's supported by the ifilter packs can be read

natively and so you're asking specifically about Excel files yes

um okay so this is the data validation this is happening in real time so let's say a

website does um has window dressing for the Halloween holiday that's coming up or they're

doing a b testing or ABC testing um you want to have this data validation running

um in real time and you want to have error handling logic that's happening

um you know in real time and so you know you've got all of these capabilities built in you can do things like Branch

logic um if else you know Etc to basically catch and handle your

errors depending on what the website is doing the other thing that we have that

ties in with the data validation is Success criteria which is at the level of every run or the parent job so you

can specify say I want 95 of the same number of actions that I had in the last

run right this enables you since it's a database back tool and we track

um kpis uh you know as as we're running our our

jobs day over day or hour over hour you can easily make sure that you're getting the same amount of data each time and if

not it'll raise an alarm or an alert it'll notify you that something is a

mess and you can really set thresholds so let's say you're getting data from a

government site which is typically error error has a lot of Errors um maybe you know that you get a minimum

of 300 errors every time um you work with a site and you really

don't want to be notified unless you have um 325 or more errors you can specify that in your

success criteria and there's two layers to the success criteria so you have the success criteria for the individual runs

and then you can create a job which has a bucket of runs and you can have success criteria at that higher level so

we are laser focused in our Tech stack on making sure that we get the right data quality that we get the right data

accounts that we're setting thresholds and that we're notifying sending alerts and alarms anytime

there is an issue um so you can integrate the notification

into um any API enabled ticketing system our

AG control center also comes with a ticketing system built in that allows you to do this very simply

and of course you can set email notifications as well at any point you know you can write custom code to do any

custom validation you want I see I have some questions coming up um

data validation stop or pause the agent no it's running in real time in parallel

as the data is coming across so there's a there's a there's a validation that's

happening at runtime opening in real time and then there's validation that's happening at export that's at the end of

the run when you're exporting all of the data from the internal database to the export Target

uh and I have another question here what if the data doesn't match so if there is

um you know if the success criteria isn't met then that will trigger an alert an alarm and a notification

um to whoever you have configured to receive those notifications

um if you wanted you could put some Branch logic in to say in this case uh

um you know if I'm getting more than uh 300 errors on this government website or

if the error is uh 503 service center available then pause collection and

resume in an hour you could do something like that where it doesn't actually notify anyone it just handles a problem

because you know who knows maybe a government website goes down uh frequently enough that you would just

come back in an hour and try again um okay so that's data quality now

blocking um with blocking there's basically two types of blocking right you've got

um you know the type of blocking where they're applying rate limits on your proxy this is the old style of blocking

right so we so to counter this we have built um proxy uh centralized proxy management

components in text apps so we have one provider pool component that basically allows you to

set up any type of proxy provider so whether it's a super super proxy or back connect ports or a

list of ips you know whatever type of uh proxy provider integration is required

it's supported um and from there you can create proxy pools so let's say you're doing

Australian retail and you've got uh you know a bunch of different providers for Australian

retail um you know because for critical data collection we never trust any one

provider um you know then you set up your your proxy tools and you can configure those

in your agent um and so when a provider has an outage

or if there's a particular subnet that's gone down or some small subset of

providers uh pool that they've extended to you isn't working properly you can

easily fix that in one place and all your agents are still running um so this is a big

um uh productivity productivity enabler and a key evaluation criteria you want

to make sure that you are not tied in to only one provider and you want to make

sure that you have absolute control over who your providers are and what your pools are what types of ips you're using

because this is an area that's changing rapidly on the web and you want to make sure that your teams have full control

um and so that's one way of blocking another way of blocking is to look at

your device uh fingerprint so and the way they do this is they take objects in

the browser typically and and information about the the um environment that your scraper is

running in and they take all of those different variables and flatten them into a single hash which is the

fingerprint for your device and then they track rate of requests from that device so if you have a server

um that's doing your scraping and you're trying to pull data from a site that has implemented uh blocking based on rate

limits that are looking at your device fingerprint your server is going to be dead in the water pretty quickly unless

you're using our software so we basically will because we have a custom browser it's a

custom version of chromium inside our tool we can randomize all the objects that are used to create that device

fingerprint so when you when the site you know like typically what you'll see

is a site will require that you run your scraper in a full browser which is very

expensive but very easy to do in our software um so we'll make one request in that

browser and we'll randomize all the types of things that they typically look at

so for example um uh you know we'll make sure to load a

full browser and then we'll um do things like

um you know we'll clear the storage uh we'll make sure that the connection is

alive worry about the proxy you know put random delays in there if we want we can specify the that we can randomize the

browser size which is another thing they typically look at rotate the user agent

um you know rotate the web browser profile you know typically with the with rotating the proxy address and then we

can do other things at the level of the web browser which allow you to do things like

um you know sometimes they'll block if you don't have canvas reading turned on

um it's okay if you don't know what these things are we have you know troubleshooting steps in our 600 page manual that explain all of this

um uh you know you rotate your canvas string where you know there's all these things that you can do to basically

um randomize what your um what your device fingerprint looks like so for all of those hundreds of

millions of dollars going into block blocking Services um so Quantum is not blocked

um you know sometimes we come across interesting cases that are new to us but for the most part

um we're not blocked and it's a key performance uh key evaluation criteria

that you should um consider when you're looking at web scraping software now the last well one

of the you know another point that I want to mention is um when you're so so when going back to

our diagram here I'm going to now walk you through the agent control center

which is accessible to your agent developer and also via the browser it's

accessible to um your Ops managers or you know Dev

managers or compliance managers so if your developer can basically check in in

an agent you know it's very easy to just go to Version Control and check in your

agent um you know and there's some simple views here that that uh the developer can get

um but what you're getting is basically a full version of every agent and all of

its dependent artifacts so there's no more losing track of exactly what the

input file was there's no more losing track of exactly what the credentials were at that point or what you know

where it was writing to or what API it was using everything is stored in one place and it's stored together in a

single version so you can keep track you can also um actually look and see

um you know as a manager you can see oh yeah this is the one that had the handling added if we get the if access

denied um error right so you can see oh yeah this is the one that I actually want to

deploy and then you can go and you can actually deploy that version to your

production cluster or you can say you know what I want to add a deployment just to my QA server because I want to

verify that this is actually working and then when something goes wrong you can

easily open up the audit log and see any changes that have happened to schedule

deployments Etc um so this is really a very rich

um very rich approach and it's a critical feature that you're gonna work

in your web scraping software um when you're adding you know scores or

hundreds or thousands of agents to a large-scale web scripting operation like

many of our customers do you really need automation at this level you need everything pulled together for you and

you need a very clean way to know what's running where um you know check run history check you

know job history here with the the success criteria in the job is is

set to Market as failure but you can actually see like the developer can go in if there's

if they assigned a bug for example they can go and and load the Run history and

see you know what are the issues that have come up here or how has this been running

um what should I expect Etc and they can see the success criteria and they can see what proxy

pool you know Etc everything that you need is basically in one place um

so now I'm going to show you in the browser when you log in to go to

your agent control Repository um you've basically got the same view

but this is this is for um a user that isn't necessarily writing and maintaining agents

um they just want to see what's going on so they can go and see you know okay this is my Nike agent these are my

Fields here's some sample values here's all my data validation rules

um you know what I'm doing where I'm Distributing to Etc um and this is how my server is

configured all of those uh variables that need to be set for a particular

server can be set in one place again you don't have to open every agent anytime a server variable changes

really important and your credentials to any trusted Source are encrypted you

basically create your Connections in here and they're an encrypted file you

store them in the agent repository that the agents have access to them um but you're you know uh your your most

junior staff do not have access to um you know all the data in these

trusted resources now what also enables you to do as a uh

as a as a as an organization is it allows you to really implement

compliance guidelines into your entire operation so web scraping is an

unregulated field and there's spin a fair bit of litigation in the space

trying to Define uh what's uh you know what's a good actor and what's a bad

actor in this space so what we do and we work with a lot of very sensitive institutions large hedge

funds banks that have Deep Pockets and are governed by the SEC

and they're concerned about uh you know potentially having any liability brought

to them in the space I'm actually as a CEO of sequentum I'm working with

financial information standards committee there are non-profit to Define

standards for web scraping operations but basically what we do in our software

is we make sure that every single configuration is explicit that has to do

with compliance and that it's auditable fully auditable so for example if if uh

if you're selling real estate data and there's a real estate site that is that

is quite certain that you're reselling data that you pulled from their site then and they come to you and say I

think that you owe me you know x amount of dollars because every night at midnight I have to spend up 80 extra

servers for Bots to pull down data well you can go into your uh uh software and

show that actually according to your guidelines you're only pulling lessons

you know X percent of average daily values and here's all the configurations and here are

the rate limits that you have set on each of these data collection agents

and you can go through and show the Run history and the job history and show

exactly what you pulled when um and you know you can basically uh you

know nip it in the bud right there um so you mitigate the risk that you're going to get blamed for

um scraping activities of Bad actors that may at some point look like they're coming from you

um not only that you can show because you can pull down each of these agents

at any point you can see what version was running and you can actually go and

get that particular agent and make sure that

you know it's configured the way that that you expected it to so for things like captcha captcha is a is a is an

explicit um configuration tool if you want to add um you know capture handling you can do

that um SEC governed institutions don't tend to like to do that but retail companies

for example routinely do that um so these are different you know different approaches from different

compliance groups from different companies but but because the software is so Advanced

and we have all these productivity neighbors and tracking of every explicit configuration you can uh you know track

what whether you're doing captures or not very easily um similarly you can do things like

specify whether or not you want to follow robots txt um you can either

always follow it so if the robots txt changes from one

group to the next the agent will automatically stop collecting or you can choose to warn only or to ignore you

know all of these things are explicit configurations and fit nicely with your compliance requirements I have a couple

of uh questions here um

what about third-party proxy providers can you connect yes um you can basically set up any proxy

provider that you want uh right here we have a simple list but you can set up

um any any type of proxy provider whether it's a super proxy or a data list or

back connect ports uh you know or an API whatever it is you can configure it in your providers and then from there you

can create pools that you then set up in your

in your agent um yes all of basically any provider can be

supported in here um

uh what I'm showing is there's a question about CG professional

um so we have Legacy versions of our software Visual Web Ripper and CG

professional and CG premium those do not include all the Enterprise features so

CG Enterprise is the third generation of our product and the Enterprise large-scale web scraping features are

only available in CG Enterprise um and we're happy to you know walk you

through the process to upgrade but the last point I wanted to mention and then I want to go to questions is that you

want to have Rich data interoperability um you want to make sure that this data

collection process works with your data sources and with your data targets

so for example uh when you're you know even in just collecting data you want to

make sure that you can collect data from any source so one of the incredibly Rich

features in this software is that we can mix and match requests so if you are

getting rate limited by a device fingerprint and you need to load up a

full browser page well for those of you that have done this before that's an incredibly expensive thing to do on your

server it takes a lot of memory it takes a lot of CPU power it costs a lot to run full browser scrapes

so maybe you just want to do one request after that you can go and do these lower

level simpler requests what we call parsers that will basically parse the

information and it's incredibly easy to find back-end apis they power probably

40 percent of all Sites we have purpose-built Equity browsers that will

help you find uh you know back-end apis that expose all the data that you're

trying to pull so it's incredibly easy to see in this example I'm showing a

Json um back-end API and then I'm switching to the web request Editor to look at the

requests that actually pulled up that data um I'm just modifying one of the parameters which I'm I'm able to do

visually it's incredibly easy for me to do and I'm going to format it for input

you could format it to put into C sharp or python code or into a regular expression

um for this I'm I'm basically going to just uh create it for input

this is a quick example I'm going to get rid of all of these full browser requests because I don't

actually need to do this scrape in a full browser and I'm going to basically pull up

the content in the Json parser and here just like I did in the

in the in the browser I can point and click and I can choose my my elements

and I can pull down all the data that I want so I'm clicking on one job node holding on the shift key

s clicking on another job node it's creating my list component and then again that context awareness I'm going

to click on this jobs out it's going to give me options all the types of things

that I would typically want to do here I'm going to capture all the web elements you know Etc

so now really in a couple of moments I have found you know with this data source there's a backend API and I'm

able to pull all of the data very simply so you really want a tool that allows

Rich data interoperability with your sources and your targets so for this one

I'm going to go ahead and and run this

okay it's making one request it's gotten a couple of Errors these are

just data validation errors because it's defaulting to short text for the description I'm not going to worry too

much about that um but basically at this point now that

I've collected this data I'm going to um

first of all I may want to do change tracking right I may want to I may want to

specify okay if I don't see this job listed for two days show it as deleted

so that I'm tracking what jobs have been filled presumably so I can set up change tracking I can

also set up export targets to basically export to any format

if your operation is like ours or some of the some of our customers you have uh

different units that are all pulling the same data they basically want to have uh

they basically want to have their content in Excel CSV Json parquet you

know they may want it in lots of different formats and you basically can export to as many formats as you want

at the same time and then you can deliver those um to any delivery Target so you can

have SFTP you can email them you can send to S3 or Azure and of course you

can write a script using again normal programming languages Python 3 C sharp

you can have script a script Library you know reusing um the same logic over and over

um so you've you can export to any format you can distribute to any Target and last but not least you've got full

API integration at every point so anything that you need to do anytime

you want to kick off the job ad hoc or you want to pull down data at a specific

time to integrate with your operation you've you can embed this the

large scale operation you know while you have full transparency of everything that's going on you can embed it into

your wider data engineering pipeline and we've got a full uh you know manual

that you know 600 Pages the details you know all the types of things that you

can do incredibly Rich tool set built over 10 years um you know this is just like an example

of agent command um uh capabilities you can basically do all

of these different types of things pretty much everything that you would ever want to be able to do

so that's um I guess the last point with data interoperability is that you can

integrate to external libraries if you wanted to integrate uh you know OCR or

um or any custom PDF reader or image reader you can do that you can read data

from files you can read data from databases whether it's an oledb or nosql

DB and of course you can interact with any API so if you want to enrich your data

on the fly with um with IBM Watson sentiment analysis or

entity extraction that's easily done so that you know a key evaluation

criteria is that your your tool set allow for Rich data interoperability

both at their level of the sources as well as the level of the targets just just to summarize you want to make sure

that your tool is incredibly easy to use so you don't have to pay top dollar for

each of your resources and you're not boring uh your high level Engineers your

highly skilled Engineers with mind numbingly repetitive work that can be can be automated and can be managed by a

much lower level staff you want all of those automation productivity enablers

all those key functions and methods that you need to use over and over again like change tracking or uh you know just take

the new data from the list you want to have your monitoring and controls and

alerts alarms over data quality you want to make sure that you're not blocked you want to have that centralization of all

the aspects of your operation separating out all the components that could possibly change

affecting your agents you want to be able to change a change uh you know any

anything that's going wrong you want to be able to fix in one place and have all your agents automatically working again

and then of course you want that centralized transparency also um so that you can set up a compliance

uh governance program to make sure that you're uh you know checking all the

boxes from uh risk mitigation standpoint and then of course you want that data

applicability so let's see I have one more question here is it possible to connect inputs from a database and write

outputs to a database yes MySQL yes we support MySQL you can read and write uh

from and to an oledb with mySQL or SQL

server or mariadb postgres you can do that at any point uh in your workflow it's easily done with custom code and uh

you know of course we have a lot of built-in capabilities depending on how you configure

um depending on how you configure your internal database

um so you can basically configure your internal database to be whatever the uh

OLED type database that you have on site whether it's I guess it's MySQL for you

um so you can just configure it to use MySQL and because it's a database backed tool

you can pick up wherever you left off so for example um you know once you configure your

clusters and servers um you know if one of your servers uh

you know drops dead for some reason you can easily just move that work to another server and it could pick up

where it left off because it's a database back tool um okay so someone has asked where can I

download a demo um we don't make our software publicly available

um just because it is so rich and robust we like to hold your hand a little bit while you're demoing the software we're

happy to provide a free trial and uh and to allow you to use the desktop and

we'll give you a um a demo environment of our uh you know

ACC that you can upload your your content to um we're happy to do that uh just uh

contact us after the webinar and we will um set you up

with a trial license okay any other questions

let's see I don't see any other questions

I'm gonna see right for a moment

looks like we may have a couple more a ballpark pricing okay so here we go so

the desktop is um five so we're on an annual subscription basis

um you've got a product uh development team that's dedicated to you our desktop

is five thousand dollars annual subscription and that includes support um so basically uh we're gonna get you

unstuck if you have a problem not uh maybe you don't know how to use something and uh you know you're you're

having trouble figuring out um you know where to turn the in our

manual um you send your uh your agent into us

and we usually can uh get you unstuck in in a matter of 10 minutes so there's

email support um and we typically respond uh in a maximum of 24 hours it's often a

lot quicker than that um and then if if there's a particular agent that you're blocked on

um or you really just need us to write something for you because you're pressed for time

um we can do that ad hoc for 150 an hour we'll write agents for you

um the server comes bundled right now with the agent control center the pricing is not based on a paracore uh or

uh you know any of those uh measures it is ten thousand dollars annual

subscription per server so if you wanted to set it up initially on a small uh small form factor and that is your

operation grows you deactivate it reactivate it on bigger software on bigger Hardware that's fine that's still

the same price um from our point of view it's very difficult for to estimate exactly how

many resources their agents are going to use in the future so we've come up with

this pricing model for the server to make it incredibly easy for teams to

expand and grow Within their budget very very focused on

keeping your your budget lean you may have noticed that there's no per page

load pricing and there's no per agent pricing and there's no per core pricing

that makes it extremely friendly to teams that are really just getting started and trying to set up large-scale

wave scraping operations from scratch so you basically know exactly how much

you're going to spend as you start adding thousands of Agents you're going to buy more servers we know that because

one server is going to be optimized for full browser workloads and other ones going to be optimized for parser

workloads etc etc um and so we're confident that you know

you'll grow with us and um that you know you'll build uh businesses on top of uh air software

that are very very profitable we have many customers that have over a billion

in annual revenue um so uh you know this is

so we're very confident that our pricing is is uh competitive and that it's really uh working for our customers

um yes the the desktop pricing is five thousand dollars annual subscription per

seat so that's per user if you wanted to set it up on a virtual desktop and have

you know one person in one time zone use it for one shift and another person another time zone use it for another

shift that's fine we're not limiting you by the number of users in fact in the

agent control center there's no limit on the number of users you can configure your organization

and your users There's No Limit here there's no license attached to these

users again we're trying to make it as easy as possible to get things going and

get off the ground another question what kind of business or activities your clients use what can

we do with it in general okay so this is a very common question that we get

um it's what are the use cases that people are using web data collection for

and this is an incredibly vast field so for example maybe you're a university

professor and you're looking at macroeconomic um uh you know you're trying to assemble

a macroeconomic uh index or indicator that shows the overall health of the economy you might look at rental prices

you might look at sale prices you might look at um you know one one common one that's

been written about in The Press is is uh in summertime they look at RV rentals and our RVs uh used RV sales

um how fast are they moving are the prices getting discounted which would represent uh certain weakness in the

market you know these are all macro economic indicators if you're an investor and you're running a hedge fund

that's a large Porsche probably 70 percent of our bespoke services are are uh our hedge funds right now

um you know the business of investment is is getting automated they will look at anything that a company has put

online that exposes uh their internal operations so if it's

um if it's an airline they'll look at how many markets are they servicing have they dropped any markets are the flights

late um how often are they late what airports are they servicing and how much traffic uh is there in general in this in those

uh airports um you know these are all uh this is all information you can find online they'll

look at prices the look at seat map availability per class these are all

um uh you know facts that you can find online by looking at various websites and they'll be able to construct the

balance sheet of those Airlines uh before the earnings are announced um and

they'll automatically take in these fees and they will apply algorithms to them

and they will uh basically uh you know create what's what they call signals and

a signal a positive signal means you know invest invest or hold

um and a negative single means uh get out of here uh stop investing in this this is going down or short it

um and uh and so that's a finance use case in real estate

um they're pulling you know their their marrying government data to um uh real estate sale or or rental

listings that are on you know typical websites um and they're and they're able to come

up with an entire um uh information database on all the

closings and all the real estate activity in a particular District or

city or region or Market um real estate is a is a very big area

in in retail uh of course they're doing competitive pricing they're checking you

know an online retail sales and e-commerce sites uh they're constantly checking each other's prices

um you can also uh now there's another layer which is uh the the product

uh manufacturers and brands are checking the retail sites like Walmart uh or

Amazon to see where their product uh um sits in the order of products returned

first you know so that's a whole nother level um you know there's social media use cases

maybe you're pulling um sentiment around a brand maybe if you're looking at

um uh you know for example a new product like uh meatless meat or fake meat or

whatever it's called Uh the the the meat Alternatives that are coming out um you know a good place to to look to

assess um how things are going is to look on social media and Yelp and these types of

sites that have reviews um and you pull all that all the review content down and then you run it through some sentence and Analysis and keyword

extraction um uh you know text analytics and you get a pretty clear picture pretty

quickly about how that new product is doing and how individuals and consumers

are responding to um that product um so those are a number of different use case examples jobs are also a really

typical one um it'll show you what areas the company is expanding into and again you can use

those third apis to do text analytics to enrich the raw data that you're pulling

with our software um so one more question I have what defines

a server a server is um right now the way the server is

defined is a server is basically um the software that runs on uh a

particular machine that does all of the data collection work that's defined

as part of that agent and right now the server license is a bundle of the server

and the agent control center so those two things come together

all right let's see do we have any other questions it looks like we're almost done we have about three minutes left oh

we've got one more question here how is CG handling the new Google photo

based captcha oh I'm so glad you asked this um so we have layers and layers of ways

of dealing with captcha so for um uh you know for basically for captcha

it's very easy to either uh you can either make your request

whenever you encounter a captcha you can either reissue the request with new you

know clear your storage clear your headers you know reset with a new IP you know Etc and you can get around that

um captcha or um you can go ahead and you can just automate the captcha using

um you know this built-in function that we have um you can either do it

um you know by a custom script uh which you know there's uh there's ways to use

AI to get past these captions or you can use these third-party services death by

captcha to captcha are two very very common ones and basically what what you

do is you have a key to their services and this enables you to very quickly

um pass that captcha on to their human beings who are going to click through for you and then they pass back the

success key to your scraping session and off you go

and all of this is documented in our manual um which I should mention is quite

voluminous it's got information on uh um you know basic things about what's uh

you know how techniques limitations you know Etc and then it's got detailed

information about pretty much everything that you would need to know um on how to collect data how to work

with data um you know any anonymity requirements you have compliance

um you know Etc full documentation of how to deal with captcha in our manual

any other questions

okay well um we're just coming on the hour I want

to thank you all for your um for attending and congratulate you for

um investing time and efforts in in to make sure that you have the best solution possible for your companies and

your businesses if you have any follow-up questions please go ahead and email us at sales

um we'd be happy to answer anything uh any for follow-on questions you have we're

also happy to um set up trials and uh get you set up

to uh you know to really get Hands-On with this with this software

um I hope this has been uh an educational session and I look forward to seeing some of you

hope to see some of you on some of the future webinars that we are doing we have a full schedule that we've sent out

we're doing uh one webinar a week on specific topics this one has been on key

evaluation criteria we look forward to staying in touch and helping you achieve

uh your large-scale web scraping operation goals