Scraping Google Ranks for Fun and Profit
written 2012, by Justone [justone(at)squabbel.com] update and bugfixes: 4/28/2012 adapted to new Google design: 12/06/2012 added filter configuration to support exact human results: 12/07/2012 bugfix in functions.php: 6/13/2013 rewrite of Google parser: 3/19/2014 small fixes: 6/19/2016 Appended information about scraping.services 1/5/2017 Using my experience on scraping and backend IT solutions I had written the free Advanced Google Scraper in 2009. The source code was offered for free online and received feedback every week. In the past years I received a lot of positive comments and many questions regarding scraping. Usually for professional projects. The new scraper/rank checker here is a complete rewrite of the original one with a more stable html parser and better inner structures. Due to the complete lack of experience for such projects by average developers I decided to make this sort of challenging development my new profession, if you require customization just contact me. I can help with IP addresses, development, hosting, server management and whatever else you require to run such a challenging project professionally. However it is not too hard to get everything up and running if you or your developer has a bit experience with servers and PHP.
Scraping through a service(2017)
A completely new option to consider for scraping Google or Bing is another project I have been working on for a customer of mine. Scraping.services is not completely open source anymore but it is a high volume scraping solution that takes almost all tasks from your shoulders. In many cases it will be a more cost efficient solution, in some cases you might want to go both routes to reduce your dependency on one approach (higher reliability). In any case I would suggest reading this website, you will learn a lot of tricks and difficulties about scraping Google (which also applies on Bing and others)
The new and free Google Rank Checker can solve these requirements:
I will now cover the principles of scraping and go into detail about some of the rank-checker's features, some information will overlap with my Advanced Google Serp Scraper website Scraping search engines is a serious task. These days companies invest a lot of money and effort for organic and paid search engine traffic. This project concentrates on the organic search engine rank and it focuses only on the Google results. Google is still so far ahead compared with search engines that I did not invest the time to parse the competition yet. Mainly for larger SEO projects it is also important to know the keyword rankings of local country results, so I invested the time to analyze this part also and learn what is required to make it possible.Accurate scraping results
Sounds simple but actually there are multiple traps one can fall into that result in inaccurate search results. That might be unimportant for pure link scraping requirements but very important when it comes to rank analysis. Some Google parameters can affect ranking results, bad IPs/proxies or too many requests can also change ranking results. The free Rank Checker will provide accurate ranking results, filter out advertisements and is able to parse Rank, URL, Title and Description of each result.
Country and language specific scraping results
At first I thought it is impossible to provide country/language specific result pages without using IPs from those countries. I ran many detailed experiments during development and using the finished Google Rank Checker, finally I learned that it is possible to provide accurate location specific results. As usual things are not as easy as one might expect! Google provides more than 160 different languages and domains, often more than one official language is provided for one country and each language can provide different ranking results. Each domain will provide different results. So there is not a clear "ranking list" for one country, there can be multiple such rankings that are all authentic. The Google Rank Checker will require you to provide two codes, one for the language and one for the country. It is able to provide a list of all codes/languages/countries. The default configuration is "Google Worldwide" and "english" which produces the typical US ranking results. For verification I used four fresh browser installations of Google Chrome with local IP addresses from USA, UK, Germany and Austria. At the same time I ran the Google Rank Checker using a seo-proxies.com license and configured it for each of the four countries and languages. The results were very similar with small variations of ranking positions, of course that is a concern and required a deeper analysis! Using specific google parameters (please see the php source code for details) I was able to receive good ranking results for each country. Warning: I've also seen some very strange ranking results when using lower quality IP addresses from heavily used US datacenters. I used seo-proxies.com for the real tests which provides good IP quality. As I had multiple german servers and access to a german DSL (residential) computer at that time (for another project) I used Google Germany for the main test. The results showed similar tiny ranking changes from computer to computer, so even within one country the ranking results are not always exactly the same. In most cases one or two sites that are next to each other interchange the ranking on a result page. This might be because Google ranks are not always completely synchronized among all servers, or another influence I've not yet got my grip on. Summary on local Google ranking results: The free Google Rank Checker will provide as accurate results as I was able to receive when using local IP addresses. A small inaccuracy is always possible, even when using different IPs from the country itself.
IP/Proxy management
When scraping Google it is essential to avoid detection. Google does not want to be abused from thousands of people as this could have an impact on their servers (and they don't like to share their database with us). Avoiding detection is not a hard task if you are doing things right. a) do not push out more requests than 20 per hour per IP address b) try to spread the requests evenly (don't push 20 in 1 minute and wait 59 minutes) c) avoid cookies at all, they are not required d) rotate your IP address for each different keyword (you can and should request multiple pages of one keyword with the same IP) The free Google Rank Checker includes IP management functions which will "remember" the IP usage (also between application runs) and takes the number of available IPs into account when adding delays between requests
Local file based caching
Mainly during the development of scrapers one challenge is to keep the IPs (often a smaller number of IPs than in production use) in good quality. Often the programmer has to do a large number of test runs, each time the scraper will access Google .. using up the IPs. The free Google Rank Checker contains caching functions that will store the parsed page as serialized "object" in a directory. This way the Scraper can be run as often as one likes while tweaking program functions or the parser, if it already scraped the page it will not access Google again. The default "resolution" is 24 hours, this can of course be changed. Optional the caching can be disabled or forced.
The seo-proxies.com API
The free Google Rank Checker is supporting the seo-proxies.com API! seo-proxies.com is a high quality proxy service which solves a major task when scraping, getting a reliable and trustworthy source of IPs. Using low quality IPs can result in a lot of troubles, Google often knows those IPs already due to frequent previous abuse. SEO-Proxies.com also features a special API function that is used by the Rank Checker, it provides Google domains and language codes for country/language specific ranks. Of course it is possible to replace those functions in the free Rank Checker with an own solution, however I can recommend the use of seo-proxies.com for production environments and serious projects.
Modular source code design
As always, if someone is in a rush developing something it often does not look nice and is hard to see through at a later time. That was the case with the Advanced Google Scraper from my last article. The Google Rank Checker is based on the Advanced Google Scraper but heavily cleaned up and should be much easier to see through for other programmers.
Pure PHP code
For scraping projects PHP is a nice programming language, it allows on-the-fly changes and tweaks to adapt to problems and can run long-term and reliable as console application. PHP as programming language is mainly focused on web programming, this also means it comes with many functions very useful for scraping tasks. In our case we use the powerful libCURL API to send our HTTP requests and the DOM (document object model) functions to parse the results. Of course you can also launch the Google Rank Checker through a webserver but for production use it is recommended to run it as console script. The code was tested on PHP 5.2.6 but should be compatible with most PHP 5 versions.
Multithreading
For larger scaled scraping it is often required to run multiple threads at the same time. The code already contains some comments where small changes would be required to make it multi-threading compatible. Anyway, for most projects you will be fine with a single instance. When using seo-proxies.com it is possible to request additional proxy processes for multi-threading, alternatively you can use multiple accounts.
Hints for scraping Google and avoiding detection
The heart of my article: The free Google Rank Checker, written in PHP for web or console (recommended) usage
This source is free for your fun and profit, you can adapt the source to your requirements. Please make sure to read the short license agreement on top of the source code. This script includes:
For professional projects PHP is well suited but you should use the scraper as console script for best reliability. Requirements: * PHP 5.2 (5+ should do it) with libCURL and DOM support * An SEO-Proxies.com license for high quality IPs and the Google country API * "write rights" in the local directory, the script will create a working directory for local storage
Download the three source code files here: simple_html_dom.php google-rank-checker.php functions.php
#!/usr/bin/php
<?php
/* License: free for private and commercial use
This code is free to use and modify as long as this comment stays untouched on top and one exception.
URL of original source: http://google-rank-checker.squabbel.com
Author of original source: justone@squabbel.com
This tool should be completely legal but in any case you may not sue or seek compensation from the original Author for any damages or legal issues the use may cause.
By using this source code you agree NOT to increase the request rates beyond the IP management function limitations, this would only harm our common cause.
Exception:
Public redistributing modifications of this source code project is not allowed without written agreement. Contact me by email if you are unsure.
Using this work for private and commercial projects is allowed, redistributing it is not allowed.
The reason behind this is that my website shall stay the primary location for this source.
If you need customization of this source code you are welcome to contact me justone@squabbel.com
Some possible extensions:
* database integration with in/out queue for synchronous/asynchronous full automated script interaction
* increasing available functionality, adding different search modes, different resultset parsing
* modification into a scheduled script with custom data retrieval/placement
* modification into a background service
*/
error_reporting(E_ALL);
// ************************* Configuration variables *************************
// Your seo-proxies api credentials
$pwd="2b24aff3c1266-----your-api-key---"; // Your www.seo-proxies.com API password
$uid=YOUR_USER_ID; // Your www.seo-proxies.com API userid
// General configuration
$test_website_url="http://www.website.com"; // The URL, or a sub-string of it, of the indexed website. you can use a domain/hostname as well but including http:// is recommended to avoid false positives (like http://alexa.com/siteinfo/domain) !
$test_keywords="some keyword,another keyword"; // comma separated keywords to test the rank for
$test_max_pages=3; // The number of result pages to test until giving up per keyword. Each page contains up to 100 results or 10 results when using Google Instant
$test_100_resultpage=0; // Warning: Google ranking results will become inaccurate! Set to 1 to receive 100 instead of 10 results and reduce the amount of proxies required. Mainly useful for scraping relevant websites.
//$test_safe_search="medium"; // {right now not supported by the script}. Google safe search configuration. Possible choices: off, medium (default), high
/* Local result configuration. Enter 'help' to receive a list of possible choices. use global and en for the default worldwide results in english
* You need to define a country as well as the language. Visit the Google domain of the specific country to see the available languages.
* Only a correct combination of country and language will return the correct search engine result pages. */
$test_country="global"; // Country code. "global" is default. Use "help" to receive a list of available codes. [com,us,uk,fr,de,...]
$test_language="en"; // Language code. "EN" is default Use "help" to receive a list. Visit the local Google domain to find available langauges of that domain. [en,fr,de,...]
$filter=1; // 0 for no filter (recommended for maximizing content), 1 for normal filter (recommended for accuracy)
$force_cache=0; // set this to 1 if you wish to force the loading of cache files, even if the files are older than 24 hours. Set to -1 if you wish to force a new scrape.
$load_all_ranks=1; /* set this to 0 if you wish to stop scraping once the $test_website_url has been found in the search engine results,
* if set to 1 all $test_max_pages will be downloaded. This might be useful for more detailed ranking analysis.*/
$portal="int"; // int or us (must match your settings, int is default)
$show_html=0; // 1 means: output formated with HTML tags. 0 means output for console (recommended script usage)
$show_all_ranks=1; // set to 1 to display a complete list of all ranks per keyword, set to 0 to only display the ranks for the specified website
// ***************************************************************************
$working_dir="./wd_rank_checker";
/*Description:
* This is a working and full featured Google Rank Checker
* Tis script can and should be use as a base for own developments and customizations but it is also useful as standalone tool
* Knowing your website rank for important keywords and watching how it changes related to website changes or competition is essential, this tool can be a great help on that.
* There are websites that might do the same for you but they are unreliable and often produce wrong results, this tool puts that power into your own hands.
* Traffic estimation: 450kb per 100 results
*Features:
* + seo-proxies.com API support - getting reliable results from Google can be a pain and most proxies are not well suited for this, seo-proxies makes it easy
* + local country result feature (default is the main english google result set) (read more in last notes)
* + multipage DOM parsing - this tool is an advanced project, it can test for more than one result page and will interpret the results like a real browser (DOM)
* + correct proxy management - built in IP management, the tool will use and manage IPs in an optimal way to avoid blocks, wrong results and similar issues
* + multi-keyword support - test for more than one keyword
* + local cache (file based) to prevent unrequired serp lookups (resolution is one lookup per keyword-page per day)
*Requirements:
* + local write rights to create the working directory and store files in it (script will create directory and files automated)
* + Remove timeout for console scripts (when run on console)
* + Based on usage consider increasing max memory for console scripts (when run on console)
*
*Possible upgrades and ideas:
* + Multi-threading support can easily be added. delay_time() and the proxy API need adaption (&offset=n) for custom seo-proxies licenses with parallel proxy support.
* + Database support for ranking results is recommended for professional usage, this would easily allow a ranking history.
* + When used in production environments, emailing support should be added so any warnings or aborts result in an emergency email to the project manager
*
*
*Last notes:
* DONATE if you like the source or information so I can keep working and updating.
* The recommended source of IPs is the built in seo-proxies.com service. It is of course possible to modify the source code and change the proxy support.
* But this can result in accuracy/gray+blocklist troubles and even legal issues with Google, it is not recommended to change the proxy source or the IP management functions without advanced scraping experience.
* From time to time Google changes parts of the design, sometimes this can cause parsing issues. The website will try to always stay up to date on such changes.
*
* In general it is not recommended to use 100 results per page, this will reduce the amount of proxies required (in best case by 10) but also reduce the
* accuracy of the ranking results. It is required to use the '10 results per page' option if ranking results shall be accurate.
*
* The country specific results have been verified using geolocated IP addresses and browsers for UK, USA, DE and AT.
* In this project I'm using the small "google api" from seo-proxies to retrieve country codes and google domains.
* In my tests it has been found that the Rank checker is able to produce IDENTICAL ranking results when using my seo-proxies.com license.
* So it is possible to test for local results WITHOUT maintaining expensive proxies or servers in each of the countries.
* However, I can't guarantee this for all results. But this was true for resultset I've personally tested.
*
*
* The cache files contain a serialized php array. The main reason for this is that Google changes their layout from time to time, storing the raw html content
* in cache files would require to keep all "old" processing methods to be able to parse the output at a later time.
* The cache can be cleared by a crontab/scheduler which removes files older than 24 hours (based on unix "find" for example)
*
*/
require "functions.php";
$page=0;
$PROXY=array(); // after the rotate api call this variable contains these elements: [address](proxy host),[port](proxy port),[external_ip](the external IP),[ready](0/1)
$LICENSE=array(); // contains details about the seo-proxies.com license used for proper IP management
$results=array();
if ($show_html) $NL="<br>\n"; else $NL="\n";
if ($show_html) $HR="<hr>\n"; else $HR="---------------------------------------------------------------------------------------------------\n";
if ($show_html) $B="<b>"; else $B="!";
if ($show_html) $B_="</b>"; else $B_="!";
/*
* Start of main()
*/
if ($show_html)
{
echo "<html><body>";
}
$keywords=explode(",",$test_keywords);
if (!count($keywords)) die ("Error: no keywords defined.$NL");
if (!rmkdir($working_dir)) die("Failed to create/open $working_dir$NL");
$country_data=get_google_cc($test_country,$test_language);
if (!$country_data) die("Invalid country/language code specified.$NL");
$ready=get_license();
if (!$ready) die("The specified seo-proxies.com license ($uid) is not active. $NL");
if ($LICENSE['protocol'] != "http") die("The seo-proxies.com proxy protocol of license $uid is not set to HTTP, please change the protocol to HTTP. $NL");
echo "$NL$B Google rank checker for $test_website_url initated $B_ $NL$NL";
/*
* This loop iterates through all keyword combinations
*/
$ch=NULL;
$rotate_ip=0; // variable that triggers an IP rotation (normally only during keyword changes)
$max_errors_total=3; // abort script if there are 3 keywords that can not be scraped (something is going wrong and needs to be checked)
$rank_data=array();
$siterank_data=array();
foreach($keywords as $keyword)
{
$rank=0;
$max_errors_page=5; // abort script if there are 5 errors in a row, that should not happen
if ($test_max_pages <= 0) break;
$search_string=urlencode($keyword);
$rotate_ip=1; // IP rotation for each new keyword
/*
* This loop iterates through all result pages for the given keyword
*/
for ($page=0;$page<$test_max_pages;$page++)
{
$serp_data=load_cache($search_string,$page,$country_data,$force_cache); // load results from local cache if available for today
$maxpages=0;
if (!$serp_data)
{
$ip_ready=check_ip_usage(); // test if ip has not been used within the critical time
while (!$ip_ready || $rotate_ip)
{
$ok=rotate_proxy(); // start/rotate to the IP that has not been started for the longest time, also tests if proxy connection is working
if ($ok != 1)
die ("Fatal error: proxy rotation failed:$NL $ok$NL");
$ip_ready=check_ip_usage(); // test if ip has not been used within the critical time
if (!$ip_ready) die("ERROR: No fresh IPs left, try again later. $NL");
else
{
$rotate_ip=0; // ip rotated
break; // continue
}
}
delay_time(); // stop scraping based on the license size to spread scrapes best possible and avoid detection
global $scrape_result; // contains metainformation from the scrape_serp_google() function
$raw_data=scrape_serp_google($search_string,$page,$country_data); // scrape html from search engine
if ($scrape_result != "SCRAPE_SUCCESS")
{
if ($max_errors_page--)
{
echo "There was an error scraping (Code: $scrape_result), trying again .. $NL";
$page--;
continue;
} else
{
$page--;
if ($max_errors_total--)
{
echo "Too many errors scraping keyword $search_string (at page $page). Skipping remaining pages of keyword $search_string .. $NL";
break;
} else
{
die ("ERROR: Max keyword errors reached, something is going wrong. $NL");
}
break;
}
}
mark_ip_usage(); // store IP usage, this is very important to avoid detection and gray/blacklistings
global $process_result; // contains metainformation from the process_raw() function
$serp_data=process_raw_v2($raw_data,$page); // process the html and put results into $serp_data
if (($process_result == "PROCESS_SUCCESS_MORE") || ($process_result == "PROCESS_SUCCESS_LAST"))
{
$result_count=count($serp_data);
$serp_data['page']=$page;
if ($process_result != "PROCESS_SUCCESS_LAST")
$serp_data['lastpage']=1;
else
$serp_data['lastpage']=0;
$serp_data['keyword']=$keyword;
$serp_data['cc']=$country_data['cc'];
$serp_data['lc']=$country_data['lc'];
$serp_data['result_count']=$result_count;
store_cache($serp_data,$search_string,$page,$country_data); // store results into local cache
}
if ($process_result != "PROCESS_SUCCESS_MORE")
break; // last page
if (!$load_all_ranks)
{
for ($n=0;$n < $result_count;$n++)
if (strstr($results[$n]['url'],$test_website_url))
{
verbose("Located $test_website_url within search results.$NL");
break;
}
}
} // scrape clause
$result_count=$serp_data['result_count'];
for ($ref=0;$ref<$result_count;$ref++)
{
$rank++;
$rank_data[$keyword][$rank]['title']=$serp_data[$ref]['title'];
$rank_data[$keyword][$rank]['url']=$serp_data[$ref]['url'];
$rank_data[$keyword][$rank]['host']=$serp_data[$ref]['host'];
//$rank_data[$keyword][$rank]['desc']=$serp_data['desc'']; // not really required
if (strstr($rank_data[$keyword][$rank]['url'],$test_website_url))
{
$info=array();
$info['rank']=$rank;
$info['url']=$rank_data[$keyword][$rank]['url'];
$siterank_data[$keyword][]=$info;
}
}
} // page loop
} // keyword loop
if ($show_all_ranks)
{
foreach ($rank_data as $keyword => $ranks)
{
echo "$NL$NL$B"."Ranking information for keyword \"$keyword\" $B_$NL";
echo "$B"."Rank - Website - Title$B_$NL";
$pos=0;
foreach ($ranks as $rank)
{
$pos++;
if (strstr($rank['url'],$test_website_url))
echo "$B$pos - $rank[url] - $rank[title] $B_$NL";
else
echo "$pos - $rank[url] - $rank[title] $NL";
}
}
}
foreach ($keywords as $keyword)
{
if (!isset($siterank_data[$keyword])) echo "$NL$B"."The specified site was not found in the search results for keyword \"$keyword\". $B_$NL";
else
{
$siteranks=$siterank_data[$keyword];
echo "$NL$NL$B"."Ranking information for keyword \"$keyword\" and website \"$test_website_url\" [$test_country / $test_language] $B_$NL";
foreach ($siteranks as $siterank)
echo "Rank $siterank[rank] for URL $siterank[url]$NL";
}
}
//var_dump($siterank_data);
if ($show_html)
{
echo "</body></html>";
}
?>
<?php
/* License: open source for private and commercial use
This code is free to use and modify as long as this comment stays untouched on top.
URL of original source: http://google-rank-checker.squabbel.com
Author of original source: justone@squabbel.com
This tool should be completely legal but in any case you may not sue or seek compensation from the original Author for any damages or legal issues the use may cause.
By using this source code you agree NOT to increase the request rates beyond the IP management function limitations, this would only harm our common cause.
*/
function verbose($text)
{
echo $text;
}
/*
* By default (no force) the function will load cached data within 24 hours otherwise reject the cache.
* Google does not change its ranking too frequently, that's why 24 hours has been chosen.
*
* Multithreading: When multithreading you need to work on a proper locking mechanism
*/
function load_cache($search_string,$page,$country_data,$force_cache)
{
global $working_dir;
global $NL;
global $test_100_resultpage;
if ($force_cache < 0) return NULL;
$lc=$country_data['lc'];
$cc=$country_data['cc'];
if ($test_100_resultpage)
$hash=md5($search_string."_".$lc."_".$cc.".".$page.".100p");
else
$hash=md5($search_string."_".$lc."_".$cc.".".$page);
$file="$working_dir/$hash.cache";
$now=time();
if (file_exists($file))
{
$ut=filemtime($file);
$dif=$now-$ut;
$hour=(int)($dif/(60*60));
if ($force_cache || ($dif < (60*60*24)))
{
$serdata=file_get_contents($file);
$serp_data=unserialize($serdata);
verbose("Cache: loaded file $file for $search_string and page $page. File age: $hour hours$NL");
return $serp_data;
}
return NULL;
} else
return NULL;
}
/*
* Multithreading: When multithreading you need to work on a proper locking mechanism
*/
function store_cache($serp_data,$search_string,$page,$country_data)
{
global $working_dir;
global $NL;
global $test_100_resultpage;
$lc=$country_data['lc'];
$cc=$country_data['cc'];
if ($test_100_resultpage)
$hash=md5($search_string."_".$lc."_".$cc.".".$page.".100p");
else
$hash=md5($search_string."_".$lc."_".$cc.".".$page);
$file="$working_dir/$hash.cache";
$now=time();
if (file_exists($file))
{
$ut=filemtime($file);
$dif=$now-$ut;
if ($dif < (60*60*24)) echo "Warning: cache storage initated for $search_string page $page which was already cached within the past 24 hours!$NL";
}
$serdata=serialize($serp_data);
file_put_contents($file,$serdata, LOCK_EX);
verbose("Cache: stored file $file for $search_string and page $page.$NL");
}
// check_ip_usage() must be called before first use of mark_ip_usage()
function check_ip_usage()
{
global $PROXY;
global $working_dir;
global $NL;
global $ip_usage_data; // usage data object as array
if (!isset($PROXY['ready'])) return 0; // proxy not ready/started
if (!$PROXY['ready']) return 0; // proxy not ready/started
if (!isset($ip_usage_data))
{
if (!file_exists($working_dir."/ipdata.obj")) // usage data object as file
{
echo "Warning!$NL"."The ipdata.obj file was not found, if this is the first usage of the rank checker everything is alright.$NL"."Otherwise removal or failure to access the ip usage data will lead to damage of the IP quality.$NL$NL";
sleep(5);
$ip_usage_data=array();
} else
{
$ser_data=file_get_contents($working_dir."/ipdata.obj");
$ip_usage_data=unserialize($ser_data);
}
}
if (!isset($ip_usage_data[$PROXY['external_ip']]))
{
verbose("IP $PROXY[external_ip] is ready for use $NL");
return 1; // the IP was not used yet
}
if (!isset($ip_usage_data[$PROXY['external_ip']]['requests'][20]['ut_google']))
{
verbose("IP $PROXY[external_ip] is ready for use $NL");
return 1; // the IP has not been used 20+ times yet, return true
}
$ut_last=(int)$ip_usage_data[$PROXY['external_ip']]['ut_last-usage']; // last time this IP was used
$req_total=(int)$ip_usage_data[$PROXY['external_ip']]['request-total']; // total number of requests made by this IP
$req_20=(int)$ip_usage_data[$PROXY['external_ip']]['requests'][20]['ut_google']; // the 20th request (if IP was used 20+ times) unixtime stamp
$now=time();
if (($now - $req_20) > (60*60) )
{
verbose("IP $PROXY[external_ip] is ready for use $NL");
return 1; // more than an hour passed since 20th usage of this IP
} else
{
$cd_sec=(60*60) - ($now - $req_20);
verbose("IP $PROXY[external_ip] needs $cd_sec seconds cooldown, not ready for use yet $NL");
return 0; // the IP is overused, it can not be used for scraping without being detected by the search engine yet
}
}
// return 1 if license is ready, otherwise 0
function get_license()
{
global $uid;
global $pwd;
global $LICENSE;
global $NL;
$res=proxy_api("hello"); // will fill $LICENSE
$ip="";
if ($res <= 0)
{
verbose("API error: Proxy API connection failed (Error $res). trying again soon..$NL$NL");
return 0;
} else
{
($LICENSE['active']==1) ? $ready="active" : $ready="not active";
verbose("API success: License is $ready.$NL");
if ($LICENSE['active']==1) return 1;
return 0;
}
return $LICENSE;
}
/* Delay (sleep) based on the license size to allow optimal scraping
*
* Warning!
* Do NOT change the delay to be shorter than the specified delay.
* When scraping Google you should never do more than 20 requests per hour per IP address
* This function will create a delay based on your total IP addresses.
*
* Together with the IP management functions this will ensure that your IPs stay healthy (no wrong rankings) and undetected (no virus warnings, blacklists, captchas)
*
* Multithreading:
* When multithreading you need to multiply the delay time ($d) by the number of threads
*/
function delay_time()
{
global $NL;
global $LICENSE;
$d=(3600*1000000/(((float)$LICENSE['total_ips'])*12));
verbose("Delay based on license size, please wait.. $NL");
usleep($d);
}
/*
* Updates and stores the ip usage data object
* Marks an IP as used and re-sorts the access array
*/
function mark_ip_usage()
{
global $PROXY;
global $working_dir;
global $NL;
global $ip_usage_data; // usage data object as array
if (!isset($ip_usage_data)) die("ERROR: Incorrect usage. check_ip_usage() needs to be called once before mark_ip_usage()!$NL");
$now=time();
$ip_usage_data[$PROXY['external_ip']]['ut_last-usage']=$now; // last time this IP was used
if (!isset($ip_usage_data[$PROXY['external_ip']]['request-total'])) $ip_usage_data[$PROXY['external_ip']]['request-total']=0;
$ip_usage_data[$PROXY['external_ip']]['request-total']++; // total number of requests made by this IP
// shift fifo queue
for ($req=19;$req>=1;$req--)
{
if (isset($ip_usage_data[$PROXY['external_ip']]['requests'][$req]['ut_google']))
{
$ip_usage_data[$PROXY['external_ip']]['requests'][$req+1]['ut_google']=$ip_usage_data[$PROXY['external_ip']]['requests'][$req]['ut_google'];
}
}
$ip_usage_data[$PROXY['external_ip']]['requests'][1]['ut_google']=$now;
$serdata=serialize($ip_usage_data);
file_put_contents($working_dir."/ipdata.obj",$serdata, LOCK_EX);
}
// access google based on parameters and return raw html or "0" in case of an error
function scrape_serp_google($search_string,$page,$local_data)
{
global $ch;
global $NL;
global $PROXY;
global $LICENSE;
global $scrape_result;
global $test_100_resultpage;
global $filter;
$scrape_result="";
$google_ip=$local_data['domain'];
$hl=$local_data['lc'];
if ($page == 0)
{
if ($test_100_resultpage)
$url="http://$google_ip/search?q=$search_string&hl=$hl&ie=utf-8&as_qdr=all&aq=t&rls=org:mozilla:us:official&client=firefox&num=100&filter=$filter";
else
$url="http://$google_ip/search?q=$search_string&hl=$hl&ie=utf-8&as_qdr=all&aq=t&rls=org:mozilla:us:official&client=firefox&num=10&filter=$filter";
} else
{
if ($test_100_resultpage)
{
$num=$page*100;
$url="http://$google_ip/search?q=$search_string&hl=$hl&ie=utf-8&as_qdr=all&aq=t&rls=org:mozilla:us:official&client=firefox&start=$num&num=100&filter=$filter";
} else
{
$num=$page*10;
$url="http://$google_ip/search?q=$search_string&hl=$hl&ie=utf-8&as_qdr=all&aq=t&rls=org:mozilla:us:official&client=firefox&start=$num&num=10&filter=$filter";
}
}
//verbose("Debug, Search URL: $url$NL");
curl_setopt ($ch, CURLOPT_URL, $url);
$htmdata = curl_exec ($ch);
if (!$htmdata)
{
$error = curl_error($ch);
$info = curl_getinfo($ch);
echo "\tError scraping: $error [ $error ]$NL";
$scrape_result="SCRAPE_ERROR";
sleep (3);
return "";
} else
if (strlen($htmdata) < 20)
{
$scrape_result="SCRAPE_EMPTY_SERP";
sleep (3);
return "";
}
if (strstr($htmdata,"computer virus or spyware application"))
{
echo("Google blocked us, we need more proxies ! Make sure you did not damage the IP management functions. $NL");
$scrape_result="SCRAPE_DETECTED";
die();
}
if (strstr($htmdata,"entire network is affected"))
{
echo("Google blocked us, we need more proxies ! Make sure you did not damage the IP management functions. $NL");
$scrape_result="SCRAPE_DETECTED";
die();
}
if (strstr($htmdata,"http://www.download.com/Antivirus"))
{
echo("Google blocked us, we need more proxies ! Make sure you did not damage the IP management functions. $NL");
$scrape_result="SCRAPE_DETECTED";
die();
}
if (strstr($htmdata,"/images/yellow_warning.gif"))
{
echo("Google blocked us, we need more proxies ! Make sure you did not damage the IP management functions. $NL");
$scrape_result="SCRAPE_DETECTED";
die();
}
$scrape_result="SCRAPE_SUCCESS";
return $htmdata;
}
/*
* Parser
* This function will parse the Google html code and create the data array with ranking information
* The variable $process_result will contain general information or warnings/errors
*/
require_once "simple_html_dom.php";
function process_raw_v2($data, $page)
{
global $process_result; // contains metainformation from the process_raw() function
global $test_100_resultpage;
global $NL;
global $B;
global $B_;
$results=array();
$html = new simple_html_dom();
$html->load($data);
/** @var $interest simple_html_dom_node */
$interest = $html->find('div#ires ol li.g');
echo "found interesting elements: ".count($interest)."\n";
$interest_num=0;
foreach ($interest as $li)
{
$result = array('title'=>'undefined','host'=>'undefined','url'=>'undefined','desc'=>'undefined','type'=>'organic');
$interest_num ++;
$h3 = $li->find('h3.r',0);
if (!$h3)
{
continue;
}
$a = $h3->find('a',0);
if (!$a) continue;
$result['title'] = html_entity_decode($a->plaintext);
$lnk = urldecode($a->href);
if ($lnk)
{
preg_match('/.+(ht[^&]*).+/', $lnk, $m);
if ($m[1])
{
$result['url']=$m[1];
$tmp=parse_url($m[1]);
$result['host']=$tmp['host'];
} else
{
if (strstr($result['title'],'News')) $result['type']='news';
}
}
if ($result['type']=='organic')
{
$sp = $li->find('span.st',0);
if ($sp)
{
$result['desc']=html_entity_decode($sp->plaintext);
$sp->clear();
}
}
$h3->clear();
$a->clear();
$li->clear();
$results[]=$result;
}
$html->clear();
// Analyze if more results are available (next page)
$next = 0;
if (strstr($data, "Next</a>"))
{
$next = 1;
} else
{
if ($test_100_resultpage)
{
$needstart = ($page + 1) * 100;
} else
{
$needstart = ($page + 1) * 10;
}
$findstr = "start=$needstart";
if (strstr($data, $findstr)) $next = 1;
}
$page++;
if ($next)
{
$process_result = "PROCESS_SUCCESS_MORE"; // more data available
} else
{
$process_result = "PROCESS_SUCCESS_LAST";
} // last page reached
return $results;
}
function process_raw($htmdata,$page) // process the html and put results into $serp_data
{
global $process_result; // contains metainformation from the process_raw() function
global $test_100_resultpage;
global $NL;
global $B;
global $B_;
$dom = new domDocument;
$dom->strictErrorChecking = false;
$dom->preserveWhiteSpace = true;
@$dom->loadHTML($htmdata);
$lists=$dom->getElementsByTagName('li');
$num=0;
$results=array();
foreach ($lists as $list)
{
unset($ar);unset($divs);unset($div);unset($cont);unset($result);unset($tmp);
$ar=dom2array_full($list);
if (count($ar) < 2)
{
verbose("s");
continue; // skipping advertisements
}
if ((!isset($ar['class'])) || ($ar['class'] != 'g'))
{
verbose("x");
continue; // skipping non-search result entries
}
// adaption to new google layout
if (isset($ar['div'][1]))
$ar['div']=&$ar['div'][0];
if (isset($ar['div'][1]))
$ar['div']=&$ar['div'][0];
//$ar=&$ar['div']['span']; // changes 2011 - Google changed layout
//$ar=&$ar['div']; // changes 2011 - Google changed layout // change again, 2012-2013
$orig_ar=$ar; // 2012-2013
// adaption finished
$divs=$list->getElementsByTagName('div');
$div=$divs->item(1);
getContent($cont,$div);
$num++;
$result['title']=&$ar['h3']['a']['textContent'];
$tmp=strstr(&$ar['h3']['a']['@attributes']['href'],"http");
$result['url']=$tmp;
if (strstr(&$ar['h3']['a']['@attributes']['href'],"interstitial")) echo "!";
$tmp=parse_url(&$result['url']);
$result['host']=&$tmp['host'];
$desc=strstr($cont,"<span class='st'>"); // instead of using DOM the string is parsed traditional due to frequent layout changes by Google
$desc=substr($desc,17);
$desc=strip_tags($desc);
$result['desc']=$desc;
// 2012-2013 addon, might be extended with on request
if (isset($ar['table']) && (strlen($result['title'] < 2))) // special mode - embedded video or similar
{
// if interesting the object can be parsed here
$result['title']="embedded object";
$result['url']="embedded object";
}
//echo "$B Result parsed:$B_ $result[title]$NL";
verbose("r");
flush();
$results[]=$result; // This adds the result to our large result array
}
verbose(" !$NL");
// Analyze if more results are available (next page)
$next=0;
$tables=$dom->getElementsByTagName('table');
if (strstr($htmdata,"Next</a>")) $next=1;
else
{
if ($test_100_resultpage)
$needstart=($page+1)*100;
else
$needstart=($page+1)*10;
$findstr="start=$needstart";
if (strstr($htmdata,$findstr)) $next=1;
}
$page++;
if ($next)
{
$process_result="PROCESS_SUCCESS_MORE"; // more data available
} else
$process_result="PROCESS_SUCCESS_LAST"; // last page reached
//var_dump($results);
return $results;
}
function rotate_proxy()
{
global $PROXY;
global $ch;
global $NL;
$max_errors=3;
$success=0;
while ($max_errors--)
{
$res=proxy_api("rotate"); // will fill $PROXY
$ip="";
if ($res <= 0)
{
verbose("API error: Proxy API connection failed (Error $res). trying again soon..$NL$NL");
sleep(21); // retry after a while
} else
{
verbose("API success: Received proxy IP $PROXY[external_ip] on port $PROXY[port]$NL");
$success=1;
break;
}
}
if ($success)
{
$ch=new_curl_session($ch);
return 1;
} else
return "API rotation failed. Check license, firewall and API credentials.$NL";
}
/*
* This is the API function for $portal.seo-proxies.com, currently supporting the "rotate" command
* On success it will define the $PROXY variable, adding the elements ready,address,port,external_ip and return 1
* On failure the return is <= 0 and the PROXY variable ready element is set to "0"
*/
function extractBody($response_str)
{
$parts = preg_split('|(?:\r?\n){2}|m', $response_str, 2);
if (isset($parts[1])) return $parts[1];
return '';
}
function proxy_api($cmd,$x="")
{
global $pwd;
global $uid;
global $PROXY;
global $LICENSE;
global $NL;
global $portal;
$fp = fsockopen("$portal.seo-proxies.com", 80);
if (!$fp)
{
echo "Unable to connect to proxy API $NL";
return -1; // connection not possible
} else
{
if ($cmd == "hello")
{
fwrite($fp, "GET /api.php?api=1&uid=$uid&pwd=$pwd&cmd=hello&extended=1 HTTP/1.0\r\nHost: $portal.seo-proxies.com\r\nAccept: text/html, text/plain, text/*, */*;q=0.01\r\nAccept-Encoding: plain\r\nAccept-Language: en\r\n\r\n");
stream_set_timeout($fp, 8);
$res="";
$n=0;
while (!feof($fp))
{
if ($n++ > 4) break;
$res .= fread($fp, 8192);
}
$info = stream_get_meta_data($fp);
fclose($fp);
if ($info['timed_out'])
{
echo 'API: Connection timed out! $NL';
$LICENSE['active']=0;
return -2; // api timeout
} else
{
if (strlen($res) > 1000) return -3; // invalid api response (check the API website for possible problems)
$data=extractBody($res);
$ar=explode(":",$data);
if (count($ar) < 4) return -100; // invalid api response
switch ($ar[0])
{
case "ERROR":
echo "API Error: $res $NL";
$LICENSE['active']=0;
return 0; // Error received
break;
case "HELLO":
$LICENSE['max_ips']=$ar[1]; // number of IPs licensed
$LICENSE['total_ips']=$ar[2]; // number of IPs assigned
$LICENSE['protocol']=$ar[3]; // current proxy protocol (http, socks, vpn)
$LICENSE['processes']=$ar[4]; // number of proxy processes
if ($LICENSE['total_ips'] > 0) $LICENSE['active']=1; else $LICENSE['active']=0;
return 1;
break;
default:
echo "API Error: Received answer $ar[0], expected \"HELLO\"";
$LICENSE['active']=0;
return -101; // unknown API response
}
}
} // cmd==hello
if ($cmd == "rotate")
{
$PROXY['ready']=0;
fwrite($fp, "GET /api.php?api=1&uid=$uid&pwd=$pwd&cmd=rotate&randomness=0&offset=0 HTTP/1.0\r\nHost: $portal.seo-proxies.com\r\nAccept: text/html, text/plain, text/*, */*;q=0.01\r\nAccept-Encoding: plain\r\nAccept-Language: en\r\n\r\n");
stream_set_timeout($fp, 8);
$res="";
$n=0;
while (!feof($fp))
{
if ($n++ > 4) break;
$res .= fread($fp, 8192);
}
$info = stream_get_meta_data($fp);
fclose($fp);
if ($info['timed_out'])
{
echo 'API: Connection timed out! $NL';
return -2; // api timeout
} else
{
if (strlen($res) > 1000) return -3; // invalid api response (check the API website for possible problems)
$data=extractBody($res);
$ar=explode(":",$data);
if (count($ar) < 4) return -100; // invalid api response
switch ($ar[0])
{
case "ERROR":
echo "API Error: $res $NL";
return 0; // Error received
break;
case "ROTATE":
$PROXY['address']=$ar[1];
$PROXY['port']=$ar[2];
$PROXY['external_ip']=$ar[3];
$PROXY['ready']=1;
usleep(250000); // additional time to avoid connecting during proxy bootup phase, to be 100% sure 1 second needs to be waited
return 1;
break;
default:
echo "API Error: Received answer $ar[0], expected \"ROTATE\"";
return -101; // unknown API response
}
}
} // cmd==rotate
}
}
function dom2array($node)
{
$res = array();
if($node->nodeType == XML_TEXT_NODE)
{
$res = $node->nodeValue;
} else
{
if($node->hasAttributes())
{
$attributes = $node->attributes;
if(!is_null($attributes))
{
$res['@attributes'] = array();
foreach ($attributes as $index=>$attr)
{
$res['@attributes'][$attr->name] = $attr->value;
}
}
}
if($node->hasChildNodes())
{
$children = $node->childNodes;
for($i=0;$i<$children->length;$i++)
{
$child = $children->item($i);
$res[$child->nodeName] = dom2array($child);
}
$res['textContent']=$node->textContent;
}
}
return $res;
}
function getContent(&$NodeContent="",$nod)
{
$NodList=$nod->childNodes;
for( $j=0 ; $j < $NodList->length; $j++ )
{
$nod2=$NodList->item($j);
$nodemane=$nod2->nodeName;
$nodevalue=$nod2->nodeValue;
if($nod2->nodeType == XML_TEXT_NODE)
$NodeContent .= $nodevalue;
else
{ $NodeContent .= "<$nodemane ";
$attAre=$nod2->attributes;
foreach ($attAre as $value)
$NodeContent .= "{$value->nodeName}='{$value->nodeValue}'" ;
$NodeContent .= ">";
getContent($NodeContent,$nod2);
$NodeContent .= "</$nodemane>";
}
}
}
function dom2array_full($node)
{
$result = array();
if($node->nodeType == XML_TEXT_NODE)
{
$result = $node->nodeValue;
} else
{
if($node->hasAttributes())
{
$attributes = $node->attributes;
if((!is_null($attributes))&&(count($attributes)))
foreach ($attributes as $index=>$attr)
$result[$attr->name] = $attr->value;
}
if($node->hasChildNodes())
{
$children = $node->childNodes;
for($i=0;$i<$children->length;$i++)
{
$child = $children->item($i);
if($child->nodeName != '#text')
if(!isset($result[$child->nodeName]))
$result[$child->nodeName] = dom2array($child);
else
{
$aux = $result[$child->nodeName];
$result[$child->nodeName] = array( $aux );
$result[$child->nodeName][] = dom2array($child);
}
}
}
}
return $result;
}
function getip()
{
global $PROXY;
if (!$PROXY['ready']) return -1; // proxy not ready
$curl_handle=curl_init();
curl_setopt($curl_handle,CURLOPT_URL,'http://squabbel.com/ipxx.php'); // this site will return the plain IP address, great for testing if a proxy is ready
curl_setopt($curl_handle,CURLOPT_CONNECTTIMEOUT,10);
curl_setopt($curl_handle,CURLOPT_TIMEOUT,10);
curl_setopt($curl_handle,CURLOPT_RETURNTRANSFER,1);
$curl_proxy = "$PROXY[address]:$PROXY[port]";
curl_setopt($curl_handle, CURLOPT_PROXY, $curl_proxy);
$tested_ip=curl_exec($curl_handle);
if(preg_match("^([1-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(\.([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3}^", $tested_ip))
{
curl_close($curl_handle);
return $tested_ip;
}
else
{
$info = curl_getinfo($curl_handle);
curl_close($curl_handle);
return 0; // possible error would be a wrong authentication IP or a firewall
}
}
function new_curl_session($ch=NULL)
{
global $PROXY;
if ((!isset($PROXY['ready'])) || (!$PROXY['ready'])) return $ch; // proxy not ready
if (isset($ch) && ($ch != NULL))
curl_close($ch);
$ch = curl_init();
curl_setopt ($ch, CURLOPT_HEADER, 0);
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER , 1);
$curl_proxy = "$PROXY[address]:$PROXY[port]";
curl_setopt($ch, CURLOPT_PROXY, $curl_proxy);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 20);
curl_setopt($ch, CURLOPT_TIMEOUT, 20);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.0; en; rv:1.9.0.4) Gecko/2009011913 Firefox/3.0.6");
return $ch;
}
function rmkdir($path, $mode = 0755) {
if (file_exists($path)) return 1;
return @mkdir($path, $mode);
}
/*
* For country&language specific searches
*/
function get_google_cc($cc,$lc)
{
global $pwd;
global $uid;
global $PROXY;
global $LICENSE;
global $NL;
global $portal;
$fp = fsockopen("$portal.seo-proxies.com", 80);
if (!$fp)
{
echo "Unable to connect to google_cc API $NL";
return NULL; // connection not possible
} else
{
fwrite($fp, "GET /g_api.php?api=1&uid=$uid&pwd=$pwd&cmd=google_cc&cc=$cc&lc=$lc HTTP/1.0\r\nHost: $portal.seo-proxies.com\r\nAccept: text/html, text/plain, text/*, */*;q=0.01\r\nAccept-Encoding: plain\r\nAccept-Language: en\r\n\r\n");
stream_set_timeout($fp, 8);
$res="";
$n=0;
while (!feof($fp))
{
if ($n++ > 4) break;
$res .= fread($fp, 8192);
}
$info = stream_get_meta_data($fp);
fclose($fp);
if ($info['timed_out'])
{
echo 'API: Connection timed out! $NL';
return NULL; // api timeout
} else
{
$data=extractBody($res);
$obj=unserialize($data);
if (isset($obj['error'])) echo $obj['error']."$NL";
if (isset($obj['info'])) echo $obj['info']."$NL";
return $obj['data'];
if (strlen($data) < 4) return NULL; // invalid api response
}
}
}
?>