class_http.php
Author: Troy Wolf (troy@troywolf.com)
Modified Date: 2005-06-24 14:30
Download: class_http.zip
View class source: class_http.php source
class_http.php is a "screen-scraping" utility that makes it easy to scrape
content and cache scraped content for any number of seconds desired before
hitting the live source again. Caching makes you a good neighbor!
The class has 2 static methods that make it easy to
extract individual tables of data out of web pages. The class even comes with
a companion script that makes it easy to use and cache external images
directly within img elements.
The class cloaks itself as the User Agent of the user making the request to
your script. It also sends your script as the Referer, since in essence, it
is the referrer. This means you should be able to screen-scrape sites that
normally block screen-scraping. This class is not meant to help you break
any company's usage policies. Be a good neighbor, and always use caching when
you can.
Need to access protected content? The class can do basic authentication.
However, a lot of sites that require login do not use basic authentication.
To use the class in your scripts, you first need to include the class file.
Modify the path to fit your needs.
require_once(dirname(__FILE__).'/class_http.php');
Instantiate a new http object. You can create one object and use it over and
over again throughout your script, or you can create multiple objects as
needed.
$h = new http();
The caching feature requires a directory on your webserver to save the cache
files. If you prefer, you can hard-code this in the class itself by modifying
the 'dir' property in the http() function (the class constructor). The class
will default to storing the cache files in the current directory, but for
security, you should store them in a non web-accessible directory. You can
set this property per object using the code below. You must end the value with
a "/". If you do not plan to use caching, don't worry about this property.
$h->dir = "/home/foo/bar/";
Example to screen-scrape the Google home page without caching.
if (!$h->fetch("http://www.google.com")) {
echo "<h2>There is a problem with the http request!</h2>";
echo $h->log;
exit();
}
Once you have executed the fetch() method, you have three properties
available. The HTTP Status, HTTP headers, and body.
Usually, you will only be interested in the body content.
echo "Status: ".$h->status;
echo "<pre>".$h->header."</pre>";
echo $h->body;
Here is an example to screen-scrape the MSFT stock page at moneycentral.com WITH caching.
You can pass in a TTL which is a Time-To-Live in seconds that you want the
cached data to be considered "good". For example, if you set the ttl to 600, it
means that before going to the source site for the data, the local cache will be
checked. If the cache file exists, and is not more than 10 minutes old, the
class will use the cache. Otherwise, the source site will be scraped, and the
local cache file will be updated. This makes your page faster and makes you a
better neighbor to the external site.
$url = "http://moneycentral.msn.com/detail/stock_quote?Symbol=MSFT";
if (!$h->fetch($url, 600)) {
echo "<h2>There is a problem with the http request!</h2>";
echo $h->log;
exit();
}
There is a special ttl value of "daily". This tells the class to consider the
cached data "good" as long as it was scraped today. Otherwise, go get a fresh
copy of content from the source site and update the local cache.
if (!$h->fetch($url, "daily")) {
echo "<h2>There is a problem with the http request!</h2>";
echo $h->log;
exit();
}
Optionally, you can pass in a name that will be used to name the cache file.
This is useful if you want to be able to know which cache files are which.
If you do not pass a name, it will default to an MD5 hash of the url.
if (!$h->fetch($url, 600, "MSFT_Info")) {
echo "<h2>There is a problem with the http request!</h2>";
echo $h->log;
exit();
}
The class comes with 2 static methods you can use to extract data out of
HTML tables.
- table_into_array() will rip a single table into an array.
- table_into_xml() will internally call table_into_array() then
create an XML document from the array. I thought this would be cool, but in
practice, I've never used this method since the array is so easy to work
with.
This example builds on the previous example to extract the MSFT stats out
of
http://moneycentral.msn.com/detail/stock_quote?Symbol=MSFT
.
Read the comments in the class file to learn how to use this static method.
$msft_stats = http::table_into_array($h->body, "Avg Daily Volume", 1, null);
/* Print out the array so you can see the stats data. */
echo "<pre>";
print_r($msft_stats);
echo "</pre>";
The class can do basic authentication to scrape protected content. Note that
most sites that require login do not use basic authentication.
Pass your username and password in like this:
$url = "http://someprivatesite.net";
$h->fetch($url, 0, null, "MyUserName","MyPassword");
If your need to access content on a port other than 80 (or 443 for https),
just put the port in the URL in the standard way:
$h->fetch("http://somedomain.org:8088");
The class includes a companion script named image_cache.php that can be used
as the src attribute within an image element. Why not just link directly to a
neighbor's images? If your site has a lot of traffic, that's a lot of hits to
your neighbor's site. So why not just copy their image to your own server?
That's fine for images that do not change, but some sites create dynamic
images such as stock charts that are generated new every minute.
image_cache.php in conjunction with class_http.php makes it easy to directly
link to third-party images and cache the image data for whatever TTL makes
sense for your application.
View the source for image_cache.php.
In this example, we will cache the chart image found at
http://moneycentral.msn.com/investor/charts/chartdl.asp?FC=1&Symbol=MSFT&CA=1&CB=1&CC=1&CD=1&CP=0&PT=5
You have to look at the page source code to find the url to their image. Then
you url encode their image URL, and pass it as a parameter to image_cache.php
in your image's src attribute. The embedded URL is very long because it was
long to start with, and after URL encoding, it is much longer. In this example,
we have set ttl=60 which means cache the image for 1 minute before hitting the
source site again.
<img src="image_cache.php?ttl=60&url=http%3A%2F%2Fdata.moneycentral.msn.com%2Fscripts%2Fchrtsrv.dll%3FSymbol
%3DMSFT%26C1%3D0%26C2%3D1%26C9%3D2%26CA%3D1%26CB%3D1%26CC%3D1%26CD%3D1%26CF%3D0%26EFR%3D236%26EFG%3D246%26EFB
%3D254%26E1%3D0" width="448" height="300" alt="Chart Graphic" />
Tip: Use PHP's
urlencode()
function to encode your embedded URLs.
Finally, anytime you have problems, be sure to look at the 'log' property
which will give you specific information related to problems with your
http requests or problems with caching.
/*
The log property contains a log of the objects events. Very useful for
testing and debugging. If there are problems, the log will tell you what
is wrong. For example, if the cache dir specified does not have write privs,
the log will tell you it could not open the cache file. If a socket to the remote server
could not be opened, the log will tell you this.
*/
echo "<h1>Log</h1>";
echo $h->log;
About the author
Troy Wolf operates
ShinySolutions Webhosting,
and is the author of
SnippetEdit--a PHP application
providing browser-based website editing that even non-technical people can
use. Website editing as easy as it gets. Troy has been a professional
Internet and database application developer for over 10 years. He has many
years' experience with ASP, VBScript, PHP, Javascript, DHTML, CSS, SQL, and
XML on Windows and Linux platforms.