Albedo Systems




LOGSITE.EXE V1.1

Website Log Analysis

Copyright Albedo Systems Ltd 1998
http://www.albedo.co.uk/
17 March 1998

CONTENTS INTRODUCTION
LOGSITE.EXE is the stand-alone version of Albedo Systems' Windows NT/95 LogSite log analysis program. It enables Site logs to be generated as a scheduled task under NT. (For example you could update the logs for all your smaller sites every midnight).

The program should also prove useful to non-Cold Fusion users. Although LogSite is specifically designed to operate with user logs produced using Cold Fusion logging tags, users of other systems, such as ASP, can generate such user logs from their pages too using Perl or other means.

Furthermore, two companion programs are provided with this program (COMBCONV.EXE and CFX_COMMCONV.EXE) that will convert NCSA/CERN Combined and Common access logs to a format that can be used by LogFile albeit, in the case of Common logs, with less information available.

CONDITIONS AND PURCHASING
LOGSITE.EXE 1.1 is commercial software. It may not be distributed except in the form of a time-limited demo version. The price for the full version is $70 (US), or £45 (UK). All users who register before 21 June 1998 will receive free upgrades to the software when these become available. (The accompanying conversion programs, COMBCONV.EXE and CFX_COMMCONV.EXE, are freeware and may be freely used and distributed).

This freely downloadable version is time-disabled - it will not analyse any logs or portions of logs outside the period 21 March 1998 to 30 April 1998.

We're afraid that we must ask for snail-mail international money orders or UK cheques (an e-commerce solution should be available soon). Please make them payable to Albedo Systems Ltd and send them to...

Albedo Systems Ltd
268 Amhurst Road
Stoke Newington
London N16 7UP
United Kingdom

...making sure that you state your email address - LogSite is compact enough to be distributed by email, and we are adopting this as the simplest method.

All technical queries with regard to the use of LogSite should be addressed to admin@albedo.co.uk. If you use our software we'd, of course, appreciate it if you credited us/linked to us, but you know you don't have to. We'll be doing more stuff soon, so please do drop by our website at http://www.albedo.co.uk/

SUMMARY
LogSite is unique as a low-cost solution to log analysis:
  • Provided the user log is not too vast, LogSite can be run remotely, on the server, at the user's whim, allowing instant feedback for a webmaster or a client. The program is fully server-compliant and tested.
  • It is fast. LogSite is an entirely self-contained piece of system software, independent of ODBC data sources. A high degree of optimisation means that even a large log file can be analysed in a matter of seconds.
  • It produces clear, accessible figures. Advanced dynamic graphing techniques have been employed to create attractive charts that download rapidly, along with clearly readable numeric tables.
  • It is comprehensive. Besides figures for pages served - and visited - LogSite also logs individual visitors, the pages that have referred them to the site and the agents/browsers that have visited the site. It is also completely unique in providing a log of search engine spider visits to the site.
  • It is highly configurable. Not only are a number of the pages entirely optional - a boon when a log is being analysed on the server, but a host of parameters means that the appearance and content of logsite pages can be tailored to the exact needs of the user.
  • It looks good. Our object was to produce a log program that a webmaster can happily show to a client without embarrassment about bad HTML technigues or unattractive graphics. The program can also be altered to harmonise with the website it describes.
GENERATING THE LOG
The program analyses user logs with the following comma-separated format:

Date,Time,IP Address,User Agent,Page,Page Description,Referring Page

Here is an example line:

"06-Feb-98", "06:25 PM", "127.0.0.0", "Mozilla/3.0", "c:\webs\mysite\htdocs\index.cfm", "Main Index Page", "http://www.someone_else.co.uk/links.htm"

These can be set up in various ways, but pose no special problem for Cold Fusion users. A number of freeware CF_ logging tags are available, including the excellent CF_LOG from Ben Forta of StoneAge Software. To take CF_LOG as an example, this works as any CF_ tag would - the tag file, log.cfm, is placed in the directory where pages are to be logged. Each page to be logged will then contain a line of the sort...

<CF_Log FILE="c:\webs\mysite/logs/mysite.log" TEXT="Main Index Page">

...which will append the appropriate page data to the log file. Unfortunately, while LogSite is fully compatible with the CF_LOG tag provided on Allaire's site - this does not supply Page Referral information. A slightly modified version of the tag that does so is supplied with this program.

LOG CONVERSION
There is another approach to using LogSite, and that is to convert the existing logs (NCSA format) on the server to LogSite-compatible format. These come in two basic formats depending on how the systems administrator has configured the server:

  • Combined NCSA/CERN: Contains full logging information.
  • Common NCSA/CERN: An older format that lacks any information on user agents and referral pages.
Accordingly, we have provided two programs that can convert server logs to LogSite-compatible format. See their accompanying documentation (provided with LogSite) for how to use them.

READING THE LOG
When LogSite is run, it generates a series of HTML pages with figures and charts describing your website usage (up to eight pages, as of version 1.1 - later versions will be more complex). All the pages are hyperlinked - main sections are as follows:

1. Log Summary: Front page to the log, with a few summary figures and links to the other pages.

2. Page Analysis: This will show which pages are most popular. It contains:
  • A daily graph, showing a 'radar scan' of page accesses over the period of the log.
  • A graph showing average page hits for days of the week.
  • A graph showing page hits for each month of the log.
  • A graph showing average page hits for hours of the day.
  • A graph showing the Top 20 pages accessed.
  • A complete list of all pages accessed, with figures.
3. Visit Analysis: This page shows where visitors are entering your site. It is achieved by analysing the site referral field in the log. A visited page is defined as any page that was accessed from an external site. The graphs here are in the same categories as for Page Analysis.

4. Referral Analysis: This lists the external pages that are referring to your site. It contains a complete list of them, plus a Top 20 graph.

5. Visitor Analysis: For this to work properly, a 'reverse DNS' operation should be performed to provide an ip address/domain cross-reference file (see Appendix 1). Otherwise the program can only tell you how may unique visitors the site has had.

Assuming that a reference file has been supplied, the program will supply the following breakdowns:
  • A table of visitors by sector (where this part of the domain name is meaningful in this sense).
  • A Top 20 graph of visitors by nationality.
  • A complete list of nationalities visiting, with figures.
  • An optional user search table. This can be configured to give figures on a particular domain - or a set of domains like America Online or Demon - for its format see Appendix 4.
  • A complete list, on a separate page, of all domains visiting with figures. (Warning - this can be rather long).
6. Agent Analysis: This, among other things, tells you which browsers and other agents are being used to access the site. It is set up using an optional file supplied by the user. This not only means that when new browsers come on to the market, they can immediately be included in the figures, but that the breakdown can vary from the simple to the complex.

Hence a simple breakdown into Netscape, MS Internet Explorer and others is possible, as is an analysis that covers every single issue of these browsers. Agents other than browsers (for example robot spiders) can also be scanned for. For a description of the Agent configuration file see Appendix 2. A useful example file is included with the program.

Agent Analysis gives a Top 20 graph of agents visiting, plus a table giving figures for all agents in the file.

7. Robot Analysis: It can be helpful to know whether, and how often, the search engines' robot spiders have accessed your site. This page does just that, and relies on a file supplied by the user to identify the spiders. (An example, covering the main search engines, is supplied). For a complete description of the file format see Appendix 3).

For each search engine/robot that has visited the page gives a 'radar scan' of visits over the period of the log, plus a listing of all pages visited (and, presumably, indexed).

Search engine analysis, it should be noted, is unique to LogSite at the time of writing.

INSTALLATION
The zip file that contains this file should contain:

  • readme.txt: a brief note
  • lsdsdoc.cfm: this file.
  • logsite.exe: Stand-alone executable version of LogSite.
  • combconv.exe: Program to convert Combined NCSA/CERN access logs.
  • combdocd.cfm: Documentation for combconv.exe.
  • commconv.exe: Program to convert Common NCSA/CERN access logs.
  • commdocd.cfm: Documentation for commconv.exe.
  • dot_graph.gif: Small gif used to draw graphs.
  • dot_back.gif: Small gif used as a background for graphs.
  • dot_clr.gif: Transparent formatting gif.
  • logsite.gif: Logsite logo.
  • agent.txt: Sample agent file.
  • robo.txt: Sample search engine file.
  • search.txt: Sample user search file.
  • Albedo6.gif, back.gif, rule_pnk.gif: Various layout gifs used in these documentation files.
Simply Unzip the files into the directory you wish to run them from.


USAGE
LogSite is runnable from the DOS prompt. It also requires a parameter file to run successfully. You may put this whereever you want, but the program defaults to a file called Logsite.ini placed in the same directory as the program. If, on the other hand, you wish to use a different parameter file, the usage is logsite [filename], where [filename] is the full pathname of the parameter file.

Here's an example of a parameter file - all the attributes must be on separate lines. (Bear in mind that not all of these parameters are mandatory).

LOGFILE="d:\mysite\logs\site.log"
DIRECTORY="d:\mysite\analysis\"
DOMAIN_NAME="http://www.mysite.co.uk/"
SUB_DOMAIN1="homesites/"
SUB_DOMAIN2="users/"
TITLE="My Site"
HEADER="d:\mysite\analysis\header.cfm"
FOOTER="d:\mysite\analysis\footer.cfm"
TABLE_BACK1="##80FFFF"
TABLE_BACK2="##C0FFC0"
PATH="d:\mysite\htdocs\"
IP_FORBID1="255.255.255.255"
IP_FORBID2="127.0.0.0"
REFERRALS="Y"
IP_DOMAIN_NAME="d:\mysite\logs\dns.txt"
BROWSER_FILE="d:\mysite\logs\agents.txt"
ROBOT_FILE="d:\mysite\logs\robo.txt"
VISITOR_FILE="d:\mysite\logs\search.txt"
LOG_CYCLE="31"

The parameters...

LOGFILE (mandatory) The absolute pathname for the log file you wish to analyse.

DIRECTORY (mandatory) points to the directory where you wish your log pages to be generated.

DOMAIN (optional, defaults to NULL) is the fully qualified domain name for your site. LogSite uses this to determine whether not a referring page is external to your site or not. Without it, visit processing will not return any results.

SUB_DOMAIN1 and SUB_DOMAIN2 (both optional, default to NULL) These two strings are also used in the referral processing. Referrals show all sites external to the user domain from which people have visited the site. However, you may have directories inside your site that you wish to regard as 'external' - for example users or homesites that you are hosting within your site. If you wish to know if you are getting any visits and refers from these, put a string ("users/", for example, this would mean that all references of the form http://www.mysite.co.uk/users/... or http://www.mysite.co.uk/.../users/... will be regarded as external) in one or both of these fields that uniquely defines the directories you wish excluded.

TITLE (optional, defaults to NULL). The name of your site. In fact this is only used if you settle for the default page header.

HEADER (optional, defaults to NULL): The path name of an HTML file that will be used as the header for the page. This allows you a great measure of customisation - you can provide your own background, logo and so on. If absent, this defaults to a simple header with the site title.

FOOTER (optional, defaults to NULL): The path name of an HTML file that will be used as the footer for the page. As with the header, this allows you yet more of customisation. If absent, this defaults to a simple footer that closes off the <BODY> and <HTML> tags.

TABLE_BACK1 (optional, defaults to FFFF80 (pale yellow)). Background colour for the main labels on the graphs and tables. Note the Cold Fusion double hash.

TABLE_BACK2 (optional, defaults to C0C0C0 (dark grey)). Second background colour for the histograms. Note the Cold Fusion double hash.

PATH (optional, defaults to NULL). An entirely cosmetic string, used to clip unnecessary information from the start of the page names.

IP_FORBID1 and IP_FORBID2 (optional, default to NULL). If you are interested in your true log counts, then you probably don't want to count site accesses by the site designer and/or owner. If you know the appropriate IP numbers, you can ensure that these accesses are not included in the analysis here.

REFERRALS (optional, defaults to "Y"). External referral information may not be available for your site, particularly if you are using a converted NCSA log. If so, a referral page is not generated, but you may also wish to cancel the option, if you just want to run a skeleton log - in which case an "N" is appropriate here.

IP_DOMAIN (optional, defaults to NULL). This is the pathname of the ip address/domain reference file that the program uses to generate the visitor page. This file will have to be produced by using an external Reverse DNS procedure. More about all this, including the file format, is provided in Appendix 1.

BROWSER_FILE (optional, defaults to NULL). This is the pathname of the browser (agent) reference file that the program uses to generate the agent analysis page (assuming that the log contains user agent information). This file can be as simple or complex as you wish, up to a limit of 120 user-specified browsers or agents. A simple example file is provided with the program. More about all this, including the file format, is provided in Appendix 2.

ROBOT_FILE (optional, defaults to NULL). This is the pathname of the robot (spider) reference file that the program uses to generate the robot analysis page (assuming that the log contains user agent information). This file can be as simple or complex as you wish, up to a limit of 120 search engines. A simple example file is provided with the program. More about all this, including the file format, is provided in Appendix 3.

VISITOR_FILE (optional, defaults to NULL). The pathname of a user-definable search file that the program uses to generate extra visitor analysis (for example you may be interested in all users from AOL, or from a given university). There is a limit of 200 search strings in this file (warning - overdoing this could seriously affect program performance). A simple example file is provided with the program. More about all this, including the file format, is provided in Appendix 4.

LOG_CYCLE (optional, defaults to NULL). If you are running LogSite online, perhaps on a daily basis, it is helpful to be able to clip the log to a convenient size, possibly arranged from midnight to midnight. If LOG_CYCLE is specified, the log is clipped to the number of days specified. Extra lines are removed from the start of the log and appended to a file called archive.log in the log directory and, for good measure, the original log is backed up as log.bak.

CUSTOMISING THE DESIGN
Logsite logs can be visually customised in several ways...
  • Using the Header and Footer parameters, HTML can be added at the beginning and end of each log page, using images, text, links and/or backgrounds.
  • The table colours can be varied using the TABLE_BACK parameters.
  • Finally, two small coloured gifs are used to create the graphs. dot_graph.gif is used to draw the graph columns. It comes supplied as red (FF0000), but can be altered to a different colour with a graphics package if you wish. dot_back.gif is used as a 'blanker' to show the slice of the year in some of the graphs. It is supplied as black, but can also be altered if you prefer.
APPENDIX 1 - Identifying visitors
All visitors to a site are identified by a unique IP address, contained in the logs. However, no useful information can be obtained from this in itself, but it will correspond to a useful domain name, describing a unique account. This 'reverse DNS' procedure, however, requires looking up each ip address on the web, which takes time (do not set your web server to perform this, unless you seriously want to slow down your sites). Hence LogSite does not itself perform reverse DNS.

It can however, accept an ip address/domain file that has been prepared earlier in batch mode. This has the simple comma-separated quote-qualified format: ip address,domain. LogSite then uses this to prepare statistics. There are many ways available to prepare a file like this - we recommend a Cold Fusion tag called CFX_GetIPHostName from Ben Forta of Stone Age Software.

Besides this, LogSite offers a helping hand with these procedures. For a start, it de-duplicates the DNS-IP file to save processing time for the next run (it is easy to end up with duplicate records). It also outputs a simple file, ip.txt, which consists of a list of 'unresolved' ip addresses that can then be fed into CFX_GetIPHostName. This file is a comma-separated list, with no quote qualifiers and no line returns.

APPENDIX 2 - Tailoring the agent file
Since the world of browsers (and other agents, like robots) is ever-changing, LogSite allows the user to specify which agent/browsers can be searched for. The agent file takes a very simple format: search string, display text (quote-qualified). An example file is supplied.

Note that the analysis can be made as fine or coarse as desired. For example the demo example runs:

"MSIE 1","Microsoft Internet Explorer 1.x"
"MSIE 2","Microsoft Internet Explorer 2.x"
"MSIE 3","Microsoft Internet Explorer 3.x"
"MSIE 4","Microsoft Internet Explorer 4.x"
"Mozilla/1","Netscape 1.x"
"Mozilla/2","Netscape 2.x"
"Mozilla/3","Netscape 3.x"
"Mozilla/4","Netscape 4.x"
"Lynx","Lynx"

(Note the order here - MSIE must come first, because Microsoft cleverly puts Mozilla-compatible fields in its user agent entry. Reversing MSIE and Netscape here would mean that no MSIE entries would be identified.)

It would be quite possible, however, to get really simple and just do...

"MSIE","Microsoft Internet Explorer"
"Mozilla","Netscape"

Or really picky and go for...

"MSIE 4.01","Microsoft Internet Explorer 4.01"
"MSIE 4.0","Microsoft Internet Explorer 4.0"
...
"Mozilla/4.04","Netscape 4.04"
...
and so on - it's really up to you.

APPENDIX 3 - Spotting search robots
One of the unique features of LogSite is its ability to scan for robot agents - whether these be from the main search engines or those pesky email address thieves. It can therefore give you an idea as to how well your site has been indexed.

To achieve this in an ever-changing world, LogSite uses a user-configurable file (an example covering the major search engines is included). This has rather a more complex format than the browser file, since robots can be identified in a number of different ways.

The format is:

Search string 1, Search string 2, Search string 3, Robot Name, Owner's Name, Owner's URL, Comments

Here's an example for Alta Vista's Scooter:

"scooter.pa-x.dec.com","204.123.9.20","Scooter","Scooter","Alta Vista","http://www.altavista.digital.com/","Digital Equipment's comprehensive web indexer"

The first two search strings are both used to search both the ip addresses and DNS (if present). The third search field is used to search the User Agent field only. The rest of the fields are used to set up the page. Comments are optional, but can be a nice touch.

The reason for the search complexity is to ensure that robots can be identified even for logs that contain no user agent information. Two IP/domain search fields are needed, because some spiders use more than one. The search strings can be partial, of course - for example "204.123.9." would identify all IPs of the form 204.123.9.nnn.

If yo are interested in knowing more about spiders and other robots on the web, try http://www.botspot.com/

APPENDIX 4 - Configurable user search
The visitor analysis page of LogSite provides a number of breakdowns, but you may have specific visitors or groups of visitors that you wish to count or be alerted to, whether they be from Sheffield University or America Online. The very simple user search file allows you to do just this. It has the straightforward format:

search string, display text (both quote-qualified)

An example file is included. The search string can be used to match any part of a domain/host, thus "aol.com" will count all visits from AOL people.

Happy logs...
Fin Fahey
Fiona Daly

LEGAL DISCLAIMER
Neither Albedo Systems Ltd. nor anyone else who has been involved in the creation, production or delivery of this product shall be liable for any direct, indirect, consequential or incidental damages (including damages for loss of business profits, business interruption, loss of business information, and the like) arising out of the use or inability to use this product even if Albedo Systems Ltd. has been advised of the possibility of such damages.