|
|
|
Website Log Analysis Copyright Albedo Systems Ltd 1998 http://www.albedo.co.uk/ 17 March 1998 CONTENTS
LOGSITE.EXE is the stand-alone version of Albedo Systems' Windows NT/95 LogSite log analysis program. It enables Site logs to be generated as a scheduled task under NT. (For example you could update the logs for all your smaller sites every midnight). The program should also prove useful to non-Cold Fusion users. Although LogSite is specifically designed to operate with user logs produced using Cold Fusion logging tags, users of other systems, such as ASP, can generate such user logs from their pages too using Perl or other means. Furthermore, two companion programs are provided with this program (COMBCONV.EXE and CFX_COMMCONV.EXE) that will convert NCSA/CERN Combined and Common access logs to a format that can be used by LogFile albeit, in the case of Common logs, with less information available. CONDITIONS AND PURCHASING LOGSITE.EXE 1.1 is commercial software. It may not be distributed except in the form of a time-limited demo version. The price for the full version is $70 (US), or £45 (UK). All users who register before 21 June 1998 will receive free upgrades to the software when these become available. (The accompanying conversion programs, COMBCONV.EXE and CFX_COMMCONV.EXE, are freeware and may be freely used and distributed). This freely downloadable version is time-disabled - it will not analyse any logs or portions of logs outside the period 21 March 1998 to 30 April 1998. We're afraid that we must ask for snail-mail international money orders or UK cheques (an e-commerce solution should be available soon). Please make them payable to Albedo Systems Ltd and send them to... Albedo Systems Ltd 268 Amhurst Road Stoke Newington London N16 7UP United Kingdom ...making sure that you state your email address - LogSite is compact enough to be distributed by email, and we are adopting this as the simplest method. All technical queries with regard to the use of LogSite should be addressed to admin@albedo.co.uk. If you use our software we'd, of course, appreciate it if you credited us/linked to us, but you know you don't have to. We'll be doing more stuff soon, so please do drop by our website at http://www.albedo.co.uk/ SUMMARY LogSite is unique as a low-cost solution to log analysis:
The program analyses user logs with the following comma-separated format: Date,Time,IP Address,User Agent,Page,Page Description,Referring Page Here is an example line: "06-Feb-98", "06:25 PM", "127.0.0.0", "Mozilla/3.0", "c:\webs\mysite\htdocs\index.cfm", "Main Index Page", "http://www.someone_else.co.uk/links.htm" These can be set up in various ways, but pose no special problem for Cold Fusion users. A number of freeware CF_ logging tags are available, including the excellent CF_LOG from Ben Forta of StoneAge Software. To take CF_LOG as an example, this works as any CF_ tag would - the tag file, log.cfm, is placed in the directory where pages are to be logged. Each page to be logged will then contain a line of the sort... <CF_Log FILE="c:\webs\mysite/logs/mysite.log" TEXT="Main Index Page"> ...which will append the appropriate page data to the log file. Unfortunately, while LogSite is fully compatible with the CF_LOG tag provided on Allaire's site - this does not supply Page Referral information. A slightly modified version of the tag that does so is supplied with this program. LOG CONVERSION There is another approach to using LogSite, and that is to convert the existing logs (NCSA format) on the server to LogSite-compatible format. These come in two basic formats depending on how the systems administrator has configured the server:
READING THE LOG When LogSite is run, it generates a series of HTML pages with figures and charts describing your website usage (up to eight pages, as of version 1.1 - later versions will be more complex). All the pages are hyperlinked - main sections are as follows: 1. Log Summary: Front page to the log, with a few summary figures and links to the other pages. 2. Page Analysis: This will show which pages are most popular. It contains:
4. Referral Analysis: This lists the external pages that are referring to your site. It contains a complete list of them, plus a Top 20 graph. 5. Visitor Analysis: For this to work properly, a 'reverse DNS' operation should be performed to provide an ip address/domain cross-reference file (see Appendix 1). Otherwise the program can only tell you how may unique visitors the site has had. Assuming that a reference file has been supplied, the program will supply the following breakdowns:
Hence a simple breakdown into Netscape, MS Internet Explorer and others is possible, as is an analysis that covers every single issue of these browsers. Agents other than browsers (for example robot spiders) can also be scanned for. For a description of the Agent configuration file see Appendix 2. A useful example file is included with the program. Agent Analysis gives a Top 20 graph of agents visiting, plus a table giving figures for all agents in the file. 7. Robot Analysis: It can be helpful to know whether, and how often, the search engines' robot spiders have accessed your site. This page does just that, and relies on a file supplied by the user to identify the spiders. (An example, covering the main search engines, is supplied). For a complete description of the file format see Appendix 3). For each search engine/robot that has visited the page gives a 'radar scan' of visits over the period of the log, plus a listing of all pages visited (and, presumably, indexed). Search engine analysis, it should be noted, is unique to LogSite at the time of writing. INSTALLATION The zip file that contains this file should contain:
USAGE LogSite is runnable from the DOS prompt. It also requires a parameter file to run successfully. You may put this whereever you want, but the program defaults to a file called Logsite.ini placed in the same directory as the program. If, on the other hand, you wish to use a different parameter file, the usage is logsite [filename], where [filename] is the full pathname of the parameter file. Here's an example of a parameter file - all the attributes must be on separate lines. (Bear in mind that not all of these parameters are mandatory). LOGFILE="d:\mysite\logs\site.log" DIRECTORY="d:\mysite\analysis\" DOMAIN_NAME="http://www.mysite.co.uk/" SUB_DOMAIN1="homesites/" SUB_DOMAIN2="users/" TITLE="My Site" HEADER="d:\mysite\analysis\header.cfm" FOOTER="d:\mysite\analysis\footer.cfm" TABLE_BACK1="##80FFFF" TABLE_BACK2="##C0FFC0" PATH="d:\mysite\htdocs\" IP_FORBID1="255.255.255.255" IP_FORBID2="127.0.0.0" REFERRALS="Y" IP_DOMAIN_NAME="d:\mysite\logs\dns.txt" BROWSER_FILE="d:\mysite\logs\agents.txt" ROBOT_FILE="d:\mysite\logs\robo.txt" VISITOR_FILE="d:\mysite\logs\search.txt" LOG_CYCLE="31" The parameters... LOGFILE (mandatory) The absolute pathname for the log file you wish to analyse. DIRECTORY (mandatory) points to the directory where you wish your log pages to be generated. DOMAIN (optional, defaults to NULL) is the fully qualified domain name for your site. LogSite uses this to determine whether not a referring page is external to your site or not. Without it, visit processing will not return any results. SUB_DOMAIN1 and SUB_DOMAIN2 (both optional, default to NULL) These two strings are also used in the referral processing. Referrals show all sites external to the user domain from which people have visited the site. However, you may have directories inside your site that you wish to regard as 'external' - for example users or homesites that you are hosting within your site. If you wish to know if you are getting any visits and refers from these, put a string ("users/", for example, this would mean that all references of the form http://www.mysite.co.uk/users/... or http://www.mysite.co.uk/.../users/... will be regarded as external) in one or both of these fields that uniquely defines the directories you wish excluded. TITLE (optional, defaults to NULL). The name of your site. In fact this is only used if you settle for the default page header. HEADER (optional, defaults to NULL): The path name of an HTML file that will be used as the header for the page. This allows you a great measure of customisation - you can provide your own background, logo and so on. If absent, this defaults to a simple header with the site title. FOOTER (optional, defaults to NULL): The path name of an HTML file that will be used as the footer for the page. As with the header, this allows you yet more of customisation. If absent, this defaults to a simple footer that closes off the <BODY> and <HTML> tags. TABLE_BACK1 (optional, defaults to FFFF80 (pale yellow)). Background colour for the main labels on the graphs and tables. Note the Cold Fusion double hash. TABLE_BACK2 (optional, defaults to C0C0C0 (dark grey)). Second background colour for the histograms. Note the Cold Fusion double hash. PATH (optional, defaults to NULL). An entirely cosmetic string, used to clip unnecessary information from the start of the page names. IP_FORBID1 and IP_FORBID2 (optional, default to NULL). If you are interested in your true log counts, then you probably don't want to count site accesses by the site designer and/or owner. If you know the appropriate IP numbers, you can ensure that these accesses are not included in the analysis here. REFERRALS (optional, defaults to "Y"). External referral information may not be available for your site, particularly if you are using a converted NCSA log. If so, a referral page is not generated, but you may also wish to cancel the option, if you just want to run a skeleton log - in which case an "N" is appropriate here. IP_DOMAIN (optional, defaults to NULL). This is the pathname of the ip address/domain reference file that the program uses to generate the visitor page. This file will have to be produced by using an external Reverse DNS procedure. More about all this, including the file format, is provided in Appendix 1. BROWSER_FILE (optional, defaults to NULL). This is the pathname of the browser (agent) reference file that the program uses to generate the agent analysis page (assuming that the log contains user agent information). This file can be as simple or complex as you wish, up to a limit of 120 user-specified browsers or agents. A simple example file is provided with the program. More about all this, including the file format, is provided in Appendix 2. ROBOT_FILE (optional, defaults to NULL). This is the pathname of the robot (spider) reference file that the program uses to generate the robot analysis page (assuming that the log contains user agent information). This file can be as simple or complex as you wish, up to a limit of 120 search engines. A simple example file is provided with the program. More about all this, including the file format, is provided in Appendix 3. VISITOR_FILE (optional, defaults to NULL). The pathname of a user-definable search file that the program uses to generate extra visitor analysis (for example you may be interested in all users from AOL, or from a given university). There is a limit of 200 search strings in this file (warning - overdoing this could seriously affect program performance). A simple example file is provided with the program. More about all this, including the file format, is provided in Appendix 4. LOG_CYCLE (optional, defaults to NULL). If you are running LogSite online, perhaps on a daily basis, it is helpful to be able to clip the log to a convenient size, possibly arranged from midnight to midnight. If LOG_CYCLE is specified, the log is clipped to the number of days specified. Extra lines are removed from the start of the log and appended to a file called archive.log in the log directory and, for good measure, the original log is backed up as log.bak. CUSTOMISING THE DESIGN Logsite logs can be visually customised in several ways...
All visitors to a site are identified by a unique IP address, contained in the logs. However, no useful information can be obtained from this in itself, but it will correspond to a useful domain name, describing a unique account. This 'reverse DNS' procedure, however, requires looking up each ip address on the web, which takes time (do not set your web server to perform this, unless you seriously want to slow down your sites). Hence LogSite does not itself perform reverse DNS. It can however, accept an ip address/domain file that has been prepared earlier in batch mode. This has the simple comma-separated quote-qualified format: ip address,domain. LogSite then uses this to prepare statistics. There are many ways available to prepare a file like this - we recommend a Cold Fusion tag called CFX_GetIPHostName from Ben Forta of Stone Age Software. Besides this, LogSite offers a helping hand with these procedures. For a start, it de-duplicates the DNS-IP file to save processing time for the next run (it is easy to end up with duplicate records). It also outputs a simple file, ip.txt, which consists of a list of 'unresolved' ip addresses that can then be fed into CFX_GetIPHostName. This file is a comma-separated list, with no quote qualifiers and no line returns. APPENDIX 2 - Tailoring the agent file Since the world of browsers (and other agents, like robots) is ever-changing, LogSite allows the user to specify which agent/browsers can be searched for. The agent file takes a very simple format: search string, display text (quote-qualified). An example file is supplied. Note that the analysis can be made as fine or coarse as desired. For example the demo example runs: "MSIE 1","Microsoft Internet Explorer 1.x" "MSIE 2","Microsoft Internet Explorer 2.x" "MSIE 3","Microsoft Internet Explorer 3.x" "MSIE 4","Microsoft Internet Explorer 4.x" "Mozilla/1","Netscape 1.x" "Mozilla/2","Netscape 2.x" "Mozilla/3","Netscape 3.x" "Mozilla/4","Netscape 4.x" "Lynx","Lynx" (Note the order here - MSIE must come first, because Microsoft cleverly puts Mozilla-compatible fields in its user agent entry. Reversing MSIE and Netscape here would mean that no MSIE entries would be identified.) It would be quite possible, however, to get really simple and just do... "MSIE","Microsoft Internet Explorer" "Mozilla","Netscape" Or really picky and go for... "MSIE 4.01","Microsoft Internet Explorer 4.01" "MSIE 4.0","Microsoft Internet Explorer 4.0" ... "Mozilla/4.04","Netscape 4.04" ... and so on - it's really up to you. APPENDIX 3 - Spotting search robots One of the unique features of LogSite is its ability to scan for robot agents - whether these be from the main search engines or those pesky email address thieves. It can therefore give you an idea as to how well your site has been indexed. To achieve this in an ever-changing world, LogSite uses a user-configurable file (an example covering the major search engines is included). This has rather a more complex format than the browser file, since robots can be identified in a number of different ways. The format is: Search string 1, Search string 2, Search string 3, Robot Name, Owner's Name, Owner's URL, Comments Here's an example for Alta Vista's Scooter: "scooter.pa-x.dec.com","204.123.9.20","Scooter","Scooter","Alta Vista","http://www.altavista.digital.com/","Digital Equipment's comprehensive web indexer" The first two search strings are both used to search both the ip addresses and DNS (if present). The third search field is used to search the User Agent field only. The rest of the fields are used to set up the page. Comments are optional, but can be a nice touch. The reason for the search complexity is to ensure that robots can be identified even for logs that contain no user agent information. Two IP/domain search fields are needed, because some spiders use more than one. The search strings can be partial, of course - for example "204.123.9." would identify all IPs of the form 204.123.9.nnn. If yo are interested in knowing more about spiders and other robots on the web, try http://www.botspot.com/ APPENDIX 4 - Configurable user search The visitor analysis page of LogSite provides a number of breakdowns, but you may have specific visitors or groups of visitors that you wish to count or be alerted to, whether they be from Sheffield University or America Online. The very simple user search file allows you to do just this. It has the straightforward format: search string, display text (both quote-qualified) An example file is included. The search string can be used to match any part of a domain/host, thus "aol.com" will count all visits from AOL people. Happy logs... Fin Fahey Fiona Daly LEGAL DISCLAIMER Neither Albedo Systems Ltd. nor anyone else who has been involved in the creation, production or delivery of this product shall be liable for any direct, indirect, consequential or incidental damages (including damages for loss of business profits, business interruption, loss of business information, and the like) arising out of the use or inability to use this product even if Albedo Systems Ltd. has been advised of the possibility of such damages. |