First program is “Loading the web server logs using user specified date range” . & then i have to preprocessing these logs to form the sequence database. Preprocessing include 1. data cleaning ( to remove .jpg extension ; to remove the page except status code is 200 (successful) )
2.User Identification ( i want to use with cs-username in IIS log but there’s a problem. *** I use Forms Authentication with no anonymous login but cs-username field is still “-” in logs. I have to solve this problem when I write a BookStore Web Site*** ) or (I have to use IP & user Agent to identify user,but there ‘s also a problem ‘cos I use 1 PC (WindowsXP , IIS5.1 , VS2005 ) to test my website so my ip address is “localhost” ; most of the other web usage mining thesis use logs from many sites so they don’t use cs-username; In my case , I built my web site & i use “Login username ” ,i think cs-username field must be filled with login-username” but i ‘v heard that if i use the IIS authentication, cs-username field is filled with DOMAIN\USER like widows logon user )
3. session identification ( after user identification – using cs-username or heuristics IP & UserAgnet) , i have to sessioniaze with timeout (30 minutes default) . another way is Referer .
After sessionization , i ‘ll get the sequence Database (sid,sequence) ; I have to show this output to the teacher ; this is the first part .