Being assigned a data warehousing/data mining project for class sounds like fun, but where am I supposed to get a data set? I can buy a database of all area codes and exchanges with latitude and longitude, but I would still have to simulate a hundred million records to address scalability and query optimization issues. Then I could find out if my estimations of the size of records is within a factor of ten, but the networks I see still wouldn't be "real" and I would have no idea if that's what real social networks looked like. (As an undergrad English Lit major, I was reading 18th Century epistolary novels instead of taking Data Structures like my Computer Science major classmates. The sad part is that Data Strutures would have been more interesting.)
Fortunately, data magically appear on my Linux box every day.
Each morning at four am, logwatch runs on my Fedora Core 4 (Red Hat Linux) box. It tells me how many times nonexistent files on my webserver have been requested, and how many router firewall violation attempts have been logged. It also tells me how many times Apache logged a "method not allowed" 405 code. I have several daily log files that give me useful information on attacks. The problem is that there are so many attacks that if I banned every IP that looked for a web application hole or probed a port I wouldn't have time for anything else.
So it makes sense to look for attack source IP (Internet Protocol) addresses that probe my router AND request holes in web apps. To do this I need three files: my router log from syslog; and two greps of all my Apache logs. (grep -h will suppress file names at the beginning of each line) looking for 404 and 405 errors. This gives me three tables, from which I can do inner joins on source IP in each. Of course, I have do do some tedious data cleanup to get the text log files into Excel and from there Access. (I always underestimate the time it takes to clean up data.) From Access, I'm going to go to SQL 2005, Analysis Services, and build a cube. From there I should be able to "see" the attacks using Pivot Tables in Microsoft Excel.
If I see a source IP in my router log and Apache error logs, then it's probably worth banning. Correlating IP addresses to identify those involved in multiple methods of attack takes me from hundreds of IP addresses down to six.