Grad School: May 2006 Archives

Summer may be the best semester at George Washington University because all the undergrads are gone. The Marvin center is almost completely vacant, and the food court is closed. I am taking two electives this summer, which will give me more electives than I need. Why am I doing this? I’m actually learning something useful. People keep asking me why I don’t go for the CIO Certification. Instead of doing databases, I’d be taking MGT 272 Information Resource Management and MGT 274 Survey of Advanced IT Technologies. I have yet to read a single government job description that even mentions CIO Certification. (The Government Services Administration “invented” the CIO Certification, but the Office of Personnel Management sets job standards.)

I’ve taken academic classes in Information Systems at GWU for a while, and I have also taken classes at Learning Tree in SQL, Exchange, Solaris, Security and the programming language, C. The more advanced my classes get at GWU, the more they resemble a Learning Tree class, with one exception: at GWU, they teach theory and practice. At Learning Tree, it’s just practice. You can learn how Microsoft SQL Server works without learning a thing about normalization. Learning Tree is training database administrators, not database designers. My professor for both classes, John Artz, argues that vendor certifications will include more theory in the future; otherwise the vendor certs will become less relevant over time.

So what are my classes? Data Warehouse Design; and Database and Expert Systems. Database and Expert Systems includes no expert systems (I covered them in Decision Systems), and is mostly T-SQL for Microsoft SQL Server 2005. Data Warehouse Design is mostly theory with some implementation on SQL 2005 Analysis Services. Relational database theory hasn’t changed in about 30 years now, so you’d think I would have learned more relational database management systems earlier, but hey, it’s easy to become distracted with security and email and the web servers and Linux. Not that I haven’t used plenty of relational databases as back-ends to applications – I just didn’t think about the relational algebra that drives my queries.

While I can’t post class notes here, I can tell you which books the professor has chosen. When I’m searching for technology books, it’s hard to tell which ones are good.
Database and Expert Systems:
Dusan Petkovic’s SQL Server 2005: A Beginner’s Guide
Ken England’s Microsoft SQL Server 2000 Performance Optimization and Tuning Handbook. (There isn’t one for SQL 2005 yet.)
Data Warehouse Design:
Ted Lachev’s Applied Microsoft Analysis Services and Microsoft Business Intelligence Platform 2005.
(And the John Artz manuscript.)
Both classes:
Connolly and Begg’s Database Systms. This is the book I wish my database class from last semester used. It covers a lot more than that textbook: Hoffer, Prescott and McFadden’s Modern Database Management.

If you’ve ever seen Being John Malkovich, you should remember what happens when Malkovich himself enters the Malkovich portal: everyone is John Malkovich and all they say is “Malkovich Malkovich.” On the Internet these days, I feel like it’s “Google Google.”

After months of waiting, I got my free Google Analytics invitation. I had set up Urchin reporting on a number of sites years ago, and I was disappointed that I couldn’t buy Urchin anymore because Google had swallowed them whole. Google Analytics is even better than Urchin was, and it includes Urchin’s campaign and e-commerce tracking modules. I also liked Urchin a lot more than Webtrends, which seems to have grown into bloatware since my first (positive) encounter with Webtrends in 1999. The trouble with Webtrends and Omniture is that they are focused on big business, because that’s where the big money is. Google Analytics is simple enough use on this blog, but can scale as large as is needed. GA gives me the same tools that the big guys have. Thus, Google will make its money scaling out rather than scaling up.

Unlike Urchin, Google Analytics doesn’t run on my server. The JavaScript script goes back to Google, and I’m sure they can also see how many people are visiting my sites. On the up side, I didn’t have to remember to reconfigure Apache logging.
This is what I used to have to enter as root into httpd.conf and hope that I didn't make any typos:

LogFormat "%h %v %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" \"%{Cookie}i\"" special 

Also, the Googlebot doesn’t show up in Analytics reports. Browsers that reject third-party cookies won’t be track-able. And Google will know a lot more about all of our browsing habits once more sites start implementing Google Analytics.

As far as the markets go, Google Analytics is a shot at other web analytics packages, and it may surpass them because Google can integrate your Google Adwords into your Google Analytics. (Malkovich Malkovich.) This is something that others can’t do, because Omniture and Overture are separate. Of course, Google Analytics, Adwords, and everything else Google integrates with your Gmail account. Before long, Google may be able to offer IP address resolution into Google Earth. (Malkovich Malkovich Malkovich)

I also recently created Google sitemaps for a couple of sites. Google uses sitemaps to find new pages and changed pages on sites faster. The sitemap generator is a python script that looks at your log files, your web file system and creates an XML file that the Googlebot can download. After creating the sitemaps, the two have now gotten a lot of hits from the Googlebot: 1671 for this site, and 5,000 or so for the other site, which gets a lot more traffic than mine. By getting me to do some of the work of indexing, Google indexing operates more efficiently.

Before long, Google will know more about traffic on the web than the NSA knows about traffic on the Public Switched Telephone Network.

Between the phone records, emails, and instant messaging, the NSA can now map the social and business fabric of America and see how it is changing over time. By collecting our phone records, the NSA has created a dataset that can map our social connections. As I've mentioned in my previous post, the NSA can now map the calls we make and the calls our callees make, creating a giant tree of connections. A wrong number might get you placed on a no-fly list, but I'm not going into morality. (The current administration is the de facto law, so legality becomes a moot argument.) I'm just going to explore how they might use this data and what they could find from it.

The phone call data structure (date/time, origin number, destination number, duration, ID) could easily be used to scan emails and instant messaging as well: date/time, origin address, destination address, length, ID). Using the same reflexive queries, the NSA can track our social and business connections through email. They could also tell how those annoying chain letters get spread (Good TImes virus warning: forward this to all your friends). However, the size of each record would be larger, probably more than double the ~48 bytes required for a phone record. The size would slow down the query and require more storage. Also, geocoding the emails would be more difficult, although it could be done based on where the person writing them maintains a physical address, or on the originating Internet Protocol address.

With all this data, the NSA is in a position to know the difference in social networks that use telephones and those that use email. If they've had this program in place for long enough, they could even recognize trends social communication. With the right set of queries, you could spot the difference between neighborhoods where people know each other and neighborhoods where nobody knows each other and plot them in different colors on a map. What percentage of Americans calls their mom on Mother's Day? In which neighborhoods do people call their mothers most often?

It would also be somewhat easy to plot our connections in a GIS tool such as Google Earth. You would just export your target connection tree into an XML kmz/kml file and import it into Google Earth or ESRI's ArcGIS. The graphs derived from the data would by anonymous. (Of course, research ethics might prevent us from using the data.)

Sadly, none of the derived information will ever see the light of day. Academically, it would produce fascinating results that could teach us about how we communicate with each other and how our communications are changing. Then again, no local, state, or federal government information sytem about citizens has ever been immune from misuse. IRS workers read their neighbors' tax returns. Cops run license "plates for dates," and use NCIC (National Criminal Information Center) criminal background checks to help those running for election denounce their opponents. If a dataset about us exists, it will be abused by those with access, but appropriate security controls should prevent that.

Today, USA Today reported that the National Security Agency has been collecting domestic phone records of many of us U.S. citizens. Unlike everyone else blogging on this today, I'm taking no position on the ethicality of this activity. Instead, I'm going to tell you what I would do with those phone records from the perspective of a database geek. There's plenty of other analysis going on elsewhere, and I'm no constitutional lawyer.

I've been using Vonage for a while now, and I have access to my own phone records on the computer. It's easy enough to cut and paste my Vonage call records into Excel and from there into Access. From Access, I can easily export/import them into the Relational Database Mangement System of my choice, which for now is MS SQL 2005. However, there are many more out there.

Each records looks something like this: Date, Time (you can combine these into a LongDate), From phone number, To phone number, Duration, and a unique transaction ID. I get all this for incoming and outgoing calls. It's great for anyone that does billing for phone time. I'm assuming that these are the same kind of records that the NSA gets. Once the NSA gets these records, they do a data transform to make all the fields fit into their system in a uniform manner. Since the data is already fairly simple, they don't have to do much, and even a moderately skilled programmer like me could write something to transfer phone records almost as fast as they could get them.

If I had phone records from other people, I could combine them with my phone records into one massive table (relation, in database-speak). I could then do a reflexive query on them to pull a list of all the people I had contact with, through incoming or outgoing calls. I could then do another query to pull all contacts of all the people who had called me; this would show my my friends' friends. If I had access to more data about the phone numbers, say through geocoding (a fancy way of saying latitude and longitude attached to each phone number), I could create a map and track a phone tree. If I call someone in New York, and they call someone in Paris, and the person in Paris calls someone in Amman, I could draw lines making the connections on a map.

For this level of tracking to work, the NSA has to have absolutely all the phone records they can possibly get their hands on. If they have a target talking to someone and that someone talks to someone else and the NSA's records drop at the first friend of the target, they're lost. It would be a dead end. If they get all the records, the creation of a massive data warehouse that shows connections between people is pretty much academic. The budget for doing all this has dropped dramatically over recent years: you might be able to do it with a couple of Netezza data warehouse appliances. Rumor has it the NSA was Netezza's first customer. All the hardware to do it might cost under a million dollars. The tricky part, as with all data mining projects, is getting good data, and the NSA has that problem solved.

The hardest part left for them is scalability: they're trying to drink from a firehose, but the records aren't that big, which makes it feasable. You might be able to store all the number-only data in a record as short as 40 bytes: LongDate, Number, Number, Number, Number. (I'm not going to get into data types in depth here, but let's assume we can store phone numbers as numbers and not text to save space.) Thus one million phone records would occupy 40 megabytes. If the US makes a hundred million phone calls a day, that's about 4 GB a day of data. Large, but manageable if you have a large budget. Even if you double the key identifier size to 16 bytes (to cover hundreds of millions of calls) you're still only up to 4.8 GB per 100 million calls.

Only after you've identified a target would you want to create a join query that connects names and addresses with phone numbers; this would be far more efficient than attaching names to the phone record tables, and would give the NSA a chance to say they're recording numbers only. If the NSA uses a consumer data company like, say, Acxiom, to get information on phone numbers post-targeting, then they're not even subject to the Freedom of Information Act or US Privacy Law.

The end result is that the NSA has the capability to map our social and business networks; given enough time and hardware, they could even plot them on satellite photos, creating a cool mish-mash of lines across neighborhoods. They could create files on us all like Friendster lists our friends and their connections. Whether the NSA's system actually works efficiently, we'll never know.

The Job Interview

| | Comments (0) | TrackBacks (0)

I used to think it happened only to me, and then I read Dilbert last Saturday. My interviewers are asking more and more real-world technical questions about real problems they have. More often than not, I can solve them in a couple of sentences. And later I wonder if they really had a job to offer or if they were just looking for cheap solutions.

Even months later, I can see users from my interviewers' companies reading my blog entries about how IIS, SQL, AD, and SharePoint work together. Thank you, Google Analytics. And no, I would never name your company here, because it just wouldn't be professional.

I have three finals: one Monday evening and two Wednesday evening. The bad part about having two finals on one evening is that it's two exams for which I have to prepare. The good part is crossover between the two classes: both cover neural networks, although it's definitely on my database exam, my decision support systems class covered it better. Although the Marakas text had problems, including a typo and a cut-off paragraph, it explained neural nets better than the database textbook or the db professor did, and I've gone over his powerpoint several times looking for a good definition. Marakas:

neural networks attempt to mirror the way the human brain works in recognizing patterns by developing mathematical structures with the ability to learn.

One type of human intelligence is the ability to recognize patterns, and then learn to recognize patterns better. Thus, neural networks are one form of artificial intelligence.

If you search the web, though, you'll find as many definitions of neural networks as you do result pages.

About this Archive

This page is a archive of entries in the Grad School category from May 2006.

Grad School: April 2006 is the previous archive.

Grad School: June 2006 is the next archive.

Find recent content on the main index or look in the archives to find all content.