Wednesday, February 27, 2008

Correlating votes on Digg with website traffic

Anonymous Prof has written a very nice post analyzing the amount of traffic that flows to websites from front page stories on Digg. S/he took website traffic data from Alexa and investigated the change in traffic that resulted from a page on each site making it onto the front page of Digg. For large websites, the effects were negligible, but for smaller sites (less than 0.0005% of pageviews on Alexa) the effects are quite apparent. The traffic increased by around 0.1% per vote on Digg, meaning that a story receiving 500 diggs (votes) would temporarily increase traffic to the site hosting the content by around 50%. The effect on very small sites (those not indexed by Alexa) would obviously be much bigger, and often causes the webserver to go down under load.

Anonymous Prof's post on visualizing the last.fm social network is also worth a look for the cool pictures.

Tuesday, February 26, 2008

Staying anonymous online: part 2

I wrote recently about the amount of personal information that each of us is making public and putting online, sometimes without even realizing it. In the post, I stated that it should be possible to link the various profiles of a particular user across different social networking sites, even if the user has completely different/anonymous user-ids on the different sites. Turns out that there already exists a company that does exactly that ....

The website is called Spokeo and their aim is to let you keep track of what (/everything) that your friends are up to on a bunch of different social networking sites, including: last.fm, Amazon, MySpace, LinkedIn, YouTube, Flickr, Digg, Facebook, etc. According to Phil Bradley, who tried it out, the system is able to link the profiles of your friends on different sites, even if they use different login names on each site.

What kind of information can you see? Well, anything that's public and anything else you have access to according to the website:
"All users can track information that is publicly available. Only users who have accesss to private content on a third party site can track that private content in Spokeo."
The thing is, there is quite a lot of publicly available data about pretty much everybody on the net. A service like this that aggregates the data and make it easier to find and search, means that the inherent anonymity that comes with there being so much data on the net and that data being distributed over so many websites doesn't apply any longer. Data about you can now be integrated, stored and archived. So what you did a number of years ago on some obscure website, may well come back to find (/haunt) you in the future.

Sound fanciful? Imagine for example that you are applying for a job and the prospective employer has a look at your profile on LinkedIn and then links to your Digg profile only to discover that you particularly like stories about some ultra-liberal political candidate ... or that you're into guns ... or maybe you've shown a passing interest in a certain religious sect. - All this information wouldn't be hard to find, but it's probably not the stuff you want on your CV!

Also, it looks like Spokeo isn't the only "social content aggregator" and that these sites are becoming quite common, according to this post on Techcrunch.

Friday, February 15, 2008

Most Popular Data Mining Algorithms

Just saw this post on the Data Mining Research blog about a summary paper on the top 10 data mining algorithms as voted by the program committees of KDD, ICDM and SDM. The paper gives a really nice introduction to and summary of a bunch of particularly useful algorithms. It is a great starting point for somebody interested in finding out more about data mining. Some of the sections are better written than others - I particularly liked the discussions regarding support vector machines (SVM) and k-nearest neighbor (kNN) classification.

Friday, February 8, 2008

Predicting Airfares using Machine Learning

Good research ideas can be turned into interesting and successful businesses. Case in point, a company called Farecast that provides an airfare search engine (similar to Orbitz). The website distinguishes itself by using machine learning techniques to predict the future price of airfares based on historical data. Thus it recommends whether a user should buy immediately (because the price is going up) or wait for a lower price (as the price is likely to fall).

I'm interested in the company in particular because it is based on technology developed by my thesis advisor Craig Knoblock (together with other researchers at USC and the University of Washington). The original idea was described in this paper "To buy or not to buy: Mining airline fare data to minimize ticket purchase price" by Oren Etzioni, Craig A. Knoblock, Rattapoom Tuchinda, and Alexander Yates. The paper was published at KDD back in 2003, but is still a very good read. - Needless to say, I am a big fan of Craig's research. He has a knack for discovering new research problems that are both interesting and practical (and sometimes even commercializable).

According to a post on TechCrunch, the company has just started predicting airfares on some international routes. This will be very useful for those of us not living in the US and who spend a significant portion of our incomes on airfares. - I might be able to travel home and see the family a bit more often.

Thursday, February 7, 2008

Digging into the workings of social news aggregators

An ex-colleague of mine, Kristina Lerman, wrote a very interesting paper analyzing and modeling the process in which news items are discovered and make their way onto the front page of the social news aggregation site Digg. The paper is called "Social Information Processing in Social News Aggregation".

Interesting points made in the article regarding the social nature of the system, include:
  • People subscribe to feeds of news items that are posted or voted for ("dugg") by their friends, increasing greatly the probability of viewing (and thus voting on) news items that their friends found interesting.
  • This "social filtering" phenomenon is necessary in order to deal with the huge number of stories that are posted to Digg each day.
  • Stories propagate very fast through the system: anecdotal evidence suggests that they are faster than clustering-based approaches such as Google News, (which makes sense, as a story must break on multiple news sources before it will appear on a clustering-based news aggregator).
  • According to mathematical model devised to explain the process by which news items are propagated to the front page, a highly relevant/interesting story (relevance=0.9) may not get onto the front page if the original poster is not well connected - i.e. doesn't have a large number of readers.
  • And inversely, relatively uninteresting articles (relevance=0.1) may make it to the front page if the contributor is particularly well connected (i.e. famous).
Reading the article made me think of another article I read recently on using agent-modeling techniques to analyze human social behavior. This time the author was interested in the divergence of opinions rather than the consensus opinion (i.e. newsworthiness on Digg). In his article "Mobility and Social Network Effects on Extremist Opinions", André Martins created agent-based simulations of public opinion so as to analyze the emergence of extremist positions in society. The finding in that case was that more interaction between diverse groups could lower the overall amount of extremism.

(Thanks to the physics arXiv blog for bringing my attention to the second article. - They point out that in this age of terrorism fears, closing borders and restricting travel may well have a negative effect on curbing extremism.)

Friday, February 1, 2008

Gaming in the virtual real world

Google's 3D Warehouse is a repository of user-generated 3D models that can be viewed in Google Earth. Anybody can build models using Sketchup and contribute them to the collection. People have already modeled lots of buildings, bridges, and in some cases entire cities (see Adelaide for example). In true user-generated content style, users can vote on the quality of the models and the best of them are then included by default in Google Earth.

Since the modeling data is free to use/reuse and in an open format (KML), it looks particularly interesting from a data integration perspective. As this collection grows, it should provide a fun dataset for research in spatial data integration. One simple integration problem that comes to mind is:
"find me an office in downtown LA with price less than X and a view out to the ocean"
Answering this query would involve integrating data from different real estate websites along with the 3D building models. Another example integration problem, this time for terrorism defense:
"find all rooftops with a clear line of sight onto a convoy traveling along roads XYZ"
In this case, vector data (street maps) would need to be integrated with the 3D building models.

Anyway, the point of this post was to highlight something very cool (and not so research related) that I saw on the blog digital urban. In the post, the writers take the free content from the 3D Warehouse and add it to a video game, so that game play can take place in a real (virtual) city instead of an imagined one. There is a video of an attack helicopter flying around over London! Looks like you'll soon be able to defend (or invade) your favorite tourist destination.