Wednesday, March 19, 2008

Degrees of separation in social networks

It's a common belief that there are six degrees of separation between any two individuals on the planet. In other words, if you consider all the people you know, and then all the people they know, and continue for six iterations, you will cover (almost) everybody on the planet. With the advent of online social networks and the ability to mine them for statistics, it's interesting to see how well this "six degrees" rule really works in practice ....

Researchers at Microsoft and CMU recorded one month of chat conversations from the Microsoft's Instant Messaging (IM) client and analyzed the data - not the content of the conversations, just metadata regarding the users (age, sex, location) and the conversations (length, time, number of messages). The resulting social network is a biggie, with 240 million users and 30 billion conversations. The paper is called "Planetary-Scale Views on an Instant-Messaging Network" by Jure Leskovec and Eric Horvitz. [Thanks to the physics arXiv blog for pointing me to the paper.]

Interestingly they discover the average minimum path length between connected users (99.9% of all users) was 6.6. That is to say, that there was indeed 6 degrees of separation between users over the one month of data collected. Obviously, since IM users are not a representative sample of the whole world population (not many grandparents use instant messaging), we would expect them to be more highly connected than the rest of the population. My guess is that this average path length would drop significantly as the time window was increased from just one month of data.

Other notable statistics from the paper are that almost all conversations (99%) involve only 2 people and that we converse more frequently and longer with members of the opposite sex than they do with their own.

Another interesting study on social networks can be found in "Measurement and Analysis of Online Social Networks" by Alan Mislove, Massilmiliano Marcon, Krishna P. Gummadi, Peter Druschel and Bobby Bhattacharjee. [Thanks to Cyrus Hall for pointing me to the paper.] The interesting thing about this study is that they compare different types of social networks to see which properties they all share in common. The networks investigated are Flickr, YouTube, LiveJournal and Orkut, i.e. a photo-sharing, a video-sharing, a blogging and a friendship network.

The average minimum path length between users (in the main connected component) for the different networks was:
  • Flickr: 5.67 (1.8 million users, 26.9%)
  • LiveJournal: 5.88 (5.2 million users, 95.4%)
  • Orkut: 4.25 (3.0 million users, 11.3%)
  • YouTube: 5.10 (1.1 million users, coverage unknown)
The numbers in brackets are the number of users in the sample and the estimated coverage of the social network. The fact that links in Orkut are bidirectional ("friendship connections" are reciprocal) probably explains why the average path length is lower in that case.

The thing I find intriguing is that the path length for the other three networks are all pretty much within rounding error of one another. And even the values for Orkut and IM aren't that far off. - For comparison, the average path length between pages on the Web is 16 if the direction of hyperlinks is taken into account (and 7 if it isn't).

The paper also comes to other interesting conclusions. They find that for networks with directed connections between users, both the number of incoming connections to a node, and the number of outgoing connections follow power-law distributions. They also find that nodes with high in-degree tend to have high out-degree.

It is well known that hyperlinks in the Web graph also follow power-law distributions (see for example this paper), so this result for social networks was to be expected I guess. A second result regards the importance of "highly connected individuals" in the network and is a little more surprising:
The networks each contain a large, densely connected core. Overall, the network is held together by about 10% of the nodes with highest degree. As a result, path lengths are short, but almost all shortest paths of sufficient length traverse the highly connected core.
I'm wondering if the last result has anything to do with the limited sample made of each social network or the fact that crawling itself biased the network structure detected. - Although from the description of the crawling procedure given in the paper, it sounds like they tried their utmost to make sure their sampling process introduced as little bias as possible.