RSS Entries (RSS) and Comments (RSS)  

LOD cloud shows surprisingly lumpy structure

The protypical Linked Open Data map gives the general impression of a richly interlinked set of bubbles.However, A small experiment showed that this first impression is very wrong!  Christophe Gueret from the VU Amsterdam re-constructed the LOD link table as a .net file, which we then displayed using simple stress minimisation in Visione. This revealed some surprises:

(click on image for bigger copy)

Surprise 1: There is not one cloud, but three. As this graph visualisation shows, LOD is not one cloud, but three, each with dense internal connections and only sparse connections between them. The three sub-clouds are also clearly recognisable: one sub-cloud is bio/life-sciences data, one sub-cloud is (surprisingly) academic bibliographic material, and the central cloud is “all the rest”, connecting the other two, with DBPedia as its hub.

Surprise 2. DBLP is as important as DBPedia. Also surprisingly, the total betweenness degree of the relatively unknown DBLP datasets is as high as the betweenness degree of the widely recognised DBPedia hub.  The sum of three DBLP instances accounts for 25% of the betweenness, almost the same number as DBPedia (28%). The reason for this high betweenness is that the DBLP sets are the only link between the bibliographic subcloud and “the rest”.

So now the questions are: Is this good or bad? Is this surprising or obvious? Is this long-term structural or just a short-term coincidence. Anybody?(a first experiment would be to take the density of the links between the bubbles into account, and see if this would change anything? The .net file is here for you to experiment with, please share your results).

Reblog this post [with Zemanta]

Tags: ,

15 Responses to “LOD cloud shows surprisingly lumpy structure”

  1. Richard Cyganiak Says:

    Nice work! The three clusters didn’t surprise me—the biomed/HCLS cluster is dominated by datasets resulting from the bio2rdf project, and the academic publications cluster is dominated by the work of Hugh Glaser’s team at RKBExplorer. The third cluster is much more diverse.

    Anyways, don’t read too much into the diagram. Often we were forced to make arbitrary decisions: How many links are enough to warrant an arrow between datasets? Should a given multi-part project be represented as one circle or several circles? It’s just a highly simplified map of the linked data out there on the web, and analysing the map illuminates the choices of the map’s creators more than it illuminates the actual data cloud.

  2. FrankVanHarmelen Says:

    @Richard: the graph is build from the table of linked datasets, not from your diagram. So it should be accurate. There are no judgements on when to include/exclude a link (assuming the data on http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/DataSets/LinkStatistics are correct? )

  3. FrankVanHarmelen Says:

    Nice work by Rinke Hoekstra: http://twitpic.com/17qj1h/full shows the same graph, now using Gephi . Size of nodes indicates betweenness, color indicates cluster (automatically detected). Layout is Force Atlas with repulsion factor of 10000

  4. FrankVanHarmelen Says:

    Christophe Gueret’s script for generating the .net file is at http://bit.ly/aho1Ut

  5. Pieter De Leenheer Says:

    Great to see the network effect here !

  6. PaulGroth Says:

    Interesting observation by @Richard, two of the major clusters of the LOD (scholarly data and bio data ) are run by two organizations. So if Southampton and Bio2rdf collapse these chunks die. Would nice to see more organizations hosting or make sure that these organizations have enough support so that they can ensure uptime.

  7. Dan Brickley Says:

    Re the ‘two orgs’ vulnerability; since this is all open shared data, you would hope that the data would still be useful and used even if the original source drops off the ‘net. Of course RDF and the Linked Data model especially put particular weight on domain-name based HTTP URIs. So I guess we should take care to pick ‘em carefully, cos changing them afterwards is going to be tricky…

  8. Joshua Shinavier Says:

    Also check out Marko Rodriguez’s analysis from March 2009: http://arxiv.org/abs/0903.0194
    This paper likewise describes several distinct clusters of linkage. Some other observations:
    1) the LOD graph is not strongly connected
    2) a diameter of 8 is large given relatively small size of the cloud
    3) data sets in the cloud have nearly identical incoming and outgoing link patterns (which indicates that the majority of links are reciprocal owl:sameAs)

  9. Pieter De Leenheer Says:

    the formation of criticial hubs is typical to scale-free networks. Amazing that knowledge networks organise according to the same pattern as social networks. I would be interesting to compare both networks here and make correlations. This would indeed lead to the conclusion is the main champion driving discussions within the initiative. Where is the SIOC graph :-) ?

  10. Hugh Glaser Says:

    Nice work indeed.
    Hopefully Anja (and others) won’t have to spend their time doing the cloud by hand any more - thanks for the efforts over the years!

    Looks like it is showing how disconnected RKB is from the rest of the world.

    However, it isn’t actually quite that bad, you will be pleased to hear. There are quite a lot of point-wise links from different RKB stores to dbpedia and others. It is just that I haven’t found the time to put in small numbers of links (and didn’t want to make the table overloaded with RKB stuff).
    And actually there is a whole unlocode RKB site missing (with links to geonames and datagovuk I think), and probably some others.

    A great thing about these pictures is that it is now easy for me to see such things, and dataset publishers will feel encouraged to keep the data up to date.

  11. Kingsley Idehen Says:

    Belated comments.

    Nice post.

    The Linked Open Data realm is really a lumpy galaxy. It certainly isn’t a DBpedia based Solar System :-)

    Links:

    1. http://bit.ly/90tUKJ — old post titled State of the Linked Data Web.

  12. Jose Manuel Gomez-Perez Says:

    Nice work. IMHO, this kind of visualizations can provide useful insight in order to improve LOD from an architectural viewpoint so to say, e.g. to detect scalability issues, maximize domain coverage, support detection of redundancies, etc.

    On the other hand, something that can be drawn from this particular work is a notion of the domains where LOD is having a greater impact currently: Bio and Biblio, complemented with general purpose knowledge from the DBPedia cluster. However, this can be seen the other way around: probably these domains are the best represented ones because their respetive communities have traditionally had a deeper involvement in LOD.

  13. Tom Morris Says:

    Any chance of scalable version (SVG?) of these graphics? The Gephi version is better, but some sections are still too dense to read labels.

    The assumption that starting data table is accurate may not be valid. For the spot checking I did, it was missing significant chunks of stuff. For example, Freebase and DBpedia are basically fully bidrectionally connected for the things they have in common (80%+ of DBpedia?), but this isn’t represented at all. Freebase also links to MusicBrainz and a number of other data sources.

    One other thing worth noting is that it appears the diameter of the nodes now represents in-degree rather than number of triples stored.

  14. CKAN->network exporter for the LOD Cloud « Semantic Web world for you Says:

    […] year ago, we posted on the LarkC blog a first network model of the LOD cloud. Network analysis software can highlight some aspects of the cloud that are not directly visible […]

  15. CKAN->network exporter for the LOD Cloud - Knowledge Representation and Reasoning Group Says:

    […] year ago, we posted on the LarkC blog a first network model of the LOD cloud. Network analysis software can highlight some aspects of the cloud that are not directly visible […]

Leave a Reply