Citation analysis: winner takes all

November 7, 2012

A small group of papers (1%) often gets a disproportional amount of attention and citations (17%). This pattern has been identified a long time ago (have a look at the Web of Science selection procedure as an example of this trend). A short correspondence by Barabási, Song and Wang published recently in Nature revealed that this pattern only emerges after some time and that those top 1% of papers are not necessarily cited a lot immediately after they emerge. The authors argue that this pattern might be a result of our changing reading habits now that academic publications are so abundant, easily searchable and as a result easily accessible: “Researchers increasingly rely on crowd sourcing to discover relevant work, a process that favours the leading papers at the expense of the remaining 99%”.

Read the full correspondence on the Nature website.

SMiLE: But who is going to read 12,000 tweets?!

July 9, 2012

A second blogpost about the SMiLE project I am involved in appeared recently on the London School of Economics website. I wrote about the project’s aims before as Nicole Beale and Lisa Harris explained it on the LSE website earlier. This second blog post introduces a first glimpse at the results including a short discussion of Twitter network visualization and analysis. Exciting!

In fact, this second blog post reveals some of the really cool work the project members have been up to. MSc students here in Southampton have been busy using the collected social media data in creative ways for their projects. The project is also working with the Oxford e-research centre on a guide for best practice for using social media at conferences. But that’s not all! We are also working on depositing the entire social media archive with the Archaeology Data Service in York, and publishing some of the results in Internet Archaeology.

The rest of the blog post goes on to discuss some of the issues surrounding all this. How does one go about depositing an electronic social media archive? Lisa and Nicole looked into some of the comments of the conference delegates, provided in feedback forms, to get a more qualified picture of the issue and how to proceed. The blog also discusses the issue of developing an interface through which this dataset can be explored. Mark Borkum and I are looking at using network analysis tools for this. More on the network side of things will be revealed in later posts.

Have a look at the original article, definitely worth a read!

Awesome art by Aaron Koblin

May 27, 2011

Just saw this TED talk by Aaron Koblin, a digital artist who’s work has inspired me for a while now. His art shows stunning examples of the fact that we have so much data available everywhere that relates in unsuspecting ways. If we bother to add things up, like he does for hand-drawn sheep, Johnny Cash still images, flight patterns, $100 bills and even voice samples, we see surprising things emerge that you would not expect by just looking at a single image or sample. His work on flight patterns is stunning and has been exhibited in the New York MOMA recently.

Check out his work online. And check out his talk below. Believe me, you’ll be surprised!

Blog updated

February 16, 2010

I recently updated the entire contents of this blog’s pages, to reflect the new aims of the Archaeological Network Analysis project. The bibliography has been expanded with a long list of archaeological and non-archaeological works on network analysis. I also added my own publications to the bibliography.

All the information on this blog is still very much a work in progress. You can find an outline of the project, the dataset we use, the preliminary methodology, and an explanation of how the resulting networks should be understood.

Data online

September 15, 2009

I made a temporary website where all the project’s networks are available. Feel free to have a look and explore the data yourself. At present the networks can only be viewed in Internet Explorer, but please contact me if even this browser doesn’t work.

Method update: beta-skeletons

July 21, 2009

This second update of the project’s method concerns the distance networks based on beta-skeletons described in an earlier blog post. We mentioned that the reconstruction of ancient trade routes is extremely complex as a number of variables should be taken into account, so our best bet is to focus on one parameter that might have been influential in determining trade routes. Using beta-skeletons and graph theory we will investigate whether the distance between centre of production and site of deposition is reflected in the ceramic evidence and whether it significantly influenced the selection of trade routes.

Although we mentioned in a previous post that the beta-skeleton would be compared with a reconstruction of trade routes based on the shortest path for every sherd from centre of production to site of deposition over this beta-skeleton, we now have to confess that this is nonsense as we would compare the beta-skeleton with a slightly altered version of itself that is based on a large number of assumptions concerning the intermediary sites. We realized that these shortest paths actually contain the hypothesis that we are testing, as they represent trade routes based on the ceramic evidence in which distance surpasses all other factors in importance.

To create such a network of trade routes we will make a beta-skeleton in which every site has at least one connection, so that all of them would be reachable. This will be done in ArcGIS with a beta-skeleton calculator programmed by dr. Graeme Earl, applied to all the sites in the database and their geographical coordinates. For every sherd the shortest path in geographical distance from centre of production to centre of deposition over this beta-skeleton will be calculated in pajek (although this can be done in ArcGIS, pajek is able to calculate geographical as well as graph theoretical shortest paths). Edge value will represent the number of sherds passing between two sites and edges with a value of zero will be discarded.

At this point we have a reconstruction of the trade routes over which the vessels would have been transported if the distance between start and ending point would have been the only factor taken into consideration by their transporters. This network embodies the hypothesis we want to test, which can be done by comparing it to another network visualisation of ceramic evidence. The networks of co-presence described in the previous post will provide this basis for comparison, as they do not contain any assumptions of their own (before their analysis that is).

Now, there is an obvious danger of comparing things with different meanings, so we need to be very clear of what aspects of both networks will be used for comparison. We will focus on a couple of phenomena that we think are represented in both types of networks: bridges and centrality.

A bridge is a line whose removal increases the number of components in the network (de Nooy 2005: 140). In our networks of co-presence a bridge is a site that forms the connection between two different groups of distribution networks. Such a site should play an important role in dispersing information on the pottery market as it is linked in with highly differing networks, but does not necessarily play a central role in the entire network. On the distance network these sites should play a similar role in connecting different distribution networks, in order for the hypothesis to be valid.

Sites belonging to the centre of a pottery distribution network can be easily reached by new pottery forms from diverse producing centres, they are central to the communications network of the pottery trade as it is represented in the ceramic evidence. This is true for both our shortest path network and our co-presence network, and can be measured using the closeness centrality method: sites are central in distribution networks if their graph theoretical distance to all other sites is minimal. In network terms: the closeness centrality of a vertex is the number of other vertices divided by the sum of all distances between the vertex and all others (de Nooy 2005: 127). Although this method will provide comparable numerical results (a score between 1 and 0), we will not compare these absolute values. Rather, we will focus on seeing whether sites that are central (or not) in our co-presence network are also central (or not) in our shortest path network.

Pairs of contemporary networks of both types will be compared using these methods in order to provide an answer to our hypothesis “was distance a significant factor in selecting trade routes?”

Method update : co-present forms and wares

July 20, 2009

In a previous post we described how a network analysis of co-present forms and wares might help us understand the distributions evidenced by the ceramic data. Here we will elaborate on this type of network by explaining how we will create the network, what it represents, how we are planning on analysing it and what the results of our analyses actually mean.
At the basis of our analysis lies a two-mode network: a network in which vertices are divided into two sets, and vertices can only be related to vertices in the other set (de Nooy 2005: 103). In human language, sites are connected with forms/wares that are present on the sites, and the forms/wares are themselves connected to other sites on which they were found. A fictitious example of a two-mode network is given in figure 1. A major benefit of using two-mode networks is that we do not lose any information present in the dataset, the specific forms and numbers of sherds present in specific sites are represented in all their complexity. The data will be extracted from the project’s database to form such two-mode networks.

Two-mode network

Fig. 1: A fictitious two-mode network representing sites connected to pottery forms which are present on the site. The value indicates the number of sherds of a form that have been found. (click to enlarge)

To facilitate the analysis of the data, however, we need to transform this two-mode network into two distinct one-mode networks. This is done for the example network of figure 1 and represented in figures 2 and 3. Both one-mode networks provide us with a different type of information: the first one (Fig. 2) represents the sites as vertices connected by the number of forms that are present on both sites; the second one (Fig. 3) represents the forms as vertices connected by the number of sites on which both forms are present. The strengths of a visualisation of ceramic distributions as networks should already be apparent in these one-mode networks.

One-mode network 1

Fig. 2: A fictitious one-mode network representing sites connected to sites which have evidence of the same pottery forms (co-presence). The value indicates the number of pottery forms that are co-present. (click to enlarge)

One-mode network 2

Fig. 3: A fictitious one-mode network representing pottery forms connected to other pottery forms which have been found on the same site (co-presence). The value indicates the number of sites on which both forms are co-present. (click to enlarge)

Now, what do these networks actually mean? As it is our goal to shed light on the relationship between ceramics and the dynamics of Roman trade, we should be very critical and clear about this point. We state that when sites have evidence of a specific pottery form in common, they have a connection of some sort. The nature of this connection represents, in its broadest sense, the distribution network of a pottery form. What network analysis allows us to do is to analyse the structure of these distribution networks, which will help us understand the processes that reach, maintain and evolve these structures.
A first step in our attempt at understanding the structure of Roman ceramic distributions lies in identifying strong components using m-slices (de Nooy 2005: 109-113) : we will look for vertices which are strongly connected to each other and have high edge values (ie. number of sites or co-present forms). For the first one-mode network (Fig. 2) such a strong component will contain sites that are all part of the distribution networks of a variety of pottery forms. In this fictitious example Athens, Rhodes and Sparta all have evidence of the same two pottery forms (EAA1 and EAA2), which might lead us to conclude that similar processes led to the deposition of these specific sherds on these sites. For the second one-mode network (Fig. 3) the strong components indicate pottery forms that are present in the same sites and, therefore, have a similar distribution pattern.
Such an analysis might considerably improve our understanding of ceramic distributions as it allows us to answer questions such as: What pottery forms had a similar distribution? Can this be explained by the proximity of the producing centre to the consuming sites? Is there a significant difference in the distribution of pottery forms made from the same ceramic ware group (ie. the same producing region)? Is there a similarity between distribution patterns of forms from different wares (which might indicate similar processes of distribution for different producing centres)?
Apart from identifying clusters of sites that form part of similar distribution networks and pottery forms that had a comparable distribution, we can examine the position of individual sites in these networks. When we restrict our attention to the connections in the networks, we get an impression of the diversity of trade relations. Every edge represents the membership of a site or pottery form to a distribution network. Vertices with many edges have access to many and diverse distribution networks, which might indicate better knowledge of trade patterns or a stronger position in pottery trade, as more information on pottery distribution networks is at their disposal. Such aspects can be studied by focusing solely on the number of absolute or relative edges, using methods to define degree, K-cores, closeness, betweenness, bridges and week ties. Although we can’t elaborate on their exact application here, these measurements help us understand the position and roles of sites and pottery forms in different distribution networks. We might be able to identify sites which played a dominant or regulating role in the distribution of specific pottery forms or wares. We would like to stress that identifying such sites is crucial in any attempt to reconstruct trade routes, as they might serve to fill in the gaps on a transportation route from producing centres to consuming centres.
Another strength of our approach will lie in the analysis of networks from different time periods, allowing for the evolution of distribution patterns to become apparent, and threshold periods to be identified.
This type of networks will form the basis for a comparison with contemporary shortest-path networks, described in the next method update.
The analysis of the structure of the distribution patterns as they are represented in the co-presence networks will be studied in more detail using hierarchical clustering based on dissimilarity measurements. This refinement of our method will be described in a later blog post.