Citation analysis: winner takes all

A small group of papers (1%) often gets a disproportional amount of attention and citations (17%). This pattern has been identified a long time ago (have a look at the Web of Science selection procedure as an example of this trend). A short correspondence by Barabási, Song and Wang published recently in Nature revealed that this pattern only emerges after some time and that those top 1% of papers are not necessarily cited a lot immediately after they emerge. The authors argue that this pattern might be a result of our changing reading habits now that academic publications are so abundant, easily searchable and as a result easily accessible: “Researchers increasingly rely on crowd sourcing to discover relevant work, a process that favours the leading papers at the expense of the remaining 99%”.

Read the full correspondence on the Nature website.

SMiLE: But who is going to read 12,000 tweets?!

A second blogpost about the SMiLE project I am involved in appeared recently on the London School of Economics website. I wrote about the project’s aims before as Nicole Beale and Lisa Harris explained it on the LSE website earlier. This second blog post introduces a first glimpse at the results including a short discussion of Twitter network visualization and analysis. Exciting!

In fact, this second blog post reveals some of the really cool work the project members have been up to. MSc students here in Southampton have been busy using the collected social media data in creative ways for their projects. The project is also working with the Oxford e-research centre on a guide for best practice for using social media at conferences. But that’s not all! We are also working on depositing the entire social media archive with the Archaeology Data Service in York, and publishing some of the results in Internet Archaeology.

The rest of the blog post goes on to discuss some of the issues surrounding all this. How does one go about depositing an electronic social media archive? Lisa and Nicole looked into some of the comments of the conference delegates, provided in feedback forms, to get a more qualified picture of the issue and how to proceed. The blog also discusses the issue of developing an interface through which this dataset can be explored. Mark Borkum and I are looking at using network analysis tools for this. More on the network side of things will be revealed in later posts.

Have a look at the original article, definitely worth a read!

Awesome art by Aaron Koblin

Just saw this TED talk by Aaron Koblin, a digital artist who’s work has inspired me for a while now. His art shows stunning examples of the fact that we have so much data available everywhere that relates in unsuspecting ways. If we bother to add things up, like he does for hand-drawn sheep, Johnny Cash still images, flight patterns, $100 bills and even voice samples, we see surprising things emerge that you would not expect by just looking at a single image or sample. His work on flight patterns is stunning and has been exhibited in the New York MOMA recently.

Check out his work online. And check out his talk below. Believe me, you’ll be surprised!

Blog updated

I recently updated the entire contents of this blog’s pages, to reflect the new aims of the Archaeological Network Analysis project. The bibliography has been expanded with a long list of archaeological and non-archaeological works on network analysis. I also added my own publications to the bibliography.

All the information on this blog is still very much a work in progress. You can find an outline of the project, the dataset we use, the preliminary methodology, and an explanation of how the resulting networks should be understood.

Method update: beta-skeletons

This second update of the project’s method concerns the distance networks based on beta-skeletons described in an earlier blog post. We mentioned that the reconstruction of ancient trade routes is extremely complex as a number of variables should be taken into account, so our best bet is to focus on one parameter that might have been influential in determining trade routes. Using beta-skeletons and graph theory we will investigate whether the distance between centre of production and site of deposition is reflected in the ceramic evidence and whether it significantly influenced the selection of trade routes.

Although we mentioned in a previous post that the beta-skeleton would be compared with a reconstruction of trade routes based on the shortest path for every sherd from centre of production to site of deposition over this beta-skeleton, we now have to confess that this is nonsense as we would compare the beta-skeleton with a slightly altered version of itself that is based on a large number of assumptions concerning the intermediary sites. We realized that these shortest paths actually contain the hypothesis that we are testing, as they represent trade routes based on the ceramic evidence in which distance surpasses all other factors in importance.

To create such a network of trade routes we will make a beta-skeleton in which every site has at least one connection, so that all of them would be reachable. This will be done in ArcGIS with a beta-skeleton calculator programmed by dr. Graeme Earl, applied to all the sites in the database and their geographical coordinates. For every sherd the shortest path in geographical distance from centre of production to centre of deposition over this beta-skeleton will be calculated in pajek (although this can be done in ArcGIS, pajek is able to calculate geographical as well as graph theoretical shortest paths). Edge value will represent the number of sherds passing between two sites and edges with a value of zero will be discarded.

At this point we have a reconstruction of the trade routes over which the vessels would have been transported if the distance between start and ending point would have been the only factor taken into consideration by their transporters. This network embodies the hypothesis we want to test, which can be done by comparing it to another network visualisation of ceramic evidence. The networks of co-presence described in the previous post will provide this basis for comparison, as they do not contain any assumptions of their own (before their analysis that is).

Now, there is an obvious danger of comparing things with different meanings, so we need to be very clear of what aspects of both networks will be used for comparison. We will focus on a couple of phenomena that we think are represented in both types of networks: bridges and centrality.

A bridge is a line whose removal increases the number of components in the network (de Nooy et.al. 2005: 140). In our networks of co-presence a bridge is a site that forms the connection between two different groups of distribution networks. Such a site should play an important role in dispersing information on the pottery market as it is linked in with highly differing networks, but does not necessarily play a central role in the entire network. On the distance network these sites should play a similar role in connecting different distribution networks, in order for the hypothesis to be valid.

Sites belonging to the centre of a pottery distribution network can be easily reached by new pottery forms from diverse producing centres, they are central to the communications network of the pottery trade as it is represented in the ceramic evidence. This is true for both our shortest path network and our co-presence network, and can be measured using the closeness centrality method: sites are central in distribution networks if their graph theoretical distance to all other sites is minimal. In network terms: the closeness centrality of a vertex is the number of other vertices divided by the sum of all distances between the vertex and all others (de Nooy et.al. 2005: 127). Although this method will provide comparable numerical results (a score between 1 and 0), we will not compare these absolute values. Rather, we will focus on seeing whether sites that are central (or not) in our co-presence network are also central (or not) in our shortest path network.

Pairs of contemporary networks of both types will be compared using these methods in order to provide an answer to our hypothesis “was distance a significant factor in selecting trade routes?”

Method update : co-present forms and wares

In a previous post we described how a network analysis of co-present forms and wares might help us understand the distributions evidenced by the ceramic data. Here we will elaborate on this type of network by explaining how we will create the network, what it represents, how we are planning on analysing it and what the results of our analyses actually mean.
At the basis of our analysis lies a two-mode network: a network in which vertices are divided into two sets, and vertices can only be related to vertices in the other set (de Nooy et.al. 2005: 103). In human language, sites are connected with forms/wares that are present on the sites, and the forms/wares are themselves connected to other sites on which they were found. A fictitious example of a two-mode network is given in figure 1. A major benefit of using two-mode networks is that we do not lose any information present in the dataset, the specific forms and numbers of sherds present in specific sites are represented in all their complexity. The data will be extracted from the project’s database to form such two-mode networks.

Two-mode network

Fig. 1: A fictitious two-mode network representing sites connected to pottery forms which are present on the site. The value indicates the number of sherds of a form that have been found. (click to enlarge)

To facilitate the analysis of the data, however, we need to transform this two-mode network into two distinct one-mode networks. This is done for the example network of figure 1 and represented in figures 2 and 3. Both one-mode networks provide us with a different type of information: the first one (Fig. 2) represents the sites as vertices connected by the number of forms that are present on both sites; the second one (Fig. 3) represents the forms as vertices connected by the number of sites on which both forms are present. The strengths of a visualisation of ceramic distributions as networks should already be apparent in these one-mode networks.

One-mode network 1

Fig. 2: A fictitious one-mode network representing sites connected to sites which have evidence of the same pottery forms (co-presence). The value indicates the number of pottery forms that are co-present. (click to enlarge)

One-mode network 2

Fig. 3: A fictitious one-mode network representing pottery forms connected to other pottery forms which have been found on the same site (co-presence). The value indicates the number of sites on which both forms are co-present. (click to enlarge)

Now, what do these networks actually mean? As it is our goal to shed light on the relationship between ceramics and the dynamics of Roman trade, we should be very critical and clear about this point. We state that when sites have evidence of a specific pottery form in common, they have a connection of some sort. The nature of this connection represents, in its broadest sense, the distribution network of a pottery form. What network analysis allows us to do is to analyse the structure of these distribution networks, which will help us understand the processes that reach, maintain and evolve these structures.
A first step in our attempt at understanding the structure of Roman ceramic distributions lies in identifying strong components using m-slices (de Nooy et.al. 2005: 109-113) : we will look for vertices which are strongly connected to each other and have high edge values (ie. number of sites or co-present forms). For the first one-mode network (Fig. 2) such a strong component will contain sites that are all part of the distribution networks of a variety of pottery forms. In this fictitious example Athens, Rhodes and Sparta all have evidence of the same two pottery forms (EAA1 and EAA2), which might lead us to conclude that similar processes led to the deposition of these specific sherds on these sites. For the second one-mode network (Fig. 3) the strong components indicate pottery forms that are present in the same sites and, therefore, have a similar distribution pattern.
Such an analysis might considerably improve our understanding of ceramic distributions as it allows us to answer questions such as: What pottery forms had a similar distribution? Can this be explained by the proximity of the producing centre to the consuming sites? Is there a significant difference in the distribution of pottery forms made from the same ceramic ware group (ie. the same producing region)? Is there a similarity between distribution patterns of forms from different wares (which might indicate similar processes of distribution for different producing centres)?
Apart from identifying clusters of sites that form part of similar distribution networks and pottery forms that had a comparable distribution, we can examine the position of individual sites in these networks. When we restrict our attention to the connections in the networks, we get an impression of the diversity of trade relations. Every edge represents the membership of a site or pottery form to a distribution network. Vertices with many edges have access to many and diverse distribution networks, which might indicate better knowledge of trade patterns or a stronger position in pottery trade, as more information on pottery distribution networks is at their disposal. Such aspects can be studied by focusing solely on the number of absolute or relative edges, using methods to define degree, K-cores, closeness, betweenness, bridges and week ties. Although we can’t elaborate on their exact application here, these measurements help us understand the position and roles of sites and pottery forms in different distribution networks. We might be able to identify sites which played a dominant or regulating role in the distribution of specific pottery forms or wares. We would like to stress that identifying such sites is crucial in any attempt to reconstruct trade routes, as they might serve to fill in the gaps on a transportation route from producing centres to consuming centres.
Another strength of our approach will lie in the analysis of networks from different time periods, allowing for the evolution of distribution patterns to become apparent, and threshold periods to be identified.
This type of networks will form the basis for a comparison with contemporary shortest-path networks, described in the next method update.
The analysis of the structure of the distribution patterns as they are represented in the co-presence networks will be studied in more detail using hierarchical clustering based on dissimilarity measurements. This refinement of our method will be described in a later blog post.

Geographic interconnections

In this blog post we continue our quest to develop a method for studying trade routes as they are reflected in the ceramic evidence. It provides an alternative and in some ways parallel to our previous post concerning Beta-skeletons.

A computerised model was developed by Rihll and Wilson (1991) to study the interconnections between sites based solely on their geographical coordinates, while taking size, importance and interactions between sites into account. The only thing one needs to enter into the model are the locations of all sites. Other factors are simulated and develop when running the model thanks to three assumptions:

  • Interaction between any two places is proportional to the size of the origin zone and the importance and distance from the origin zone of all other sites in the survey area, which compete as destination zones.
  • The importance of a place is proportional to the interaction it attracts from other places.
  • The size of a place is proportional to its importance.

Through a number of simulations starting from an initially egalitarian state (equal size and importance for all sites), the most likely pattern of interconnections between sites is determined.

Shawn Graham successfully used this model in his analysis of the brick industry in the Tiber valley. Networks of interconnections between sites in the Tiber valley were created to “explore the effects of geography, stripped of all other considerations” (Graham 2006b: 77; Graham 2009: 678-681).

As the Relative Neighbourhood Graph (RNG) this method uses straight line distances, which will allow us to study the influence of distance in the distribution of table wares. However, as it is a probabilistic model its potential for testing hypotheses is far greater.

We could use this method to create a network of all sites included in the distribution patterns of table wares. The network can be analysed to determine the relative positions of all sites, knowing that distance is a significant factor and taking size and importance into account, but most importantly, exactly knowing the value of all these factors for a given result.

We should stress that the simulated importance represents the importance of a site in the table ware trade, given that distance is a significant factor (this might require a revision of the mathematics underlying the model). Instead of running an egalitarian simulation, we can therefore enter the values for importance into the model as they are present in the ceramic data. When we rerun the model we will be able to analyse a network of a certain distribution in a certain period knowing that distance is influential and being able to calculate this influence. Moreover, we can compare these ceramic networks with multiple stages in the egalitarian simulation.

Again, this is just an idea that might bring us one step closer to understanding the decision made by people involved in the distribution of table wares, but it is by no means without its issues:

  • We assume a direct correlation between number of sherds and importance in trade patterns. Should we use the diversity and relative amounts of ceramic forms as an index of importance? As this is a simulation we accept that we enter arbitrary values for something we try to study (the relative position of sites in different ceramic distribution patterns). Still we should beware for circular thought patterns which will eventually tell us that the things we think are significant will turn out to be significant.
  • What with the ‘size’ factor? Should we remove it from the model or can it represent another aspect of ceramic trade?
  • Is it useful to apply this model to the ceramic evidence, or should we just run the analysis without including the number of sherds, to see how sites relate to one another in space? Such an approach might allow us to compare a distance-based simulation with Beta-skeletons of ceramic distributions?

Defining networks

As already mentioned in the preliminary method defining networks (the relationships within ceramic distributions) is of crucial importance as this will dominate the results of the analysis. This should also happen as early on as possible in the project, because it will determine our approach of the data (the database model and overall method). As it is our aim to investigate the relationship between ceramics and Roman trade, we thought it best not to drift too far from the data themselves. We could even question the use of analysing networks that combine the ceramic data and other parameters (like distance, topography or sailing conditions), as the things we think to be significant will also turn out to be structuring factors in the networks.
But what relationships are explicitly present in the data themselves? As we mentioned before, it’s hard to think of networks that include no assumptions (this is why we prefer a methodology that is based on testing hypotheses/assumptions, rather than focusing on one type of network). We noticed that it is hard not to think geographically when thinking about the relationships within a large quantity of ceramics. The first network we came up with actually focused on the transportation of the ceramics, from centre of production to centre of deposition. In this network the points would represent sites and the lines acts of ceramic transportation. Such a network, however, requires assumptions about the junctions between the known starting and ending sites, which made us think about making distance a defining factor. Although, it is very temping to try and reconstruct ancient trade routes, we decided that there were too many factors to take into account (land/sea travel, distance, sailing conditions, topography, Roman roads).
So are there non-geographical networks reflected in ceramic distributions? We might look at the quantities of certain ceramic types, the diversity of pottery types for every site, and the patterns in presence of types at the same period in the same place. It becomes increasingly hard to imagine such networks and what they represent; but it should result in an interesting and innovative view on ceramic distributions. Do these networks inform us on the contacts of producing centres, the popularity of pottery types, the social networks in which traders frequented, do they reflect trade in other items like staple goods?
It is our aim to discover the structure within a ceramic database, by evaluating as many network types as possible as hypotheses. Please share doubts about the above mentioned networks; and feel free to propose other relationships that could be implied by ceramic distributions.

Blog at WordPress.com.

Up ↑