Citation Graphs and Methodology

Last week I posted some graphs showing how often various journals cite other journals. Here is one of those graphs, just to remind you what they looked like.

I didn’t say particularly clearly how I got the data, or even exactly what the graphs meant. So here’s the long methodology post showing how I ended up with graphs like this one.

The data all come from Web of Science, and in particular its “Cited Reference Search”. For each of the 32 journals I was looking at, and each of the 40 years in 1976–2015, I searched for all citations to that journal in that year. So, for instance, I did a search under “Cited Reference Search” where the source was Australasian Journal of Philosophy, and the year was 1996. I set the “Timespan” to be 1976–2015, and the Indexes to search to be just the Arts & Humanities Citation Index.

When you do that, you get a list of possible articles that you might want to look at the citations of. (These are, roughly, all the things published in that journal in that year.) I hit select all, because I wanted to see everything that cited at least one of those journals, and then downloaded some information about everything that turned up.

Note three things about doing the search this way:

  1. I didn’t even download any information about which individual articles were being cited. The method got me more information than I needed (or even wanted) about the citing articles, but none at all about what was cited. Obviously there is a lot of interest in which articles are being cited, but that will be left for others to do.
  2. This method will only turn up citations to the article in the journal. So if someone just cited, say, “Elusive Knowledge” just in its reprint in Papers in Metaphysics and Epistemology, and not in the original AJP publication, it won’t show up. I don’t think this made a huge difference, but it affected some things. For instance, Stephen Barker’s 2011 Noûs paper “Can Counterfactuals Really Be about Possible Worlds?” doesn’t show up in the list of papers citing AJP 1996, because the reference to “Elusive Knowledge” is just to the reprint. So this is one way some noise creeps into the system.
  3. At each of the 1280 stages, I’m doing a disjunctive search: Find all things that cite that journal in that year. The search doesn’t discriminate between things that cite five articles from that year and things that cite one article from that year. I think this is probably a good thing; citations to multiple articles in one year are usually citations to a single thread, and I’d rather treat them as a single citation. But it is a complication.

There was one more annoying wrinkle. Web of Science doesn’t separate out Noûs from Philosophical Issues and (usually) Philosophical Perspectives. As far as I could tell, every issue of Philosophical Issues is coded as a special issue of Noûs, and about half the issues of Philosophical Perspectives are. I wanted to get this noise out of the data. So when I was searching for Noûs I had to go through by hand and search just for those things that cited the real Noûs articles, not the ones citing the special issues. This wasn’t too hard, because the special issues had page numbers and/or issue numbers starting with ‘S’, but it was a bit of a pain. We’ll come back to that.

The result, after these 32 by 40 searches, was a file with roughly 240,000 citations in it. But a lot of these were citations in journals I wasn’t looking at. So I made a restricted file where the citations were just to the 32 journals I was looking at. This was mostly just a matter of filtering the large file by the citing journal, though again there was a bit of a pain filtering out the different things coded as ‘Noûs’. (It wasn’t too hard this time, to be honest, because the downloaded data about citing papers included issue and page number, so searching for things like ‘S’ got rid of most of them.) This was less automated than most of the process, so there was a higher chance of errors creeping in.

The result was a file with about 106,000 citations in it. The graph you saw above comes from a slightly smaller file, one that deletes all of the citations to articles in the same journal. That covers a lot of citations, so we’re now down to about 82,000 citations. Journals, it turns out, love publishing papers that cite other things in that very journal. For 24 journals, the journal they most frequently cite is themselves. For 7 others, it is Journal of Philosophy, and for 1, the Journal of Political Philosophy, it is Philosophy & Public Affairs. So we cut out a fair bit here.

I used that 82,000 strong list to build a 32 by 32 table, with the cited journal on the rows, and the citing journal on the columns. Each cell had a count of how often that pair showed up in the data set, from 0 (any number of times), to 1534 (citations by Sythese of Philosophy of Science). These are raw counts; so journals that publish a lot will naturally have bigger numbers (in both the rows and columns) than smaller journals. I’ll come back to this point in later posts; I’ve been working a fair bit this week on ways to address this.

Then I arranged the journals so that similar journals were nearby. I was using a fairly rough and ready version of similarity, and there were probably better ways to do this. There ended up being a big jump between the philosophy of science journals and the ethics journals, and a jump (though actually a bit smaller) between CJP in the generalist journals and the history journals.

It’s striking that it is possible to go by relatively small steps from the generalist journals to the philosophy of science journals, but not to the ethics journals. A large part of the explanation here is that Synthese exists as a bridge between the two, but no similar journal exists for bridging the ethics journals to the generalist journals. Economics and Philosophy sort of functions as such a bridge, since it connects to the political philosophy journals and the philosophy of science/formal epistemology journals, but it’s too small. Given the important of ethics-and-epistemology to young philosophers these days, I suspect that situation will change in the next few years.

There ends up being something like a category of ethics, aesthetics and history journals in the data set I have. This is not because these journals are all intrinsically similar. It is rather that they are all linked to Kant Studien.

Because Mind is linked to the other UK/Australian journals, and to Mind and Language, the UK/Australian journals ended up nearer to the specialist journals than the North American journals did. If I get a chance, I’d like to write more about the geographic patterns in the journals, because these are fairly interesting to me.

Then I had to colour code the journals. I went through a lot of options here before settling on what you see. I wanted nearby journals to get similar colours, while different categories to get very different colours, and the whole thing to not look terrible. And I would have liked to have very different colours for each journal, but I ended up having to seriously compromise on that. I landed on green for generalist journals, going through blue-ish greens for specialist journals in areas the generalists cover a lot, into darker blues and purples for philosophy of science, then jumping to reds for ethics and political, and oranges for aesthetics and for history. I alternated light/dark colours around the circle, but the light/dark doesn’t mean anything; it was just to make it easier to detect edges.

And I fed that table into Circos. The result is what you saw. Here’s how to read it.

Between each pair of (distinct) journals, there are a pair of lines. Each line starts on an outer ring and ends on an inner ring. The outer ring is the journal being cited, the inner ring is the journal doing the citing. The colour of the line is the colour of the journal being cited.

Around the edges there are three arcs, each with an array of colours. These represent the outbound citations, the inbound citations, and (on the outside) the sum of these. They are ordered by size. If the colours were more distinct, you could easily see which journal the particular journal interacts with the most. As it stands, looking for the purples, reds and browns on those arcs gives you a bit of a sense with how much the journal interacts with philosophy of science, with ethics, and with aesthetics/history. (Those colours usually come way towards the end of the arc, though Synthese obviously has more purple towards the top end.)

That’s about enough, I think, to show what’s going on. I have four big projects going forward.

  1. Building graphs that just highlight specific journals. It’s impossible to make out anything about the history journals, for instance, at the scale shown here. So I’ll in effect do some magnification.
  2. Building graphs (and perhaps gifs) that show the evolution over time of citation patterns. I might go back to some old fashioned line graphs to show the change in citation to various journals, and how much more egalitarian it has become.
  3. Looking at ways to highlight the geographical features of the citation patterns. This is something I’m really fascinated by.
  4. Figuring out the best way to normalise the data to account for the fact that some journals are bigger than others. I have some ideas here, but it’s a non-trivial challenge.