Professional Documents
Culture Documents
Edited* by William H. Press, University of Texas at Austin, Austin, TX, and approved November 6, 2014 (received for review August 7, 2014)
We consider the incidence of text reuse by researchers via a sys- [e.g., refs. 810; for instance, all graduate students at Cornell
tematic pairwise comparison of the text content of all articles de- University take online training through the Office of Research
posited to arXiv.org from 1991 to 2012. We measure the global Integrity (www.oria.cornell.edu/rcr/)]. In Supporting Information,
frequencies of three classes of text reuse and measure how chronic section B we provide a brief survey of representative policies.
text reuse is distributed among authors in the dataset. We infer Journal publishers provide effective international guidelines, and
a baseline for accepted practice, perhaps surprisingly permissive com- the American Physical Societys, for example, are unequivocal re-
pared with other societal contexts, and a clearly delineated set of garding text reuse (11): Authors may not . . . incorporate without
aberrant authors. We find a negative correlation between the attribution text from another work (by themselves or others), even
amount of reused text in an article and its influence, as measured by
when summarizing past results or background material. We will
see that arXiv submissions do not always conform to these exacting
subsequent citations. Finally, we consider the distribution of coun-
standards and yet are published by journals, indicating that editors
tries of origin of articles containing large amounts of reused text.
do not systematically use an automated screen.
To be clear, we are careful in what follows to restrict attention
arXiv | plagiarism | text mining | n-grams to simple text overlaps. We make no attempt to detect plagia-
rism in its most general form, which includes unattributed use
COMPUTER SCIENCES
One motivation for undertaking this analysis of arXiv data was purposes of this analysis we have also excluded articles from very
the known incidence of text copying and plagiarism, usually no- large experimental collaborations, because the long lists of au-
ticed by readers, and sometimes reported in the news media. The thor names (and other boilerplate) can masquerade as authors
authors of ref. 4, for example, pointed out unattributed use of reusing their own text. To detect text overlaps between arbitrary
their text in a series of four arXiv articles in 1999. A news article pairs of articles efficiently, we use an extension of the methodology
from 2003 (5) described the case of an unknown author who tried
to establish research credentials by submitting texts largely copied Significance
from other sources. In a 2007 news article (6) discussing the
earlier version (1) of the work here, it was noted that the cases
detected spanned a wide range, from 27 pages of lecture notes by In the modern electronic format it is both easier to reuse text and
another author used verbatim in a thesis, to reuse of common easier to detect reused text. This is the first comprehensive study
introductory material, to text overlaps of benign common phrases. of patterns of text reuse within the full texts of an important
Shortly afterward, as reported in another news article (7), a large large scientific corpus, covering a 20-y timeframe. It provides an
number of articles from a group of coauthors was withdrawn important baseline for what is regarded as standard practice
owing to reuse of text copied from a variety of sources (see arxiv. within the affected research communities, a standard somewhat
org/new/withdrawals.aug.07.html for a list of 65 withdrawals). more lenient than that currently applied to journalists, popular
Practical considerations for running the arXiv site provide an- authors, and public figures.
other motivation, because problematic authors can inconvenience
readers by producing more than their share of articles, reusing Author contributions: P.G. designed research; D.T.C. and P.G. performed research; D.T.C.
large blocks of their own text. Screening for this had been hap- and P.G. analyzed data; and D.T.C. and P.G. wrote the paper.
hazard, and moreover a systematic baseline to identify outliers and The authors declare no conflict of interest.
to provide a principled response to the claim this is common *This Direct Submission article had a prearranged editor.
practice, everyone does it was needed. The current work, reusing See Commentary on page 6.
the methodology of ref. 1, gives a more systematic assessment of 1
To whom correspondence should be addressed. Email: ginsparg@cornell.edu.
the statistics of text reuse in the arXiv dataset and permits iden-
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.
tification of the extremes of the distribution, so that outliers can be 1073/pnas.1415135111/-/DCSupplemental.
publicly flagged (this policy was implemented in the summer of
Commercial resources, such as Ithenticate, use a much larger dataset. See in particular
2011; see arxiv.org/help/overlap).
CrossCheck (13), implementing Ithenticate for research publications, and used by mem-
Although there is no universal standard pertaining to reuse ber publishers to screen journal submissions (14, 15). That coverage is still far from as
of text in scientific publications, many universities and pub- comprehensive as that available via commercial search engines, as assessed by comparing
lishers have established explicit guidelines and provide training to results from the Google custom search API.
The number of article pairs with at least 10 or more 7-grams in common is of order figure from its original context, invisible to the author and reader in the new context but
600,000, about 2 per million of the total possible 757,0002 =2 278 billion total article pairs. nonetheless seen by the pdf-to-text converter and flagged as a large text overlap.
COMPUTER SCIENCES
mulative histogram of the number of authors whose articles
distributed among authors: Is it all of the authors some of the contain a given fraction of significant AU, CI, and UN text
time, or some of the authors all of the time? It might, for ex- overlaps. For example, an author with 10 articles, four of which
ample, be reassuring if only a relatively small group of authors is have significant AU overlap, would contribute to the upper (blue)
responsible for the majority of observed cases. Here we char- line for x-axis values less than or equal to 0.4. Most importantly,
acterize the extent to which text reuse is normal behavior, to we see that the number of authors with articles flagged for each of
be able identify the abnormal behaviors. the three types of overlaps drops significantly as the fraction of
problematic articles increases from 0%. Of the total 127,270
Text Overlap Networks. In general, a network is a collection of authors in the dataset, only 51,060, 11,860, and 1010 have more
nodes linked together by edges, where each node represents an than 4% of their articles contain AU, CI, and UN text overlaps,
object and each edge between two nodes represents a connection respectively. The vast majority of authors, therefore, either never
or relationship between the corresponding objects. Here we in- or only rarely reuse significant amounts of text in new pub-
troduce a text overlap network in which each node represents an lications. In the more problematic region we see only 14,020,
article and each edge a pairwise textual overlap between articles. 1600, and 210 with at least 24% of their articles containing sig-
Because articles published later in time copy from earlier ones nificant AU, CI, and UN overlaps, respectively. We infer that the
(and not vice versa), all edges in the network are directed forward practice of reusing text is uncommon and is restricted to a mi-
in time to represent text transfer. Each edge is weighted according nority of serial offenders, responsible for the heavy tail in Fig. 1.
to the number of 7-grams that the two connected articles have in
common, and edges are color-coded blue, green, and red, re- 4. More Author Sociology
spectively, to indicate AU, CI, and UN modes of text overlap. Text Overlap and Citations. Having seen that the problematic be-
In the text overlap network for articles written by a specific havior is restricted to a small minority of authors, we turn to
author, the density of edges is proportional to the amount of assess the impact of their work. We use the number of citations
reused text, so the network provides a useful visualization of text that each article has received as a proxy for its influence and
reuse and for assessing overlaps among articles by a particular investigate any correlation with the amount of copied content in
author or group of authors. Fig. 3 shows the text overlap net- the article. We focus on a subset of 116,490 articles for which we
works of two authors with vastly different patterns of text reuse. have relatively clean citation data, primarily in astrophysics
Articles by author A have few overlaps: Of 217 coauthored and high-energy physics. The articles selected for this subset
articles, only 6 contain previously published text, whereas author appeared before the start of 2011, giving them time to accu-
Bs text overlap network is far more densely connected. The blue mulate citations. To provide a better proxy for an articles real
edges reveal clusters of articles by that author with material influence, we discard self-citations (i.e., citations by articles with
copied from one another. Furthermore, in contrast to author A, any coincident authors).
We estimate the fraction of copied content in an article by reason that text reuse might go undetected is that the articles
dividing the number of 7-grams that have appeared previously from which the text is copied are also less well-read. In Sup-
by the total number of 7-grams from the article, without re- porting Information, section E, we present data showing that
moving the common 7-grams. Retaining both common and there is as well a negative correlation between the amount of
uncommon 7-grams in this instance gives a better measure of reused content from an article and the number of citations that
the extent to which authors rely on earlier texts. We exclude article received, even after screening for author self-copying.
from the dataset all articles with 95% or more overlap with This may result from authors working in overall less-active sub-
other articles, because these are typically articles erroneously ject areas (e.g., ref. 7) or may be due to a tendency for authors to
submitted more than once to arXiv after minor revisions and borrow text from authors of the same nationality even in more
are not the type of overlap at issue.k We also exclude from the active research areas, where articles by authors of that nation-
dataset all articles containing less than 5% reused content, ality are already correlated with fewer citations.
because these signify likely failure of the pdf-to-text conver-
sion, for example due to font issues, making the estimate of Author Demography. We now investigate whether articles con-
fraction of copied content unreliable. (Because we are retain- taining large amounts of reused text come from a uniform dis-
ing as well the common 7-grams for this purpose, all properly tribution of countries, or only from some more restricted set. To
converted texts will now exhibit some reused content.) We have submit, authors must register an email address with arXiv, and
also excluded review-type articles (conference proceedings, we assign a country of origin to each article using the country
theses, etc., as described in the lead-up to Fig. 2) to avoid code associated with the email address of the submitting author.
creating an artifactual correlation between reused content and
low number of citations.
Fig. 5 shows the number of citations plotted against the frac-
tion of copied content contained in each article. The wedge of
points at the left of the scatter plot shows that there is a higher
variance in the number of citations for articles containing low
amounts of copied content. Qualitatively speaking, it is more
likely for articles with a low fraction of copied content to receive
very many citations, whereas it is relatively rare for articles with
a high fraction of copied content to receive the same number
of citations. To quantify this, we also plot the median number
of citations as a function of fraction of copied content in red
and calculate a Spearman correlation coefficient of r = :739
P = 6:76 109 . This illustrates a strong decreasing trend of
citations for articles with increasing copied content. The pres-
ence or absence of reused text in an article thus serves as an
artifactual quality flag, with articles having large amounts of
unoriginal content cited less frequently.
Because the articles are less frequently cited and presumably
little read, it is tempting to speculate that the reused content in
these articles goes largely unnoticed and undetected. Another
COMPUTER SCIENCES
nationalities for unethical behaviorit is only to give a flavor for statements regarding relatively unambiguous textual overlap of
the demographics of the authors involved. Shedding light on the materials within arXiv. They are informational to readers, who
nature of the problem may help to address it. may find it useful to know when an article draws heavily from
Labeling the articles by country of origin according to email another. They can also be informative to authors from different
domain, we first use the fraction of copied content in each educational backgrounds, unaware that importing large sec-
article, as in Fig. 4. We ignore countries from which fewer than tions of text from their earlier articles, or from articles by
40 articles have been submitted, as providing insufficient data others, is not common practice.
to resolve a clear pattern. Setting the threshold for flagging The reaction of authors has fallen into three classes. (i) No
articles at 20% reused text identifies a group of countries with reaction whatsoever: Some authors even retain the admin note
more than 15% of their submissions flagged. For comparison, when replacing the submission with a new version, seemingly
under 5% of submissions from the United States and United oblivious to its appearance. (ii) Attempted remediation: Other
Kingdom and under 10% of submissions from China, Turkey,
authors try repeatedly to replace the submission with new ver-
and India were flagged by this criterion. Increasing the flagging
sions to remove or minimize the overlapping text. (The admin
threshold to articles with at least 50% reused content gives
note is retained if the amount of text overlap remains above the
roughly the same group of countries with over 5% of their
flagging threshold.) Some authors even politely request some
submissions flagged. For comparison, fewer than 1% of sub-
missions from the United States and United Kingdom are form of itemization of the overlapping text, apparently unable to
flagged by this criterion. Using an alternate criterion of more recall which parts of the text are original and which are reused
than 100 7-grams for AU, and more than 20 for CI or UN text from elsewhere. Although that detailed information is not pro-
reuse, again roughly the same group of countries appears with vided, some determined authors eventually succeed to eliminate
more than 11% of their articles flagged for text overlaps. With the note through successive revision. (iii) Indignant objection:
this criterion, fewer than 4% of articles from the United States Some authors have insisted that there could not possibly be text
and United Kingdom are flagged, and China, Turkey, and India overlap (although the heuristics in place to avoid flagging false
have 78% of their submissions flagged. positives have proven reliable). Other authors have suggested
The countries that consistently, regardless of metric, contain
the highest percentages of flagged submissions are (listed al-
phabetically) Bangladesh, Belarus, Bulgaria, Colombia, Cyprus, As discussed earlier, there is no systematic scan for text copied from sources outside of
arXiv, and no attempt to detect plagiarism as more generally defined, as unattrib-
Egypt, Iran, Jordan, Kazakhstan, Kyrgyzstan, Latvia, Lux-
uted use of ideas independent of copied text. The exceptions described earlier for re-
embourg, Micronesia, Moldova, Pakistan, Saudi Arabia, and view articles, theses, conference proceedings, book contributions, multipart articles,
Uzbekistan. To screen for countries dominated by a small group and so on, are respected, so that common-authored overlaps are not flagged in cases
of highly prolific authors, we also consider countries with high that seem to be accepted as common practice.
1. Sorokina D, Gehrke J, Warner S, Ginsparg P (2006) Plagiarism detection in arXiv. 10. Research Councils UK (2014) RCUK policy and guidelines on governance of good re-
Sixth International Conference on Data Mining. Available at arxiv.org/abs/cs/ search conduct (Research Councils UK, Swindon, UK). Available at www.rcuk.ac.uk/
0702012. Publications/researchers/grc/.
2. Schleimer S, Wilkerson D, Aiken A (2003) Winnowing: Local algorithms for docu- 11. American Physical Society (2014) Editorial policies and practices (Am Physical Soc,
ment fingerprinting. Proceedings of the ACM SIGMOD International Conference on Ridge, NY). Available at prd.aps.org/info/polprocd.html#guide.
Management of Data, pp 7685, implemented as an internet service (Moss: A 12. Biagioli M (2012) Recycling texts or stealing time?: Plagiarism, authorship, and credit
in science. Int J Cult Property 19(3):453476.
system for detecting software plagiarism). Available at theory.stanford.edu/aiken/
13. iParadigms (2014) CrossCheck powered by IThenticate (iParadigms, Oakland, CA).
moss/.
Available at www.ithenticate.com/products/crosscheck/.
3. Ginsparg P (2011) arXiv at 20. Nature 476:145147.
14. Nature Publishing Group (2010) Plagiarism pinioned. Nature 466:159160, 10.1038/
4. Nakanishi N, Ojima I (1999) Notes on unfair papers by Mebarki et al. on quantum
466159b.
nonsymmetric gravity. Available at httparxiv.org/abs/hep-th/9912039.
15. Butler D (2010) Journals step up plagiarism policing. Nature 466:167, 10.1038/466167a.
5. Giles J (2003) Preprint server seeks way to halt plagiarists. Available at www.nature.
16. Errami M, Garner H (2008) A tale of two citations. Nature 451(7177):397399.
com/nature/journal/v426/n6962/full/426007a.html. 17. Stretton S, et al. (2012) Publication misconduct and plagiarism retractions: A sys-
6. Feder T (2007) Experimenting with plagiarism detection on the Arxiv. Available at tematic, retrospective study. Curr Med Res Opin 28(10):15751583.
www.physicstoday.org/resource/1/phtoad/v60/i3/p30_s1?bypassSSO=1. 18. Bohannon J (2013) Whos afraid of peer review? Science 342(6154):6065.
7. Brumfiel G (2007) Turkish physicists face accusations of plagiarism. Available at www. 19. Introna L, Hayes N, Blair L, Wood E (2003) Cultural attitudes towards plagiarism (Univ
nature.com/nature/journal/v449/n7158/full/449008b.html. of Lancaster, Lancaster, UK). Available at https://sites.google.com/site/lucasintrona/
8. Cornell University (2014) Recognizing and avoiding plagiarism (Cornell Univ, Ithaca, home/reports. Accessed June 1, 2014.
NY). Available at plagiarism.arts.cornell.edu/tutorial/index.cfm. 20. Kiesler S, Kraut R, Resnick P, Kittur A (2010) Regulating behavior in on-line commu-
9. Office of Research Integrity (1994) ORI policy on plagiarism (US Dept of Health and nities. Building Successful Online Communities: Evidence-Based Social Design (MIT
Human Services, Washington, DC). Available at ori.hhs.gov/ori-policy-plagiarism. Press, Cambridge, MA).