This article is part of the supplement: The BioCreative II - Critical Assessment for Information Extraction in Biology Challenge

Open Access Highly Accessed Open Badges Research

Overview of BioCreative II gene normalization

Alexander A Morgan1, Zhiyong Lu2, Xinglong Wang3, Aaron M Cohen4, Juliane Fluck5, Patrick Ruch6, Anna Divoli7, Katrin Fundel8, Robert Leaman9, Jörg Hakenberg10, Chengjie Sun11, Heng-hui Liu12, Rafael Torres13, Michael Krauthammer14, William W Lau15, Hongfang Liu16, Chun-Nan Hsu17, Martijn Schuemie18, K Bretonnel Cohen19 and Lynette Hirschman19*

Author affiliations

1 Biomedical Informatics, Stanford University, 251 Campus Drive,, Stanford, CA, 94305, USA

2 Center for Computational Pharmacology, University of Colorado School of Medicine, PO Box 6511, Aurora, Colorado, 80045, USA

3 School of Informatics, University of Edinburgh, 2 Buccleuch Place, Edinburgh, EH8 9LW, UK

4 Oregon Health & Science University, 3181 SW Sam Jackson Park Road, Portland, Oregon, 97239, USA

5 Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, D-53754, Sankt Augustin, Germany

6 University and Hospitals of Geneva, 24 Micheli du Crest, 1201 Geneva, Switzerland

7 School of Information, University of California, Berkeley, 102 South Hall, Berkeley, California, 94720, USA

8 Institut für Informatik, Ludwig-Maximilians-Universität München, Amalienstr. 17, 80333 Munich, Germany

9 Department of Computer Science and Engineering, Arizona State University, 699 S. Mill Avenue, Tempe, Arizona, 85281, USA

10 Biotechnological Centre, Technische Universität Dresden, Tatzberg 47-51, 1307 Dresden, Germany

11 School of Computer Science and Technology, Harbin Institute of Technology, West Da-Zhi Street, Mailbox 319, No. 92, Harbin, 150001, China

12 Department of Computer Science and Information Engineering, National Cheng Kung University, No. 1, University Road, Tainan City 701, Taiwan

13 Bioalma, Biolma, Ronda de Poniente, 4, 2 C-D, Tres Cantos, Madrid, E-28760, Spain

14 Department of Pathology, Yale University School of Medicine, 300 Cedar Street TAC 309, New Haven, Connecticut, 06510, USA

15 Division of Computational Bioscience, Center for Information Technology, National Institutes of Health, 9000 Rockville Pike, Bethesda, Maryland, 20892, USA

16 Department of Biostatistics, Bioinformatics and Biomathematics, Georgetown University Medical Center, 4000 Reservoir Rd. NW, Washington, District of Columbia, 20057, USA

17 Institute of Information Science, Academia Sinica, 128 Academic Road, Section 2, Taipei, Taiwan

18 Biosemantics Group, Medical Informatics Department, Erasmus MC University Medical Center, Dr. Molewaterplein 50, 3015GE, Rotterdam, The Netherlands

19 Information Technology Center, The MITRE Corporation, 202 Burlington Road, Bedford, Massachusetts, 01730 USA

For all author emails, please log on.

Citation and License

Genome Biology 2008, 9(Suppl 2):S3  doi:10.1186/gb-2008-9-s2-s3

Published: 1 September 2008



The goal of the gene normalization task is to link genes or gene products mentioned in the literature to biological databases. This is a key step in an accurate search of the biological literature. It is a challenging task, even for the human expert; genes are often described rather than referred to by gene symbol and, confusingly, one gene name may refer to different genes (often from different organisms). For BioCreative II, the task was to list the Entrez Gene identifiers for human genes or gene products mentioned in PubMed/MEDLINE abstracts. We selected abstracts associated with articles previously curated for human genes. We provided 281 expert-annotated abstracts containing 684 gene identifiers for training, and a blind test set of 262 documents containing 785 identifiers, with a gold standard created by expert annotators. Inter-annotator agreement was measured at over 90%.


Twenty groups submitted one to three runs each, for a total of 54 runs. Three systems achieved F-measures (balanced precision and recall) between 0.80 and 0.81. Combining the system outputs using simple voting schemes and classifiers obtained improved results; the best composite system achieved an F-measure of 0.92 with 10-fold cross-validation. A 'maximum recall' system based on the pooled responses of all participants gave a recall of 0.97 (with precision 0.23), identifying 763 out of 785 identifiers.


Major advances for the BioCreative II gene normalization task include broader participation (20 versus 8 teams) and a pooled system performance comparable to human experts, at over 90% agreement. These results show promise as tools to link the literature with biological databases.