October 2, 2003
MEDIA CONTACT: Joanna Downer
PHONE: 410-614-5105/ Pager: 410-283-4976
Thorough, Searchable Database of Human Proteins Unveiled
Like expert curators who verify and create catalogs of the world's great art collections, an international team of scientists has developed a human protein database they say will change the way biology is done. The team unveils the online Human Protein Reference Database in the October issue of Genome Research.
The database, which currently contains scientist-compiled entries on the 3,000 most-studied human proteins, including their known roles in health and disease, is expected to hold comprehensive information on 10,000 human proteins by year's end. Importantly, this database includes known interactions between proteins, creating a web that ties separate discoveries together.
"This is the real beginning of systems biology in the human," says principal investigator Akhilesh Pandey, Ph.D., assistant professor in the McKusick-Nathans Institute of Genetic Medicine at Johns Hopkins. "We wanted to make the best human protein database ever, so research could go faster and available information could be easier to find and easier to organize."
Pandey says advances in technology have made getting data much easier, but processing it and interpreting observations are now the big hurdles in laboratories.
"It has remained difficult to put together a big picture of biology, to see how one set of observations intersects with and complements others," he says. "With this single database, biologists now will be able to quickly review what is known about the proteins and how they interact, speeding the creation of new hypotheses to test in the lab."
The 3,000 proteins currently in the database are known to interact with anywhere from tens to hundreds of other proteins. Online, a user can pull up a visual web of protein-protein interactions with just the click of a mouse.
"The entries have been critically reviewed, making the information in the database as accurate and complete as possible," says Pandey. "Scientists can even link directly to the scientific paper behind an item, to judge for themselves its validity."
To create the database entries, dozens of trained biologists, most at the Institute of Bioinformatics in India, started with the database Online Mendelian Inheritance in Man (OMIM), the offspring of a paper catalog of disease genes started in 1966 by Victor A. McKusick, M.D., University Professor of Medical Genetics at Hopkins.
Focusing on these genes' proteins, the scientists critically reviewed hundreds of thousands of scientific papers, making connections between papers and resolving inconsistencies -- something automated computer programs cannot do, says Pandey. They also pulled information from smaller, existing databases to complete each protein's entry.
"We believe that manual curation -- lots of scientists poring through the literature -- is the key to building a more accurate and more complete database," says Pandey, who serves as chief scientific adviser to the Institute of Bioinformatics. "Eventually, we hope the database will be managed by the larger community of scientists, because it will be most useful if those who know these proteins best take responsibility for keeping entries up to date and accurate."
The database currently contains everything that's known about proteins involved in diseases, such as so-called breast cancer genes BRCA1 and BRCA2, and proteins in key pathways, such as families of enzymes that modify other proteins. It includes only experimentally proven or widely accepted facts about the proteins, without mixing in computer-generated predictions the way some other databases do, says Pandey.
The online database is also easy to use, in large part because those who designed it are experts in both computer science and biology, he adds. A biologist looking for information about BRCA1, for example, can search by any of its names and get a single entry that contains everything -- its alternative names, structure, function and sequence, how it's modified, other proteins with which it interacts, where it's found in cells, where it's found in the body and links to the papers that say so.
"The richness of the database is astounding, since it was created in such a short time by expert reviews of individual publications," says Aravinda Chakravarti, Ph.D., director of the McKusick-Nathans Institute and a co-author on the paper. "This would have been impossible without scientists to review the literature and computational biologists to make a database that is truly easy to use."
Academic researchers will have free access to the database. Johns Hopkins Licensing and Technology Development is currently establishing licensing criteria for companies interested in using the database. The database has been active for five months and has elicited almost 2 million hits, simply from word-of-mouth and presentations at scientific meetings, says Pandey.
The Human Protein Resource Database was built using freely available computer code, so-called open source, from ZOPE (Z Object Publishing Environment), which experts at the Institute of Bioinformatics adjusted to fit the project's needs. One of the benefits of using an object-oriented structure like ZOPE, Pandey says, is that there's no limit on the number of entries (i.e. proteins) or the number of characteristics that can be included.
Authors on the paper are Pandey, Chakravarti, Suraj Peri, Daniel Navarro, Ramars Amanchy, Troels Kristiansen, Chandra Kiran Jonnalagadda, Mads Gronborg, Nieves Ibarrola, Chi Dang, Joe Garcia, Jonathan Pevsner and Ada Hamosh of Johns Hopkins; Vineeth Surendranath, Vidya Niranjan, Babylakshmi Muthusamy, T.K.B. Ghandi, Nandan Deshpande, K. Shanker, Shiva Sanker H.N., Rashmi Prasad B., Ramya M.A., Chandrika K.N., Padma N. Harsha H.C., Yatish A.J., Kavitha Poovaiah M., Minal Menezes, Dipanwita Roy Choudhury, Shubha Suresh, Neelanjana Ghosh, Saravana R., Sreenath Chandran, Subhalakshmi Krishna, Mary Joy, Sanjeev Anand, Madavan V., Ansamma Joseph and Krishna Deshpande of the Institute of Bioinformatics, Bangalore, India; Guang Wong and Lily Huang of the Whitehead Institute for Biomedical Research; William Schiemann, of the National Jewish Medical and Research Center, Denver, Colo.; Stefan Constantinescu of the Ludwig Institute for Cancer Research, Brussels, Belgium; Roya Khosravi-Far, Hanno Steen and Muneesh Tewari of Harvard Medical School; Saghi Ghaffari of Mount Sinai School of Medicine; Gerard Blobe, Duke University; Peri, Kristiansen, Gronborg, Ole Jensen and Peter Roepstorff of the University of Southern Denmark, Odense; and Arul Chinnaiyan of the University of Michigan. Navarro is also affiliated with the University of Navarra, Pamplona, Spain.
Pandey serves as chief scientific adviser to the Institute of Bioinformatics. The terms of this arrangement are being managed by The Johns Hopkins University in accordance with its conflict of interest policies.
On the Web:
Genome Research: http://www.genome.org
Human Protein Reference Database: http://www.hprd.org
Institute of Bioinformatics: http://www.ibioinformatics.org