Indian-American Leads 52 Scientists Creating Human Protein DatabaseBy Francis C. Assisi
Baltimore, October 4 -- A massive international effort, involving a team of 52 scientists, led by Indian Americans, and supported by 26 young scientists in India, as well as others from the United States, Belgium, Denmark and Spain, has developed a human protein database they say will change the way biology is done.
"This is the real beginning of systems biology in the human," says principal investigator Akhilesh Pandey, M.D., Ph.D., assistant professor in the McKusick-Nathans Institute of Genetic Medicine at Johns Hopkins. "We wanted to make the best human protein database ever, so research could go faster and available information could be easier to find and easier to organize."
The team unveiled the online Human Protein Reference Database this week in the October issue of Genome Research. The database integrates a wealth of information relevant to the function of human proteins in health and disease. It currently contains scientist-compiled entries on the 3,000 most-studied human proteins, and is expected to hold comprehensive information on 10,000 human proteins by yearīs end, freely accessible to noncommercial researchers.
"We think this database is the most user-friendly and comprehensive and annotated resource so far for the proteins in it. Most features of proteins that biologists care about and would want to see are in one place here," claims Pandey, who graduated from the Armed Forces Medical College in Pune in 1988 and went on to obtain his Ph.D. from the University of Michigan.
Ease of use was a high priority for the database, Pandey said. For instance, a biologist looking up information on the breast cancer gene BRCA1 can search by any of its names and get a single entry containing everything—its alternate names, structure, function, sequence, how itīs modified, known interactions with other proteins, where itīs found in cells, where itīs found in the body.
"We are providing for the first time a comprehensive picture of protein–protein interactions in humans," Pandey said.
The database includes each proteinīs known roles in health and disease and direct links to related scientific papers. Only experimentally proven or widely accepted facts about a protein are included, without mixing in unproven computer-generated predictions. In the future, the database team hopes the biology community will help to provide updates on proteins as they come.
To the question why he decided to develop yet another database when others already exist, Pandey explains: “We believe that biological databases are still in their early stages and no protein database can be considered as an established standard. We feel that a variety of databases trying to solve problems in diverse ways provide the biologists the possibility of choosing their favorite. Our approach is radically different from existing databases and we want to offer biologists the possibility of choosing instead of imposing one database by default. Besides, most of the databases are automated and ours is manually curated to avoid errors. We are also trying to provide information that few other databases provide.”
Twenty-six researchers at the Institute of Bioinformatics in Bangalore, helped create the database by critically reviewing hundreds of thousands of scientific papers, drawing connections between papers and resolving inconsistencies. Each researcher read an average of at least 10 to 20 papers each every day, with every protein reviewed twice.
"The numbers are closer to 50 a day, but people would tell me they donīt believe me," Pandey said.
The Bangalore-based Institute of Bioinformatics is a not-for-profit organization established by īThe Genomics Research Trust,ī emphasizing cutting edge research in Databases, Computational Genomics, Proteomics and Comparative Genomics. The initial goal of this Institute is to create a freely available human Protein Reference Database using open source technologies and to experimentally verify predicted human genes using molecular biology and proteomics-based methods. Pandey is the Chief Scientific Advisor for the Institute.
The Human Protein Reference Database project started off with the Online Mendelian Inheritance in Man database and also pulls information from smaller, existing databases to complete every proteinīs entry. Pandey feels the Human Protein Reference Databaseīs strength is its more accurate and complete entries due to its emphasis on manual curation of entries, as opposed to the automated computer programs most databases employ.
The database has been under development since May 2002 and active for 5 months, receiving almost 2 million hits just from word of mouth and presentations at scientific meetings, Pandey said. Johns Hopkins Licensing and Technology Development are currently establishing criteria for companies interested in using the database to pay fees under licensing arrangements.
Pandey says advances in technology have made getting data much easier, but processing it and interpreting observations are now the big hurdles in laboratories.
"It has remained difficult to put together a big picture of biology, to see how one set of observations intersects with and complements others," he says. "With this single database, biologists now will be able to quickly review what is known about the proteins and how they interact, speeding the creation of new hypotheses to test in the lab."
The 3,000 proteins currently in the database are known to interact with anywhere from tens to hundreds of other proteins. Online, a user can pull up a visual web of protein-protein interactions with just the click of a mouse.
"The entries have been critically reviewed, making the information in the database as accurate and complete as possible," says Pandey. "Scientists can even link directly to the scientific paper behind an item, to judge for themselves its validity."
"The richness of the database is astounding, since it was created in such a short time by expert reviews of individual publications," says Aravinda Chakravarti, Ph.D., director of the McKusick-Nathans Institute and a co-author on the paper. "This would have been impossible without scientists to review the literature and computational biologists to make a database that is truly easy to use."
Other Indo-American scientists involved in the project, besides Pandey and Chakravarti, are: Suraj Peri and Chandra Jonnalagadda of Johns Hopkins, Muneesh Tewari of Harvard Medical School, and Arun Chinnaiyan of the University of Michigan.