Halldor serenades

8 03 2010





Is inconsistency in biological annotations wrong?

8 03 2010

A wealth of information science research explores consistency between different indexers. Give two professional, well-trained indexers the same item and they will probably apply different terms. From a subject access perspective this might create problems – there may be a mis-match between what terms a user expects, and the terms different indexers have applied.

Inconsistency in indexing therefore has implications for information retrieval. Inconsistency is a natural consequence of the way indexers appraise resources, and seek to apply what they consider to be the best terms. Can index terms be wrong? Perhaps, but for every term, there is likely to be arguments for or against its application.

Social tagging on sites like Delicious can also be explored for consistency. Wolfram et al. demonstrate that vector space modeling (which is traditionally used in information retrieval) can be applied to tag populations taken from a Citeulike dataset to measure consistency between taggers. They found that tagging consistency did not vary between subject areas.

Sadly, the authors do not go so far as to state whether they felt user tags were consisently good or consistently bad.

I think biologists would consider Gene Ontology annotations to be either right or wrong. Annotations can be critically appraised as valid or invalid, based on the biology.

The human element in GO annotations, the fuzzy, inconsistent element in manually, or even automatically tagging genes with functional terms, is ignored.

What kind of biology do we have if we accept that functional labels for genes – ‘lipid transporter activity’, ‘cardiac atrium development’ – are not true or false, and that different biologists might apply them within their personal theoretical framework entirely as they feel?

Wolfram, D., Olson, H. A., and Bloom, R. (2009). Measuring consistency for multiple taggers using vector space modeling. Journal of the American Society for Information Science and Technology, 60(10):1995-2003.

dx doi 10.1002/asi.21123





A Wordle image created from a MEDLINE search for Gene Ontology

9 02 2010

Wordle image created from MEDLINE search for gene ontology

Created from http://www.wordle.net/





The GIFtS of gene functional knowledge

9 02 2010

Every year, over 2 million working hours are wasted across the planet trying to come up with acronyms for things (like organsations or software) which also spell out words.

I made this statistic up, but this should not distract us from the GeneCards Inferred Functionality Scores (GIFtS) tool which offers an estimate for the amount of functional knowledge known about different genes. By drawing together data from different biomedical databases, GIFtS scores genes by the number of resources containing any information about said gene. For more information, see the GeneCards website.

For example as of today, BCL2 has a GIFtS score of 76 out of 100 meaning that 76 different biomedical / genetic databases currently contain information pertaining to the structure and function of BCL2.

The authors of this algorithm and the GeneCards database suggest their score might prove useful in gene analyses. If a biologist performs a gene expression study, GIFtS scores could be used to identify low-functional information genes. Genes without much functional information will be interesting to researchers (although I can also imagine they might be avoided like the plague).

Interesting to note that there is strong correlation between number of publications mentioning a gene and these GIFtS scores. Suggests number of publications provides a pretty good estimate for how much is known about individual genes, although GIFtS do reveal unseen patterns for mystery genes which very few publications, such as non-protein coding genes.

Harel, A., Inger, A., Stelzer, G., Almashanu, L. S., Dalah, I., Safran, M., and Lancet, D. (2009). Gifts: annotation landscape analysis with genecards. BMC Bioinformatics, 10(1):348+.

dx doi 10.1186/1471-2105-10-348 and Full text available





Does not mean not for NOT qualifiers in Gene Ontology files?

25 01 2010

Gene associations between entities and GO terms are saved by the Gene Ontology consortium to large annotation files in a simple file format, available at the link below:

http://www.geneontology.org/GO.current.annotations.shtml

Annotation files are divided up by species and are used to create automatic Gene Ontology links to lots of different external database resources such as Entrez-Gene.

Moreira et al. (2007) criticised this annotation file format, citing semantic ambiguities inherent to the file structure which could be solved using OWL. A big problem for example is the use of the NOT qualifier, which means that an annotation, often manually curated, specifies that an ontology term is NOT associated with a gene.

So the gene FMN1 in S. cerevisiae (according to QuickGO) is NOT associated with FMN adenylytransferase activity.

  • See gene FMN1 (riboflavin kinase) Saccharomyces cerevisiae in QuickGo

The EntrezGene service automatically extracts GO annotations, presumably from recent annotation files. The paper above noted that the FMN1 entry in EntrezGene did not highlight the negative relationship between FMN1 and FMN adenylytransferase activity because it failed to infer the appropriate meaning from the NOT qualifier.

The entry is still labelled with the GO annotation ‘FMN adenylytransferase’ activity, even though PMID 10887197 suggests FMN1 in S. cerevisiae does not display said activity.

This is a problem.

Moreira, D. A., Shah, N. H., and Musen, M. A. (2007). Interpretation errors related to the go annotation file format. AMIA … Annual Symposium proceedings / AMIA Symposium. AMIA Symposium, pages 538-542.

PMID 18693894 and Full text available HERE




CoLIS 7 is in London this year

22 01 2010

The Seventh International Conference on Conceptions of Library and Information Science is coming to London between 21-24 June 2010. The overall theme is:

Integration in the information sciences: unity in diversity

This conference will explore the integration and underlying unity of the information sciences, as both academic disciplines and as work practices.

More information can be found on the CoLIS 7 website:

http://colis.soi.city.ac.uk/





Not too much information when using the Gene Ontology

19 01 2010

I wonder how often results in the biological literature are reproducible. With standards for data description, common analysis tools and the facility to add supplementary information to published articles, it *should* be really easy to provide enough detail to make bioinformatic-type analyses totally transparent.

Rhee et al. (2004) provide suggestions to authors using the Gene Ontology and annotations on how to avoid certain pitfalls when analysing biological data using GO. They conclude:

“…it is crucial for any analysis to cite data sources (including the version of ontology, date of annotation files, numbers and types of annotations used, versions and parameters of software, and so on) to ensure that results are fully reproducible.”

I don’t think the authors would have put this into their review if biologists were already citing full data sources. What is the point of not fully citing data sources? Is it a recurring oversight or something else?

Rhee, S. Y. Y., Wood, V., Dolinski, K., and Draghici, S. (2008). Use and misuse of the Gene Ontology annotations. Nature reviews. Genetics, 9(7):509-515.

dx doi 10.1038/nrg2363 and Full text available HERE





Can a computer be creative?

13 01 2010

Whilst reading a paper on creativity in the sciences, it occured to me that e-science methodologies – which depend heavily on computers and informatic solutions to provide leverage on complex scientific problems – do little to factor in the very important idea of research creativity.

Perhaps e-science enables scientists to be more creative, because they are able to handle much larger datasets than they would otherwise be able to. Data produced by CERN could not be analysed on the back of napkin.

And yet e-science infrastructure is dependent on a certain amount of control and standardisation. The Gene Ontology for example both standarises terminology and meaning for biological language, and determines that which is permitted. Biologists are not given free reign to edit the ontology to their own purposes, which on the one hand is good (everyone is analysing bio-data using the same tool) and on the other hand is bad (everyone is analysing bio-data using the same tool).

If the Gene Ontology contains flaws, or weaknesses, or limitations, then these will shape the kind of explanations biologists can infer from results. It is an example of an e-science approach gifting authority and control at the expense of freedom and creativity.

Perhaps one day computers will be creative. Perhaps they will be able to think, and guess, and imagine novel solutions to difficult and mysterious problems.

Until that day though, I think it is important to remember that a tension exists between the creative impulses of the thinking scientists who, engaged in a human endeavour little different to the composition of music or act of painting, may crave standardisations and informatic tools to tackle complex problems, yet risks adopting these tools at the expense of the freedom to imagine what is not a standard, what is not paradigmatic.

Heinze, T., Shapira, P., Senker, J., and Kuhlmann, S. (2007). Identifying creative research accomplishments: Methodology and results for nanotechnology and human genetics. Scientometrics, 70(1):125-152.

dx doi 10.1007/s11192-007-0108-6 and Full text available HERE





Life scientists just Google it

17 12 2009

Interesting set of case studies recently published by the Research Information Network under the title, ‘Patterns of information use and exchange: case studies of researchers in the life sciences’.

Informal exchange of information was found to be very important in the day-to-day work of life scientists, as were simple search solutions to information needs – ie, biologists just Google it first.

The serendipitous nature of the results provided by search engines like Google, with the extra context information they can provide, is relevant. Life scientists aren’t so different to the rest of us in our searching: they look for easy options, aim at getting lucky and don’t have the time to learn new tools.

Aggregators and meta-searches might be a good solution to some information needs in the sciences. Rather than specialised tools, perhaps simple interfaces, straightforward customisations and ‘pushing’ likely relevant information to desktops, leaving researchers to graze at their leisure, might be better than complex tools, sites you have to visit and search for information or really specific searches on certain topics.

So if you’re interested in the devlopmental biology of Drosophila, perhaps a simple setup where articles, datasets, blog posts, sequence information and commercial kits are combined in a regularly updated page might be useful.

And of course, a great big conspicious Google search box at the top of the page. Who am I to deny what biologists want?





The Wayne’s World 2 approach to research data

17 12 2009

Research data infrastructure planning and associated initiatives in the UK (for example) follow the Wayne’s World 2 model.

Wayne asks Jim Morrison what he should do with his life. Jim tells him to put on a concert:

“How will I get the bands to come?”
“If you book them, they will come.”

The approach is top-down. The logic is thus: UK research data is a national resource, so we need national planning consortia to construct a research data management agenda. Create policies, tools, an infrastructure, and researchers will adopt.

Book the bands, says Jim, and they will come.

I am not optimistic this is going to work, and a comment in response to a RIN blog post on this issue reflects this. Can a ‘coherent national framework’ for research data management be imposed on the academic landscape in the UK?

If researchers exist as small communities, with their own idiosyncracies and habits, a UK-wide strategy for data management may simply not work. Why not go small? Small, local projects, funded to encourage small networks of groups to come up with their own solutions?

The One-Big-Happy-Family image for researchers, creating data, curating it and sharing it for the common good, could be a mirage. A centralised approach won’t work and, if progress on a national e-infrastructure is anything to go by in the last few years, is not working.

If you create an infrastructure. UK scientists will sit on their hands and refuse to come. Perhaps we should turn the problem round and start small instead.