Showing posts with label bioinformatics. Show all posts
Showing posts with label bioinformatics. Show all posts

05 May 2010

The GNU/Linux Code of Life

After I published Rebel Code in 2001, there was a natural instinct to think about writing another book (a natural masochistic instinct, I suppose, given the work involved.) I decided to write about bioinformatics – the use of computers to store, search through, and analyse the billions of DNA letters that started pouring out of the genomics projects of the 1990s, culminating in the sequencing of the human genome in 2001.

One reason I chose this area was the amazing congruence between the battle between free and closed-source software and the fight to place genomic data in the public domain, for all to use, rather than having it locked up in proprietary databases and enclosed by gene patents. As I like to say, Digital Code of Life is really the same story as Rebel Code, with just a few words changed.

Another reason for the similarity between the stories is the fact that genomes can be considered as a kind of program – the “digital code” of my title. As I wrote in the book:

In 1953, computers were so new that the idea of DNA as not just a huge digital store but a fully-fledged digital program of instructions was not immediately obvious. But this was one of the many profound implications of Watson and Crick's work. For if DNA was a digital store of genetic information that guided the construction of an entire organism from the fertilised egg, then it followed that it did indeed contain a preprogrammed sequence of events that created that organism – a program that ran in the fertilised cell, albeit one that might be affected by external signals. Moreover, since a copy of DNA existed within practically every cell in the body, this meant that the program was not only running in the original cell but in all cells, determining their unique characteristics.

That characterisation of the genome is something of a cliché these days, but back in 2003, when I wrote Digital Code of Life, it was less common. Of course, the interesting question is: to what extent is the genome *really* like an operating system? What are the similarities and differences? That's what a bunch of researchers wanted to find out by comparing the Linux kernel's control structure to that of the bacterium Escherichia coli:

The genome has often been called the operating system (OS) for a living organism. A computer OS is described by a regulatory control network termed the call graph, which is analogous to the transcriptional regulatory network in a cell. To apply our firsthand knowledge of the architecture of software systems to understand cellular design principles, we present a comparison between the transcriptional regulatory network of a well-studied bacterium (Escherichia coli) and the call graph of a canonical OS (Linux) in terms of topology and evolution.

We show that both networks have a fundamentally hierarchical layout, but there is a key difference: The transcriptional regulatory network possesses a few global regulators at the top and many targets at the bottom; conversely, the call graph has many regulators controlling a small set of generic functions. This top-heavy organization leads to highly overlapping functional modules in the call graph, in contrast to the relatively independent modules in the regulatory network.

We further develop a way to measure evolutionary rates comparably between the two networks and explain this difference in terms of network evolution. The process of biological evolution via random mutation and subsequent selection tightly constrains the evolution of regulatory network hubs. The call graph, however, exhibits rapid evolution of its highly connected generic components, made possible by designers' continual fine-tuning. These findings stem from the design principles of the two systems: robustness for biological systems and cost effectiveness (reuse) for software system.

The paper's well-worth reading, but if you find it heavy going (it's really designed for bioinformaticians and their ilk), there's an excellent, easy-to-read summary and analysis by Carl Zimmer in Discover magazine. Alternatively, you could just buy a copy of Digital Code of Life...

Follow me @glynmoody on Twitter or identi.ca.

06 November 2009

Microsoft's Biological Implants

Microsoft's up to its old tricks of offering pretty baubles to the innocent with The Microsoft Biology Foundation:

The bioinformatics community has developed a strong tradition of open development, code sharing, and cross-platform support, and a number of language-specific bioinformatics toolkits are now available. These toolkits serve as valuable nucleation points for the community, promoting the sharing of code and establishing de facto standards.

The Microsoft Biology Foundation (MBF) is a language-neutral bioinformatics toolkit built as an extension to the Microsoft .NET Framework. Currently it implements a range of parsers for common bioinformatics file formats; a range of algorithms for manipulating DNA, RNA, and protein sequences; and a set of connectors to biological Web services such as NCBI BLAST. MBF is available under an open source license, and executables, source code, demo applications, and documentation are freely downloadable from the link below.

Gotta love the segue from "strong tradition of open development, code sharing and cross-platform support" to "here, take these patent-encumbered .NET Framework toys to play with".

The point being, of course, that once you have dutifully installed the .NET framework, with all the patents that Microsoft claims on it, and become locked into it through use and habit, you are part of the Microsoft-controlled ecosystem. And there you are likely to stay, since Microsoft doesn't even pretend any of this stuff will be ported to other platforms.

For, under the misleading heading "Cross-platform and interoperability" it says:

MBF works well on the Windows operating system and with a range of Microsoft technologies.

Yeah? And what about non-Microsoft operating systems and technologies?

We plan to work with the developer community to take advantage of the extensibility of MBF and support an increasing range of Microsoft and non-Microsoft tools as the project develops.

Well, that's a complete irrelevance to being cross-platform: it just says it'll work with other stuff - big deal.

If I were a biologist I'd be insulted at this thinly-disguised attempt to implant such patent-encumbered software into the bioinformatics community, which has a long and glorious tradition of supporting free software that is truly free and truly cross-platform, and thus to enclose one of the most flourishing and vibrant software commons.

Follow me @glynmoody on Twitter or identi.ca.

01 June 2007

Maybe Genomics is Getting a Little Too Personal

So Jim Watson's genome will soon be made public. But not all of it:

the only deliberate omission from Watson's sequence is that of a gene linked to Alzheimer's disease, which Watson, who is now 79, asked not to know about because it is incurable and claimed one of his grandmothers.

The trouble is, the better our bioinformatics gets, the more genes we will be able to analyse usefully, and the better our ability to make statistical predictions from them. Which means that more and more people will be snipping bits out of their public genomes in this way. And which also means that many of us will never put any of our genome online.

12 April 2007

No (Wo)man is a (Genomic) Island

Biofinformatics is wonderful when it comes to elucidating the structure of genomes. But it can also be applied in other, rather less laudable ways, to allow likely matches to be found on DNA databases even when no DNA sample has been given, thanks to

statistical techniques which match DNA on the database to relatives, according to Dr Pounder, a privacy law specialist at Pinsent Masons, the law firm behind OUT-LAW.COM. These techniques use the genetic fact that an individual's DNA sample is related to the DNA of close family members.

Given enough computing power, we are all family. No (wo)man is a (genomic) island.

04 April 2007

Open Genomics, Closed Minds

One of the great things about open genomics - or bioinformatics if you prefer its traditional name - is that it provides a completely objective resolution of all sorts of emotional disputes.

For example, by feeding genomic sequences of various organisms into a computer program, you can produce a tree of life that is remarkably similar to the ones proposed by traditional evolutionary biology. But in this case, there is no subjective judgement: just pure number crunching (although it's worth noting that the trees vary according to the depth of the calculations, so this is not absolute knowledge, only an ever-closer approximation thereto).

Another case in point is the closeness of the relationship between the great apes and humans. Indeed, it is only human arrogance that allows that kind of distinction to be made: a computer would lump them all together on the basis of their DNA.

Against this background, it's surprising how much we naked apes cling to our difference from the hairy kinds: perhaps it makes us feel a little better in the face of the genocide that we are waging against them. However, it looks like things here might be changing at last:


He recognises himself in the mirror, plays hide-and-seek and breaks into fits of giggles when tickled. He is also our closest evolutionary cousin.

A group of world leading primatologists argue that this is proof enough that Hiasl, a 26-year-old chimpanzee, deserves to be treated like a human. In a test case in Austria, campaigners are seeking to ditch the 'species barrier' and have taken Hiasl's case to court. If Hiasl is granted human status - and the rights that go with it - it will signal a victory for other primate species and unleash a wave of similar cases.

...

One of their central arguments will be that a chimpanzee's DNA is 96-98.4 per cent similar to that of humans - closer than the relationship between donkeys and horses.

Sadly, there's a terrible race here: which will we see first - apes recognised as near-equals, or apes razed from the face of the earth? (Via Slashdot.)

21 July 2006

First Catch Your Neanderthal

This stuff is getting too easy.

First, find some ancient remains - Croatian Neanderthal bones are great. Next, sequence lots - at least 20 times coverage. Don't worry if all you're getting are tiny fragments with around 100 DNA letters, and the signal is vastly swamped by bacterial noise. Just bung the results into a computer, and tell it (a) to cancel out all bacterial genome sequences (b) to join up all the rest. Result: one Neanderthal genome.

There's just one problem:

If the Neanderthal genome were fully recovered, it might in principle be possible to bring the species back from extinction by inserting the Neanderthal genome into a human egg and having volunteers bear Neanderthal infants. There would, however, be great technical and ethical barriers to any such venture.

Understatement of the Year, Number 369.

18 July 2006

The Mega-Important MicroRNAs

Yesterday, when I was writing about the structures found in DNA, I said

Between the genes lie stretches of the main program that calls the subroutines

This is, of course, a gross over-simplification. One of the most interesting discoveries of recent years is that between your common or garden genes there are other structures that do not code for proteins, but for strings of RNA. It turns out that the latter play crucial roles in many biological processes, for example development. Indeed, they are fast emerging as one of genomics' superstars.

So it is only right that Nature Genetics should devote an entire issue to the subject; even better, it's freely available until August 2006. So get downloading now. Admittedly, microRNAs aren't the lightest of subject-matters, but they're mega-important.

02 July 2006

Carnival of the Bioinformaticians

A little while back I wrote about the blog-form of carnivals. At the time, Pedro Beltrão said he was about to start a new one, devoted to bioinformatics, and here it is, Bio::Blogs, with its very own Web sit. I really must write something for the next one.

06 May 2006

O Happy, Happy Digital Code

My book Digital Code of Life was partly about the battle to keep genomic and other bioinformatics information open. So it's good to see the very first public genomic database, now EMBL, spreading its wings and mutating into FELICS (Free European Life-science Information and Computational Services) with even more bioinformatics goodies freely available (thanks to a little help from the Swiss Institute for Bioinformatics, the University of Cologne, Germany, and the European Patent Office).

04 April 2006

Coughing Genomic Ink

One of the favourite games of scholars working on ancient texts that have come down to us from multiple sources is to create a family tree of manuscripts. The trick is to look for groups of textual divergences - a word added here, a mis-spelling there - to spot the gradual accretions, deletions and errors wrought by incompetent, distracted or bored copyists. Once the tree has been established, it is possible to guess what the original, founding text might have looked like.

You might think that this sort of thing is on the way out; on the contrary, though, it's an extremely important technique in bioinformatics - hardly a dusty old discipline. The idea is to treat genomes deriving from a common ancestor as a kind of manuscript, written using just the four letters - A, C, G and T - found in DNA.

Then, by comparing the commonalities and divergences, it is possible to work out which manuscripts/genomes came from a common intermediary, and hence to build a family tree. As with manuscripts, it is then possible to hazard a guess at what the original text - the ancestral genome - might have looked like.

That, broadly, is the idea behind some research that David Haussler at the University of California at Santa Cruz is undertaking, and which is reported on in this month's Wired magazine (freely available thanks to the magazine's enlightened approach to publishing).

As I described in Digital Code of Life, Haussler played an important role in the closing years of the Human Genome Project:

Haussler set to work creating a program to sort through and assemble the 400,000 sequences grouped into 30,000 BACs [large-scale fragments of DNA] that had been produced by the laboratories of the Human Genome Project. But in May 2000, when one of his graduate students, Jim Kent, inquired how the programming was going, Haussler had to admit it was not going well. Kent had been a professional programmer before turning to research. His experience in writing code against deadlines, coupled with a strongly-held belief that the human genome should be freely available, led him to volunteer to create the assembly program in short order.

Kent later explained why he took on the task:

There was not a heck of a lot that the Human Genome Project could say about the genome that was more informative than 'it's got a lot of As, Cs, Gs and Ts' without an assembly. We were afraid that if we couldn't say anything informative, and thereby demonstrate 'prior art', much of the human genome would end up tied up in patents.

Using 100 800 MHz Pentiums - powerful machines in the year 2000 - running GNU/Linux, Kent was able to lash up a program, assemble the fragments and save the human genome for mankind.

Haussler's current research depends not just on the availability of the human genome, but also on all the other genomes that have been sequenced - the different manuscripts written in DNA that have come down to us. Using bioinformatics and even more powerful hardware than that available to Kent back in 2000, it is possible to compare and contrast these genomes, looking for tell-tale signs of common ancestors.

But the result is no mere dry academic exercise: if things go well, the DNA text that will drop out at the end will be nothing less than the genome of one of our ancient forebears. Even if Wired's breathless speculations about recreating live animals from the sequence seem rather wide of the mark - imagine trying to run a computer program recreated in a similar way - the genome on its own will be treasure enough. Certainly not bad work for those scholars who "cough in ink" in the world of open genomics.

16 March 2006

The Power of Open Genomics

The National Human Genome Research Institute (NHGRI), one of the National Institutes of Health (NIH), has announced the latest round of mega genome sequencing projects - effectively the follow-ons to the Human Genome Project. These are designed to provide a sense of genomic context, and to allow all the interesting hidden structures within the human genome to be teased out bioinformatically by comparing them with other genomes that diverged from our ancestors at various distant times.

Three more primates are getting the NHGRI treatment: the rhesus macacque, the marmoset and the orangutan. But alongside these fairly obvious choices, eight more mammals will be sequenced too. As the press release explains:

The eight new mammals to be sequenced will be chosen from the following 10 species: dolphin (Tursiops truncates), elephant shrew (Elephantulus species), flying lemur (Dermoptera species), mouse lemur (Microcebus murinus), horse (Equus caballus), llama (Llama species), mole (Cryptomys species), pika (Ochotona species), a cousin of the rabbit, kangaroo rat (Dipodomys species) and tarsier (Tarsier species), an early primate and evolutionary cousin to monkeys, apes, and humans.

If you are not quite sure whom to vote for, you might want to peruse a great page listing all the genomes currently being sequenced for the NHGRI, which provides links to a document (.doc, alas, but you can open it in OpenOffice.org) explaining why each is important (there are pix, too).

More seriously, it is worth noting that this growing list makes ever more plain the power of open genomics. Since all of the genomes will be available in public databases as soon as they are completed (and often before), this means that bioinformaticians can start crunching away with them, comparing species with species in various ways. Already, people have done the obvious things like comparing humans with chimpanzees, or mice with rats, but the possibilities are rapidly becoming extremely intriguing (tenrec and elephant, anyone?).

And beyond the simple pairing of genomes, which yields a standard square-law richness, there are even more inventive combinations involving the comparison of multiple genomes that may reveal particular aspects of the Great Digital Tree of Life, since everything may be compared with everything, without restriction. Now imagine trying to do this if genomes had been patented, and groups of them belonged to different companies, all squabbling over their "IP". The case for open genomics is proved, I think.

05 January 2006

Open Data - Good; Open Access - Bad?

Great story in Nature about data mashups - the mixing together of data drawn from disparate sources to create a sum greater than the parts.

This approach is not new: it lies at the heart of open source software - where chunks of code are drawn from the specialised databases known as hackers' brains and stitched together - and open genomics. Indeed, bioinformatics represents a kind of apotheosis of the mashup - see, for example, the way in which data from many researchers is pulled together in a genome browser like Ensembl.

Data mashups are more recent, and have started to gain popularity thanks to Google Earth. This provides a useful and conceptually simple scaffolding for other data to be brought together and displayed - like Nature's own avian flu mashup.

A pity, then, that this paean to the virtues of open data is not itself freely available under an open access licence. (For the benighted, the indispensable Open Access News has a long quotation that conveys the essence.)

21 December 2005

Intelligent Design ... and Bioinformatics

If you are interested in the background to the recent ruling against the teaching of Intelligent Design alongside Darwinian evolution in science classes, you might want to read a fine article on the subject, which also includes the judge's splendidly wise and perceptive remarks.

Of course, it is sad that the case even needed to be made. The idea that Intelligent Design - which essentially asserts that everything is as it is because, er, everything was made that way - can even be mentioned in the same breath as Darwinian evolution is risible. Not because the latter is sancrosanct, and cast-iron truth. But Darwin's theory is a scientific theory, testable and tested. So far, it seems to be a good explanation of the facts. Intelligent Design is simply a restatement of the problem.

Among those facts are the growing number of sequenced genomes. It has always struck me that DNA and bioinformatic analyses of it provide perhaps the strongest evidence for evolution. After all, it is possible to bung a few genomes into a computer, tell it to use some standard mathematical techniques for spotting similarities between abstract data, and out pops what are called phylogenetic trees. These show the likely interrelationships between the genomes. They are not proof of evolution, but the fact that they are generated without direct human intervention (aside from the algorithms employed) is strong evidence in its favour.

One of the most popular ways of producing such trees is to use maximum parsimony. This is essentially an application of Occam's Razor, and prefers simple to complicated solutions.

I'm a big fan of Occam's Razor: it provides another reason why Darwin's theory of natural selection is to be preferred over Intelligent Design. For the former is essentially basic maths applied to organisms: anything that tends to favour the survival of a variant (induced by random variations in the genome) is mathematically more likely to be propagated.

This fact alone overcomes the standard objection that Intelligent Design has to Darwinian evolution: that purely "random" changes could never produce complexity on the time-scales we see. True, but natural selection means that the changes are not purely random: at each stage mathematical laws "pick" those that add to previous advances. In this way, simple light-sensitive cells become eyes, because the advantage of being able to detect light just gets greater the more refined the detection available. Mutations that offer that refinement are preferred, and go forward for further mutations and refinement.

It's the same for Intelligent Design's problem with protein folding. When proteins are produced within the cell from the DNA that codes for them, they are linear strings of amino acids; to become the cellular engines that drive life they must fold up in exactly the right way. It is easy to show that random fluctuations would require far longer than the age of the universe to achieve the correct folding. But the fluctuations are not completely random: at each point there is a move that reduces the overall energy of the protein more than others. Putting together these moves creates a well-defined path towards to folded protein that requires only fractions of a second to traverse.