Search posterous

Search all posts and users. Type a name, type a favorite song title, whatever! See what comes up.
  

More posterous blogs











More recommended blogs »

Here are posterous posts filed under bioinformatics...

hatchethead says...

As a mathematics professor at a 'primarily undergraduate institution' (a.k.a. PUI) who has been deeply involved in some local and national efforts at reforming undergraduate biology and mathematics education to be more interconnected, I delight at coming across papers with titles like this one: "Computing Has Changed Biology—Biology Education Must Catch Up" by Pavel Pevzner and Ron Shamir (Science 31 July 2009: Vol. 325. no. 5940, pp. 541 - 542). Often times, papers like these are written by people who are passionate about STEM education reform, and they share in their writings insights and perspectives that are useful to me.

This is not really one of those papers. In this article, the authors make an ineffectual argument for changing the undergraduate biology curriculum to include more training in computational biology, a.k.a. bioinformatics. It's not that what they want isn't reasonable or that it's not needed. The authors do want to reform the undergraduate biology curriculum in a meaningful way. It's just that this note in the Education Forum of Science is more of an excuse to report on a conference: RECOMB Bioinformatics Education Conference (http://casb.ucsd.edu/bioed/). Mostly, the authors repeat many of lines we've heard from many others since the publication of Bio2010.

This article does give some decent concrete ideas for how a computational biology course would be designed. For example, a course could introduce the computational content/concepts in the context of framing questions such as "Did our ancestors interbreed with Neanderthals?" or "How do we distinguish between different forms of breast cancer and choose the appropriate chemotherapy?" or "How can biomarkers be used to predict clinical outcomes of young breast cancer patients?" But the authors don't go far beyond that. And this kind of suggestion, of teaching new concepts in a context that is immediately meaningful to students, is not innovative or new; educators that specialize in how students learn and in broadening participation in science and mathematics (cf. the BioQuest Curriculum Consortium at http://bioquest.org) have been recommending this approach for many years,

That the authors are ignorant of the work others have done in student learning and pedagogy reform is understandable (though not necessarily excusable, given the aim of this article). That I don't understand is the authors' apparent ignorance of or indifference to the challenge of adding anything to the standard undergraduate biology curriculum. Degree programs in the life sciences are already tight, and requirements set by admissions to medical schools further limit program flexibility (like it or not). Perhaps the recent AAMC/HHMI Committee report “Scientific Foundations for Future Physicians” (see http://www.hhmi.org/news/SFFP20090604.html) will lead medical school admissions committees to raise their expectations of applicant competence in mathematics, statistics, and computation. Such a move would open the possibility of biology programs requiring the type of new course proposed by Pevzner and Shamir. But even if medical schools calls for more quantitative sophistication (or literacy) from medical school candidates, we should ask if a course in bioinformatics should be required of all biology majors. That's a discussion I'll leave for others and for later.

In the end, the authors say that implementing their so-called proposed course "may become a first step toward building the new computational curriculum for biologists" without ever proposing a concrete course. At best, they give the readers the "pedagogical challenge" of designing this course so that it "(i) assumes few computational prerequisites, (ii) assumes no knowledge of programming, and (iii) instills in students a meaningful understanding of computational ideas and ensures that they are able to apply them." At worst, they come across as people who have little knowledge of or appreciation for the tightness of the undergraduate biology curriculum They can parrot the spirit of Bio2010 well, but they're not offering much meat for those of us who are in the trenches trying to make things happen.

Before closing, I'm going to pick a nit with their imperial use of the adjective 'computational' when used to modify 'biology.' While many people have used the terms 'bioinformatics' and 'computational biology' interchangeably. This was reasonable at a time when genomics led the way in leveraging computational and algorithmic power to illuminate questions in biology. But now applications of computation can be found in every corner of in the life sciences. Agent-based models illuminate ecosystem dynamics. CompuCell's pde-based methods are used to model cellular phenomena. Computer-assisted tools use mathematical models of shape and image-analytic techniques to understand medical images. These are just a couple examples that have come up in Truman's undergraduate mathematical biology program. Each of these are examples of 'computational biology', where computational is used in its broadest sense to mean an application of computers and computational methods. Let's relegate the synonymous use of 'bioinformatics' and 'computational biology' to the history books with a nod and a thanks.

With all that said, and as much as the article didn't thrill me like Joel Cohen's "Mathematics Is Biology's Next Microscope, Only Better; Biology Is Mathematics' Next Physics, Only Better" (see http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.0020439), the short article is worth reading, if only to see that there are many people out there who want undergraduate curricular reform. And the references are decent.

Filed under: bioinformatics

paulbarrett says...

Already been done by ChemDoodle

http://web.chemdoodle.com/overview.php

Back to the drawing board...

Filed under: bioinformatics

paulbarrett says...

...trying to think of something to do as a side project (for CV enhancement/future small business possibilities)  which encompasses:

Web technology
UI design
Cheminformatics or bioinformatics

and has a unique selling point or solves a unique problem. In spare moments I have tried to jot down ideas, then ruling them out with searching - quite a few had already been done.

And then it hit me - chemical drawing.

For 2 and a half years this was the bane of my life when I worked at Evotec. We used a Java applet based system for web drawing which, while useful and feature-rich, was a pain in the hole for integrating seamlessly into web apps, users needed Java installed for WebStart, the configuration was a real bitch. There were desktop apps as well, but again these were Java based and had limitations.

What doesn't exist anywhere, to my knowledge, is a platform independent/agnostic, CSS based drawing tool which can also handle input files or SMILES strings and display them. With the CSS and javascript frameworks about, surely this would be achievable with a bit of work?

So basically, does anyone know of a tool/site like this where any user with just a decent web browser (think ChromeOS/Linux here) can rock up and just draw away? Or have I hit on something here?

Filed under: bioinformatics

Sujai says...

In praise of Vmatch


If I could take only one bioinformatics tool with me to a desert island it would be Vmatch.

In addition to being a versatile alignment tool, Vmatch has many lesser-known features to leverage its suffix trees. The -dbcluster option allows Vmatch to cluster similar sequences using Vmatch index files. This method is about 1000x faster than attempting to align all sequences against each other, no matter how clever the algorithm.

Originally presented as means to cluster EST sequences, this feature is a very useful addition to process output from de novo short read assemblers like Velvet and ABYSS. Vmatch -dbcluster allows us to easily create a non-redundant set of contig sequences originating from several assemblies.

Why would we want to combine assemblies?


Every kmer is sacred


Why not just take the one with the highest N50? At the recent Genome Informatics meeting at CSHL, Inanc Birol (ABYSS) said "every kmer is sacred". Our experience with the de novo transcriptome assembly of plants has been that the best set of contigs is spread out all over the parameter space. Longer kmer and cvCut settings can produce a longer set of elite contigs at the expense of omitting lowly expressed (or spurious?) contigs that appear at less stringent settings.

Reads held hostage


To the point above, I would add "every cvCut is also sacred". That doesn't exactly roll off the tongue, but we have seen some instances in which an assembly will have a higher read usage at a cvCut of 10 than at 5. This paradox suggests there is a "contig read hostage" situation, in which reads critical to the formation of longer contigs can be held captive in low coverage short contigs. Raising the cvCut threshold frees up these critical reads to extend more plausible contigs and allowing additional reads from the pool to be recruited into the assembly.



What is a non-redundant set?


The following sequences have some redundancies. NODE2 and NODE3 do not add any information.
The non-redundant set will have only NODE1 and NODE4.

>NODE1
AATTCAGTTGAAGTAATAGAGGCAGCTGCTGTTAGAACTTCGCTACGACTAGCAACGCTA
TGTCGAGTTGTACCTTCCCACCTCGTATACAAGGAGCATGAAGTCATCAGCCCTTTCTAC
AGAATCTAGGTCCGAAATGATGAAATTAAGAGAAAGACAACTGAAGGTTTTAGGAGGCAA
GGTTTACAGGGTAATGAACACGGAAAAACCCACAAGCTAGGAACAGTGTGTCTTGGAGTT
TAAACTGATTTGGTAGTAGTTCGAAAACAAATTGGAAGGGACATTTAAAGTCCGAGTTGA
CGTTATCTGAGACAACTTTGTCTTTAACCGACAGGGAGTTGAGGTAGGAGAGAGTGTCCA
CATATTTAACGTTGTTCAGATATGGGATGTAGCAGTTGTAACCGAAGCATGTAGGAAGGT
TAAAGGGGTCCATACCTCTATTCTAGTCCCGAAGGTTGGTAGCTAGACTCGGTGACCTAA
AATGAGAACGGAAGAAACGGAGGTGACATCCATGGGGCTTGTCGTATATCCAATCATACC
TTTGGAGAAGGAAGTTAAAGGTCAAAACTTTAAAAACCATGAGGACCATTTTATCCTCGT
CACTGTCGACTATGGTGAAGTGACCTAGCAGGTTTGTGTAGTTACTGTTTTGTAAGTGTA
AAGTGCGTTGCTGGCTATAGGAGTTTCGCATGAAACATGCTCGCTCTTGTGACCCATCGG
TTACCAAGTTTACAGACTAGGTGAGGACTAGGTCACCTCTGTTTGGTAACCAAAAAGTAG
AGAAAGAATATCAACAAGGTATATACAACAACTTAGAGTAGCAAACCTTAAAAGAGTTGT
CGTCAGTCACAAAGGGAGAGTTGTACTAAAGGGAACTTTTGTTCACAGAGCTAAAACTCA
TTAAGACCATTTTAATGTGGTCTACCAAGTCTGTCGAAAAGAAGTGGATGCTACGGCAGA
>NODE2 a substring of NODE1
AATTCAGTTGAAGTAATAGAGGCAGCTGCTGTTAGAACTTCGCTACGACTAGCAACGCTA
TGTCGAGTTGTACCTTCCCACCTCGTATACAAGGAGCATGAAGTCATCAGCCCTTTCTAC
AGAATCTAGGTCCGAAATGATGAAATTAAGAGAAAGACAACTGAAGGTTTTAGGAGGCAA
GGTTTACAGGGTAATGAACACGGAAAAACCCACAAGCTAGGAACAGTGTGTCTTGGAGTT
TAAACTGATTTGGTAGTAGTTCGAAAACAAATTGGAAGGGACATTTAAAGTCCGAGTTGA
CGTTATCTGAGACAACTTTGTCTTTAACCGACAGGGAGTTGAGGTAGGAGAGAGTGTCCA
CATATTTAACGTTGTTCAGATATGGGATGTAGCAGTTGTAACCGAAGCATGTAGGAAGGT
>NODE3 a reverse complement substring of NODE1
CGACAGACTTGGTAGACCACATTAAAATGGTCTTAATGAGTTTTAGCTCTGTGAACAAAA
GTTCCCTTTAGTACAACTCTCCCTTTGTGACTGACGACAACTCTTTTAAGGTTTGCTACT
CTAAGTTGTTGTATATACCTTGTTGATATTCTTTCTCTACTTTTTGGTTACCAAACAGAG
GTGACCTAGTCCTCACCTAGTCTGTAAACTTGGTAACCGATGGGTCACAAGAGCGAGCAT
GTTTCATGCGAAACTCCTATAGCCAGCAACGCACTT
>NODE4 a new sequence
TTACGAACGATAGCATCGATCGAAAACGCTACGCGCATCCGCTAAGCACTAGCATAATGC
ATCGATCGATCGACTACGCCTACGATCGACTAGCTAGCATCGAGCATCGATCAGCATGCA
TCGATCGATCGAT

How do we create a non-redundant set?


Do not use -nonredundant


No, really. There is an option called -nonredundant which should presumably do what we want, but unfortunately that writes a "representative" member of each cluster to a file, which may or may not be the longest contigs. I'm not sure what makes a sequence representative of a cluster, but we want the longest member of each cluster.

The workaround


To create a non-redundant set we will produce cluster files and then extract the longest sequence from each file. Use the following commands to produce your cluster files from an index:
1mkvtree -allout -pl -db mySeqs.fa -dna -indexname myIndex #mkvtree will accept multiple fasta or gzipped fasta files
2mkdir sequenceMatches #this is where vmatch will put each cluster
3vmatch -d -p -l 25 -dbcluster 100 0 mySeqs "(1,0)" -v -s myIndex > mySeqs.rpt
4mv mySeqs*.match mySeqs*.fna sequenceMatches

What do these options do?


  • -d direct matches (forward strand)

  • -p palindromic matches (reverse strand)

  • -l search length (set this below your shortest sequence)

  • -dbcluster queryPerc targetPerc - runs an internal alignment of the index created by mkvtree.
    • The two numeric arguments specify what percentage of your query sequence (the smaller) is involved in an alignment to the cluster sentinel/target superstring. For our purposes we require 100% of our query sequence substring to match the target. We don't care what percentage of the target is aligned, so set the second parameter to 0.

  • The third argument to dbcluster is the index prefix name that vmatch will give to the THOUSANDS of cluster .fna and .match files it will create.

  • The fourth argument "(1,0)" specifies that we want to keep singletons (in a file called mySeqs.single.fna) and that there is no limit to the number of sequences in an acceptable cluster.

  • -v verbose report (redirected to a the .rpt)

  • -s create the individual cluster fasta files and match reports

  • Consult the vmatch manual for fuzzy matches and more examples:
    http://www.zbh.uni-hamburg.de/vmatch/virtman.pdf

    The standard output details the clusters and singlets.


    0:
    761437: NODE_83_length_31_cov_22.451612
    857642: NODE_106_length_31_cov_22.451612
    621409: NODE_152_length_29_cov_27.758621
    749981: NODE_185_length_29_cov_27.758621
    1:
    761861: NODE_531_length_424_cov_26.851416
    805590: NODE_1413_length_320_cov_28.187500
    837480: NODE_1236_length_320_cov_28.187500
    858407: NODE_937_length_320_cov_28.187500
    765510: NODE_1542_length_108_cov_34.870369
    621979: NODE_786_length_425_cov_32.602352
    750915: NODE_1290_length_321_cov_34.392525

    Extract the longest sequence from your cluster files


    Here is a perl script to do that
    01#!/usr/bin/perl
    02#print the longest sequence in a fasta file
    03use strict;
    04my %seqs;
    05my $defline;
    06while () {
    07   chomp;
    08   if ( /^#/ || /^\n/ ) {
    09       #comment line or empty line do nothing
    10   }
    11   elsif (/>/) {
    12       s/>//g;
    13       $defline = $_;
    14       $seqs{$defline} = "";
    15   }
    16   elsif ($defline) {
    17       $seqs{$defline} .= $_;
    18   }
    19   else {
    20       die('invalid FASTA file');
    21   }
    22}
    23my $max = $defline;
    24foreach my $def ( keys %seqs ) {
    25   $max = ( length( $seqs{$def} ) > length( $seqs{$max} ) ) ? $def : $max;
    26}
    27print ">" . $max . "\n" . $seqs{$max} . "\n";

    We want the longest member of each cluster and all sequences in the singletons file:
    1for f in sequenceMatches/*fna;
    2 do
    3   if [ "$f" = "sequenceMatches/mySeqs.single.fna" ];
    4     then cat $f >> mySeqs_longest_seqs.fa;
    5     else perl longestSeq.pl $f >> mySeqs_longest_seqs.fa;
    6   fi;
    7done

    That's it - now you have a comprehensive and non-redundant set of the longest contigs from a number of assemblies.

    Filed under: bioinformatics

    D says...

    [late edit (as of Nov 5, 09) - just heard in a talk on data collection/management/interpretation that personalized medicine will never exist. all of this will just go towards better targeting of drugs. maybe it's more than semantics? not so sure]

    A visualization for drug targeting in drug discovery, which is a little bit of a big deal. This + genome sequencing = future of personalized medicine.
    http://en.wikipedia.org/wiki/Computational_biology
    http://en.wikipedia.org/wiki/Drug_discovery

    This image comes from: MJ Keiser et al. Nature 000, 1-7 (2009) doi:10.1038/nature08506

    Filed under: bioinformatics

    BioInfo says...

    PLAST: parallel local alignment search tool for database comparison http://ow.ly/tX0c bioinformatics blast fb

    Filed under: bioinformatics

    BioInfo says...

    Downloads & Upgrades from BioInform | IlluminaCompute, EnsemblGenomes, Skyline…: http://ow.ly/tW7e bioinformatics

    Filed under: bioinformatics

    BioInfo says...

    Tips for de novo bacterial genome assembly http://ow.ly/tbQK velvet illumina bioinformatics

    Filed under: bioinformatics

    Phosphorylation selection. As protein kinases evolved and species complexity increased, the numbers of genomically encoded tyrosine residues [and thus phosphorylated tyrosines (pY)] decreased proportionally. This negative correlation also applies to threonine (pT) but not serine (pS) residues. Positionally conserved phosphorylation sites tend to be located in ordered protein domains, whereas those not conserved are located in intrinsically disordered regions where their exact positions shift with evolutionary time.

    Filed under: Bioinformatics

    echo QSVJRYHFCEXDILPKZTOBWA | grep -o . | sort -r | grep -v ‘[CKQW]‘ | perl -ne ‘chomp; print “$_\t”;’ | cut -f2,4- | perl -ane ‘$x=”KPFFAOJECKNPAIGMBMHBMCLPDD”; foreach $y (split(//,$x)) {print $F[ord($y)-65];} print “\n”;’

    Filed under: Bioinformatics