— the oh no sequences! blog

Hi everyone!

Christmas Eve is almost here and there’s still time for a last-minute present.
Thanks to CloudFormation and this template I’m about to show you, Neo4j Server is now friends with AWS (Amazon Web Services) and together they bring you the opportunity of getting your own fresh Neo4j Server machine running in just a few clicks!

I created the github repository Neo4jAWS where you can find all the files needed for this, (which are actually not many but probably I’ll be adding more tools for Neo4j and AWS integration soon).

Ok, so what does this CloudFormation template actually do?

  1. It launches an instance in the availability zone you decide and with a type of your choice - you should also provide your key-pair
  2. Attaches the volume including your Neo4j DB to the new instance (you must provide your volume ID)
  3. Downloads the latest Neo4j stable release (1.5) and overwrites the server properties file with your own file(you have to provide a public URL where it should be available)
  4. It finally starts the Neo4j Server previously copying your DB folder under the data server folder (you have to provide the name of that folder as a parameter for the template)

And, what do you have to do?

  1. Go to the CloudFormation section of the AWS console
  2. Click on ‘Create New Stack’ button
  3. Download the template file from the github repository and then select the option ‘Upload a Template File’
  4. You should be seeing now the parameters window where you should enter the values: KeyPairName, Neo4jDBFolder, AvailabilityZone, EBSVolumeID, ServerPropertiesFile (this should be a public URL), and InstanceType
  5. Once you’ve reviewed that everything’s OK, just click next and wait for the stack to change to the state CREATE_COMPLETE

UPDATE –> Here you have the set of screenshots you should see in the process:


CloudFormation tab: click on Create new stack button.


Give a name to your stack and choose the option for uploading a file, browsing to the template file you previously downloaded from Neo4jAWS repository. Click ‘Continue’ then.


You should be seeing something like this by now. It’s time to provide all the parameters!
When you’re done, click con ‘Continue’ after reviewing the values and just wait for it to change to state ‘CREATE_COMPLETE’ ;)

If nothing weird happens, you should be able to see the WebAdmin in your browser typing as URL the public IP given as output of the stack plus the port you specified in your neo4j-server.properties file.

Beware that the template opens by default the port 7474 for communicating with the Server, if you want to use another port number for any reason, you should change the SecurityGroup manually

As always, please don’t hesitate to give any kind of feedback or suggestion you may have, as well as pointing to possible issues/bugs (you can use github issues in the repository for that).

Happy Holidays!

@pablopareja

Post to Twitter

Read More

Hi everyone!

A couple of days ago I published a post describing how to obtain cool GO annotation visualizations with Gephi + Bio4j. As an example I used data from one of the first assemblies for the EHEC genome, and I was wondering today: Why not using the last version from BGI assembly we annotated with our great BG7 bacterial genome annotation pipeline and put together the visualizations for the three sub-ontologies? Here you have the result:

Biological Process


(Please click on the image above to check the zoomable version)

Molecular Function


(Please click on the image above to check the zoomable version)

Cellular Component


(Please click on the image above to check the zoomable version)

Have a good weekend! ;)

@pablopareja

Post to Twitter

Read More

Hi !

I just finished this afternoon a small project I had to do about identification of microsatellites in DNA sequences. As with every new project I start, I think of something that:

  • I didn’t try before
  • is worth learning
  • is applicable in order to meet the needs of the specific project

These last few days it was the chance to get to know and try the visualization tool included in the last version of Neo4j Webadmin dashboard.
I had already heard of it a couple of times from different sources but had not had the chance to play a bit with it yet. So, after my first contact with it I have to say that although it’s something Neo4j introduced in the last versions, it already has a decent GUI and promising functionality.

Apart from GUI considerations, I created the repository MicrosatellitesNeo4jModel with a bunch of nodes and relationships wrappers as an API for performing traversals for all this data in an easy way.
Here is the domain model I chose:

Microsatellites Neo4j domain model

On the programs side, I developed two different Java classes, one dealing with the identification of the microsatellites and their subsequent storage on the Neo4j DB (CreateMicrosatellitesDB.java) and another (ExtractDataToCSV.java)for extracting statistical information for a set of specific parameters like tuple length and things like that.
Both classes are in the repository Microsatellites.

Once the DB was created, I played a bit with the display profiles in the WebAdmin data browser so that different node types had a different aspect and this is what I got:

Microsatellites DB data browser screenshot

Here you can find blue circles (sequence IDs), orange boxes (tuples repeated in the microsatellites found), and greenish squares (tuple length nodes).

One of the features I was missing in the visualization was having style rules for relationships as well as for nodes. This was specially important in my case where I have relevant information stored as relationships attributes, (I actually could not visualize the number of tuple repeats in the microsatellites found, just the name of the relationship ‘MICROSATELLITE_FOUND’ everywhere).
However I posted a question on neo4j user list about this and it seems they already are working on this, cool!

As always, everything here is open source and released under under AGPLv3
Cheers,

@pablopareja

Post to Twitter

Read More

Today I found an interesting discussion in Neo4j user list and found myself in the mood of writing a couple of related thoughts I have had in mind the last months.
Here they are: (the titles are taken from the guidelines for building your domain model

- Use reference and subreference nodes to organize entry points

In principle I don’t agree with this. Well, I do in theory (it actually is how I implemented things in the beginning). However in one of my projects, where I’m dealing with huge amounts of relationships, reference and subreference nodes become supernodes which in turn are a very problematic bottleneck in most traversals; (this is due to the lack of a natively implemented system for discerning between relationships with different types, I opened this issue about this here). While this is solved, I’m always tempted/forced to start using indexes instead of relationships, but then I wonder how different things are in the end compared to other not graph-based DB systems !??

- Use relationships types appropiately

I’m gonna be frank with this, I never understood why there’re mandatory relationships types but not mandatory nodes types !?
It just doesn’t make any sense for me. I can understand that this could bring some trouble depending on implementation decisions taken at core level but then, why doing this only half way? It’d have been better having no restriction for either nodes or relationships than how things are now.

With all this having been said, I still find Neo4j a very promising DB in the near future and I’m really happy to use it in a lot of different projects/use-cases; however I think the way for it to get better each day is not keeping saying how cool it is but actually pointing at the weak spots it may have.

Pablo Pareja

Post to Twitter

Read More

And right now we’ve finished the automatic annotation of the assembly of new strain that  HPA (Health Protection Agency UK)  made available yesterday (get the assembly file here http://www.hpa-bioinformatics.org.uk/lgp/resource/454Scaffolds.fna)

Once more we’ve used BG7 and the same set of reference proteins (137,063 proteins in total):

  • The representative Uniprot proteins corresponding to all Uniref90 clusters for all Escherichia coli proteins
  • All Uniprot proteins from organisms including in their name the terms “EHEC” or “EAEC”
  • All Uniprot proteins from bacteria that have in any Uniprot field the term “toxin”
  • All Uniprot proteins from bacteria that have in any Uniprot field  “hemolysin”
  • All the proteins from Salmonella typhi, Yersinia pestis and Shigella dysenteriae

Results

We’ve detected 5,916 genes

  • 5,792 protein encoding genes
  • 124 RNA genes

4,912 out of the 5,792 (84.80%) protein encoding genes have canonical start and stop codon and haven´t either frame-shifts or intragenic stop codons.

615 out of the 5,792 (10.61%) protein encoding genes have some frameshifts or intragenic stop codon in their sequences, probably caused by inherent technology errors.

You can get the results of the annotation here https://github.com/ehec-outbreak-crowdsourced/BGI-data-analysis/tree/master/strains/H112180280/seqProject/HealthProtectionAgencyUK/annotations/era7bioinformatics/era7_HPA_H112180280_annotations)

Post to Twitter

Read More

We’ve finished the automatic annotation of the third BGI assembly of the E. coli TY-2482 strain genome (get the assembly file here ftp://ftp.genomics.org.cn/pub/Ecoli_TY-2482/Escherichia_coli_TY-2482.scaffold.20110610.fa.gz)

As in the other annotations we’ve done so far we used BG7 system to annotate the genome. And we have used the same set of reference proteins (137,063 proteins in total):

  • The representative Uniprot proteins corresponding to all Uniref90 clusters for all Escherichia coli proteins
  • All Uniprot proteins from organisms including in their name the terms “EHEC” or “EAEC”
  • All Uniprot proteins from bacteria that have in any Uniprot field the term “toxin”
  • All Uniprot proteins from bacteria that have in any Uniprot field  “hemolysin”
  • All the proteins from Salmonella typhi, Yersinia pestis and Shigella dysenteriae

Results

We’ve detected 5,936 genes

  • 5,806 protein encoding genes
  • 130 RNA genes

4,881 out of the 5,806 (84.06%) protein encoding genes have canonical start and stop codon and haven´t either frame-shifts or intragenic stop codons.

533 out of the 5,806 (9.18%) protein encoding genes have some frameshifts or intragenic stop codon in their sequences, probably caused by inherent technology errors.

You can get the results of the annotation here https://github.com/ehec-outbreak-crowdsourced/BGI-data-analysis/tree/master/strains/TY2482/seqProject/BGI/annotations/era7bioinformatics/BGI_V3

Post to Twitter

Read More

David Studholme detected a missed region in EHEC TY-2482-v1 (http://www.genomic.org.uk/blog/?p=523) assembly that was also absent in TY-2482-v2.

This is the region with type VI secretion system surrounding the detected by Studholme missed region:

contig Era7 geneID ini Era7 tags Protein name
106 108864 2776 secretion system VI Putative uncharacterized protein
106 83591 6193 secretion system VI Putative uncharacterized protein
106 85611 6792 secretion system VI Putative uncharacterized protein
106 107028 7091 secretion system VI Putative type VI secretion protein
106 79778 8549 secretion system VI Putative uncharacterized protein
106 58451 9104 secretion system VI Putative uncharacterized protein
106 108747 10185 secretion system VI Putative uncharacterized protein
106 60210 10494 secretion system VI Putative uncharacterized protein
106 106509 10965 secretion system VI Putative uncharacterized protein
106 75277 12958 secretion system VI Putative type VI secretion protein
106 99465 13930 secretion system VI Putative uncharacterized protein
106 92338 15696 secretion system VI Putative uncharacterized protein
106 94122 16111 secretion system VI Putative uncharacterized protein
106 81926 16627 secretion system VI Putative type VI secretion protein
106 95285 18115 secretion system VI Putative type VI secretion protein
106 102409 18594 secretion system VI Putative type VI secretion protein
106 11998 19032 Transposase

.

These are the BLASTN results for that region of  BGI_assembly_v2 vs EC 55989:

query id subject id % identity align len q. start q. end s. start s. end evalue
106 gi|218350208 99.88 18937 1 18937 3369005 3387941 0.0
106 gi|218350208 100.00 3964 18923 22886 3389265 3393228 0.0
106 gi|218350208 99.92 3832 22885 26716 3397877 3401707 0.0
106 gi|218350208 99.84 2455 26837 29291 3427391 3429843 0.0

.

These are the EC 55989 genes missed in BGI assembly v2:

hypothetical protein EC55989_3318 3401747 3402160 EC55989_3318
hypothetical protein EC55989_3319 3402305 3402739 EC55989_3319
hypothetical protein EC55989_3320 3402739 3403275 EC55989_3320
hypothetical protein EC55989_3321 3403256 3404356 EC55989_3321
hypothetical protein EC55989_3322 3404311 3406074 EC55989_3322
hypothetical protein; putative membrane protein 3406082 3406879 EC55989_3323
hypothetical protein EC55989_3324 3406776 3408374 EC55989_3324
hypothetical protein EC55989_3325 3408374 3411763 EC55989_3325
hypothetical protein EC55989_3326 3411756 3412904 EC55989_3326
hypothetical protein EC55989_3327 3412908 3413174 EC55989_3327
Conserved hypothetical protein. Putative exported protein 3413206 3413883 EC55989_3328
conserved hypothetical protein; putative exported protein 3414027 3414704 EC55989_3329
hypothetical protein EC55989_3330 3414724 3416406 EC55989_3330
hypothetical protein EC55989_3331 3416403 3418928 EC55989_3331
hypothetical protein EC55989_3333 3419451 3420026 EC55989_3333
putative chaperone clpB 3420014 3422680 EC55989_3334
hypothetical protein EC55989_3335 3422840 3423331 EC55989_3335
hypothetical protein EC55989_3336 3423337 3425163 EC55989_3336
hypothetical protein EC55989_3337 3425070 3425789 EC55989_3337
hypothetical protein EC55989_3338 3425720 3427057 EC55989_3338
hypothetical protein EC55989_3339 3427073 3428617 EC55989_3339

.

We will review this region in the new annotations that we will do for the two new available genome sequences contributed by HPA and BGI.

Post to Twitter

Read More

As it was suggested by Kat Holt (http://bacpathgenomics.wordpress.com/2011/06/05/ehec-genomes-snp-locations/) and others a plasmid very similar to pEC_Bactec is a part of the genome of the recently sequenced EHEC H112180280 strain.

The figure displays a simple alignment obtained using MAUVE  Move Contigs tool between pEC_Bactec plasmid (above) and the scaffolds 7 and 13 of H112180280 genome (below):

pEC-Bactec vs HPA UK 454 H112180280 strain

Scaffolds 7 and 13  cover practically all pEC-Bactec plasmid sequence. The two white pEC-Bactec regions that hasn’t associated similar regions in H112180280 correspond to the genes:

- pndC and  TnpA OrfB  IS66 (the left white region into the red similarity block)

- TnpA IS26 transposase (The white patch into the blue block)

The H112180280 sequence regions without any similarity conexion with  pEC-Bactec  flanking the two little green blocks correspond to N regions that remain undefined in the H112180280 sequence obtained with paired end 454 technology.

Post to Twitter

Read More

HPA (Healt Public Agency http://www.hpa.org.uk/) has just announced the sequence of a E. coli strain. the strain H112180280.

They have sequenced the strain with 454 and they’ve released

  • sff files
  • FASTA file with the scaffolds
  • The annotation (done by Anthony Underwood) in GenBank format

Data available here http://www.hpa-bioinformatics.org.uk/lgp/genomes

They got

  • 13 scaffolds
  • 5405081 bp
  • 88748 Ns (1.64%)

When we saw the data. The genome in only 13 scaffolds. We couldn’t help aligning it with the other high-quality de novo assembly we have so far (the BGI version 2 of the TY-2482 strain)

How similar these two strains would be? 454 assembly could help scaffolding Illumina-IonTorrent contigs?

Here’s what we got after aligning both genomes using Mauve (http://gel.ahabs.wisc.edu/mauve/)

You can get the results of this Mauve analysis in the GitHub repository https://github.com/ehec-outbreak-crowdsourced/BGI-data-analysis/tree/master/strains/comparativeAnalysis/era7bioinformatics/Mauve_H112180280_TY2482

This quick analysis could give us some hints to reduce the number of contigs in both assemblies.

For example. Scaffolds 1, 2, 3 and 4 in the HPA assembly (the one above) could be merged in one contig (provided confirmation with PCR, Sanger sequence,  etc).

And from the point of view of  TY-2482 assembly, even more contigs could be merged. See for instance the similarity region in green bottom left (red vertical lines indicate different contigs) . As well as the other similarity regions along the whole assembly (the pink, light green, turquoise and purple blocks)

Post to Twitter

Read More

Analysing the automatic annotation we did of the second BGI assembly of TY-2482 genome (see post here http://blog.ohnosequences.com/2011/06/automatic-annotation-of-the-second-bgi-assembly-of-e-coli-ty-2482-genome/) we have found that this isolate has 3 restriction modification systems

  • Type I Restriction modification system encoded in an operon in contig 42. The specific protein encoded by the gene 79712, the modification protein encoded by the gene 84400 and the restriction protein encoded by 66267
  • Type II system encoded in an operon in contig 486. The nuclease protein encoded by the gene 21919 and the methyltransferase protein encoded by gene 23135
  • Type III system encoded in the contig 493. The nuclease protein encoded by the gene 3634 and the methyltransferase one encoded by gene 5265
Type I restiction-modification system encoded in the contig 42

Type II restriction-modification system encoded in contig 486

Type III Restriction-modification system encoded in contig 493

Post to Twitter

Read More