Aligning Paired-End Reads In Single-End Mode

February 15, 2013, 4:28 pm

≫ Next: Samtools Mpileup And Overlapping Paired-End Reads

≪ Previous: Is There An Elegant Way To Extract Only The Properly-Paired Reads In A Sam/Bam File?

Hello,

I have a question on how to align paired-end reads.

In cases of very large fastq files, which make aligners like TopHat crash in a server with limited memory in RAM, I have seen people align one pair mate at a time, to prevent TopHat from crashing. I cann't come up with an argument against doing this, but am tempted to reason that one should always align paired-end reads in "paired-end mode".

My question is: How are downstream analyses affected if one aligns paired-end reads one pair mate at a time, setting BWA or Bowtie/TopHat to single-end mode?

Thanks, G.

↧

Samtools Mpileup And Overlapping Paired-End Reads

November 25, 2013, 11:24 am

≫ Next: Crossbow Final Step Failing On Emr

≪ Previous: Aligning Paired-End Reads In Single-End Mode

I am using samtools mpileup to generate a pileup for paired-end RNA sequencing data. I am curious about how samtools handles pair mates whose read mappings overlap. With regard to the simple example below, for positions 7-11, are both pair mates enumerated?

position     1    6    11   16
reference    ATGCATGCATGCATGC
pair mate 1  ATGCATGCATG
pair mate 2'       GCATGCATGC    (reverse complemented)

I am parsing the pileup output to quantify RNA editing at some positions and do not want to count an 'RNA molecule' twice just because the read pair mates 'overlapped.'

Thanks.

↧

Crossbow Final Step Failing On Emr

October 8, 2013, 12:07 am

≫ Next: High level of duplicate in one reads of paired-end data

≪ Previous: Samtools Mpileup And Overlapping Paired-End Reads

Hello, I am trying to run Crossbow via EMR command line. I managed to complete all the crossbow steps- Alignment with Bowtie, Calling SNPS and Postprocess. I am getting an error in the final step Get Counters. Can anyone please help me fix this? controller

2013-10-08T03:39:40.661Z INFO Fetching jar file.
2013-10-08T03:39:42.169Z INFO Working dir /mnt/var/lib/hadoop/steps/5
2013-10-08T03:39:42.169Z INFO Executing /usr/lib/jvm/java-6-sun/bin/java -cp /home/hadoop/conf:/usr/lib/jvm/java-6-sun/lib/tools.jar:/home/hadoop:/home/hadoop/hadoop-tools.jar:/home/hadoop/hadoop-core.jar:/home/hadoop/hadoop-core-0.20.205.jar:/home/hadoop/hadoop-tools-0.20.205.jar:/home/hadoop/lib/*:/home/hadoop/lib/jetty-ext/* -Xmx1000m -Dhadoop.log.dir=/mnt/var/log/hadoop/steps/5 -Dhadoop.log.file=syslog -Dhadoop.home.dir=/home/hadoop -Dhadoop.id.str=hadoop -Dhadoop.root.logger=INFO,DRFA -Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/5/tmp -Djava.library.path=/home/hadoop/native/Linux-amd64-64 org.apache.hadoop.util.RunJar /home/hadoop/contrib/streaming/hadoop-streaming-0.20.205.jar -D mapred.reduce.tasks=1 -input s3n://crossbow-emr/dummy-input -output s3n://ashwin-test/crossbow-emr-cli_crossbow_counters/ignoreme1 -mapper cat -reducer s3n://crossbow-emr/1.2.1/Counters.pl  --output=S3N://ashwin-test/crossbow-emr-cli_crossbow_counters -cacheFile s3n://crossbow-emr/1.2.1/Get.pm#Get.pm -cacheFile s3n://crossbow-emr/1.2.1/Counters.pm#Counters.pm -cacheFile s3n://crossbow-emr/1.2.1/Util.pm#Util.pm -cacheFi ...

↧

High level of duplicate in one reads of paired-end data

July 21, 2014, 6:52 am

≫ Next: Trinity Question

≪ Previous: Crossbow Final Step Failing On Emr

Hi,

We are doing some transcriptomic analysis on bovine immune blood cells and we seem's to have some problem with high levels of duplicate in our data. Our library were prepared with the illumina tru-seq stranded kit.

First, we sequence 50pb single-end to test our librairies and we found a high level of duplicated reads( ~80%) and around 2% of A's and T's stretch. We tough the problem was our librairies so we sent the RNA to our sequencing facility so they can do all the work (except RNA extraction)

So the sequencing facility did the library and the sequencing. To be sure that our data would be usable we sequence these datasets 100pb paired-end (3 sample in total). To our surprise, the high level of duplicate we saw in our first sequencing experiment was back but only in one read and the same read for all three sample. The other read was aroud 10% of duplicate in each sample.

Since our RNA seem ok when tested on Agilent bioanalyser technologies (RIN >9) and that the libraries were prepared by a sequencing facility of confidence, I'm here to ask you what could be wrong with our data?

Thanks a lot!
Olivier.

↧

Trinity Question

October 1, 2013, 2:18 am

≫ Next: Tool: Trim Adapters Of Paired-End Reads (Fastq)

≪ Previous: High level of duplicate in one reads of paired-end data

Hi， I 'd like to use Trinity to analyze my strand-specific pair-end sequencing data (dUTP) and I am very confused about how to choose the argument. I got two reads files, _1.fa and _2.fa. For --SS_lib_type argument ,I choose RF.As described in tophat manual for dUTP(graph below), reads from _1.fa are from the right-most end of the fragment(in transcript coordinates), and _2.fa are from the left-most . So ,when I run trinity ,should I choose _1.fa for argument --right, and _2.fa for --left.In the command line "Trinity.pl --seqType fa --left left.fa --right right.fa --CPU 4 --JM 4G", dose the left.fa always means _1.fa?

enter image description here

↧

Tool: Trim Adapters Of Paired-End Reads (Fastq)

February 15, 2013, 4:15 am

≫ Next: Mapping Applications On Hadoop Cluster

≪ Previous: Trinity Question

Trimming adapter sequences of paired-end experiments is sometimes a problem. If you clip the mates in two steps, it migh happen that you loose one mate, but not the corresponding one, resulting in two uneven sets of mates. With the small perl-script clipPairedEndFastq.pl you are able to clip the adapters of both mates and you will end up with two correct fastq files. If both mates are too short after clipping (<15nt), both mates are deleted. If one mate is too short after clipping , but the other is long enough, there are two possibilities (-n parameter): 1) The mate which is too short is replaced by an "N", or 2) it is replaced by the original (untrimmed) read.NOTE: cutadapt has to be installed on your machine!

clipPairedEndFastq.pl

usage: clipPairedEndFastq.pl -m1 <file> -m2 <file> -o1 <file> -o2 <file> -s1 <file> -s2 <file>

[INPUT]
 -m1 <file>    raw mates 1
 -m2 <file>    raw mates 2
 -a1 <string>  adapter for mates 1
 -a2 <string>  adapter for mates 2
 -o1 <file>    clipped mates 1
 -o2 <file>    clipped mates 2
 -s1 <file>    clippStat mates 1
 -s2 <file>    clippStat mates 2
 -n  <int>     1: fill mates <15nt with Ns (default)
               0: reset mates <15nt with original mate
 -h <file>     this (usefull) help message

Example: ./clipPairedEndFastq.pl -m1 R1.fq -m2 R2.fq -o1 R1.clipped.fq -o2 R2.clipped.fq -a1 ACGT -a2 ...

↧

Mapping Applications On Hadoop Cluster

September 26, 2013, 8:20 pm

≫ Next: Align Paired End Reads Using Blast

≪ Previous: Tool: Trim Adapters Of Paired-End Reads (Fastq)

Hello,

I would like to know if there are any new mapping applications for paired end data designed to work on hadoop cluster. The mapping applications that i know are CloudBurst (single ended), Crossbow and DistMap.

PS: I dont have a background in Bioinformatics. I am an IT student working on Hadoop infrastructure to assess its performance for an organization.

Thanks

↧

Align Paired End Reads Using Blast

March 24, 2014, 3:17 pm

≫ Next: Help Needed To Run Seal For Genome Mapping.

≪ Previous: Mapping Applications On Hadoop Cluster

Hi all, Has anyone align illumina paired end reads using BLAST, I used gsnap to do the alignment first, then use BLAST to align the reads which were not mapped by gsnap. It seems that BLAST can only align single end reads. I aligned the two files separately and got results. But I don't know exactly how to deal with those results of BLAST. 1. Should I include the paired reads from two files or include all the reads as results? 2. How to merge the results with the sam file? Because I want to do assembly next step, I want to merge the blast results with gsnap results. Any comments would be appreciated. Thanks.

↧

Help Needed To Run Seal For Genome Mapping.

October 17, 2013, 10:39 pm

≫ Next: Assembly Aligned Paired-End Reads

≪ Previous: Align Paired End Reads Using Blast

Hello, I managed to build an index file from the reference genome using Seal. Now I am trying to run Seqal but I am running into errors.

./seqal /user/hadoop/seal/synthetic_prq1 /user/hadoop/seal_output /mnt/shared/data/version0.5.x/index.tar
Traceback (most recent call last):
File "./seqal", line 39, in <module>
from bl.mr.seq.seqal.seqal_run import SeqalRun
File "/mnt/shared/seal-0.3.2-src/build/bl/mr/seq/seqal/__init__.py", line 25, in <module>
from pydoop.pipes import runTask, Factory
ImportError: No module named pydoop.pipes

I have a hadoop cluster of 1 master and 5 slaves (Version 1.0.3). I have installed pydoop on the shared nfs folder in the cluster.

Can anyone please tell me how to get past this error?

Thanks, Ashwin

↧

Assembly Aligned Paired-End Reads

March 20, 2014, 9:44 am

≫ Next: Combination Of Paired-End And Single-End Samples In Chip-Seq Tf Study

≪ Previous: Help Needed To Run Seal For Genome Mapping.

Hi all,

I have a set of mapped paired-end reads and I would like to assemble the ones that overlap respecting the pairing information.

This means assemblying only pairs when the 2 first mates overlap and the 2 second mates overlap too. The reads are already mapped on a genome, there is nothing more to do with the sequences, only with the positions.

The goal is to get the extended positions with the count information.

Pairs example:

chr5:1456-1498,+     chr5:1654-1702,+
chr5:958-1012,+      chr5:1318-1388,+
chr5:1423-1478,+     chr5:1612-1667,+

I would like to get:

2     chr5:1423-1498,+     chr5:1612-1702,+
1     chr5:958-1012,+       chr5:1318-1388,+

I can't find any software working on the positions, all I can find is FLASH, PEAR, etc. which are working on the fastq files.

Cheers

↧

Combination Of Paired-End And Single-End Samples In Chip-Seq Tf Study

November 22, 2013, 3:33 am

≫ Next: What Better Way To Get The Paired Reads Aligned Against The Reference Genome?

≪ Previous: Assembly Aligned Paired-End Reads

I have 2 batches of chip-seq samples:

(A) One biological SE replicate

This batch, actually, consists of 4 SE Chip-seq samples - one treatment, one control, both have IP controls. All are SE sequenced at a lab A

(B) Two biological PE replicates

This batch consists of 8 PE Chip-Seq samples - 2 treatment, 2 controls, all have IP controls. All are PE sequenced at a lab B.

The idea is to establish differential binding between the treatment and the controls. I plan to use macs and diffbind but I am open to suggestions.

My first specific question is how to treat the PE samples. I see 3 options:

(1) Treat all forward and reverse reads as SE reads

(2) Take just the forward or the reverse reads

(3) Treat each PE sample as, essentially, 2 technical replicates (one consisting of forward reads, one of reverse)

I find (3) intuitively attractive but, again, I am open to suggestions. If I go for it than the model would need to introduce 2 additional factors - one accounting for the technical replicates, another for the batch effect. So far I have dealt with batch effects only - I presume that adding additional factor shall be straight-forward but, please, let me know if there are things that I need to pay attention to.

↧

What Better Way To Get The Paired Reads Aligned Against The Reference Genome?

March 23, 2013, 6:37 pm

≫ Next: Bfast Match Paired End Reads - Reports Half Total Number Of Reads

≪ Previous: Combination Of Paired-End And Single-End Samples In Chip-Seq Tf Study

Hi everybody,

I have a group of paired reads sequenced using Solid 4 (50bp each mate). I discovered that reads are contaminated by E.coli. My strategy is to align the reads against the reference genome and against the genomes of E.coli, and separate the aligned and no-aligned reads, respectively.

My question is: how to better way to get the paired reads, from the SAM file or during alignment? I use Shrimp, that allow to use the parameter --al (aligned reads) and --un (unaligned reads).

Help me?

↧

Bfast Match Paired End Reads - Reports Half Total Number Of Reads

January 17, 2013, 4:39 pm

≫ Next: How To Convert A Soapsnp Output File To Soap, Sam Or Bam Formats

≪ Previous: What Better Way To Get The Paired Reads Aligned Against The Reference Genome?

I'm using bfast 0.7.0a and testing on the paired end data present in the bfast user manual (Figure 5.4 in bfast-book.pdf). The format for this fastq file is shown that paired reads should follow sequentially in the file (read1R1, read1R2, read2R1, read2R2, etc). The same name is to be used for the sequential reads in a pair. So, the user manual data has 4 reads = two pairs. When I run bfast match:

bfast match -A 0 -t -n 16 -f hg19.fa -i 1 -r bfast_book.fastq
************************************************************
Checking input parameters supplied by the user ...
Validating fastaFileName hg19.fa.
Validating readsFileName bfast_book.fastq.
Validating tmpDir path ./.
**** Input arguments look good!
************************************************************
************************************************************
Printing Program Parameters:
programMode:                            [ExecuteProgram]
fastaFileName:                          hg19.fa
mainIndexes                             1
secondaryIndexes                        [Not Using]
readsFileName:                          bfast_book.fastq
offsets:                                [Using All]
loadAllIndexes:                         [Not Using]
compression:                            [Not Using]
space:                                  [NT Space]
startReadNum:                           1
endReadNum:                             2147483647
keySize:                                [Not Using]
maxKe ...

↧

How To Convert A Soapsnp Output File To Soap, Sam Or Bam Formats

October 2, 2013, 6:38 pm

≫ Next: Wgsim Mutations In Output After Setting Everything To 0

≪ Previous: Bfast Match Paired End Reads - Reports Half Total Number Of Reads

Hello,

I would like to know if there is any tool to convert SOAPsnp to SOAP? Once it is converted to SOAP, I can convert it to SAM using Soap2Sam from reseqtools. Hopefully using reseqtools is not that complicated.

If this is complicated, i wouldn't mind trying the reverse. Somehow converting SAM file to SOAPsnp files.

Background: I ran BWA in a HPC cluster. So I have SAM/BAM files from that test. Now I have to generate SAM/BAM files from the same data by running tests on Hadoop/AWS. I tried using DistMap. But ran into some errors; still trying to fix those errors. Now I am thinking of using Crossbow, which seems to be relatively easy to run on AWS. After that I have to compare both the results in terms of quality, time etc.

Hoping to find some help from Biostar members.

Thanks, Ashwin

↧

Wgsim Mutations In Output After Setting Everything To 0

April 5, 2013, 3:55 pm

≫ Next: Joining Paired-End Illumina Raw Reads

≪ Previous: How To Convert A Soapsnp Output File To Soap, Sam Or Bam Formats

I was just wondering, is there any useful information on wgsim? Tutorial? Anything? I have been stuck with it for the last 2 weeks. I'm really not sure how to use it. I need it for a project of mine. For example, I downloaded a genome from NCBI. What I do is call wgsim like this:

./wgsim -e 0 -s 0 -N 1000 -1 30 -2 30 -r 0 -R 0 -X 0 -A 0 test_genome_one_row.fa read1.fa read2.fa

With this, I would expect that all reads would be the same as the parts of the genome since I set all its error parameters to 0. But somehow, I get reads with mutations(or something else, because they don't belong in the original genome.) What is going on in here and can somebody please explain wgsim's arguments and how can I really control its behaviour? Thanks!

↧

Joining Paired-End Illumina Raw Reads

March 26, 2014, 11:53 am

≫ Next: Tophat 2 - Both Pairs Map Concordantly

≪ Previous: Wgsim Mutations In Output After Setting Everything To 0

I recently have amplicons sequenced (Illumina PE250) to investigate microbial community. I want to know what quality detection, and what kind of trims of raw reads by what programs should be performed before joining paired-end raw reads? I think low-quality parts, potential barcode or primer sequences existed in raw reads should be excluded in advance. Is this right?

↧

Tophat 2 - Both Pairs Map Concordantly

August 24, 2012, 2:15 am

≫ Next: Does The Mapping Of A Read'S Pair Affect Its Own Mapping Score In Bwa?

≪ Previous: Joining Paired-End Illumina Raw Reads

Hi,

I have just run the following command

tophat2 --solexa1.3-quals -p 12 -r 80 --max-multihits 1 --no-mixed --no-discordant /home/Turkey/Index/turkeyindex /home/Turkey/WTCHG24920061sequence1.fa /home/Turkey/WTCHG24920062sequence_2.fa

I understand this is instructing Tophat 2 to map the reads so that no multiple hits are allowed, and both pairs of reads have to map concordantly.

However, when I examine the stats of the output in Bamtools I get the following

Total reads: 33102389
Mapped reads: 33102389 (100%)
Forward strand: 16600010 (50.1475%)
Reverse strand: 16502379 (49.8525%)
Failed QC: 0 (0%)
Duplicates: 0 (0%)
Paired-end reads: 33102389 (100%)
'Proper-pairs': 28865628 (87.201%)
Both pairs mapped: 30696760 (92.7328%)
Read 1: 16559354
Read 2: 16543035
Singletons: 2405629 (7.26724%)

Why do only 92.7% of the reads fall under both pairs are mapped and 7.2% are singletons? I would expect there to be no singleton reads and 100% where both pairs have mapped, given the options I specified?

Any help would be much appreciated.

Thanks

↧

Does The Mapping Of A Read'S Pair Affect Its Own Mapping Score In Bwa?

January 6, 2014, 9:02 am

≫ Next: Resampling Fastq Sequences Without Replacement

≪ Previous: Tophat 2 - Both Pairs Map Concordantly

I would like to know more about how BWA treats the mapping score of paired end reads.

I am familiar with the process used to assign mapping scores by BWA (thanks to this post). I see that the mapping score is affected by the correct mapping of a read's pair. But to what extent?

For instance, if one read in a pair is mapped and the other is not, does the mapped read receive a reduction in its own mapping score.

Further, does anyone know the file and location of the BWA which assigns mapping score?

Thanks in advance.

↧

Resampling Fastq Sequences Without Replacement

March 15, 2013, 12:35 pm

≫ Next: Best way to map paired-end, uniquely mapped reads with Bowtie2?

≪ Previous: Does The Mapping Of A Read'S Pair Affect Its Own Mapping Score In Bwa?

Hello, I want to extract a random sample (without replacement) of 7.5 million fastq sequences from illumina sequencing data that contains about 30 million sequences each in of the reads. I want to extract the same sequence from each of the two files ( e.g., if I extracted sequence no. 31 from read 1, I would want to extract the same sequence from read 2 also). How can I do this? Is there a script or module I can use? Any help would be appreciated. Thanks

↧

Best way to map paired-end, uniquely mapped reads with Bowtie2?

April 17, 2014, 3:15 am

≫ Next: Estimating Insert Size From Paired End Data.

≪ Previous: Resampling Fastq Sequences Without Replacement

This has come up before, but I have never been satisfied with, or completly understood, the answers. My issue is how can Bowtie2 be used to map uniquely mapping paired-end reads (Illumina) to a genome for the purpose of ChIP-seq analysis. Unlike the original Bowtie there is no magical '-m1' flag to enforce the reporting of unqiely mapped reads.

My current strategy is to run Bowtie2 with '-k 2', which reports up to two alignments per read. The resulting SAM file can then be filtered to remove reads containing the 'XS' flag, which reports the mapping quality of a second best alignment.

I am fairly happy with this when using single-end reads, but with paired-end reads samtools flagstat on the filtered BAM file can show an odd number for properly paired reads.

I would be interest to get an opinion on this method and whether anyone else has tried an tested alternatives.

Thanks.

↧