Estimating Insert Size From Paired End Data.

February 27, 2014, 10:30 pm

≫ Next: Coverage For Pair-End Rna-Seq, Extend Reads Or Not?

≪ Previous: Align Paired End Reads Using Blast

Hi,

I have paired end data from illumina. To estimate the insert size in silico ( from scratch ), I have aligned the reads as single end reads to the genome ( mouse ). Now I have the two alignment files ( SAM/BAM). I would like to estimate the insert size ( distribution plot ) from the two SAM/BAM files.

The existing tools will look for the "=" field in the SAM format to consider it as the corresponding mate but not read name. Any one is aware of any tool that estimates the insert size based on reads names from the two sam files ?

Thanks.

↧

Coverage For Pair-End Rna-Seq, Extend Reads Or Not?

February 27, 2013, 8:02 pm

≫ Next: 20 Mate Alignment With Gaps To Reference Genome

≪ Previous: Estimating Insert Size From Paired End Data.

HI, all I have a question when computing region coverage of pair-end RNASeq data. As showed by the sketch map, when computing region coverage, whether I should use actually mapped reads or extended reads? The coverage for impacted region may be different. I wonder which one is reasonable. I want to compute and compare coverage for small regions like 100 bp, not as large as a gene. Also, one of the compared sample is sequenced Pair-endly, the other is sequenced single-endly. Which way of computing is more suitable in this case? I have a Rip-Seq and want to use this RNA-Seq as a control to call peaks. The genome would be binned into 100bp regions and coverage of each region will be computed for both Rip-Seq and RNA-Seq. A following fisher's test would be used to select significantly enriched regions of Rip-Seq. The region is defined. "How many fragments were sequenced from this region?" is what I want to ask . If one of my defined regions happens to located in the internal region of two ends of a fragment, the coverage of it in RNA-Seq would be 0 if just mapped tags used. However, this region is surely covered by reads. Thank you very much!

Reference                       ========--------------------------------------------------===========
Actually mapped reads            ^^^^                                                            $$$$
Extended reads                   ^^^^^^^                                                  $$$$$$$$$$$
Impacted r ...

↧

20 Mate Alignment With Gaps To Reference Genome

August 22, 2013, 2:24 pm

≫ Next: Need help using Shrimp2 on paired end color-space SOLiD data.

≪ Previous: Coverage For Pair-End Rna-Seq, Extend Reads Or Not?

I'm trying to align a string of twenty 5-mers with gaps of 100 to 400 bp to a reference genome. It's similar to paired-alignment except instead of N=2, N=20. The order of the twenty 5-mers is known.

Is there any software that can handle a data set like this?

Best regards,

-Sam

↧

Need help using Shrimp2 on paired end color-space SOLiD data.

April 15, 2014, 8:20 am

≫ Next: How To Check If Illumina Fastq Is Single Or Paired End With Minimal Sequence Id

≪ Previous: 20 Mate Alignment With Gaps To Reference Genome

Hi, I have SOLiD reads which are paried-end (75bp and 35bp) in .csfasta and .QV.qual format. I would like to use Shrimp2 to align them. So far I have been having trouble using it. I used the following command:

gmapper -1 Sample/F3/reads/Hope_2014_02_20_1_01_13_0502_F3.csfasta -2 Sample/F5-DNA/reads/Hope_2014_02_20_1_01_13_0502_F5-DNA.csfasta $SCRATCH/human_hg19.fa -N 32 -p opp-in  > Sample.sam 2> Logs/Sample.log

This is my log file and the error is shown at the bottom. I'm not sure what that means.

- Processing genome file [/Refs/human_hg19.fa]
- Processing contig chr1
- Processing contig chr2
- Processing contig chr3
- Processing contig chr4
- Processing contig chr5
- Processing contig chr6
- Processing contig chr7
- Processing contig chr8
- Processing contig chr9
- Processing contig chr10
- Processing contig chr11
- Processing contig chr12
- Processing contig chr13
- Processing contig chr14
- Processing contig chr15
- Processing contig chr16
- Processing contig chr17
- Processing contig chr18
- Processing contig chr19
- Processing contig chr20
- Processing contig chr21
- Processing contig chr22
- Processing contig chrX
- Processing contig chrY
- Processing contig chrM

Loaded Genome
note: detected fastq format in input file [Sample/F3/reads/Hope_2014_02_20_1_01_13_0502_F3.csfasta]
- Processing read files [Sample/F3/reads/Hope_2014_02_20_1_01_13_0502_F3.csfasta , Sample/F5-DNA/reads/Hope_2014_02_20_1_01_13_0502_F5-DNA.csfasta]

note: quality v ...

↧

How To Check If Illumina Fastq Is Single Or Paired End With Minimal Sequence Id

March 14, 2014, 8:26 am

≫ Next: Understanding Samtools Flagstat Output

≪ Previous: Need help using Shrimp2 on paired end color-space SOLiD data.

Hi all, I am trying to check if a FASTQ is single or paired end. From wikipedia I saw that default format has to be like this:

@HWUSI-EAS100R:6:73:941:1973#0/1

but in my case the sequence id is like

@HWUSI-EAS100R:6:73:941:1973

with missing # part.. Can I assume that it is single end? I could not find a good source to learn from about it.. can you also point me to something like this? Thanks

↧

Understanding Samtools Flagstat Output

October 23, 2013, 10:12 pm

≫ Next: Is There An Elegant Way To Extract Only The Properly-Paired Reads In A Sam/Bam File?

≪ Previous: How To Check If Illumina Fastq Is Single Or Paired End With Minimal Sequence Id

The following is the output of samtools flagstat command on bam file (paired-end) generated after markDuplicate of Picards.

7417232 + 0 in total (QC-passed reads + QC-failed reads)
287618 + 0 duplicates
4534962 + 0 mapped (61.14%:-nan%)
7417232 + 0 paired in sequencing
3708616 + 0 read1
3708616 + 0 read2
4528278 + 0 properly paired (61.05%:-nan%)
4534962 + 0 with itself and mate mapped

I am having difficulty in understanding whether the duplicates are pairs or single. If there are total of 7417232 pairs and out of them 287618 pairs are duplicates means, there are 3% of duplicate reads in my data. Is my understanding is correct ?

↧

Is There An Elegant Way To Extract Only The Properly-Paired Reads In A Sam/Bam File?

April 29, 2013, 7:28 am

≫ Next: Sorting Fastq Files After Trimming (Orphans And Pe)

≪ Previous: Understanding Samtools Flagstat Output

I know I should be filtering for the following tags: 99,163,83,147 and I know that samtools would work to get all the pairs. For example:samtools view -F 0x99 -b in.bam I was wondering if there was a more elegant way to do this than running samtools four times to filter for each tag. It also occured to me, I would probably have to sort the bam files afterward to ensure that the pairs were in the same order, which means I have to run the sort function 4 times as well.

I would appreciate knowing if there was a better way to do this.

↧

Sorting Fastq Files After Trimming (Orphans And Pe)

December 23, 2012, 2:37 pm

≫ Next: Orientation in paired-end sequencing?

≪ Previous: Is There An Elegant Way To Extract Only The Properly-Paired Reads In A Sam/Bam File?

I have a bunch of Illumina PE data that has been run through fastx trimmer and clipper. I am ready to map these reads, but am needing to create 2 files for paired end reads (the left and right hand reads in separate files) and a file with the orphaned reads. Of course the paired end files need to have the reads in the same order.

This has to be a common problem, but I can't seem to find a tool that parses fastq files in this way (I swear I searched the Biostar forum).

Any help would be greatly appreciated.

↧

Orientation in paired-end sequencing?

June 16, 2014, 3:17 pm

≫ Next: Warnings In Bowtie Mapping

≪ Previous: Sorting Fastq Files After Trimming (Orphans And Pe)

I am new to bioinformatics and currently learning how to use Bowtie 2. As written in the manual:

A pair that aligns with the expected relative mate orientation and with the expected range of distances between mates is said to align "concordantly". If both mates have unique alignments, but the alignments do not match paired-end expectations (i.e. the mates aren't in the expected relative orientation, or aren't within the expected distance range, or both), the pair is said to align "discordantly".

I have read about the basics of paired-end sequencing and orientation (in the molecular biology sense). In summary, my understanding is that in paired-end sequencing we sequence both ends of a DNA fragment at the 5' end and the 3' end (we call them mate 1 and mate 2) and by knowing the expected length between the mates we can better align the fragment to a reference genome.

My question is, what is it meant by two mates having an expected relative orientation? If am using Bowtie 2 and giving it a file with all the first mates and another one with all the corresponding second mates, how can two mates align without first having the expected orientation?

↧

Warnings In Bowtie Mapping

October 16, 2013, 10:23 pm

≫ Next: Should I merge bacterial RNA-seq paired-end data?

≪ Previous: Orientation in paired-end sequencing?

Hello, I am trying to use bowtie on small synthetic data for short read mapping. My command is ./bowtie -p 8 -t -S hg19 -1 synthetic_sample1.fq -2 synthetic_sample2.fq > bowtie.sam. The alignment stats is really bad.

Seeded quality full-index search: 00:03:12
# reads processed: 1000003
# reads with at least one reported alignment: 129 (0.01%)
# reads that failed to align: 999874 (99.99%)
Reported 129 paired-end alignments to 1 output stream(s)
Time searching: 00:03:17
Overall time: 00:03:17
[samopen] no @SQ lines in the header.
[sam_read1] missing header? Abort!

My error log says the following almost with every read I think. So the error log is a long list of similar errors

Warning: Exhausted best-first chunk memory for read chrY_25607312_25607822_7:0:0_4:0:0_13d4/1 (patid 986138); skipping read

When I googled the warning, I saw some suggestions of using --chunkmbs while running bowtie. I am not really sure what that does. I coudn't understand it from the manual. Still I used it with this command as it was suggested in one of the forums ./bowtie -p 8 -t --chunkmbs 256 -S hg19 -1 synthetic_sample1.fq -2 synthetic_sample2.fq > bowtie.sam. Then my error log says

Time loading reference: 00:00:01
Time loading forward index: 00:00:01
Time loading mirror index: 00:00:02
Error: Could not allocate ChunkPool of 268435456 bytes
Warning: Exhausted best-first chunk memory for read Error: Could not allocate ChunkPool ...

↧

Should I merge bacterial RNA-seq paired-end data?

June 19, 2014, 1:41 am

≫ Next: Mapping Trimmed Paired End Reads With Mapsplice

≪ Previous: Warnings In Bowtie Mapping

Hi,

I have bacterial RNA-seq data from paired-end reads (2x75). I'm interested in sense/antisense differential expression.

I would like to know if makes sense to merge R1 and R2 into one read (if overlaps) or maybe work only with 75 bp reads (R2 ones).

Thanks, Bernardo

↧

Mapping Trimmed Paired End Reads With Mapsplice

April 9, 2012, 9:12 pm

≫ Next: Help Needed To Run Seal For Genome Mapping.

≪ Previous: Should I merge bacterial RNA-seq paired-end data?

Hi! I'm trying to map illumina paired end using MapSplice. I used script named sickle to trim the bad quality tails and the output looks like this:Paired end 1

@TUPAC_0006:1:1:3062:1473#0/1
TGTATGTATCTATAACAATGGCCTTTTGTGTTGTTTTTCTGGGCATGTTGTATGACAATTTCTATTACAGCATATG
+
fYcfffffffgggggfggggdgggfgfggggfggggggggggfgggggggfgggdgegggggggeggcaaf]cefg
@TUPAC_0006:1:1:3623:1474#0/1
TGGGAGGTTAAGAGTAGCATGAAGAACTTAAGATGAGGATAAGAGTCTAAATTTTTAGTTTCAAGGTTTCAATAGA
+
aa^\Sf``cfgg_cfcfcffeeSfff]ffcfffffWfb`dfefeffcfcc[ffffff_fefaaeffdefaacacdg
@TUPAC_0006:1:1:4325:1480#0/1
CAAGAACCGACTAAAACAGACACCATTATGAACTCAATTAATCACCACAACACCC
+
a\^aQ^\a^ZGSNUQd^ZZ[ZZV[Zfce]c]]W^^]]^^]``_``[aa_acaaa_
@TUPAC_0006:1:1:4592:1480#0/1
CGGTTCCTTAAATCAAGTTACATGATGTTGCCTGAATTCAAGTAAACAGGAATGAATAATCCTCCTGTAGGGAAGA
+
haghgghhghgghhhhhhhghechhghh]cfhhhhhhhg]hghhhfhhhhgghghfhchhfdhgghhghhchecfd
@TUPAC_0006:1:1:10714:1476#0/1
GAGCGATACTGATGATGAAATTTTCAAAAATGACTGCCAACTATTTTGAG
+
VDLTMRTTRVPVPV[\_`aTccc[_ccc_a_aRca]aa]a[c[aS^_]]a
@TUPAC_0006:1:1:4409:1497#0/1
GGGAGGGCGATGAGGACTAGGATGATGGCGGGCAGGATAGTTCAGACGGTTTCTATTTCCTGAGCGTCTG
+
fWfafffcfffddddaafaffWff]bdf`fd```ecadWb]cffff_ffdaabb]c_fa]a`Q`^`W\\^
@TUPAC_0006:1:1:6557:1505#0/1
GGCTTATTGTACAGATTATTTTATCACCCAGGTATTAAGCTTAGTACCCATTAGATATCTTTCCCAATCCTCTACC
+
g\_fggggggffffcgggfgggggggfgcfffbffaffffIXZUSgggggggggggfgcgggggggggcfggaefg

Paired end 2@TUPAC ...

↧

Help Needed To Run Seal For Genome Mapping.

October 17, 2013, 10:39 pm

≫ Next: Dealing With Read Counts Under Pe And Se Scenarios

≪ Previous: Mapping Trimmed Paired End Reads With Mapsplice

Hello, I managed to build an index file from the reference genome using Seal. Now I am trying to run Seqal but I am running into errors.

./seqal /user/hadoop/seal/synthetic_prq1 /user/hadoop/seal_output /mnt/shared/data/version0.5.x/index.tar
Traceback (most recent call last):
File "./seqal", line 39, in <module>
from bl.mr.seq.seqal.seqal_run import SeqalRun
File "/mnt/shared/seal-0.3.2-src/build/bl/mr/seq/seqal/__init__.py", line 25, in <module>
from pydoop.pipes import runTask, Factory
ImportError: No module named pydoop.pipes

I have a hadoop cluster of 1 master and 5 slaves (Version 1.0.3). I have installed pydoop on the shared nfs folder in the cluster.

Can anyone please tell me how to get past this error?

Thanks, Ashwin

↧

Dealing With Read Counts Under Pe And Se Scenarios

August 10, 2012, 4:22 am

≫ Next: How To Assemble Genome Generated With Bac Clones?

≪ Previous: Help Needed To Run Seal For Genome Mapping.

Hi, I am unsure how to deal with this case to go about analysing RNA-seq data. Suppose that you have a control and treatment setup with 4 biological replicates each. However, two in control and two in treatment were pooled together and paired-end sequenced and the other 4 were single-end sequenced. That is:

Condition    Sample_No    Sequence_Type
Control        1,2          paired-end
Control        3,4          single-end
Treatment      1,2          paired-end
Treatment      3,4          single-end

Now, suppose I'd want to perform a differential gene expression analysis, how would you take care of the difference in the reads (due to PE and SE) within conditions?

i) You map the reads as such - PE as PE and SE as SE libraries. You count the total number of reads that fall under each sample. You then normalize using edgeR's TMM method for difference in the library size and then perform the differential expression analysis. ii) You map the reads as such - PE as PE and SE as SE libraries. You count the total number of reads first in pair (meaningful for PE samples) that fall under each sample. You then normalize using edgeR's TMM method for difference in the library size and then perform the differential expression analysis. iii) You discard the second pair altogether and treat them as two conditions with 8 SE libraries.

Understanding the inherent mess-up in the experimental setup and the possibility of bias and ...

↧

How To Assemble Genome Generated With Bac Clones?

July 30, 2013, 12:44 pm

≫ Next: Trimming Adapters For Paired-End Sequences

≪ Previous: Dealing With Read Counts Under Pe And Se Scenarios

I have 2 fastq files from illimina with reads length 250b. Sequences from one file obtained by sequencing from "right" and from "left" in another. This is paired end sequencing. As it is whole genome shotgun technique, both fastq files comprise vector sequence, target sequnce in BAC, ecoli DNA and maybe plasmids DNA. So how i can assemble just my target DNA from BAC? What tools or assemblers i must use?

↧

Trimming Adapters For Paired-End Sequences

February 6, 2013, 10:41 am

≫ Next: Should We Dump Illumina Pair-End Mapping Results In Sam With Mapq=0, But Good Template Length

≪ Previous: How To Assemble Genome Generated With Bac Clones?

Hi all,

I got illumina paired end fastq files. They told me to trim read 2 at the beginning for ~20 to 30 bp due to the WGA adapters.

Can we find the adapters by looking in to the quality? Which tool is good for trimming adapters by keeping the paired nature of the sequences? Is there any issues will come if I use bowtie2 in the downstream for aligning the trimmed sequences(trimming only one in pair) with ref?

Following is the example of read2:

@HWI-ST1162:139:C0H7WACXX:5:1101:1865:1112 2:N:0:CGATGTA GTCATGGTGTCTCTTCACAACAATGGAAACCCTAACTAAGACAAAGACTAATAGAAGTGTTTTTTTAGGAA

<9;>;>?1=>;=9=?########################################################

Thanks, Deepthi

↧

Should We Dump Illumina Pair-End Mapping Results In Sam With Mapq=0, But Good Template Length

December 18, 2012, 5:23 am

≫ Next: Paired-End Protocol For Micrornaseq

≪ Previous: Trimming Adapters For Paired-End Sequences

hi, everyone! I am working on illumina pair-end sequencing

After mapping by bwa, I got a pair of reads with MAPQ=0, with both reads mapped to more than one place. But the Template Length is OK, and I find this pair of reads in this sam file only once.

So, should we dump this mapping pair?

SPECIFIC_ID 83 Y 58900709 0 92M = 58900679 -122 AGTGCATTCCATTCCAGTCTCTTCAGTTCGATTCCATTCCATTCGTTTCGATTCCTTTCCATTCCAGCCCATTCCATTCCATTGCATTCCTT DDDCDCDDD DDDDDDDDDDDDEEC?FFFFFHHHHHJJIJIHJIJJJJIJJJJJJIJHGJJIHFJJJIGIJJIHIJIJHJJIJJIJJIFHFFH XT:A:R NM:i:1 SM:i:0 AM:i:0 X0 :i:9 X1:i:16 XM:i:1 XO:i:0 XG:i:0 MD:Z:83C8 :i:9 X1:i:16 XM:i:1 XO:i:0 XG:i:0 MD:Z:83C8

SPECIFIC_ID 163 Y 58900679 0 92M = 58900709 122 AGAACCTTCCATTACACTCCCTTCCATTCCAGTGCATTCCATTCCAGTCTCTTCAGTTCGATTCCATTCCATTCGTTTCGATTCCTTTCCAT FHGHHIIJJ JGJIIJIJJIIJJIIJEIGEGHIIJJIJIGIIIFCD XT:A:R NM:i:0 SM:i:0 AM:i:0 X0 :i:2 X1:i:18 XM:i:0 XO:i:0 XG:i:0 MD:Z:92

Thanks very much!

↧

Paired-End Protocol For Micrornaseq

July 5, 2013, 3:42 am

≫ Next: Combination Of Paired-End And Single-End Samples In Chip-Seq Tf Study

≪ Previous: Should We Dump Illumina Pair-End Mapping Results In Sam With Mapq=0, But Good Template Length

In another post, a guy wanted to know how to analyze paired-end data and use them to predict microRNAs.

I never heard about a paired-end protocol for miRNAseq and would be interested in some more information. Does anyone know this protocol?

My questions would be:

Since the reads from mature miRNAs are very short, do the two pairs completely overlap with each other?
Is this protocol still strand specific? Is the first mate always the one on the correct strand?
I don't think that recent tools like miRDeep, or miRanalyzer can handle this information. Are there tools which can?
...

I would be really thankful for some input! :)

↧

Combination Of Paired-End And Single-End Samples In Chip-Seq Tf Study

November 22, 2013, 3:33 am

≫ Next: How To Count Stand-Specific Paired-End Rna-Seq Reads Overlapping Known Protein Coding Genes ?

≪ Previous: Paired-End Protocol For Micrornaseq

I have 2 batches of chip-seq samples:

(A) One biological SE replicate

This batch, actually, consists of 4 SE Chip-seq samples - one treatment, one control, both have IP controls. All are SE sequenced at a lab A

(B) Two biological PE replicates

This batch consists of 8 PE Chip-Seq samples - 2 treatment, 2 controls, all have IP controls. All are PE sequenced at a lab B.

The idea is to establish differential binding between the treatment and the controls. I plan to use macs and diffbind but I am open to suggestions.

My first specific question is how to treat the PE samples. I see 3 options:

(1) Treat all forward and reverse reads as SE reads

(2) Take just the forward or the reverse reads

(3) Treat each PE sample as, essentially, 2 technical replicates (one consisting of forward reads, one of reverse)

I find (3) intuitively attractive but, again, I am open to suggestions. If I go for it than the model would need to introduce 2 additional factors - one accounting for the technical replicates, another for the batch effect. So far I have dealt with batch effects only - I presume that adding additional factor shall be straight-forward but, please, let me know if there are things that I need to pay attention to.

↧

How To Count Stand-Specific Paired-End Rna-Seq Reads Overlapping Known Protein Coding Genes ?

December 17, 2013, 5:02 am

≫ Next: Picard Matequery Slows Process To A Crawl

≪ Previous: Combination Of Paired-End And Single-End Samples In Chip-Seq Tf Study

Dear Biostars

Does any one how to overlap stand-specific paired-end RNA-Seq reads (BAM) with known protein coding genes (BED) ?

I tried the following but I think it is not the correct way ? Would appreciate your help!

bamTobed -i ES.bam > ES.bed
intersectBed -a ES.bed -b Ensembl_mm9.bed -wa -s |awk '!a[$4]++' |wc -l

↧