Quantcast
Channel: Post Feed
Viewing all articles
Browse latest Browse all 3231

Efficiently join paired-end read coordinates in the same line?

$
0
0
Dear community, I have a huge paired-end HiC dataset (BAM format) which I want to format like this way: HWI-D00283:117:C5KKJANXX:2:1101:1139:77789 chr6 153338506 153338556 37 + chr6 153338031 153338081 37 - HWI-D00283:117:C5KKJANXX:2:1101:1139:77856 chr6 149915169 149915219 37 - chr6 149914908 149914958 37 + HWI-D00283:117:C5KKJANXX:2:1101:1139:79414 chr4 184474969 184475019 37 - chr4 184474811 184474861 37 + HWI-D00283:117:C5KKJANXX:2:1101:1139:81280 chr6 153641723 153641773 37 - chr6 153641551 153641601 37 + HWI-D00283:117:C5KKJANXX:2:1101:1139:81917 chr8 87070282 87070332 37 - chr8 87069851 87069901 37 + HWI-D00283:117:C5KKJANXX:2:1101:1139:82575 chr17 56970884 56970934 37 - chr6 151400450 151400500 37 - HWI-D00283:117:C5KKJANXX:2:1101:1139:86642 chr6 150043041 150043091 37 - chr6 150042915 150042965 37 +   This is an example which I obtained by first converting the BAM format to BED and separating each mate into different files and then with a AWK command joined the mates. This is the awk command I used: awk 'NR==FNR {h[$4] = $1"\t"$2"\t"$3"\t"$5"\t"$6; next} {OFS="\t"; print $4,$1,$2,$3,$5,$6,h[$4]}' mate1 mate2 This command worked fine with a small dataset (1M, 10M reads), but when I tried with 200M reads file, it crashes because memory reasons I suppose. ...

Viewing all articles
Browse latest Browse all 3231

Trending Articles