我自己大概地將CNV-calling的方法快速瀏覽過一遍了,今天先花了點時間找了一下看有哪些工具我自己不需要重寫過一次,歸納如下:
BioAlignments.jl
BED.jl
及BigBed.jl
可使用但說到要有像Python底下pybedtools
近乎是bedtools
那種程度的操作,在Julia裡頭我倒是還沒有看到,所以今天的目標是先用Julia寫個bedtools裡頭makewindows的功能。
為什麼要說是閹割版的makewindows呢?首先得先來看一下bedtools makewindows的說明:
Tool: bedtools makewindows
Version: v2.23.0
Summary: Makes adjacent or sliding windows across a genome or BED file.
Usage: bedtools makewindows [OPTIONS] [-g <genome> OR -b <bed>]
[ -w <window_size> OR -n <number of windows> ]
Input Options:
-g <genome>
Genome file size (see notes below).
Windows will be created for each chromosome in the file.
-b <bed>
BED file (with chrom,start,end fields).
Windows will be created for each interval in the file.
Windows Output Options:
-w <window_size>
Divide each input interval (either a chromosome or a BED interval)
to fixed-sized windows (i.e. same number of nucleotide in each window).
Can be combined with -s <step_size>
-s <step_size>
Step size: i.e., how many base pairs to step before
creating a new window. Used to create "sliding" windows.
- Defaults to window size (non-sliding windows).
-n <number_of_windows>
Divide each input interval (either a chromosome or a BED interval)
to fixed number of windows (i.e. same number of windows, with
varying window sizes).
-reverse
Reverse numbering of windows in the output, i.e. report
windows in decreasing order
ID Naming Options:
-i src|winnum|srcwinnum
The default output is 3 columns: chrom, start, end .
With this option, a name column will be added.
"-i src" - use the source interval's name.
"-i winnum" - use the window number as the ID (e.g. 1,2,3,4...).
"-i srcwinnum" - use the source interval's name with the window number.
See below for usage examples.
在這邊我們可以看到,bedtools windows所支援分割windows的方式還挺多樣化的,我實在是沒有時間可以花一整天的時間來把所有的功能都寫出來,所以就先以我會在CNV-calling當中用到的切割分式為主—也就是設定window size
來切,這樣每條染色體就會根據我們設定的這個window size被切成固定的大小。
#!/usr/bin/env julia
# Implementation of BEDTools makewindows in Julia
"""
makeWindows(genomeFileInfo::DataFrame, windowSize::Int)
Arguments
---------
- `genomeFileInfo`: a DataFrame object by reading genome file with CSV.read
- `windowSize`: set windown size (bp)
"""
using DataFrames
using CSV
function makeWindows(df::DataFrame, winSize::Int)
for c = 1:nrow(df)
chr = df.chr[c]
chrLen = df.chrlen[c]
winNum = Int(floor(chrLen/winSize))
restLen = chrLen % winSize
for i = 0:winNum
startpos = i * winSize
if i != winNum
endpos = startpos + winSize
else
endpos = startpos + restLen
end
window = join([chr, startpos, endpos], " ")
println(window)
end
end
end
genomeFile = ARGS[1]
windowSize = parse(Int,ARGS[2])
df = CSV.read(genomeFile, header = ["chr", "chrlen"], delim = "\t")
makeWindows(df, windowSize)
使用方式也很容易,首先要準備一個genome file
,裡頭從放著每個chromosome的長度,像這樣子
chr1 249250621
chr2 243199373
...
chr18_gl000207_random 4262
執行makeWindowsForBED.jl
:
./makeWindowsForBED.jl hg19.txt 1000000 > mywindowns.bed
大概就是這樣子 @@>