Allele - tools for genomics

ALLELE location on the UT cluster: /gpfs/gvgpfs/gvhome/toomasha/SOFT/ALLELE
All others please DOWNLOAD ALLELE HERE

FUNCTION	SHORT DESCRIPTION
ADTOGEN	Conversion of allele dosage file into the Oxford gen format. NOTE: DESCRIPTION NOT POSTED YET.
FAMILY	Find family/group structure using pariwise IBD table.
GENTOAD	Convert gen/impute file to allele dosage file and do statistics.
LDPRUNE	Marker pruning based on LD map. NOTE: DESCRIPTION NOT POSTED YET.
POPSEL	Match controls to cases. Can match any number of controls - exact matching and cohort matching.
RELOUT5	Remove close relatives based on the pairwise IBD file (PI_HAT from PLINK) and optionally other preferences.

FAMILY

This function finds structure in the cohort. It can determine and cluster the individuals based on pairwise IBD values (PI_HAT from PLINK). It uses single link clustering.

./ALLELE -M family -file -header -out -delim -threshold -familysize -showmatrix

-file = input file name (pairwise IBD values after to ID's). REQUIRED.
-header = file header. Options: "yes", "no". Default = "no".
-out = output file name. Default = input file name + ".out".
-delim = files' delimiter. Options: a) "tab", "space", "comma", samicolon", "colon", "slash", "bslash", "dash", "quote", "squote", b) any symbol or word without spaces. Default = "tab".
-threshold = IBD limit above which there is a connection between two individuals. Default = 0.1.
-familysize = smallest group of individuals considered family. Default = 2.
-showmatrix = whether to show similarity matrices or only list the families by IDs. Options: "yes", "no". Default = "no".

Example:
./ALLELE -M family -file ibd.txt -header yes -out results.txt -delim space -threshold 0.125 -familysize 3 -showmatrix yes
Here the smallest family is 3 individuals and IBD limit is 0.125.

GENTOAD

This function coverts Oxford .gen/.impute file into allele dosage files.

./ALLELE -M gentoad -file -out -beg -end -last-bef-geno -rs-loc -to-hard-call -statistics

-file = input file name . REQUIRED.
-out = output file name. Default = input file name + ".out".
-beg = where to start conversion (usually don't need this flag). Options: Any legal row number. Default = 1.
-beg = where to end conversion (usually don't need this flag). Options: Any legal row number. Default = max rows.
-last-bef-geno = column number that is last before the fenotype triplets start (only need to use if your file has unusual format). Options: Any legal column number. Default = 5.
-rs-loc = column number that contains marker IDs (only need to use if your file has unusual format). Options: Any legal column number. Default = 2.
-to-hard-call = use if you want integer allele dosages rounded to the integer. Options: "yes", "no". Default = "no".
-statistics = whether you want statistics for each marker. Options: "yes", "no". Default = "no".

Example:
./ALLELE -M gentoad -file myfile.gen -out myfile.ad -statistics yes
Here a typical gen file is converted and statistics are computed.

POPSEL

This function does case - control matching. It can do exact matching (any number of rounds) or group mathing (making sure that the two groups are most similar); or it can do a combination of the two.

./ALLELE -M popsel -file1 -file2 -out -log -delim -missing -header1 -header2 -headers -exact-rounds -exact-individuals -exact-order -group-match -hitnumber

-file1 = cases file name. REQUIRED. The individual ID's are followed by parameters that are used in matching. Header optional (see below). Two first lines (after the optional header) determine the relative weights of parameters (any positive scale) and the data types (S= string, B=binary, C=continuous).

Example:

ID Sex Age
0 10 2
S B C
ID1 1 62
ID2 2 65
ID3 1 55

-file2 = control candidates' file name. REQUIRED. File structure should be the same as file1 but rows 2-3 should not be there.
-out = output file name. Default = input file name + ".out".
-log = log file name. Default = input file name + ".log".
-delim = files' delimiter. Options: a) "tab", "space", "comma", samicolon", "colon", "slash", "bslash", "dash", "quote", "squote", b) any symbol or word without spaces. Default = "tab".
-missing = missing values. Default = "NA".
-header1 = file1 header. Options: "yes", "no". Default = "no".
-header2 = file2 header. Options: "yes", "no". Default = "no".
-headers = file1 + file2 headers; use if both files have headers Options: "yes", "no". Default = "no".
-exact-rounds = how many best matches are required for each case. Options = any valid integer.
-exact-individuals = how many total best match individuals are required for the cases. Options = any valid integer. Note: Use this only if you want exact matching to be incoplete. If you have 10 cases and you set '-exact-individuals 15' then 5 cases will have 2 matches but 5 will only have one match.
-exact-order = how potential controls are treated when searching for matches. Don't use this flag at all if unsure. Options = "max", "min". They determine if the case most distant to the best control is matched first ("max") or the closest one to the best control is matched first ("min").
-group-match = the number of additional individuals matched in addition to (or instead of) group matching. New individuals are chosen so that they best "fix" the control group deviations from the cases group (see example bewlow). Options = any valid integer.
-hitnumber = a way to change the number of control options for each case. Don't use if unsure - may lead to memory errors. You may want to use it if computations are very slow. Options = any valid integer. Default: the number of rows in file2.

Example (only exact matches):
./ALLELE -M popsel -file1 allcases.txt -file2 allcontrols.txt -out results.txt -log results.log -delim space -missing NA -headers yes -exact-rounds 4
Here 4 rounds of exact matches are found.

Example (exact matches + additional individuals):
./ALLELE -M popsel -file1 allcases.txt -file2 allcontrols.txt -out results.txt -log results.log -delim space -missing NA -headers yes -exact-rounds 4 -group-match 10
Here 4 rounds of exact matches are found followed by adding 10 more individuals so that the controls group as a while becomes more similar to the cases group. For example if more women are needed, most added individuals are women etc.

RELOUT5

This function replaces the original RelOut (v1). It finds the minimal list of individuals that need to be removed to break all kinship ties smaller than the given threshold. Every time a pair is broken one individual is removed. One can set preferences to govern which individuals are preferably removed.

./ALLELE -M relout5 -file -pfile -pdefault -out -delim -header -limit -mode -pidsize

-file = filename. Pairwise ID's and IBD values. Ex: "ind1 ind2 0.2498". REQUIRED.
Note1: You should not have rows containing individuals whom you want to be removed anyway in this pairwise IBD file ('-file') and whom you don't want to use when deciding removal of relatives. You need to remove all such rows because RELOUT5 uses all individuals in this file. The content of this file alone determines what individuals are used to decide whom to exclude. For this you can use LFE function LISTSELECTION . Example:
    LFE64 -M listselection -file IBD.txt -header no -column all -delim space -list inds_to_be_removed.txt -in-or-out out -out IBD_new.txt
If instead you have a list of individuals who you need to remain (leave in) you can treat the pairwise IBD file like this (3 separate commands):
    LFE64 -M listselection -file IBD.txt -header no -column 1 -delim space -list inds_to_remain.txt -in-or-out in -out temp.txt
    LFE64 -M listselection -file temp.txt -header no -column 2 -delim space -list inds_to_remain.txt -in-or-out in -out IBD_new.txt
    rm temp.txt
Before you do this LISTSELECTION STEP above, you may want to remove all ID pairs (rows) from the IBD file that contain IBD value less than what's specified by the '-limit flag'. This speeds up the listselection process considerably. For this you can use LFE function EXTRACTIF . Example:
    LFE64 -M extractif -file IBD.txt -header no -if1 col3L0.1 -delim space -out IBD_0.1.txt
Note2: It's OK if this file contains IBD values smaller than detemined by the '-limit' flag; they are simply ignored.
-pfile = parameter filename. Each ID is followed by a parameter indicating how preferred this individual is in the final outcome. The larger the value, the less likely the individual is to be removed. THIS FILE IS OPTIONAL, if it exists you need to use '-mode 2'. The pfile score can be a composite score of different parameters. The parameters can be weighted so that their ranges don't overlap to provide clean sequential filtering. Example: parameter = (sex*100 + age) where sex is either 1 or 2 will filter first based on sex and only then the age.
-pdefault = what should be the value put in the pfile after the individual ID if the entry is missing in pfile. Options: "mean", "zero". Default = "zero". Notes: "mean" uses the mean of all individuals, "zero" uses zero.
-out = output file name. Default = input file name + ".out".
-delim = files' delimiter. Options: a) "tab", "space", "comma", samicolon", "colon", "slash", "bslash", "dash", "quote", "squote", b) any symbol or word without spaces. Default = "tab".
-header = (IBD) file header. Options: "yes", "no". Default = "no".
-limit = IBD limit above which all ties should be broken.
-mode = analysis mode. Options: 0 = always the first individual (left column) is preferably removed; 1 (default) = the individual with the strongest single tie is preferably removed (this is the best option if you don't have a pfile); 2 = the individual with the lower pfile score is preferably removed (use if you have a pfile).
-pidsize = use this ony if you want the analysis to terminate half-way (before all individuals are removed). The value indicates how many individual pairs with the IBD > limit (see '-limit') should remain in the IBD file (see '-file') when the analyis terminates. This option allows sequential removal of individulas (see a shell script for sequential removal HERE ). Consider using it with very large data sets when IBD limit is below 0.1.

Example:
./ALLELE -M relout5 -file IBDtable.txt -pfile preferences.txt -pdefault zero -out results -delim space -header no -limit 0.125 -mode 2
Here pfile exists; if any values are missing, they are replaced with 0. The IBD limit is 0.125.

ToomasHaller.com 2017

ALLELE location on the UT cluster: /gpfs/gvgpfs/gvhome/toomasha/SOFT/ALLELEAll others please DOWNLOAD ALLELE HERE

FAMILY

GENTOAD

POPSEL

RELOUT5

ALLELE location on the UT cluster: /gpfs/gvgpfs/gvhome/toomasha/SOFT/ALLELE
All others please DOWNLOAD ALLELE HERE