LFE - file exploring commands (to study files)

LFE location on the UT cluster: /gpfs/gvgpfs/gvhome/toomasha/SOFT/LFE64
All others please DOWNLOAD LFE HERE


FUNCTION SHORT DESCRIPTION
DUPLICATES Find duplicates and multiplicates in the file.
FILECHECK Check file quality (columns, rows, missing, entries etc.).
MINMAX Find local minima and maxima for each column-based array (as done in function analysis).
MULTICOUNT Count occurences of a cerain ID based on a lit of ID's. Sub-ID's supported.
REMSEQDUP Remove duplicates from ordered list. Count how many times each entry occurred.
SEARCH Search for the specified entries and get their row and column numbers.
SELECTEXTREMES Find extreme values (min,max) based on thresholds and report in a table.
STATS Find simple statistics (min, max, mean, count, missing) for the selected columns.

DUPLICATES

This function searches the file for duplicates and multiplicates. The single, double and multiple entries are presented in separate files; the summary is presented in the summary file.

./LFE32 -M duplicates -file -column -delim -header -case -remove -out

-file  = filename. REQUIRED.
-column  = column name. Options: a) valid column number, b) "all" (searches all columns). Default = "all".
-delim  = file delimiter. Options: a) "tab", "space", b) any symbol or word without spaces. Default = "tab".
-header  = file header. Options: "yes", "no". Default = "no".
-case  = case sensitivity. Options: "yes", "no". Default = "yes". Notes: "no" converts into lower-case, "yes" retains case info.
-remove  = remove symbols. Options: a) "tab", "space", b) "punctuation", c) any symbol or word without spaces. Notes: This is used to clean entries before subjecting them to duplicate search; "punctuation" removes all common punctuation marks. No Default.
-out  = output file name. Default = input file name + the type extension.

Example:
./LFE32 -M duplicates -file myinput.txt -column 1 -delim space -header yes -case no -remove % -out results
Table entries are considered without the % sign. This means that "sign" and "sign%" will be considered the same entry.


FILECHECK

This function checks the file for potential problems and provides statistics such as: column number, row number, missing entries, empty rows, list of entry types etc. Several other LFE functions require that the file is error-free. Therefore running the FILECHECK before attempting any other functions is a good idea.

./LFE32 -M filecheck -file -delim -header -tellcolumns -proberows > outfile.txt

-file  = filename. REQUIRED.
-delim  = file delimiter. Options: a) "tab", "space", b) any symbol or word without spaces, c) "find". Default = "find"; this means that the delimiter is determined automatically by testing tab, space, comma, semicolon. Note: If you know your file delimiter or you have a non-standard delimiter, please insert it by using '-delim yourdelimiter'.
-header  = file header. Options: "yes", "no". Default = "no".
-tellcolumns  = number of expected columns. Options: any integer. Default = no default. Notes: If this switch is not used, the number of columns is determined automatically by testing the first rows (as specified by the '-proberows' switch, see below).
-proberows  = number of rows to probe for automatic detection. Options: any integer. Notes: If the file delimiter or number of columns are detected the number of rows scanned is determined by this '-proberows' switch. Default: 10 or end of file.
>outfile.txt  = define output. Options: any valid filename. Default: output to screen. Notes: If the output is not specified, the results are only dispayed on the screen and an output file is not generated.

Minimal example:
./LFE32 -M filecheck -file myfile.txt
This runs the file check with all default settings. Both the number of columns and the file delimiter are determined automatically based on scanning the first 10 rows of the file (or until the end of file is reached). Results will appear on the screen. Please download myfile.txt here to test FILECHECK.

Example 2:
./LFE32 -M filecheck -file data.txt -delim : -header yes -proberows 20 > output.txt
The file delimiter is colon and the file has a header. The number of columns is automatically detected based on scanning the first 20 rows (excluding the header). The results will go to the file 'output.txt'.

Example 3:
./LFE32 -M filecheck -file input.txt -delim find -proberows 15 -tellcolumns 5 > output.txt
The file delimiter is determined automatically based on scanning the first 15 rows. The expected number of columns is 5 (defined by the user). The results will go to the file 'output.txt'.

MINMAX

This function searches for maxima and minima (or both) for an array located in a user-specified column. It doesn't simply search for smallest and largest values but it treats the numbers as a continuous array and searches for deflection points (where array changes direction). The rows corresponding to minima and maxima are written into an output file and they are prefixed with "MIN" or "MAX". Missing values are supported, empty rows are OK.

./LFE32 -M minmax -file -column -delim -out -choose -header -missing

-file  = filename. REQUIRED.
-column  = column number. Options: valid column number Default = 1.
-delim  = file delimiter. Options: a) "tab", "space", "comma", "semicolon", "colon", "bslash", "slash", "dash", "quote", "squote", b) any symbol or word without spaces. Default = "tab".
-out  = output file name. Default = input file name + ".out".
-choose  = what is searched for. Options: "min" (find minima), "max" (find maxima), "all" (find both minima and maxima). Default = "all".
-header  = file header. Options: "yes", "no". Default = "no".
-missing  = what is used to indicate the missing values. Options: any symbol or word without spaces. Default = "NA".

Example:
./LFE32 -M minmax -file mydata.txt -column 2 -delim space -out results.txt -choose min -header yes -missing nan
Here 2nd column is searched for minima, missing value is "nan", file delimiter is space and the file has a header.

MULTICOUNT

This function counts the occurrences of ID's in a specific column of a file based on a reference list given in another file. In essence, it attaches a number to each entry in the reference list stating how many times this entry was encountered in the main file. It can also break the entries into fragments based on the presence of an additional delimiter when '-iddlim' is selected. For example "waist/hip" is normally considered to be one ID. However, when '-iddelim /' is set it is considered as two separate ID's: "waist" and "hip".

./LFE32-M multicount -file -list -out -delim (-iddelim) -column -header

-file = file name. REQUIRED.
-list = this file contains the reference ID's, one per row. REQUIRED.
-out = output file name. Default = input file name + ".out".
-delim = file delimiter. Options: a) "tab", "space", "comma", "semicolon", "colon", b) any symbol or word without spaces. Default = "tab".
-iddelim = additional delimiter used to fragment the ID's in the main file. Deafult = not used. -column = number of the column where the ID's are located. Default = 1.
-header = input file header. Options: "yes", "no". Default = "no".

Examples:
./LFE32 -M multicount -file myfile.txt -out results.txt -list references.txt -delim tab -column 2 -header yes
When the entries from column 2 of myfile.txt are compared against the references in references.txt they are treated "as is" (without any modifications).

./LFE32 -M multicount -file myfile.txt -out results.txt -list references.txt -delim tab -iddelim / -column 2 -header yes
When the entries from column 2 of myfile.txt are compared against the references in references.txt they are first fragmented into smaller IDs wherever "/" is encountered. For example "waist/hip" is treated as two separate ID's: "waist" and "hip".

REMSEQDUP

This function removes duplicates from an ordered list. The list needs to be ordered so that the identical ID's are below one another. This function also counts how many times each ID was encountered.

./LFE32-M remseqdup -file -out -delim -column -header

-file = file name. REQUIRED.
-out = output file name. Default = input file name + ".out".
-delim = file delimiter. Options: a) "tab", "space", "comma", "semicolon", "colon", b) any symbol or word without spaces. Default = "tab".
-column = number of the column where the ID's are located. Default=1.
-header = input file header. Options: "yes", "no". Default = "no".

Example:
./LFE32 -M remseqdup -file myfile.txt -out results.txt -delim tab -column 2 -header yes

SEARCH

This function searches the file for the user specified entry. Both the exact match and partial match can be returned. Only one column or all columns can be searched. The locations of the matches as well as the rows containing the matches are returned. This function can be used to find all rows with a certain ID.

./LFE32 -M search -file -delim -id -column -match-type -out

-file  = filename. REQUIRED.
-delim  = file delimiter. Options: a) "tab", "space", "comma", "semicolon", "colon", b) any symbol or word without spaces. Default = "tab".
-id  = the text or number that is searched for. Options: any text. Default = no default.
-column  = column name. Options: a) valid column number, b) "all" (searches all columns). Default = "all".
-match-type  = how the search is conducted (exact match or partial match). Options: "exact" (return entries that exactly match the ID), "partial" (return entries that contain the same parts as the ID). Default = "exact".
-show-loc  = whether to show the row and column numbers of the hits. Options: "yes" (show numbers), "no" (only show hits, do not show any numbers). Default = "no".
-out  = output file name. Default = input file name + ".out".

Example:
./LFE32 -M search -file mydata.txt -delim comma -id experiment -column 1 -match-type partial -show-loc yes -out results.txt
The first column of mydata.txt is being searched for the exact match of the word "experiment". The results will contain the row number and column number of each match followed by the row that contained the match(es).

SELECTEXTREMES

This function takes a fractional value (0.0-1.0; corrsponding to 0%-100%) and for each row a) lists all min values that are among this fraction starting from the smaller end, b) lists all max values that are among this fraction starting from the larger end. The file MUST HAVE a header. Missing values supported.

./LFE32 -M selectextreams -file -delim -out -threshold -header -missing

-file  = filename. REQUIRED.
-delim  = file delimiter. Options: a) "tab", "space", "comma", "semicolon", "colon", "bslash", "slash", "dash", "quote", "squote", b) any symbol or word without spaces. Default = "tab".
-out  = output file name. Default = input file name + ".out".
-threshold  = fraction of total valus you want to see. Options: number between 0.0 and 1.0, Default = 1.0.
-header  = file header. Options: "yes", "no". Default = "no".
-missing  = what is used to indicate the missing values. Options: any symbol or word without spaces. Default = "NA".

Example:
./LFE32 -M selectextreams -file mydata.txt -delim space -out results.txt -threshold 0.5 -missing nan
Here 50% of the smallest values and 50% of the largest values are found for each row (accoress the columns) and listed in a tablular form.

STATS

This function returns simple statistics for one or many columns (as defined by the user). Currently the following parameters are returned: max, min, mean, count, sum, missing, non-missing. Please note that all values must be numbers and only one type of missing value identifier is allowed (see '-missing'). Attn! If your "sum" value exceeds the storage capability of your computer, the mean and sum values may be incorrect. Please check the absolute value of sum to decide whether the limit was exceeded.

./LFE32 -M stats -file -columns -delim -header -missing -out

-file  = filename. REQUIRED.
-columns  = column names. REQUIRED. Notes: column numbers should be separated by commas and ranges by dashes (see the example below); there should be no spaces. Default = no default.
-delim  = file delimiter. Options: a) "tab", "space", "comma", "semicolon", "colon", b) any symbol or word without spaces. Default = "tab".
-header  = file header. Options: "yes", "no". Default = "no".
-missing  = what is used to indicate the missing values. Options: a) any symbol or word without spaces, b) "none". Default = "NA". Note: "none" means that the missing entry fields are empty.
-out  = output file name. Default = input file name + the type extension.

Example:
./LFE32 -M stats -file myfile -columns 1,4-7 -delim semicolon -header yes -missing empty -out results.txt
Statistics are returned for columns 1,4,5,6,7.

ToomasHaller.com 2017