LCD-Composer

Help Page

This page will guide you in the use of the LCD-Composer server. Specific instructions regarding how to use the downloadable LCD-Composer scripts for high-throughput analyses are described on our Github page.

Below is a series of guiding questions to help you understand and choose LCD-Composer parameters, as well as practical guidance on actually conducting analyses using the LCD-Composer server.

Citation information
If you use the LCD-Composer webserver or command-line scripts, please cite the following publications:

LCD-Composer: an intuitive, composition-centric method enabling the identification and detailed functional mapping of low-complexity domains. Cascarina et al. (2021) NAR Genom Bioinform. Pubmed
The LCD-Composer webserver: high-specificity identification and functional analysis of low-complexity domains in proteins. Cascarina and Ross (2022) Bioinformatics. Pubmed

Additionally, if you use the GO term analyses generated by the server or command-line scripts, please cite:

GOATOOLS: A Python library for Gene Ontology analyses. Klopfenstein et al. (2018) Sci Rep. Pubmed
Gene ontology: tool for the unification of biology. Ashburner et al. (2000) Nat Genet. Pubmed
The Gene Ontology resource: enriching a GOld mine. Gene Ontology Consortium (2021) Nucleic Acids Res. Pubmed

Recommended browsers
The LCD-Composer webserver should work on most major browsers (Chrome, Safari, Firefox, and Edge). However, when using the "Back" button with the Safari browser, users may be required to re-input all search parameters. This still allows Safari users access to all webserver functionality but is understandably inconvenient. Therefore, we recommend one of the other major browsers whenever possible.

How do I interpret the results of an LCD-Composer analysis?
LCD-Composer search results are delivered in .tsv (tab-separated values) file format. Many users may wish to view these results files in Microsoft Excel. Unless your operating system opens .tsv files in Excel by default, you may be able to: 1) right-click on the file and choose Excel under the "Open with" option, or 2) first open Excel, then open the results file from within Excel.

The main LCD-Composer results file will always include a few lines at the top of the file that indicate the parameters used at runtime.

The results file will have the following column headers:

Protein Description = FASTA header for the protein containing the discovered LCD
UniProt ID = UniProt ID for the protein containing the discovered LCD
Domain Sequence = the sequence of the LCD that was identified
Domain Boundaries = the start and end positions (inclusive) of the identified LCD in the original protein.
Final Domain Composition = the combined composition of all of the amino acid(s) of interest. Note that for searches using multiple amino acids or groups of amino acids, this is the combined composition of all amino acids. Individual amino acid compositions are found in separate columns.
Final Domain Linear Dispersion = the linear dispersion of all amino acids of interest (same as for "Final Domain Composition").
Remaining 20 columns = the composition for each of the 20 canonical amino acids within the LCD sequence.

Can I run LCD-Composer with recommended default parameters?
Yes - if you do not explicitly change parameters used in LCD-Composer, it will run with default parameters. The only thing you MUST provide is a sequence or proteome to analyze! It would also probably make sense to change the amino acid of interest since you likely want to identify a particular type of LCD, but this is not strictly necessary if you are interested in Q-rich LCDs (default).

The defaults for both the LCD-Composer webserver and downloadable script are:
Window size = 20aa
Amino acid of interest = Q
Composition threshold = 40%
Linear dispersion = 0.5
Threshold to ignore dispersion = 70% (halfway between the composition threshold and 100%)

This will search for LCD regions that are around 20aa or larger and at least 40% Q, with moderate spacing of Q's required.

What is the "Window Size" parameter, and how will it affect the results?
The window size is the size of the window that LCD-Composer uses to read sequences. This is effectively the minimum size of LCDs that you are interested in. LCDs that are larger than the window size will also be identified due to the merging of overlapping windows that pass all of the search parameters. In rare cases, identified LCDs can also be shorter than the window size due to a trimming step performed by LCD-Composer - this is intended behavior, and aids in defining the boundaries of LCDs.

What is the "Amino acid(s) of interest" parameter, and how will it affect the results?
The amino acid(s) of interest represent the defining compositional feature(s) of the LCDs that you are interested in and would like to search for. This parameter was designed to maximize user-control over the features of LCDs that are important to you. You can search for simple LCDs enriched in a single amino acid, group amino acids that should be treated equivalently, or search for "multi-faceted" LCDs enriched in multiple amino acids to differing degrees. Here are examples highlighting how to construct the parameter for your amino acid of interest:

"G" - a simple LCD search for glycine-rich amino acids.
"DE" - a search for LCDs with a combined aspartic acid/glutamic acid content above your composition threshold.
"DEHKR" - a search for LCDs with a combined charged residue content above your composition threshold.
"QN_Y" - a search for LCDs with a combined glutamine/asparagine content above a composition threshold AND with a tyrosine content above a different composition threshold.
"QN_FWY" - a search for LCDs with a combined glutamine/asparagine content above a composition threshold AND with a combined aromatic residue content above a different composition threshold.

Notice that underscores allow you to separate amino acids or groups of amino acids so that you can specify a unique minimum composition threshold for each amino acid or amino acid group. This allows you to design extremely specific LCD searches.

What is the "Composition threshold", and how will it affect the results?
The composition threshold is the minimum percent composition of your amino acid(s) required for a given protein region to be recognized as an LCD. For example, a composition threhsold of 50% with "G" as the amino acid of interest means that a region must be 50% glycine for it to be recognized as an LCD. When you specify groups of amino acids (as in the examples above), you can specify a different composition threshold for each group using the same underscore delimiters. For example, with the amino acid groups "QN_FWY", you can set composition thresholds of "50_15": in this case, a protein region must have a combined glutamine/asparagine content >50% AND a combined phenylalanine/tyrosine/tryptophan content >15%. The sum of all composition threshold values must be between 0-100 (inclusive).

Note that you can also search for strict homopolymeric regions (e.g. polyQ) using a composition threshold of 100%.

Also, you may set very low composition thresholds, which could be particularly useful for less frequent amino acids such as methione, cysteine, and tryptophan, though some of the resulting domains may not be classified as "low-complexity" domains by most measures of sequence complexity. It is still perfectly acceptable to use LCD-Composer in this way: just be thoughtful in your use of terminology. It is also worth noting that sequence complexity exists on a continuous spectrum, and there is no absolutely universal complexity threshold that separates "low-complexity" domains from "high-complexity" domains.

What is the "Linear dispersion" parameter, and how will it affect the results?
The linear dispersion parameter is essentially a measure of how evenly spaced your amino acid(s) of interest are within the region being analyzed. High linear dispersion corresponds to sequences with very even spacing of your amino acid(s) relative to all other amino acids. Low linear dispersion corresponds to sequences with very uneven/clustered spacing of your amino acid(s) relative to all other amino acids.

When using default value, this parameter does not typically affect the identification of LCDs: it simply helps define the boundaries of the identified LCDs. However, if you set a high linear dispersion threshold fewer LCDs will tend to be identified, as more even spacing of your amino acid(s) is required for a region to be recognized as an LCD.

For most LCD searches, we recommend the default value of 0.5. However, you are free to choose alternative values if the spacing of amino acids is an important feature of the LCDs you wish to identify. Values must be a decimal between 0.0-1.0 (inclusive).

What is the "Threshold to ignore dispersion"?
For regions with very high composition of your amino acid(s) of interest, the linear dispersion parameter (described above) becomes much less important for defining the boundaries of LCDs. Therefore, LCD-Composer is programmed to ignore the linear dispersion parameter when it encounters such a region in a protein sequence. This helps prevent the loss of LCDs that have a very high composition of your amino acid(s) but a low linear dispersion of those amino acids.

By default, LCD-Composer calculates this threshold as the halfway point between your defined minimum composition threshold and 100. For example, if your composition threshold is 50, the threshold to ignore dispersion will be 75.

The default setting is recommended for most LCD searches. However, if you wish to choose an alternative value, it must be a number between 0-100 (inclusive).

My favorite organism is not listed in the dropdown menu. What can I do?
The dropdown menu only lists a select number of organisms. However, many more (~19000 reference proteomes in total) are available for analysis! You can try typing in the full scientific name of your organism (e.g. "Homo sapiens" or "Saccharomyces cerevisiae") and submitting that for analysis. The server will then try to match that name with a set of proteome files. If you are still receiving an error message, try downloading this list of available organisms and searching for your favorite organism. You should then be able to copy/paste the scientific name of the organism (4th column) into the organism selection box.

Why are there two possible choices for the "Linear dispersion method"?
When specifying multiple amino acid groups, the original LCD-Composer algorithm ("Original method" option) combined all groups prior to calculating the linear dispersion. This typically has little effect on the identified LCDs but, on occasion, can result in LCDs with the amino acid groups asymmetrically distributed across the LCD. Some users wanted additional control over the linear dispersion parameter, so the "New method" option was incorporated into the webserver. For LCD searches with only a single amino acid or group of amino acids, the "New method" and "Original method" will perform identically. However, if multiple groups of amino acids are specified (along with the "New method"), then multiple linear dispersion threshold values also need to be specified (one for each group of amino acids, in the same order). In such cases, the linear dispersion of each amino acid group will be evaluated independently, and only LCDs passing all of the linear dispersion thresholds will be included in the results file.

Why are no LCDs or proteins listed in my results file?
You still receive a results file even if no LCDs were identified using your search parameters. If you receive a file that contains the search parameters and column titles but otherwise appears empty, this means that no LCDs were identified. This may be helpful for some users since each results file has the search parameters in the heading. That way, a user has a record of all LCD searches performed, regardless of whether any LCDs were identified.

Why do some LCDs in my results file actually have final percent composition or linear dispersion values that are lower than my minimum thresholds?
On rare occasion, the final percent composition of an identified LCD will actually be lower than the minimum composition threshold used as a search parameter. The same can happen for the linear dispersion parameter. This is not a bug or a mistake. This occurs due to the merging of overlapping windows that all individually pass the minimum composition and linear dispersion thresholds. Since each of the "building blocks" of the LCD in this situation still pass the criteria, the LCD is included in the results file. If desired, the results can be easily sorted (e.g. in Excel) by final percent composition or linear dispersion, and LCDs with low final values can be removed from results (this should be noted in publications, if necessary).

How are GO-term analyses performed?
GO-term analyses are performed using the GOATOOLS Python package with default parameters. GO-term analyses are performed solely on the base reference proteome that contains one protein sequence per gene, regardless of whether the user checks the box to include all isoforms in the LCD search: this ensures that the statistics underlying the GO-term analysis are not skewed by LCD-containing proteins that have a large number of isoforms. Each GO-term results file contains both raw and corrected p-values to account for multiple-hypothesis testing. It is the user's responsibility to apply additional p-value correction methods when appropriate (e.g. if GO-term analyses were run on multiple protein sets identified using similar LCD search parameters). In such cases, it is generally advisable to first define a single "best" set of LCD search parameters, then run a single GO-term analysis on the resulting set of proteins.

For each GO-term analysis, statistics for all available GO terms are included in the results file regardless of statistical significance. However, for user convenience, the results are automatically sorted from lowest to highest based on the Sidak-adjusted p-values. Subsequently, GO terms with the same p-value are sorted from highest to lowest based on the odds ratio, which prioritizes GO terms with a high degree of enrichment above those with an identical p-value but lower degree of enrichment.

GO-term analysis is only available for the following organisms:

Homo sapiens (Humans)
Mus musculus (Mouse)
Saccharomyces cerevisiae (Yeast)
Caenorhabditis elegans (Worm)
Drosophila melanogaster (Fruit fly)
Arabidopsis thaliana (Plant - Mouse-ear cress)
Danio rerio (Zebrafish)
Bos taurus (Cow)
Canis lupus familiaris (Dog)
Dictyostelium discoideum (Slime mold)
Gallus gallus (Chicken)
Sus scrofa (Pig)
Rattus norvegicus (Rat)

What does each column in the GO-term results file represent?
GO-term results file column headers:

# GO = Gene Ontology ID
NS = GO-term category (BP=Biological Process, CC=Cellular Component, and MF=Molecular Function)
enrichment = Whether the GO term was enriched (e) or purified (p) for your protein set
name = Description of the GO term
hits_in_study = # of proteins in your protein set that are associated with the GO term
totalProteins_in_study = total # of proteins in your protein set
ratio_in_study = hits_in_study / totalProteins_in_study
hits_in_population = total # of proteins in the corresponding organism that are associated with the GO term
totalProteins_in_population = total # of proteins in the corresponding organism
ratio_in_pop = hits_in_population / totalProteins_in_population
odds_ratio = the degree to which the GO term is enriched or purified for your protein set
p_uncorrected = p-value that has not been corrected for multiple-hypothesis testing
depth = Depth of the GO term in the directed acyclic graph. GO terms become more specific as depth increases.
p_bonferroni = Adjusted p-value using Bonferroni method for multiple-hypothesis test correction
p_sidak = Adjusted p-value using Sidak method for multiple-hypothesis test correction
p_holm = Adjusted p-value using Holm method for multiple-hypothesis test correction
Length-associated GO term? = Whether the GO term is associated with significantly longer proteins (1=Yes, 0=No)
study_items = Protein IDs of proteins in your set that are associated with the GO term

What is the "Length-associated GO term?" column in my GO-term results file?
For some types of LCDs, proteins with the LCD tend to be longer than proteins without the LCD. Some GO terms are also associated with long proteins, which may result in more frequent enrichment in GO term analyses performed on a set of long proteins. Therefore, we identified GO terms for each organism with statistically significantly longer proteins associated with the GO term compared to all other proteins not associated with the GO term using a Mann-Whitney U test with Sidak-Holms adjustment for multiple-hypothesis test correction. These GO terms are indicated by a "1" in the "Length-associated GO term?" column.

However, it is important to note that the directionality of cause and effect is not clear. It is possible that long proteins are statistically more likely to have LCDs. Conversely, LCDs may be in a specific set of long proteins for biologically important reasons. LCDs also contribute to sequence length, so a set of proteins with LCDs will be longer than the same set of protein lacking the LCDs.

Therefore, this indicator column is only meant to highlight these instances if the user wishes to consider length effects.

Why does there appear to be dates in my GO-term results file when I open it in Excel?
If you open the GO-term results file in Excel, sometimes fractions are automatically converted to dates by Excel. This issue has plagued Excel users for quite some time, yet Microsoft offers no practical way to permanently disable this behavior (see this page for details). This occurs for the "ratio_in_study" column (which is the fraction of your proteins of interest that are associated with the corresponding function) and the "ratio_in_pop" column (which is the fraction of all proteins in the proteome that are associated with the corresponding function), but only for certain fractions.

For this reason, we have included 4 additional columns not normally found in GO-term results files generated by GOATOOLS. These columns are "hits_in_study", "totalProteins_in_study", "hits_in_population", and "totalProteins_in_population". These values are the values used in the problematic "ratio" columns. For example, "ratio_in_study" is simply the ratio of "hits_in_study"/"totalProteins_in_study". Likewise, "ratio_in_pop" is simply the ratio of "hits_in_population"/"totalProteins_in_population". These added columns ensure that no information is lost even if ratios are converted to dates.

NOTE: if you open the GO-term results file in Excel and subsequently save that Excel file, the fractions will be permanently converted to dates and will not be recoverable. However, the added columns described above ensure that you will always have access to the original values.

You can also open the GO-term results files using a plain-text editor (e.g. Notepad on Windows or TextEdit on Mac) and the fractions will not be converted to dates.

How is plotting performed when analyzing single protein sequences?
Plotting is performed using an adapted version of the CompositionPlotter.py script, which is also available as a command-line tool on our https://github.com/RossLabCSU/IJMS_2021. You can specify the image resolution (in dpi), figure height and width (in mm), location of an optional threshold line (in y-axis units of percent composition), and a file type. You can also specify a custom color palette if desired. Custom color palettes must be specified in hex code format, with each color separated by an underscore (e.g. "#1f77b4_#ff7f0e_#2ca02c").

How does Option 2 (LCD similarity search) work?
Option 2 allows users to submit an LCD sequence of interest in order to find compositionally similar LCDs. In some ways, this is like a BLAST search for LCDs! The LCD-Composer webserver will automatically calculate appropriate search parameters, perform an LCD search in the chosen proteome, and return a file with relevant LCDs. Within this file, you can also see the search parameters used in the search. This option is primarily intended to provide users with an extremely simple search option and to calculate LCD search parameters that can be used as initial guides for designing more sophisticated LCD searches using the other analysis options on the LCD-Composer server.

Search parameters are calculated as follows:

The number of defining features determines the number of distinct amino acid types to use in the LCD search. The features will be ranked in descending order from most-common to least-common within the LCD sequence that the user submitted.*
The percent composition is then calculated for each of the defining amino acid features. The composition threshold used in the LCD search is then the (% composition - 5). The slight decrement when calculating the threshold results in slightly higher sensitivity when detecting LCDs of potential interest to the user.**
The linear dispersion for each of the defining amino acid features is also calculated, and the resulting linear dispersion thresholds used in the LCD search are then the (dispersion - 0.2).
The window size used in the search is simply half of the length of your LCD (or 20, whichever is greater). This intentionally weights LCD searches according to the size of the user-submitted LCD sequence. For example, if a long LCD sequence is submitted, the LCD search will only identify LCDs that are also relatively long. This may reduce the number of LCDs identified, but will likely result in a greater proportion of LCDs likely to be of interest to the user.

*In rare cases there will be ties in the percent compositions of the top features. In those cases, tied residues are combined into a single group and the corresponding composition threshold for that group is set to n * (%comp - 5), where n is the number of tied residues. Tied-residue combining only occurs if the user-specified number of features is not large enough to capture all of the top tied and non-tied residues: in cases where the tied residues are found within the top number of features, each residue is treated separately and receives its own composition threshold.

**If the composition threshold for a search features falls below 0%, that amino acid is not included as a search feature (regardless of the user-specified value for number of features) since the feature is not a major feature of the query LCD.

How is compositional similarity to my query LCD sequence calculated?
When using Option 2, LCD search parameters are automatically extracted from an LCD query sequence. LCDs discovered by this search are then ranked based on compositional similarity to the query LCD sequence (from most similar to least similar). But how is "compositional similarity" determined? We define compositional similarity as the normalized Manhattan distance between two sequences (the query LCD sequence and a discovered LCD sequence) in 20-dimensional compositional space. Steps in this calculation are as follows:

For each of the 20 canonical amino acids, calculate the absolute difference in percent composition between the query sequence and the discovered LCD sequence.
Calculate the sum of absolute differences from Step 1.
Normalize the value from Step 2 by dividing by the maximum possible dissimilarity for the given query LCD sequence. If the query LCD sequence does not contain all 20 amino acids (true of most LCDs by definition), then maximum possible dissimilarity is calculated from a model homopolymer of an amino acid that does not appear in the query LCD sequence. If all 20 amino acids are present in the query LCD sequence, maximum possibility dissimilarity is calculated from a model homopolymer of the least-common amino acid in the query LCD sequence.
The value from Step 3 is the final similarity score. In principle, this score can range from 0-100, with lower values corresponding to higher similarity (and, therefore, a higher ranking of the discovered LCD in your results file).

Since Euclidean distance may be more familiar to some users, we also calculate the Euclidean distance (with respect to composition) between the query LCD and discovered LCD as follows:

For each of the 20 canonical amino acids, calculate the squared difference in percent composition between the query sequence and the discovered LCD sequence.
Calculate the sum of squared differences from Step 1.
Calculate the square-root of the sum from Step 2.
Normalize the value from Step 3 by dividing by the maximum possible dissimilarity (as defined above) for the given query LCD sequence.
The value from Step 4 is the final similarity score. Like the normalized Manhattan distance, this score can range from 0-100, with lower values corresponding to higher similarity.

What does each column in the LCD Similarity Search (Option 2) results file mean?
Results files from Option 2 contain a number of columns not found in results files from other options. Below is a brief description of what each column represents:

LCD Similarity Rank = Order of discovered LCDs from most-similar to least-similar compared to the query LCD sequence. This is based on the Manhattan distance comparing the complete sequence of the discovered LCD to the query LCD sequence.
Protein Description = FASTA header for the protein containing the discovered LCD.
UniProt ID = UniProt ID for the protein containing the discovered LCD.
Domain Sequence = Sequence of the discovered LCD.
LCD Similarity Score (Manhattan Distance) = Compositional similarity between the discovered LCD and the query LCD sequence, based on the Manhattan distance.
LCD Similarity Score (Euclidean Distance) = Compositional similarity between the discovered LCD and the query LCD sequence, based on the Euclidean distance.
Domain Boundaries = Location of the discovered LCD within its protein, expressed as (StartAA-EndAA).
Final Domain Composition = the combined composition of all of the amino acid(s) of interest. Note that for searches using multiple amino acids or groups of amino acids, this is the combined composition of all amino acids. Individual amino acid compositions are found in separate columns.
Final Domain Linear Dispersion = the linear dispersion of all amino acids of interest (same as for "Final Domain Composition").
Best Window Similarity Score, Manhattan Distance = the score (Manhattan distance) of the best-scoring fragment within the discovered LCD sequence when scanned with the window size used in the initial LCD similarity search.
Best Window Sequence, Manhattan Distance = the sequence of the best-scoring fragment within the discovered LCD sequence based on Manhattan distance.
Best Window Similarity Score, Euclidean Distance = the score (Euclidean distance) of the best-scoring fragment within the discovered LCD sequence when scanned with the window size used in the initial LCD similarity search.
Best Window Sequence, Euclidean Distance = the sequence of the best-scoring fragment within the discovered LCD sequence based on Euclidean distance.
Remaining 20 columns = the composition for each of the 20 canonical amino acids within the LCD sequence.

I received a message saying that I exceeded my limit for analyses: what can I do?
Computational resources are limited, and this ensures that all users who wish to use LCD-Composer are able to run LCD searches and receive results in a timely manner. If you wish to run a large number of LCD searches, please use the command-line scripts available on our lab Github page. Please note that limits may be subject to change if the number and sizes of submissions becomes excessive.

Our goal is to make LCD-Composer accessible to EVERYONE and used by as many people as possible. If you are having trouble analyzing protein sequences using either the LCD-Composer webserver or the LCD-Composer scripts, feel free to send Sean an email at Sean.Cascarina@colostate.edu and he will try to give additional guidance.

LCD-Composer doesn't have a feature that I need! Can I suggest incorporating it into LCD-Composer?
YES! Feel free to suggest any ideas you have for improving LCD-Composer by emailing Sean at Sean.Cascarina@colostate.edu. While we may not be able to incorporate every suggestion, we are interested in considering anything that could make LCD-Composer more useful to a broad range of users.