Gigwa v2.5.x – Documentation
Learn how to use Gigwa like a pro in a few minutes!
Watch demonstration videos on the project homepage:
http://www.southgreen.fr/content/gigwa
A/ USER DOCUMENTATION
A1/ IMPORTING DATA
A1.1/ IMPORTING GENOTYPING DATA
Choosing "Manage data" then "Import data" from the main horizontal menu leads to a page dedicated to data imports.
Anonymous users and users with no particular permissions are limited to importing genotyping data into temporary databases that remain accessible for 24h. These databases are hidden (only visible to people knowing their precise URL and to administrators).
From a second tab, users may import metadata for a database's individuals. Metadata imported by users with write permissions (administrators, database supervisors, project managers) is considered "official" and therefore made available to all users by default. If specified by any other user, it will by only available for himself (thus completing / overriding the default metadata).
Genotyping data may be provided in various formats (VCF, HapMap, PLINK, FlapJack, Intertek, BrAPI) and in various ways:
By specifying an absolute path on the webserver filesystem (convenient for administrators managing a production instance used as data portal);
By uploading files from the client computer (with an adjustable size limit: see section B7.2);
By providing an http URL, linking either to data files or to a BrAPI v1.1 base-url.
On the genotyping data import page, 3 more fields are required:
Database (not available to anonymous users since a "disposable" database is automatically generated for them): a database may contain one or several projects as long as they all rely on the same reference assembly. In the case of very large datasets, for performance-related matters it is however advised to have a single project per database;
Project: a project may contain one or several runs;
Run: each import process ends up writing a run into a project. Allowing multiple runs in a project is a way of supporting incremental data loading.
Specifying a ploidy level is optional (most of the time, the system is able to guess it) but recommended for Flapjack and HapMap formats (will speed-up imports). A descriptive text can be provided for each project. A tooltipped lightbulb icon explains how to add a how-to-cite text that would then be exported along with any data extracted from the given project.
Genotyping data import progress can be watched in real time from the upload page, or via a dedicated asynchronous progress page (convenient for large datasets), as imports run as background processes and may not be interrupted by users.
A1.2/ IMPORTING METADATA
Providing individual metadata is only supported for existing databases and aims at enabling users to select them by filtering on that metadata. This is convenient for cases where the individual list is long and / or individual names are not meaningful.
Metadata may be provided as a simple tabulated file containing user-defined columns (only one is enforced, named "individual").
Importing metadata via BrAPI v1 or v2 is also supported. To do that, you first have to import a file with at least the 3 following columns (in addition of "individual" column):
extRefId which contains the germplasmDbId or sampleDbId (of the BrAPI server);
extRefSource which contains the BrAPI url (e.g. https://test-server.brapi.org/brapi/v2);
extRefType which contains either "germplasm" or "sample" (depending on the resource type you want to get information);
Then you have to come back to the metadata importing form and click on submit in order to extract data from the BrAPI server(s).
A2/ WORKING WITH GENOTYPING DATA
From the home page, select a database and a project. Note the presence of an "Enable browse and export" checkbox that toggles between a mode where only variant counts are displayed, and one where users may browse, visualize and export selected data.
A2.1/ GENERAL FILTERING FEATURES
By default, a single grey filter panel appears, providing means to select variants based on their inherent attributes:
variant type (i.e. SNP, INDEL…);
sequence;
position;
if applicable, number of known alleles (when several are represented);
if applicable, functional annotations (if the data was provided as a SnpEff or VEP-annotated VCF file).
From version 2.5, users may also choose to select variants by IDs, by switching into a mode which is an alternative to using a combination of other filters (accessible from a checkbox in the hamburger menu). Variant selection is then supported via a dropdown with text lookup, clipboard pasting, or the providing of a local file.
As a general rule, a filter widget where no selection has been made will behave as if all its items were selected (no filtering applied on that field).
A2.2/ GROUP COMPARISON
At the bottom of this panel, a dropdown button may be used to display one or two additional panels that allow for more advanced filtering features based on one or two subsets of individuals referred to as groups.
Both of these panels have the same contents and each one lets users select a list of individuals (thus defining groups 1 and 2). Additionally, some handy tools are available on the right side of the "Individuals" drop down menu as tooltipped icons:
Activating the disk icon allows the current selection of individuals to be memorized within the web browser;
The magnifier icon (only present when metadata are available) helps selecting individuals according to the metadata attached to them;
The copy icon adds currently selected individuals to clipboard;
Clicking the paste icon opens a textbox for pasting / editing a list of individuals to select.
All other widgets in the same panel will then let users apply genotype-level filters on the selected list of individuals:
In the case where data was provided in VCF format containing numeric genotype-level fields (e.g. depth, genotype quality), a minimum acceptable value may be provided for each of them. Any genotype not respecting thus defined constraints is treated as missing for the rest of the query;
"Max missing data" ratio defines how many individuals (among the selected ones) may have a missing genotype;
"Minor allele frequency" can be provided as a range (only supported for bi-allelic data);
The "Genotype pattern" dropdown provides a list of genotype patterns that may be applied within the group. Descriptions for those are available via the question-mark icon.
Useful tip: To identify variants for which genotypes are steady within each group but different between them, set both groups" genotype pattern to "All or mostly the same". In this case, in each group panel a similarity ratio lets users specify how many of the current group"s selected individuals must have the same genotype. Additionally, an extra pink panel appears between both group panels, containing a checkbox labelled "Discriminate groups". Checking this box will ensure that the most frequent genotype in group 1 is different from that in group 2. See online video number 4 for a demonstration.
Note that genotype-level filters are applied in the order they appear: the maximum missing data filter applies first, taking into account truly missing data and genotypes treated as missing because of low quality. MAF and genotype pattern queries are then applied at the same time, on the remaining (non-missing) genotypes only.
A2.3/ BOOKMARKING QUERIES
After executing a query, any logged-in user may bookmark it from the hamburger menu. Each bookmarked query needs to be given a name and thus users may:
consult their list of bookmarked queries;
load one of them at any time to be able to re-run it;
rename them;
discard them.
A2.4/ BROWSING / VISUALIZING VARIANTS
If the "Enable browse and export" box is ticked when clicking the Search button, then one may browse online the selection (i.e. list of variants that match the query). Clicking a variant line opens a dialog providing variant details along with individuals" genotypes and optional complementary information like quality data or annotations.
Above the variant list:
Gigwa embeds a Javascript version of the Integrative Genomics Viewer, IGV.js (© Broad Institute), allowing to conveniently watch the current variant selection along with genotypes of interest, within its genomic context, without the need to go through the export process. A number of genomes are provided in default (yet configurable) lists, and users working on non-model organisms may also let the genome browser point any to local genome track files.
A chart icon leads to a dialog in which, for each sequence represented in the current selection, various kinds of series charts (variant distribution, Fst, Tajima's D) may be computed, inspected online and downloaded. In the case where data was provided in VCF format containing numeric genotype-level fields (e.g. depth, genotype quality) an additional series can be displayed for each of these fields, on top of the main series;
An "External tools" box provides
means to setup the application for interacting with / pushing data into external
tools: an external online genome browser (e.g., GBrowse, JBrowse) can be configured for viewing
each variant in its genomic context (via an extra icon at the end
of each table row); a running standalone instance of IGV can be fed with a
VCF export file (refer to tooltip for details); other online tools
(e.g. Galaxy, SNiPlay) can also be fed using exported files (click
online-output-tools icon for details);
NB: Use of external genome browsers or standalone IGV are now discouraged in favor of the newly embedded IGV.js solution
Clicking the download button opens a panel where users may select an output format, refine the list of individuals to export, select individual metadata fields to include (if available), and choose between directly downloading the output, or creating a file on the server (in which case its URL may be used later, shared or passed to external tools). A prototype version of Flapjack-Bytes (© JHI) is embedded and may be switched to after exporting into Flapjack format with the "Keep files on server" option (similarly to the way in which a standalone IGV instance can be invoked with VCF exports).
A3/ WORKING WITH REST APIs (for advanced users)
Any data imported into Gigwa is automatically interfaced via the following standard REST APIs, documented in a Swagger page available from the main menu:
The GA4GH v0.6.0a5 implementation has by design a single base-url. Listing available databases can be achieved by posting an empty body to /rest/ga4gh/referencesets/search. Thus obtained values can be then passed to other calls as referencesetId or datasetId.
The BrAPI v2.0 implementation has by design a single base-url. Listing available databases can be achieved by posting an empty body to /gigwa2/rest/brapi/v2/programs or /gigwa2/rest/brapi/v2/trials. Thus obtained values can be then passed to other calls as programDbId or trialDbId.
The BrAPI v1.1 implementation has by design a separate base-url for each database, constructed as follows: /{database}/brapi/v1/token. Each database"s base-url can be deducted from the above-mentioned calls" responses, and by convenience, the main Gigwa interface provides a link to the corresponding BrAPI base-url when a new database is selected.
The table below lists terminology correspondences:
Gigwa entity | GA4GH entity | BrAPI v1 entity | BrAPI v2 entity |
---|---|---|---|
database or module | referenceSet or dataset | database or map | program or trial |
project | variantSet | genotyping study | genotyping study or referenceSet |
run | - | - | variantSet |
sequence | reference | linkageGroup | reference |
variant | variant | marker | variant |
individual | callSet | germplasm | germplasm |
sample | - | sample or markerprofile | sample or callSet |
Please refer to http://ga4gh-schemas.readthedocs.io/en/latest/ and https://brapi.org/ for more details about each API.
B/ ADMINISTRATOR DOCUMENTATION
By default, a fresh instance of Gigwa comes with a single pre-defined administrator account (login: gigwadmin, password: nimda). It is of course strongly advised to change this password upon first connection (see section B5 below).
B1/ TOMCAT CONFIGURATION
Ready-to-use bundled packages should not require any changes in Tomcat configuration since the settings below have already been applied to them. However, if you install Gigwa in a production environment from fresh Tomcat binaries, it is necessary to apply the following modifications:
The bin/setenv.bat or bin/setenv.sh
(depending on the platform) script must contain a line as follows
in order to dedicate enough RAM to Tomcat:
export CATALINA_OPTS="$CATALINA_OPTS -Xms512m -Xmx2048m" (Ubuntu / OSX)
set "JAVA_OPTS=%JAVA_OPTS% -Xms512m -Xmx2048m" (Windows)
(This setting may of course be adapted to get the best out of the hardware configuration: we found -Xmx8192m to be a good compromise on production servers with large amounts of RAM). If MongoDB and Tomcat are running on the same host (default for bundle archive installations) you should leave most RAM available for MongoDB though.
Adding the following line that same setenv script gets around a Log4j2 vulnerability (CVE-2021-44228):
export CATALINA_OPTS="$CATALINA_OPTS -Dlog4j2.formatMsgNoLookups=true" (Ubuntu / OSX)
set "JAVA_OPTS=%JAVA_OPTS% -Dlog4j2.formatMsgNoLookups=true" (Windows)
In the conf/server.xml file, the main
Connector element must be configured as below:
maxHttpHeaderSize="65536"
maxParameterCount="-1" maxPostSize="-1"
B2/ APACHE CONFIGURATION
In production environments, Tomcat often runs behind an Apache proxy. If such is your case, you must include the following line in your VirtualHost configuration:
ProxyTimeout 86400
Otherwise, when Gigwa is undertaking a process that is meant to last long, Apache may stop waiting for Tomcat to respond, and the interface will fail in displaying results.
B3/ ENABLING PASSWORD ENCRYPTION
User information is stored in WEB-INF/classes/users.properties. By default, passwords are not encoded. Administrator may enable password encoding to enhance security by:
B4/ CENTRAL AUTHENTICATION SERVICE COMPATIBILITY
Single-Sign-On is supported via the implementation of the CAS protocol. If your organization is using CAS, making users of your Gigwa instance able to authenticate via their institutional account is as simple as defining the following parameters in WEB-INF/classes/config.properties:
B5/ MANAGING DATA
The visibility of a database is defined using two flags (default values in bold):
public / private: if public, anyone (even anonymous users) can search this database; if private, only administrators and users who were explicitly granted permissions may do so;
hidden / exposed: if hidden, only administrators will see the database in the main menu list; if exposed, any entitled user (that is, anyone if database is public, otherwise any user with permissions on at least one of the database"s projects) will see it in the list.
In other words, the first flag defines visibility on the server side while the second defines exposure on the client side. Typically, a temporary database created by an anonymous user or a user without any management permissions will be public and hidden (searchable by anyone, listed to the administrators only), and will be made accessible to its creator via a specific URL referring to the database name (thus accessible to anyone if shared).
Administrators can see all databases (even if private and/or hidden) and have all privileges on them. Only administrators may create permanent databases. This can be done either at import time, or via the main menu"s "Manage data" link, by subsequently clicking on "Manage databases". A simple interface allows then to create an empty database on a selected MongoDB host, set the public and hidden flags on existing databases, delete existing projects and databases, and, for administrators and database supervisors, manage dumps.
The dump/restore functionality was implemented in order to provide means to easily generate a copy of an entire database and thus prevent data loss in case of server crash. In the database list, a color code tells whether or not an up to date dump exists for each of them. Dump/restore operations can be launched with a few clicks, logs can be viewed while the process is running and are kept available for subsequent reference. For this functionality to be enabled, the system running Tomcat must have MongoDB Command Line Database Tools installed, and a parameter named "dumpFolder" in config.properties must point to a location Tomcat has permissions to write to (which may lie on a remote mounted filesystem, ideally backed-up periodically). Dump files may be downloaded by administrators and database supervisors who may thus manage an extra copy for even more safety.
B6/ MANAGING USER ACCOUNTS AND PERMISSIONS
By choosing the "Administer existing data and user permissions" link from the "Manage data" menu item and subsequently clicking on "Manage users and permissions", administrators may access an interface for creating / deleting users, setting their password (even their own) and permissions:
at the database level by granting the SUPERVISOR role which provides all permissions on a given database;
at the project level by granting either the READER role (only makes sense for projects in private databases) which allows to search the given project's data, or the MANAGER role which also allows to search project data, import additional genotyping runs or individual metadata, and grant roles to other users on that project.
Thus, a user with the MANAGER role on a project can administer that project in the same way as an administrator, via the main menu"s "Manage data" item also available to him after authentication. The same applies to users with the SUPERVISOR role on a database, who can manage all of its contents.
B7/ CONFIGURING ADVANCED SETTINGS (FOR SYSTEM ADMINISTRATORS: REQUIRES WRITE PERMISSIONS ON INSTALLED FILES)
Although a Gigwa instance installed via a distribution package is functional out of the box, some configuration settings can only be adjusted by editing text files. Most of them only need to be set once.
B6.1/ Managing data hosts
Declaring MongoDB hosts is done via the WEB-INF/classes/applicationContext-data.xml file following provided examples. Only hosts running with authentication enabled (refer to MongoDB documentation if needed) must be declared along with a UserCredentials bean. Note that Gigwa associates them internally using their IDs: for example, a host named myMongoHost will expect a UserCredentials bean named myMongoHostCredentials. Those credentials must be provided for a user declared in MongoDB's admin collection, who has readWriteAnyDatabase and dbAdminAnyDatabase roles. The web-application needs to be reloaded for such changes to be taken into account (please refer to Tomcat documentation if needed).
B6.2/ Setting configuration properties
The WEB-INF/classes/config.properties file may be used to set values for the following parameters:
dbServerCleanup - You may specify under this property, a csv list of hosts for which this instance will drop temporary variant collections on startup (e.g. 127.0.0.1:27017, another.server.com:27018). Temporary variant collections are often used once a search has been completed, for browsing/exporting results. They are normally dropped upon user interface unload, but some may remain if the web-browser is exited ungracefully or the application goes down while someone is using the search interface. If this property does not exist then the instance will drop all found temp collections, if it exists but is empty, none will be dropped.
adminEmail - If Gigwa is being used as a multi-user data-portal you may specify via this property an email address for users to be able to contact your administrator, including for applying for account creation.
igvDataLoadPort - Defines the port at which IGV listens for data submission. No IGV connection if missing / invalid.
igvGenomeListUrl - Defines the URL from which to get the list of genomes that are available for IGV. No IGV connection if missing / invalid.
sessionTimeout - Web session timeout in seconds. Default: 3600 (1h)
forbidMongoDiskUse - MongoDB's allowDiskUse option will be set to the opposite of this parameter's value when launching aggregation queries. Default: false
tempDbHost - Tells the system which MongoDB host to use when importing temporary databases (for anonymous users). Only used when several hosts have been configured in applicationContext-data.xml. If unspecified all connected hosts will be available for use. If invalid, no import will be possible for users without specific permissions.
maxImportSize - Defines the default maximum allowed size (in megabytes) for genotyping data file imports (capped by the maxUploadSize value set in applicationContext-MVC.xml). Default: 500Mb. NB: Does not apply to administrators (administrators are only limited by maxUploadSize for uploads and are not limited when importing via local or http files)
maxImportSize_anonymousUser - Defines the maximum allowed size (in megabytes) granted to anonymous users for genotyping data file imports. Default: maxImportSize
maxImportSize_USERNAME - Defines the maximum allowed size (in megabytes) granted to the USERNAME user for data file imports. Default: maxImportSize
serversAllowedToImport - CSV list of external servers that are allowed to import genotyping data.
genomeBrowser-DATABASE_NAME - Any property named genomeBrowser-DATABASE_NAME is a way for defining a default genome browser URL for a database called DATABASE_NAME. This is optional as users may define their own genome browser URL, thus overriding the default one if it exists.
onlineOutputTool_N - Any property named onlineOutputTool_N with N being an integer >= 1 is a way for defining an online output tool for datasets exported to server. N accepts consecutive values (if only onlineOutputTool_1 and onlineOutputTool_3 exist then only onlineOutputTool_1 will be taken into account). The property value must consist in semi-colon-separated values. The first one is the label to display for this tool, the second one is the tool URL (in which any * character will be replaced at run time with the exported file"s URL). The third value is optional and may contain a comma-separated list of file-formats (must match some of those that the Gigwa instance is able to export: BED, DARWIN, EIGENSTRAT, FLAPJACK, GFF3, HAPMAP, PLINK, VCF), thus defining those accepted by the tool (if unspecified, files in any format will be made available for this tool).
maxSearchableBillionGenotypes - Defines the maximum estimated size (in billions) of the genotype matrix (#individuals * #markers) within which genotype-level filters may be applied. This property may be tuned according to server performance. #markers is estimated by calculating an average marker count per sequence. Whatever value is set here, Gigwa will at least allow searching on one sequence for all individuals. Default: 1 billion
maxExportableBillionGenotypes - Defines the maximum size (in billions) of the genotype matrix (#individuals * #markers) that may be exported. This property may be tuned according to server performance. It aims at limiting system overhead in situations where numerous users may be working on very large databases. Default: 1 billion
maxExportableBillionGenotypes_anonymousUser - Defines the maximum size (in billions) of the genotype matrix (#individuals * #markers) that may be exported by anonymous users. Set to 0 to prevent from exporting genotypes. Default: maxExportableBillionGenotypes
maxExportableBillionGenotypes_USERNAME - Defines the maximum size (in billions) of the genotype matrix (#individuals * #markers) that may be exported by the USERNAME user. Set to 0 to prevent from exporting genotypes. Default: maxExportableBillionGenotypes
googleAnalyticsId - If set, a Google Analytics tag is automatically added into the main page.
enforcedWebapRootUrl - In some situations the system needs to provide externally visible file URLs for remote applications to download. In most cases it is able to figure out which base URL to use, but it might also be impossible (for example when a proxy is used to add a https layer). This parameter may then be used to enforce a base-URL. (Required for CAS authentication) Example values: https://secured.server.com/gigwa or http://unsecured.gigwa.server:59395
casServerURL - If defined, enables CAS authentication by defining the CAS server URL (enforcedWebapRootUrl is also required for the CAS server to know how to redirect to Gigwa after login)
casOrganization - Defines the name of the organization providing the CAS authentication. Optional
variantIdLookupMaxSize - Defines the limit number of variant IDs that are returned by the text lookup functionality of the "Variant IDs" widget. Default: 50
igvGenomeConfig_N - Properties named igvGenomeConfig_N with N being an integer >= 1 provide sets of online genomes to be displayed as default genome lists for the embedded IGV.js browser. Each property must be in the form igvGenomeConfig_N = Name;URL
dumpFolder - Enables the dump management functionality by defining the location where dumps will be stored on the application-server's filesystem (Requires installing MongoDB Command Line Database Tools on the machine running Tomcat)