Uploading sequence data to SRA
When submitting a manuscript to a journal, you are typically required to upload your data (usually fastq files) to an online repository, so that other researchers can download your dataset and replicate your analysis.
SRA (Sequence Read Archive) is one of the largest of these repositories.
In this guide, we step through the process up uploading your data files that are stored on the HPC to SRA.
Getting started
There is a useful guide and links to uploading your data to SRA here:
You should have your sequence data - usually fastq files but can also be bam or other formats - in one directory on the HPC. If they are in a few different directories or in a directory with other data files of the same type (e.g. other fastq files), then create a new directory and copy them to there.
The files should be gzipped. Use the command line gzip tool to create *.gz files, if they are not already gzipped.
1. Log in to submission portal wizard
The portal wizard will step you through the upload process.
You will need an NCBI account and be logged on with this account to use the wizard. If you don’t have an account, create one now, then log on with that and then click on the above link.
2. Create new submission
Click on the ‘New Submission’ button.
In the ‘Submitter’ tab, enter or update your details (name, organization, etc)
Click ‘continue’
In the ‘General Information’ tab, select ‘No’ for both ‘Did you already register a BioProject for this research..?’ and ‘Did you already register a BioSample for this sample’ question. You will register your BioProject and BioSample as part of the submission process.
For the ‘When should this submission be released to the public?’ question, you can choose to release the data publicly immediately, or wait for a specific date of when the data is published. This depends on the sensitivity of your data.
Click ‘continue’
In the ‘Project Info’ tab, Give your project a title and a description. Usually the data is attached to a draft manuscript, so you can give the manuscript title and the abstract.
Most of the other sections for this tab are optional to fill in. Your choice.
Click ‘continue’
In the ‘Sample Type’ tab you need to select a ‘package’ for the organism your samples are from. Choose one of plant, invertebrate, model organism, etc.
Click ‘continue’
In ‘Biosample attributes’ tab you can choose to enter your sample information in a built-in table editor, or as a tab-delimited text file.
There are a number of individual required fields and multiple fields that you need to choose one to fill in. For example, you must fill in the ‘Organism’ field and you have to fill in one of either the ‘age’ or ‘development stage’ fields.
When you submit this table you will often get a ‘Your table upload failed because multiple BioSamples cannot have identical attributes’ error. The best way to fix this is by adding a new column at the end of the table ('Add column') that you call ‘Replicates’ and adding any replicate information you have there
Click ‘continue’
In ‘SRA metadata’ tab, you can also submit a tab-delimited text file or fill in an inbuilt table.
This is where you fill in your sequence data information: for each sample you provided in the previous section, you need to provide the data files (usually fastq.gz) files associated with this sample, and any library prep or sequencing platform information. Your sequencing provider should be able to provide you with this information.
Click ‘continue’
3. Uploading files
In the ‘Files’ tab you can choose your method of uploading. In this guide we are assuming your data is on the QUT HPC, thus you’ll be uploading them via FTP from the Linux command line.
For the question ‘How do you want to provide files for this submission?', choose 'FTP or Aspera Command Line file preload’.
Click on ‘FTP upload instructions’ for details on setting up your upload directory on the SRA servers.
Here there will be an FTP server name, user name and access password provided for your submission.
Usually the FTP server is called 'ftp-private.ncbi.nlm.nih.gov
', so on the command line connect to this by typing:
ftp ftp-private.ncbi.nlm.nih.gov