Download bacterial genome sequences from the International Nucleotide Sequence Database Collaboration
The genome download service in the Assembly resource makes it easy to download data for multiple genomes without having to write scripts. To use the download service, run a search in Assembly, use facets to refine the set of genome assemblies of interest, open the "Download Assemblies" menu, choose the source database (GenBank or RefSeq), choose the file type, then click the Download button to start the download. An archive file will be saved to your computer that can be expanded into a folder containing the genome data files from your selections.
download bacterial genome
Simple variations on these steps can be used to obtain different file types or data for different sets of genome assemblies. If "All file types (including assembly structure directory)" is selected from the "File type" menu, the "ncbi-genomes-YYYY-MM-DD" folder will contain a folder for each of the selected genome assemblies containing all the content from the FTP directory for that assembly.
The genome download service is best for small to moderately sized data sets. Selecting very large numbers of genome assemblies may result in a download that takes a very long time (depending on the speed of your internet connection). Scripting using rsync is the recommended protocol to use for downloading very large data sets (see below).
We recommend using the rsync file transfer program from a Unix command line to download large data files because it is much more efficient than older protocols. The next best options for downloading multiple files are to use the HTTPS protocol, or the even older FTP protocol, using a command line tool such as wget or curl. Web browsers are very convenient options for downloading single files even though they will use the FTP protocol because of how our URLs are constructed. Other FTP clients are also widely available but do not all correctly handle the symbolic links used widely on the genomes FTP site (see below).
Download bacterial genome sequence data from GenBank
Download bacterial genome annotation from Ensembl Bacteria
Download bacterial genome assembly from Wellcome Sanger Institute
Download bacterial genome comparison tools
Download bacterial genome browser software
Download bacterial genome alignment files
Download bacterial genome phylogeny trees
Download bacterial genome metabolic pathways
Download bacterial genome plasmid maps
Download bacterial genome resequencing data
Download bacterial genome editing tools
Download bacterial genome visualization tools
Download bacterial genome functional analysis tools
Download bacterial genome expression data
Download bacterial genome transcription factor binding sites
Download bacterial genome operon predictions
Download bacterial genome regulatory networks
Download bacterial genome CRISPR-Cas systems
Download bacterial genome antibiotic resistance genes
Download bacterial genome virulence factors
Download bacterial genome horizontal gene transfer events
Download bacterial genome pan-genome analysis tools
Download bacterial genome core-genome analysis tools
Download bacterial genome accessory-genome analysis tools
Download bacterial genome strain typing tools
Download bacterial genome SNP detection tools
Download bacterial genome variant calling tools
Download bacterial genome quality control tools
Download bacterial genome statistics tools
Download bacterial genome clustering tools
Download bacterial genome ortholog detection tools
Download bacterial genome paralog detection tools
Download bacterial genome synteny analysis tools
Download bacterial genome gene prediction tools
Download bacterial genome gene ontology annotation tools
Download bacterial genome protein domain annotation tools
Download bacterial genome protein structure prediction tools
Download bacterial genome protein-protein interaction prediction tools
Download bacterial genome metabolic network reconstruction tools
Download bacterial genome metabolic flux analysis tools
Download bacterial genome metabolic modeling tools
Download bacterial genome biotechnology applications
Download bacterial genome synthetic biology tools
Download bacterial genome engineering tools
Download bacterial genome evolution analysis tools
Download bacterial genome population genetics tools
Download bacterial genome phylogenomics tools
Replace the "ftp:" at the beginning of the FTP path with "rsync:". E.g. If the FTP path is _001696305.1_UCN72.1, then the directory and its contents could be downloaded using the following rsync command:
Replace the "ftp:" at the beginning of the FTP path with "https:". Also append a '/' to the path if it is a directory. E.g. If the FTP path is _001696305.1_UCN72.1, then the directory and its contents could be downloaded using the following wget command:
Historically, the genomes FTP site had been populated by different process flows and NCBI working groups leading to undesirable differences in available content and file formats. Also, data for GenBank genomes and RefSeq genomes were located in different areas of the NCBI FTP site that had different organization.
NCBI redesigned the genomes FTP site to expand the content and facilitate data access through an organized predictable directory hierarchy with consistent file names and formats. The site now provides greater support for downloading assembled genome sequences and/or corresponding annotation data with more uniformity across species. The current FTP site structure provides a single entry point to access content representing either GenBank or RefSeq data.
The content of most of the old directories on the site, and the content previously at is no longer being updated. Many old directories from these two areas were moved to archival subdirectories within the /genomes/ area on 2 December 2015. Most of the remaining old directories were moved to the archive in March 2020. Details of what FTP directories and files were moved are as follows.
Files for old versions of assemblies will not usually be updated, consequently, most users will want to download data only for the latest version of each assembly. For more information, see "How can I download only the current version of each assembly?".
For some assemblies, both GenBank and RefSeq content may be available. RefSeq genomes are a copy of the submitted GenBank assembly. In some cases the assemblies are not completely identical as RefSeq has chosen to add a non-nuclear organelle unit to the assembly or to drop very small contigs or reported contaminants. Equivalent RefSeq and GenBank assemblies, whether or not they are identical, and RefSeq to GenBank sequence ID mapping, can be found in the assembly report files available on the FTP site or by download from the Assembly resource.
The base structure of the genomes ftp site includes several main directory areas that provide sequence and annotation content, or report files. Sequence and annotation content is further organized by major taxonomic groupings, then by species, then by assembly. Sequence content is defined by the Assembly resource. The genomes FTP site provides directories for:
Assembly directories for all current assemblies, and for many previous assembly versions, include a core set of files and formats plus additional files relevant to the data content of the specific assembly. Directories for old assembly versions that predate the genomes FTP site reorganization contain only the assembly report, assembly stats assembly status files. All data files are named according to the pattern:[assembly accession.version]_[assembly name]_content.[format]
Tab-delimited text file reporting locations and attributes for a subset of annotated features. Included feature types are: gene, CDS, RNA (all types), operon, C/V/N/S_region, and V/D/J_segment. Replaces the .ptt .rnt format files that were provided in the old genomes FTP directories.
GenBank flat file format of the genomic sequence(s) in the assembly. This file includes both the genomic sequence and the CONTIG description (for CON records), hence, it replaces both the .gbk .gbs format files that were provided in the old genomes FTP directories.
FASTA format of accessioned RNA products annotated on the genome assembly; Provided for RefSeq assemblies as relevant (Note, RNA and mRNA products are not instantiated as a separate accessioned record in GenBank and are provided for some RefSeq genomes, most notably the eukaryotes.).
Tab-delimited text file reporting hash values for different aspects of the annotation data. The hashes are useful to monitor for when annotation has changed in a way that is significant for a particular use case and warrants downloading the updated records.
Assembly directories for RefSeq genomes annotated by the NCBI Eukaryotic Genome Annotation Pipeline include extra sub-directories and files in additon to the standard set of files and formats. All data files are named according to the pattern:[assembly accession.version]_[assembly name]_content.[format]
Alignments of the annotated Known RefSeq transcripts (identified with accessions prefixed with NM_ and NR_) to the genome in BAM format [not all annotation releases have Known RefSeq transcripts]. For more information about the BAM format see: -specs/SAMv1.pdf.
Alignments of the annotated Model RefSeq transcripts (identified with accessions prefixed with XM_ and XR_) to the genome in BAM format. For more information about the BAM format see: -specs/SAMv1.pdf.
Genome Workbench project file for visualization and search of differences between the current and previous annotation releases. The NCBI Genome Workbench web site provides help on downloading and using the 64-bit version of Genome Workbench.
This file is the XML version of the HTML report for the organism, e.g. www.ncbi.nlm.nih.gov/genome/annotation_euk/Homo_sapiens/108/. It contains information on the annotation release, including: Important dates associated with the annotation
Assemblies
Gene and feature statistics
Masking results
Transcript and protein alignments used for the annotation
Assembly-assembly alignments used to track genes from the previous assembly to the current, or from the reference to an alternate assembly if relevant
Assembly directoryOne directory for each genome assembly that was annotated in the release. Named as [assembly accession.version]_[assembly name]. This directory contains the files provided for all genome assemblies plus those additional files provided for organisms annotated by the NCBI Eukaryotic Genome Annotation Pipeline.
There can be many different genome assemblies available for species with medical, agricultural or scientific relevance. The Genus_species directories under the "genbank" and "refseq" directory trees each contain an assembly_summary.txt file that provides general information on all assembly versions included in the directory, such as release date, submitter organization, assembly level and status. See for example _islandicus/assembly_summary.txt
Alternatively, any assemblies that the NCBI Reference Sequence (RefSeq) group has selected to be reference or representative genomes can be readily accessed via the directories named "reference" or "representative" in the Genus_species directories under the "genbank" and "refseq" directory trees.
Only FTP files for the "latest" version of an assembly are updated when annotation is updated, new file formats are added or improvements to existing formats are released. Consequently, most users will want to download data only for the latest version of each assembly. You can select data from only the latest assemblies in several ways:
Alternatively, the assembly summary report files provide information that can be used to identify a set of assemblies of interest along with their FTP file paths. For example, to obtain the GenBank flat file format annotation for all complete bacterial genomes in the NCBI Reference Sequences collection (RefSeq):
Variants of these instructions can be used to download all draft bacterial genomes in RefSeq (assembly_level is not "Complete Genome"), all RefSeq reference or representative bacterial genomes (refseq_category (column 5) is "reference genome" or "representative genome"), etc.
NCBI has traditionally used a compound FASTA sequence identifier string in which multiple IDs were separated by '' characters. This format provides more information but requires that the individual sequence identifiers be parsed out of the compound string. The FASTA files on the redesigned genomes FTP site have a simple sequence identifier string that is just the sequence accession.version, for example:>U00096.3 Escherichia coli str. K-12 substr. MG1655, complete genome>NC_000001.11 Homo sapiens chromosome 1, GRCh38 Primary Assembly