PrimerDigital


Import sequence(s), about Sequence Formats

Sequence Entry

Sequence data files are prepared using a text editor (Notepad, WordPad, MS Word), and saved in ASCII as text/plain format (.txt) or in .rtf.
The programme takes either a single sequence or accepts multiple separate DNA sequences in FASTA, tabulated format (two columns from MS Excel sheet or MS Word table), EMBL, MEGA, GenBank, MSF, DIALIGN, simple alignment, or BLAST Queue web alignment formats. The template length is not limited.
The FastPCR clipboard allows the user to copy and paste text or tables from MS Word documents or MS Excel worksheets or other programmes and to paste them into another Office document. It is important that all target sequences are prepared in the same format. Users can type or import from file(s) into "General Sequence(s)" or "Additional sequence(s) or pre-designed primers (probes) list" editors.

FastPCR allows files to opened in several ways: the original file can opened as read-only for editing with text editors; files can be opened to memory without using text editors, which allows larger file(s), up to 200Mb, to be analysed; files within a folder can be selected and the files opened during task execution without the use of text editor programme. Additionally, the programme can open files within a selected folder in order to join all these files in a text editor. For example, this feature can be applied to convert all files from a selected folder into a single file of FASTA sequences. Alternatively this feature allows splitting FASTA sequences to individual files in a particular selected folder.

When a sequence file is opened, FastPCR displays the information about the opened sequence and its format. The information status bar shows the number of sequences, the total sequence length (in nucleotides), the nucleotide composition and the purine, pyrimidine, CG percentage and the melting temperature. Files can be saved various formats including .rtf, .xls, or .txt from the text editor in use.

  • openOpen … (Ctrl-O) – open text or RTF file with DNA or protein sequence(s) at Fasta, GenBank, Mega, Blast etc. formats into current open TAB editor (general, additional or result)
  • open Open Large FASTA File into Memory (Ctrl-G) – is quickest way to open huge chromosome size single (like human chromosome 3, max 200Mb) or multiple sequence(s) to memory
  • Load All Files from Folder (Ctrl-J) – this is useful way opening all text files from one folder at once. You need click on any file from folder and will see result. Especially this is reliable for read all sequences from different files. Each file will converted to FASTA format with name of the file.
Software opened any text file with command line: fastpcr.exe filename.fasta





FastPCR normally expect to read sequence files in FASTA format.
Import sequences are not different for two editors: both editors take all ways, with keyboard (Shit-Insert, Ctrl-V) and allowed work with right-click mouse displays a contextual menu. Importing sequences must be at the same format.

Degenerate DNA sequences are accepted as IUPAC code is an extended vocabulary of 11 letters which allows the description of ambiguous DNA code.

Each letter represents a combination of one or several nucleotides: M=(A/C) R=(A/G) W=(A/T) S=(G/C) Y=(C/T) K=(G/T) V=(A/G/C) H=(A/C/T) D=(A/G/T) B=(C/G/T) N=(A/G/C/T), U=T and I (Inosine).

Accepted amino acid codes: A(Ala), C(Cys), D(Asp), E(Glu), F(Phe), G(Cly), H(His), I(Ile), K(Lys), L(Leu), M(Met), N(Asn), P(Pro), Q(Gln), R(Arg), S(Ser), T(Thr), U(Sec), V(Val), W(Trp), Y(Tyr).

Sequence formats are simply the way in which the amino acid or DNA sequence is recorded in a computer file.

When sequences are imported you may edit the sequences in general or additional editors and immediately visualize the result of editing. You can modify a nucleotide sequences by inserting, deleting and replacing sequence fragments.


Raw format (ASCII)

Like a text/plain format without white space and TABs. It read only standard IUB/IUPAC amino acid or nucleic acid codes characters and rejects anything else, low- and upper-case insensitive. Digits or else are removed and ignored (but Tab and space characters with combination end line character (Enter press) can be interpreted as column format). Here are some examples of raw formatted sequence:
ataaattcttattttgacactcaccaaaatagtcacctggaaaacccgctttttgtgaca


FASTA format description

FASTA format have a highest priority and is simple as the raw sequence proceeded by definition line. The definition line begins with a “>” sign and optionally followed immediately by a name for the sequence with using any length and amount of words. Many sequences can be listed in the file, the format indicating a new sequence at each new “ >” symbol found. It is important to press Enter at the end of each line after “>” to help FastPCR recognize the end and beginning of sequence and sequence’s name. Make sure the first line starts with a “>” and has (has not) a header description.
The description must be contained within one line and not run into 2 or more lines. The sequence starts directly on next line. As for the previous raw data format, sequences must be in the standard IUB/IUPAC amino acid or nucleic acid codes, any other characters - digits, spaces, TAB characters or else are ignored, low- and upper-case insensitive:
>
cggccgagatcaggcgatgcatg

>
acgacgacgcagctatattacag


the alignment sequences in FASTA format will read only standard IUB/IUPAC amino acid or nucleic acid codes characters and rejects anything else, low- and upper-case insensitive. Sequence alignment (PHYLIP, NEXUS and else) is not necessary to reformat into FASTA format.


Tables format description

You can directly import the table from text file or from the clipborad via copy and paste operations from Microsoft Word or Excel sheet (or OpenOffice), or primer’s list from FastPCR's "PCR design result", or the table with TAB or whitespace separators. Software reads only first two columns with names and sequences:
FastPCR
To check the correct format was read, look at the information under text editor in the status bar:
As for the previous raw data format, sequences must be in the standard IUB/IUPAC amino acid or nucleic acid codes, any other characters - digits, spaces or else are ignored, low- and upper-case insensitive. Tab character or spaces are used for recognition columns. Other simple table format is with or without name for primers (probes or else); name is replaced by single space (space inside sequence not allowed) and the end of each sequence, press Enter is necessary:
  •  acgaatcgtattcaagcctgc
  •  gcgtcatctggctgctacctcga
  •  cgagcttagtcttcaacgccaa
  •  agaggacgctcgtgtctttcggac
  •  gctcacgtcaaagtcttgtccgag
In case using sequence’s name, no space inside names and sequences are allowed:
  1. acgaatcgtattcaagcctgc
  2. gcgtcatctggctgctacctcga
  3. cgagcttagtcttcaacgccaa
  4. agaggacgctcgtgtctttcggac

Software always indexing each sequence from 1 to N, therefore doesn’t matter if some sequence’s name are the same or absent:
1 acg aat cgt att caa gcc tgc
 ccg tca tct ggc tgc tac ctc ga
 cga gct agt ctt caa cgc caa
1 aga gga cgc tcg tgt ctt tcg gac
  •  Press at tool bar for converting sequences into FASTA format, for checking correct format reading. The user can individually determine which of the sequences to be translated into complementary. For it is necessary to insert @ character at the beginning of the sequence name.
  •  Press at tool bar for converting sequences to IUB/IUPAC FASTA format; at tool bar for converting sequences to original format with FASTA “>” sign.

GCG/MSF Format

The file may begin with as many lines of comment or description as required. The comments are terminated with a line containing only two slashes.
The first mandatory line that is recognised as part of the MSF file contains the text "MSF:"; this line also includes the sequence length and type, the date and an internal check sum value. There then follows one line per sequence describing the sequence name, length, checksum and a weight value. Only one name per line is allowed; the qualifier "Name: " is followed by the sequence name. Extra characters, between the sequence names and "Len: " are acceptable if they contain no blank characters. Another blank line is added followed by a line starting with two slashes "//", this indicates the end of the name list. There then follows another blank line. Sequences are interleaved on separate lines with gaps represented by periods. Each sequence line starts with the sequence name which is separated from the aligned sequence residues by white space.


Simple Alignment format description

The first mandatory line that is recognised as part of the Simple Alignment file contains the text "Alignment” (case not sensitive):

alignment
 
Uni_5SrRNA        1  GGGTGCGATC ATACCAGCAC TAATGCACCG GATCCCATCA GAACTCCGCA
5SrRNA_Trifo      1  AGGTGCGATC ATACCAGCAC TAATGCACCG GATCCCATCA GAACTCCGCA
5SrRNA_Ginkg      1  GGGTGCGATC ATACCAGCGT TAATGCACCG GATCCCATCA GAACTCCGCA
5SrRNA_Metas      1  GGGTGCGATC ATACCAGCGT TAGTGCACCG GATCCCATCA GAACTCCGCA
5SrRNA_Spina      1  GGGTGCGATC ATACCAGCAC TAATGCACCG GATCCCATCA GAACTCCGCA
5SrRNA_Goss2      1  GGGTGCGATC ATACCAGCAC TAATGCACCG GATCCCATCA GAACTCCACA
5SrRNA_Gossy      1  GGGTGCGATC ATACCAGCAC TAATGCACCG GATCCCATCA GAACTCCGCA
5SrRNA_Gnetu      1  GGGTGCGATA ATACCACCGC TAACGTATCG GATCCGATCA GAACTCCGTA
5SrRNA_Citru      1  GGGTGCGACC ATACCAGCAC TAATGCACCG GATCCCATCA GAACTCCGCA
5SrRNA_Lactu      1  GTGTGCGATC ATACCAGCAC TAATGCACTG AATCCCATCA GAACTCTGCA
5SRRNA_ALFAL      1  AGGTGCGATC ATACCAGCAC TAATGCACCG GATCCCATCA GAACTCCGCA


GenBank Format and EMBL Format

GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. Although there are daily exchanges of information with the EMBL Nucleotide Sequence Database, it has its own sequence format. Each GenBank entry includes a concise description of the sequence, the scientific name and taxonomy of the source organism, and a table of features identifying coding regions and other sites of biological significance (such as transcription units, sites of mutations or modifications, and repeats). Protein translations for coding regions are included in the feature table. Bibliographic references are included along with a link to the Medline unique identifier for all published sequences. Each sequence entry is composed of lines. Different types of lines, each with their own format, are used to record the various data that make up the entry.


MEGA format description

All input data files are basic ASCII-text files, which may contain DNA sequence, protein sequence, evolutionary distance, or phylogenetic tree data. Most word processing packages (Notepad) allow you to edit and save ASCII text files. These are usually marked with a .TXT or .MEG extensions.
However, there are a number of features that are common to all MEGA data files, which are as follows.
The first line must contain the keyword #MEGA to indicate that the data file is in the MEGA format.



DIALIGN format description

The first mandatory line that is recognised as part of the Simple Alignment file contains the text " DIALIGN” (case not sensitive).



Blast Queue WEB alignments result format

This format allowed reading and joining of all “Sbjct” sequences from BLAST result at Internet browsers.
First, you need “select all” (Ctrl-A) text with graphics in the Internet browser page, then copy and paste to FastPCR. If FastPCR don’t recognize directly the format of Internet browser page, you need “select all” (Ctrl-A) text with graphics in the Internet browser page, then copy and paste to Notepad and then repeat the same for this text from Notepad: select all and copy, and after this paste to FastPCR.

Press at tool bar for converting sequences to “intelligent” FASTA format in with saving “-“ from alignment. Sequences in Blast Query Web format are NOT analysed and all non IUB/IUPAC codes are conserved. Sequences in this formatted Web or text files are preceded by a line starting with a ‘>’ symbol, containing the name and labels of the sequence.