PrimerDigital


Import sequence(s), about Sequence Formats


Software can open work with .RTF or .TXT or .FASTA (raw text) formatted files. You can associate a file type; make a file type always open by FastPCR.
  • Open … (Ctrl-O) – open text or RTF file with DNA or protein sequence(s) at Fasta, GenBank, Mega, Blast etc. formats into current open TAB editor (general, additional or result)
  •  Open Large FASTA File into Memory (Ctrl-G) – is quickest way to open huge chromosome size single (like human chromosome 3, max 200Mb) or multiple sequence(s) to memory
  • Load All Files from Folder (Ctrl-J) – this is useful way opening all text files from one folder at once. You need click on any file from folder and will see result. Especially this is reliable for read all sequences from different files. Each file will converted to FASTA format with name of the file.
Software opened any text file with command line: fastpcr.exe filename.fasta




FastPCR normally expect to read sequence files in FASTA format.
Prepare your sequence data file using a text editor (Notepad, WordPad, Word), and save in ASCII text format (plain text) or Rich Text Format (.RTF).
You can type in FastPCR editors or import nucleotide (protein) sequence(s) from file or from the clipboard as simple text, RTF text, or FASTA format, MEGA, or alignment for BLAST, EMBL result searching, Dialign or MSF alignment, Excel sheet or Word table (two columns), the table with TAB or whitespace separators to general text editor.

Import sequences are not different for two editors: both editors take all ways, with keyboard (Shit-Insert, Ctrl-V) and allowed work with right-click mouse displays a contextual menu. Importing sequences must be at the same format.

Degenerate DNA sequences are accepted as IUPAC code is an extended vocabulary of 16 letters which allows the description of ambiguous DNA code.
Each letter represents a combination of one or several nucleotides: M=(A/C) R=(A/G) W=(A/T) S=(G/C) Y=(C/T) K=(G/T) V=(A/G/C) H=(A/C/T) D=(A/G/T) B=(C/G/T) N=(A/G/C/T), U=T and I (Inosine).

Accepted amino acid codes: A(Ala), C(Cys), D(Asp), E(Glu), F(Phe), G(Cly), H(His), I(Ile), K(Lys), L(Leu), M(Met), N(Asn), P(Pro), Q(Gln), R(Arg), S(Ser), T(Thr), V(Val), W(Trp), Y(Tyr).

Sequence formats are simply the way in which the amino acid or DNA sequence is recorded in a computer file.

When sequences are imported you may edit the sequences in general or additional editors and immediately visualize the result of editing. Press F12 switch on-typing sequence interpretation to on or off. You can modify a nucleotide sequences by inserting, deleting and replacing sequence fragments.


Raw format (ASCII)

Like a text/plain format without white space and TABs. It read only standard IUB/IUPAC amino acid or nucleic acid codes characters and rejects anything else, low- and upper-case insensitive. Digits or else are removed and ignored (but Tab and space characters with combination end line character (Enter press) can be interpreted as column format). Here are some examples of raw formatted sequence: ataaattcttattttgacactcaccaaaatagtcacctggaaaacccgctttttgtgaca


FASTA format description

FASTA format have a highest priority and is simple as the raw sequence proceeded by definition line. The definition line begins with a “>” sign and optionally followed immediately by a name for the sequence with using any length and amount of words. Many sequences can be listed in the file, the format indicating a new sequence at each new “>” symbol found. It is important to press Enter at the end of each line after “>” to help FastPCR recognize the end and beginning of sequence and sequence’s name. Make sure the first line starts with a “>” and has (has not) a header description.
The description must be contained within one line and not run into 2 or more lines. The sequence starts directly on next line. As for the previous raw data format, sequences must be in the standard IUB/IUPAC amino acid or nucleic acid codes, any other characters - digits, spaces, TAB characters or else are ignored, low- and upper-case insensitive:
>
cggccgagatcaggcgatgcatg

>
acgacgacgcagctatattacag


the alignment sequences in FASTA format will read only standard IUB/IUPAC amino acid or nucleic acid codes characters and rejects anything else, low- and upper-case insensitive. Sequence alignment (PHYLIP, NEXUS and else) is not necessary to reformat into FASTA format.


Tables format description

You can directly import the table from text file or from the clipborad via copy and paste operations from Microsoft Word or Excel sheet (or OpenOffice), or primer’s list from FastPCR's "PCR design result", or the table with TAB or whitespace separators. Software reads only first two columns with names and sequences:
1F1_234-253 agggagtagcttacctcgct
2F1_263-283 gcgaaaaccaagtgcttacct
3F1_290-313 tcctcaagcgaaaaccaatccaca
4F1_318-338 tgttcacatgtttggggacga
5F1_545-564 gcttgtaaggcaaacccaca
6F1_606-625 acgtggtactcatggtgtca
7F1_668-689 cccaacggtttacctcaagggt
8F1_1071-1093 tcgcgaccttatgagaacgctgc
9F1_1112-1131 aagcagcgaccgacgaaacc

To check the correct format was read, look at the information under text editor in the status bar:
 
As for the previous raw data format, sequences must be in the standard IUB/IUPAC amino acid or nucleic acid codes, any other characters - digits, spaces or else are ignored, low- and upper-case insensitive. Tab character or spaces are used for recognition columns. Other simple table format is with or without name for primers (probes or else); name is replaced by single space (space inside sequence not allowed) and the end of each sequence, press Enter is necessary:
  •  acgaatcgtattcaagcctgc
  •  gcgtcatctggctgctacctcga
  •  cgagcttagtcttcaacgccaa
  •  agaggacgctcgtgtctttcggac
  •  gctcacgtcaaagtcttgtccgag
In case using sequence’s name, no space inside names and sequences are allowed:
  1. acgaatcgtattcaagcctgc
  2. gcgtcatctggctgctacctcga
  3. cgagcttagtcttcaacgccaa
  4. agaggacgctcgtgtctttcggac

Software always indexing each sequence from 1 to N, therefore doesn’t matter if some sequence’s name are the same or absent:
1 acg aat cgt att caa gcc tgc
 ccg tca tct ggc tgc tac ctc ga
 cga gct agt ctt caa cgc caa
1 aga gga cgc tcg tgt ctt tcg gac

  •  Press at tool bar for converting sequences into FASTA format, for checking correct format reading.
  •  Press at tool bar for converting sequences to IUB/IUPAC FASTA format; at tool bar for converting sequences to original format with FASTA “>” sign.

GCG/MSF Format

The file may begin with as many lines of comment or description as required. The comments are terminated with a line containing only two slashes.
The first mandatory line that is recognised as part of the MSF file contains the text "MSF:"; this line also includes the sequence length and type, the date and an internal check sum value. There then follows one line per sequence describing the sequence name, length, checksum and a weight value. Only one name per line is allowed; the qualifier "Name: " is followed by the sequence name. Extra characters, between the sequence names and "Len: " are acceptable if they contain no blank characters. Another blank line is added followed by a line starting with two slashes "//", this indicates the end of the name list. There then follows another blank line. Sequences are interleaved on separate lines with gaps represented by periods. Each sequence line starts with the sequence name which is separated from the aligned sequence residues by white space.


Simple Alignment format description

The first mandatory line that is recognised as part of the Simple Alignment file contains the text "Alignment” (case not sensitive):

alignment
 
Uni_5SrRNA         1   GGGTGCGATC ATACCAGCAC TAATGCACCG GATCCCATCA GAACTCCGCA
5SrRNA_Trifo       1   AGGTGCGATC ATACCAGCAC TAATGCACCG GATCCCATCA GAACTCCGCA
5SrRNA_Ginkg       1   GGGTGCGATC ATACCAGCGT TAATGCACCG GATCCCATCA GAACTCCGCA
5SrRNA_Metas       1   GGGTGCGATC ATACCAGCGT TAGTGCACCG GATCCCATCA GAACTCCGCA
5SrRNA_Spina       1   GGGTGCGATC ATACCAGCAC TAATGCACCG GATCCCATCA GAACTCCGCA
5SrRNA_Goss2       1   GGGTGCGATC ATACCAGCAC TAATGCACCG GATCCCATCA GAACTCCACA
5SrRNA_Gossy       1   GGGTGCGATC ATACCAGCAC TAATGCACCG GATCCCATCA GAACTCCGCA
5SrRNA_Gnetu       1   GGGTGCGATA ATACCACCGC TAACGTATCG GATCCGATCA GAACTCCGTA
5SrRNA_Citru       1   GGGTGCGACC ATACCAGCAC TAATGCACCG GATCCCATCA GAACTCCGCA
5SrRNA_Lactu       1   GTGTGCGATC ATACCAGCAC TAATGCACTG AATCCCATCA GAACTCTGCA
5SRRNA_ALFAL       1   AGGTGCGATC ATACCAGCAC TAATGCACCG GATCCCATCA GAACTCCGCA


GenBank Format and EMBL Format

GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. Although there are daily exchanges of information with the EMBL Nucleotide Sequence Database, it has its own sequence format. Each GenBank entry includes a concise description of the sequence, the scientific name and taxonomy of the source organism, and a table of features identifying coding regions and other sites of biological significance (such as transcription units, sites of mutations or modifications, and repeats). Protein translations for coding regions are included in the feature table. Bibliographic references are included along with a link to the Medline unique identifier for all published sequences. Each sequence entry is composed of lines. Different types of lines, each with their own format, are used to record the various data that make up the entry.


MEGA format description

All input data files are basic ASCII-text files, which may contain DNA sequence, protein sequence, evolutionary distance, or phylogenetic tree data. Most word processing packages (Notepad) allow you to edit and save ASCII text files. These are usually marked with a .TXT or .MEG extensions.
However, there are a number of features that are common to all MEGA data files, which are as follows.
The first line must contain the keyword #MEGA to indicate that the data file is in the MEGA format.



DIALIGN format description

The first mandatory line that is recognised as part of the Simple Alignment file contains the text " DIALIGN” (case not sensitive).



Blast Queue WEB alignments result format

This format allowed reading and joining of all “Sbjct” sequences from BLAST result at Internet browsers.
First, you need “select all” (Ctrl-A) text with graphics in the Internet browser page, then copy and paste to FastPCR. If FastPCR don’t recognize directly the format of Internet browser page, you need “select all” (Ctrl-A) text with graphics in the Internet browser page, then copy and paste to Notepad and then repeat the same for this text from Notepad: select all and copy, and after this paste to FastPCR.

Press at tool bar for converting sequences to “intelligent” FASTA format in with saving “-“ from alignment. Sequences in Blast Query Web format are NOT analysed and all non IUB/IUPAC codes are conserved. Sequences in this formatted Web or text files are preceded by a line starting with a ‘>’ symbol, containing the name and labels of the sequence.