skbio.sequence.Protein.iter_contiguous

Protein.iter_contiguous(included, min_length=1, invert=False)[source]

Yield contiguous subsequences based on included.

State: Stable as of 0.4.0.

Parameters:

included : 1D array_like (bool) or iterable (slices or ints)

included is transformed into a flat boolean vector where each position will either be included or skipped. All contiguous included positions will be yielded as a single region.

min_length : int, optional

The minimum length of a subsequence for it to be yielded. Default is 1.

invert : bool, optional

Whether to invert included such that it describes what should be skipped instead of included. Default is False.

Notes

If slices provide adjacent ranges, then they will be considered the same contiguous subsequence.

Examples

Here we use iter_contiguous to find all of the contiguous ungapped sequences using a boolean vector derived from our DNA sequence.

>>> from skbio import DNA
>>> s = DNA('AAA--TT-CCCC-G-')
>>> no_gaps = ~s.gaps()
>>> for ungapped_subsequence in s.iter_contiguous(no_gaps,
...                                               min_length=2):
...     print(ungapped_subsequence)
AAA
TT
CCCC

Note how the last potential subsequence was skipped because it would have been smaller than our min_length which was set to 2.

We can also use iter_contiguous on a generator of slices as is produced by find_motifs (and find_with_regex).

>>> from skbio import Protein
>>> s = Protein('ACDFNASANFTACGNPNRTESL')
>>> for subseq in s.iter_contiguous(s.find_motifs('N-glycosylation')):
...     print(subseq)
NASANFTA
NRTE

Note how the first subsequence contains two N-glycosylation sites. This happened because they were contiguous.