leetcode 187 Repeated DNA Sequences

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: “ACGAATTCCG”. When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

Example:

1
2
3

Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"

Output: ["AAAAACCCCC", "CCCCCAAAAA"]

substring + set

直观的来讲，可以直接使用substring来截取每一个subsequence，然后和放入set1 和 set2，直接用(!s1.add(substring) && s2.add(substring)) 来判断。
- 当第一次处理substring时，由于set1中没有substring，此时set1.add(subtring)返回true；!s1.add(substring)返回false，判断的后面一部分就不会执行，那么set2中并没有放入substring。
- 当第二次处理substring时， !s1.add(subtring) 返回 true，执行 s1.add(substring) 返回true。此时判断总体返回true。

public class Solution {
  public List<String> findRepeatedDnaSequences(String s) {
    HashSet<String> set = new HashSet<>();
    List<String> ans = new ArrayList<>();
    for (int i=0; i+9<s.length(); i++) {
      String part = s.substring(i, i+10);
      if (!set.add(part) && set.add() {
        ans.add(part);
      }
    }
    return ans;
  }
}

但是这种方法，使用substring的时间复杂度为$O(s.length())$.总体的复杂度实际为$ O(n ^ 2)$

HashSet + bit Operation

由于总共的字符只有四个字符，用二进制编码可以用两比特来表示，因此，可以先对字符进行编码，然后每十个字符转换成一个整数，用这个整数来代替上面的substring操作，这样一来，时间复杂度为$O(kn)$。

class Solution {
    public List<String> findRepeatedDnaSequences(String s) {
        HashSet<String> first = new HashSet<>();
        HashSet<String> second = new HashSet<>();
        List<String> ans = new ArrayList<>();
        for (int i=0; i+9<s.length(); i++) {
          String part = s.substring(i, i+10);
          if (!first.add(part) && second.add(part)) {
            ans.add(part);
          }
        }
        return ans;
    }
}