leetcode 187 Repeated DNA Sequences
z

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: “ACGAATTCCG”. When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

Example:

1
2
3
Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"

Output: ["AAAAACCCCC", "CCCCCAAAAA"]

substring + set

  • 直观的来讲,可以直接使用substring来截取每一个subsequence,然后和放入set1set2,直接用(!s1.add(substring) && s2.add(substring)) 来判断。
    • 当第一次处理substring时, 由于set1中没有substring,此时set1.add(subtring)返回true!s1.add(substring)返回false,判断的后面一部分就不会执行,那么set2中并没有放入substring
    • 当第二次处理substring时, !s1.add(subtring) 返回 true,执行 s1.add(substring) 返回true。此时判断总体返回true
1
2
3
4
5
6
7
8
9
10
11
12
13
public class Solution {
public List<String> findRepeatedDnaSequences(String s) {
HashSet<String> set = new HashSet<>();
List<String> ans = new ArrayList<>();
for (int i=0; i+9<s.length(); i++) {
String part = s.substring(i, i+10);
if (!set.add(part) && set.add() {
ans.add(part);
}
}
return ans;
}
}
  • 但是这种方法,使用substring的时间复杂度为$O(s.length())$.总体的复杂度实际为$ O(n ^ 2)$

HashSet + bit Operation

  • 由于总共的字符只有四个字符,用二进制编码可以用两比特来表示,因此,可以先对字符进行编码,然后每十个字符转换成一个整数,用这个整数来代替上面的substring操作,这样一来,时间复杂度为$O(kn)$。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
class Solution {
public List<String> findRepeatedDnaSequences(String s) {
HashSet<String> first = new HashSet<>();
HashSet<String> second = new HashSet<>();
List<String> ans = new ArrayList<>();
for (int i=0; i+9<s.length(); i++) {
String part = s.substring(i, i+10);
if (!first.add(part) && second.add(part)) {
ans.add(part);
}
}
return ans;
}
}