使用字典树进行关键词查找

2021-03-19 数据结构和算法 0 评论

起因

要写一个敏感词过滤，开始想着就写个循环就行了，但发现敏感词还挺多，而且会越来越多，循环的效率就太差了。于是就写个简单的字典树来做关键词查找。可以用于敏感词查询，也可以用于大量关键词查找场景，比如直播弹幕的关键词查找。

代码

目前只实现了正序查找第一个匹配的关键词，后面再增加其它。

代码放到了github上：https://github.com/ligb888/trie-tree

实现代码(包含了返回找到关键词和只判断不返回)：

public class TrieTreeUtil {

    private final HashMap<Character, TrieNode> children = new HashMap<>();

    public void add(String word){
        if(StringUtils.isBlank(word) || StringUtils.isBlank(word.trim())){
            return;
        }
        word = word.trim().replace(" ", "");

        char[] keys = word.toCharArray();
        HashMap<Character, TrieNode> point = children;
        for (int i = 0; i < keys.length; i++) {
            TrieNode node = point.get(keys[i]);
            if (node == null) {
                node = new TrieNode();
                point.put(keys[i], node);
            }
            if (i == (keys.length - 1)) {
                node.setWord(word);
                break;
            }
            if (node.getChildren() == null) {
                node.setChildren(new HashMap<>());
            }
            point = node.getChildren();
        }
    }

    public String contains(String word){
        if(StringUtils.isBlank(word) || StringUtils.isBlank(word.trim())){
            return null;
        }
        word = word.trim().replace(" ", "");
        char[] keys = word.toCharArray();
        for(int i = 0; i < keys.length; i++){
            String ret = contains(keys, i, children);
            if(ret != null){
                return ret;
            }
        }
        return null;
    }

    private String contains(char[] keys, int begin, HashMap<Character, TrieNode> children){
        for(int i = begin; i < keys.length; i++){
            TrieNode node = children.get(keys[i]);
            if (node == null) {
                return null;
            }
            if(node.getWord() != null){
                return node.getWord();
            }
            children = node.getChildren();
        }
        return null;
    }

    public HashMap<Character, TrieNode> getChildren() {
        return children;
    }
}

TrieNode：

public class TrieNode implements Serializable {

    private String word;
    private HashMap<Character, TrieNode> children;

    public String getWord() {
        return word;
    }

    public void setWord(String word) {
        this.word = word;
    }

    public HashMap<Character, TrieNode> getChildren() {
        return children;
    }

    public void setChildren(HashMap<Character, TrieNode> children) {
        this.children = children;
    }
}

效率对比

测试设备：2014年macbook pro，i5 4258U，8G

测试参数1：2500关键词，10万次查询：

样本类型	for循环	字典树
前匹配-长度20	1308毫秒	96毫秒
不匹配-长度20	14186毫秒	149毫秒
不匹配-长度100	32715毫秒	556毫秒

测试参数2：56000关键词，10万次查询：

样本类型	for循环	字典树
前匹配-长度20	73088毫秒	132毫秒
不匹配-长度20	225387毫秒	210毫秒
不匹配-长度100	452902毫秒	549毫秒

可以看到在关键词较多时，循环的效率下降非常快，字典树的效率几乎不受关键词数量影响，但他们都受到文本长度影响。

本文链接： http://16384.net/20210319140024/

版权声明： 本博客所有文章除特别声明外，均采用 CC BY 4.0 CN协议许可协议。转载请注明出处！

李广彬前后端 & Linux

技术无止境