这个博客的起源是Kylin的一个bug，在分析、修复这个问题的时候，尝试用了其他几个框架来解析SQL。在此记录一下，并总结一些能够去除SQL注释的解决办法。

Kylin的问题

Kylin在query模块时，想通过正则表达式来移除SQL中的注释。源码如下：

public static String removeCommentInSql(String sql1) {
    // match two patterns, one is "-- comment", the other is "/* comment */"	
    final String[] commentPatterns = new String[]{"--(?!.*\\*/).*?[\r\n]", "/\\*(.|\r|\n)*?\\*/"};
    for (int i = 0; i < commentPatterns.length; i++) {
        sql1 = sql1.replaceAll(commentPatterns[i], "");
    }
    sql1 = sql1.trim();
    return sql1;
}

这个方法没有考虑到SQL中字符串，比如`、"、'，可能也包含注释格式的字符串。比如说：

1
2
3

select * from a where price='2012--12-14',
select * from a where price=\"/* this is not comment */\"
SELECT * from a WHERE `--test` is not null

这三个test-case中，都是不含有注释的，但上面函数会误认为有注释。想要修复这个问题，有俩种思路：
（1）继续在正则表达式上修复
我们只需要排除`、"、'之内的字符串，然后对剩余的字符串使用这个正则表达式即可。但是由于引号内可能会出现互相引用，所以这个正则表达式会很难写，像我很长时间不怎么用正则表达式的人，说实话很难写。

后面想想这个正则怎么写

（2）写一个简单的解析器
其实在正则表达式就是一个按照符合一定规则的解析器，我们只需要按照规则去实现一个解析器也能完成这个要求。

阿里druid的问题

之前我用过阿里的Druid来解析过SQL，所以我一开始就用这个来解析。代码如下：

但是这里有一个问题，这里会删除#开始的注释，虽然说应该问题不大，但总是不符合方法上的注释match two patterns, one is "-- comment", the other is "/* comment */"。

Calcite的问题

后来，Kylin的PMC建议使用calcite来解决这个问题。这里有一个插曲就是我刚开始看Calcite解析SQL的实现类很蒙蔽：org.apache.calcite.sql.parser.impl.SqlParserImpl。这个代码很难读懂，后来我才知道这个类的代码是由JavaCC来生成的。
于是我就找到另一个解析SQL的实现类：org.apache.calcite.sql.advise.SqlSimpleParser，里面比较核心的方法就是nextToken。我在尝试使用这个方法来解决的时候，发现这个类本身实现也有bug，所以就将他本身修改一下。

解决办法

手写一个解析器

在上面的方法中，都是没法比较好的解决问题，而且问题中，我们只需要解析到注释就可以了，而Durid、Calcite提供的方法会解析其他一些SQL结构，这是没有必要的。所以，我就模仿SqlSimpleParser.nextToken方法，写了一个简单的注释解析器。代码如下：

/**
 * SqlCommentParser is used to parse the comment position in sql
 *
 * SqlCommentParser thinks that a sql string is made up of three types of strings.
 * One is the quote type, the other is the comment type, and the other is the ID type
 *
 * E.g:
 *  select * from a where column_b='this is not comment'-- this is comment
 *  in this sql,
 *      the quote type of strings is 【'this is not comment'】
 *      the comment type of strings is 【-- this is comment】
 *      the ID type of strings is 【select * from a where column_b=】
 *
 * the quote type is consist a pair of '、"、`, and the quote internal string can be any string
 * the comment type is consist of --、/*
 * the ID type is a strings that do not meet the above two types
 *
 * note:
 * (1)SqlCommentParser cannot identify whether the string conforms to the SQL specification,
 * (2)If you want quote、comment to support more, you should add more case in the method of nextComment
 *
 */
public class SqlCommentParser {

    private final String sql;
    private int pos;
    private int start = 0;

    public SqlCommentParser(String sql) {
        this.sql = sql;
        this.pos = 0;
    }

    public String removeCommentSql() {
        StringBuilder newSQL = new StringBuilder();
        int startIndex = 0;
        while (true) {
            Comment comment = this.nextComment();
            // the sql is parse over
            if (comment == null) {
                // process the sql without comment case
                if (startIndex != sql.length()) {
                    newSQL.append(sql, startIndex, sql.length());
                }
                break;
            } else {
                newSQL.append(sql, startIndex, comment.startIndex);
                startIndex = comment.endIndex;
            }
        }
        return newSQL.toString();

    }

    /**
     * Get next comment of sql
     *
     * it only support two comment modes, one is -- ,the other is /*
     * @return null, only if sql parser over;
     */
    public Comment nextComment() {
        while (pos < sql.length()) {
            char c = sql.charAt(pos);
            Integer startIndex = null;
            switch (c) {
                // ignore the type of quote
                case '\'':
                    start = pos;
                    ++pos;
                    while (pos < sql.length()) {
                        c = sql.charAt(pos);
                        ++pos;
                        if (c == '\'') {
                            break;
                        }
                    }
                    break;

                case '`':
                    start = pos;
                    ++pos;
                    while (pos < sql.length()) {
                        c = sql.charAt(pos);
                        ++pos;
                        if (c == '`') {
                            break;
                        }
                    }
                    break;

                case '\"':
                    start = pos;
                    ++pos;
                    while (pos < sql.length()) {
                        c = sql.charAt(pos);
                        ++pos;
                        if (c == '\"') {
                            break;
                        }
                    }
                    break;

                // parse the type of comment
                case '/':
                    // possible start of '/*'
                    if (pos + 1 < sql.length()) {
                        char c1 = sql.charAt(pos + 1);
                        if (c1 == '*') {
                            startIndex = pos;
                            int end = sql.indexOf("*/", pos + 2);
                            if (end < 0) {
                                end = sql.length();
                            } else {
                                end += "*/".length();
                            }
                            pos = end;
                            Integer endIndex = pos;
                            return new Comment(startIndex, endIndex);
                        }
                    }

                case '-':
                    // possible start of '--' comment
                    if (c == '-' && pos + 1 < sql.length() && sql.charAt(pos + 1) == '-') {
                        startIndex = pos;
                        pos = indexOfLineEnd(sql, pos + 2);
                        Integer endIndex = pos;
                        return new Comment(startIndex, endIndex);
                    }

                default:
                    if (isOpenQuote(c)) {
                        break;
                    } else {
                        // parse the type of ID
                        ++pos;
                        loop:
                        while (pos < sql.length()) {
                            c = sql.charAt(pos);
                            switch (c) {
                                case '\'':
                                case '`':
                                case '\"':
                                case '/':
                                    break loop;
                                case '-':
                                    // possible start of '--' comment
                                    if (c == '-' && pos + 1 < sql.length() && sql.charAt(pos + 1) == '-') {
                                        break loop;
                                    }
                                default:
                                    ++pos;
                            }
                        }
                    }
            }
        }
        return null;
    }

    private boolean isOpenQuote(char character) {
        if (character == '\"') {
            return true;
        } else if (character == '`') {
            return true;
        } else if (character == '\'') {
            return true;
        }
        return false;
    }

    private int indexOfLineEnd(String sql, int i) {
        int length = sql.length();
        while (i < length) {
            char c = sql.charAt(i);
            switch (c) {
                case '\r':
                case '\n':
                    return i;
                default:
                    ++i;
            }
        }
        return i;
    }

    public static class Comment {

        private Integer startIndex;
        private Integer endIndex;

        Comment(Integer startIndex, Integer endIndex) {
            this.startIndex = startIndex;
            this.endIndex = endIndex;
        }
    }
}

这个方法也是可以解决这个问题，各个测试用例也都通过了。

javaCC

JavaCC这个解决方法不是我找到的，是那个PMC写好，分享给我的。其实应该跟上面的思路差不多，确定好解析规则，然后生成代码。

https://github.com/wangjie-fourth/SqlParser

wangjie_fourth

如何正确的移除SQL中的注释