Porting SLRE to Java

From Carls wiki

Jump to: navigation, search

Ok, so as an exercise in C I'm porting SLRE to Java. It's not a big program, but still makes use of many common C constructs. I will try to balance faithfulness to the C code and Java conventions.

C in blue, Java in green.

02008-07-20

Today I tackled the main method, er, function.

int main(int argc, char *argv[])
{
        struct slre	slre;
        struct cap	caps[20];
        char		data[1 * 1024 * 1024];
        FILE		*fp;
        int		i, count, res, len;
        if (argc < 3) {
                printf("Usage: %s 'slre' <file> [count]\n", argv[0]);
        } else if ((fp = fopen(argv[2], "rb")) == NULL) {
                printf("Error: cannot open %s:%s\n", argv[2], strerror(errno));
        } else if (!slre_compile(&slre, argv[1])) {
                printf("Error compiling slre: %s\n", slre.err_str);
        } else {
                slre_dump(&slre, stderr);
                (void) memset(caps, 0, sizeof(caps));
                /* Read first 128K of file */
                len = fread(data, 1, sizeof(data), fp);
                (void) fclose(fp);
                res = 0;
                count = argc > 3 ? atoi(argv[3]) : 1;
                for (i = 0; i < count; i++)
                        res = slre_match(&slre, data, len, caps);
                printf("Result: %d\n", res);
                for (i = 0; i < 20; i++)
                        if (caps[i].len > 0)
                                printf("Substring %d: [%.*s]\n", i,
                                    caps[i].len, caps[i].ptr);
        }
        return (0);
}

It became this:

    public static void main(String[] args) {
        if ( args.length < 2 ) {
          System.out.println( "Usage: java Slre 'slre' <file> [count]" );
          return;
        }
        try {
            FileInputStream fis = new FileInputStream(args[1]);
            BufferedInputStream bis = new BufferedInputStream(fis);
            DataInputStream dis = new DataInputStream(bis);
            Slre slre = Slre.compile( args[0] );
            /* Read first 128K of file */
            byte[] data = new byte[1 * 1024 * 1024];
            int len = dis.read( data );
            dis.close();
            boolean res = false;
            int count = args.length > 3 ? Integer.parseInt(args[2]) : 1;
            Cap[] caps = new Cap[20];
            for (int i = 0; i < count; i++)
                res = slre.match(data, len, caps);
            System.out.printf("Result: %b\n", res);
            for (int i = 0; i < 20; i++)
                if (caps[i] != null && caps[i].length() > 0)
                  System.out.printf( "Substring %d: [%.*s]\n", i,
                                     caps[i].length(), caps[i].pointer() );
        }
        catch (FileNotFoundException fnfe) {
            System.out.printf( "Error: cannot open %s: %s\n",
                               args[1], fnfe.getMessage() );
        }
        catch (IllegalArgumentException iae) {
            System.out.printf( "Error compiling slre: %s\n",
                               iae.getMessage() );
        }
        catch (IOException ioe) {
            System.out.printf( "An I/O error occurred reading %s: %s",
                               args[1], ioe.getMessage() );
        }
    }

Observations:

  • The difference in length is negligible. C has 38 to Java's 41.
  • C handles errors with return codes and if statements. Java uses try, which changes the order around a bit. (And not necessarily for the worse: error handling is now at the bottom of the method.)
  • Also note how we handle one more error (IOException) in the Java code. It wasn't my idea, it was Java's! And not all that unreasonable, either, to imagine that something can go wrong during dis.read or dis.close. It's an interesting exercise to think what would happen in the C code in such a case.
  • In C, argv[0] contains the program name, with the added benefit that the actual arguments are 1-based and thus easier to think about. In Java, I had to subtract 1 everywhere.
  • In Java, I declare variables when I first use them. That makes me all warm and tingly inside.
  • As a special case of the above, note that the C version uses i for two different for loops. In the Java version, there are two variables i with the same name but separate scopes.
  • I realize that I've been using printf too little in my ordinary coding. Wonderful function, er, method.

02008-07-24

I have now finished the compiler step:

int
slre_compile(struct slre *r, const char *re)
{
        r->err_str = NULL;
        r->code_size = r->data_size = r->num_caps = r->anchored = 0;
        if (*re == '^')
                r->anchored++;
        emit(r, OPEN);    /* This will capture what matches full RE */
        emit(r, 0);
        while (*re != '\0')
                compile(r, &re);
        if (r->code[2] == BRANCH)
                fixup_branch(r, 4);
        emit(r, CLOSE);
        emit(r, 0);
        emit(r, END);
        return (r->err_str == NULL ? 1 : 0);
}

It became this:

    private Slre() {
        this.code_size = this.data_size = this.num_caps = 0;
        this.anchored = false;
    }
    public static Slre compile( String re ) {
       Slre r = new Slre();
       r.anchored = re.charAt(0) == '^';
       r.emit(Opcode.OPEN);        // This will capture what matches full RE
       r.emit(0);
       int pos = 0;
       while (pos < re.length())
           pos = r.compile(re, pos);
       if (r.code[2] == Opcode.BRANCH.getVal())
           r.fixup_branch(4);
       r.emit(Opcode.CLOSE);
       r.emit(0);
       r.emit(Opcode.END);
       return r;
   }

Notes:

  • r has turned from a struct into an object.
  • err_str is gone, replaced by exception handling.
  • anchored becomes a boolean, and the if statement evaporates
  • Since Java Strings are not zero-terminated like C strings, we have to check against the length instead.
  • Also, note how re is a reference to (within) a string. We cannot do that, so I had to introduce the int variable pos. Arguably, the C pointer style is more powerful than an integer index.
  • Primitives are sent by value in Java, so any changes inside the callee to a parameter only modify a copy of the original variable. Because of this, I had to turn the return type void of all functions modifying pos into int so that they can return the new pos value.
  • The opcodes became an enum in Java as well as in C. The Java way of accessing them is slightly more cumbersome.

I also implemented all methods called (directly or indirectly) by compile. They all exhibit more or less the same properties as the above method. Only one is especially noteworthy:

static void
compile(struct slre *r, const char **re)
{
        int	op, esc, branch_start, last_op, fixup, cap_no, level;
        fixup = 0;
        level = r->num_caps;
        branch_start = last_op = r->code_size;
        for (;;)
                switch (*(*re)++) {
                case '\0':
                        (*re)--;
                        return;
                        /* NOTREACHED */
                        break;
                case '^':
                        emit(r, BOL);
                        break;
                case '$':
                        emit(r, EOL);
                        break;
                case '.':
                        last_op = r->code_size;
                        emit(r, ANY);
                        break;
                case '[':
                        anyof(r, re);
                        break;
                case '\\':
                        last_op = r->code_size;
                        esc = get_escape_char(re);
                        if (esc & 0xff00) {
                                emit(r, esc >> 8);
                        } else {
                                exact_one_char(r, esc);
                        }
                        break;
                case '(':
                        last_op = r->code_size;
                        cap_no = ++r->num_caps;
                        emit(r, OPEN);
                        emit(r, cap_no);
                        compile(r, re);
                        if (*(*re)++ != ')') {
                                r->err_str = "No closing bracket";
                                return;
                        }
                        emit(r, CLOSE);
                        emit(r, cap_no);
                        break;
                case ')':
                        (*re)--;
                        fixup_branch(r, fixup);
                        if (level == 0) {
                                r->err_str = "Unbalanced brackets";
                                return;
                        }
                        return;
                        /* NOTREACHED */
                        break;
                case '+':
                case '*':
                        op = (*re)[-1] == '*' ? STAR: PLUS;
                        if (**re == '?') {
                                (*re)++;
                                op = op ? STARQ : PLUSQ;
                        }
                        quantifier(r, last_op, op);
                        break;
                case '?':
                        quantifier(r, last_op, QUEST);
                        break;
                case '|':
                        fixup_branch(r, fixup);
                        relocate(r, branch_start, 3);
                        r->code[branch_start] = BRANCH;
                        set_jump_offset(r, branch_start + 1, branch_start);
                        fixup = branch_start + 2;
                        r->code[fixup] = 0xff;
                        break;
                default:
                        (*re)--;
                        last_op = r->code_size;
                        exact(r, re);
                        break;
                }
}

...into...

   int compile(String re, int pos) {
       int fixup        = 0,
           level        = this.num_caps,
           branch_start = this.code_size,
           last_op      = this.code_size;
       while (pos < re.length())
           switch ( re.charAt(pos++) ) {
               case '^':
                   emit(Opcode.BOL);
                   break;
               case '$':
                   emit(Opcode.EOL);
                   break;
               case '.':
                   last_op = this.code_size;
                   emit(Opcode.ANY);
                   break;
               case '[':
                   pos = anyof(re, pos);
                   break;
               case '\\':
                   last_op = this.code_size;
                   int esc = get_escape_char(re, pos);
                   if ((esc & 0xff00) != 0) {
                       emit(esc >> 8);
                   }
                   else {
                       exact_one_char(esc);
                   }
                   break;
               case '(':
                   last_op = this.code_size;
                   int cap_no = ++this.num_caps;
                   emit(Opcode.OPEN);
                   emit(cap_no);
                   compile(re, pos);
                   if (re.charAt(pos++) != ')')
                       throw new IllegalArgumentException(
                           "No closing bracket"
                       );
                   emit(Opcode.CLOSE);
                   emit(cap_no);
                   break;
               case ')':
                   pos--;
                   fixup_branch(fixup);
                   if (level == 0)
                       throw new IllegalArgumentException(
                           "Unbalanced brackets"
                       );
                   return pos;
               case '+':
               case '*':
                   Opcode op = re.charAt(pos-1) == '*'
                               ? Opcode.STAR
                               : Opcode.PLUS;
                   if (re.charAt(pos) == '?') {
                       pos++;
                       op = op == Opcode.STAR ? Opcode.STARQ : Opcode.PLUSQ;
                   }
                   quantifier(last_op, op);
                   break;
               case '?':
                   quantifier(last_op, Opcode.QUEST);
                   break;
               case '|':
                   fixup_branch(fixup);
                   relocate(branch_start, 3);
                   this.code[branch_start] = (char) Opcode.BRANCH.getVal();
                   set_jump_offset(branch_start + 1, branch_start);
                   fixup = branch_start + 2;
                   this.code[fixup] = 0xff;
                   break;
               default:
                   pos--;
                   last_op = this.code_size;
                   pos = exact(re, pos);
                   break;
           }
       return re.length();
   }

Note:

  • Did you notice the bug I fixed on the way? I have highlighted it for you. Actually, the Java compiler gets all the praise for finding it.
$ javac Slre.java 
Slre.java:213: incompatible types
found   : Opcode
required: boolean
                        op = op ? Opcode.STARQ : Opcode.PLUSQ;
                             ^
1 error