Porting SLRE to Java
From Carls wiki
Ok, so as an exercise in C I'm porting SLRE to Java. It's not a big program, but still makes use of many common C constructs. I will try to balance faithfulness to the C code and Java conventions.
C in blue, Java in green.
02008-07-20
Today I tackled the main method, er, function.
int main(int argc, char *argv[])
{
struct slre slre;
struct cap caps[20];
char data[1 * 1024 * 1024];
FILE *fp;
int i, count, res, len;
if (argc < 3) {
printf("Usage: %s 'slre' <file> [count]\n", argv[0]);
} else if ((fp = fopen(argv[2], "rb")) == NULL) {
printf("Error: cannot open %s:%s\n", argv[2], strerror(errno));
} else if (!slre_compile(&slre, argv[1])) {
printf("Error compiling slre: %s\n", slre.err_str);
} else {
slre_dump(&slre, stderr);
(void) memset(caps, 0, sizeof(caps));
/* Read first 128K of file */
len = fread(data, 1, sizeof(data), fp);
(void) fclose(fp);
res = 0;
count = argc > 3 ? atoi(argv[3]) : 1;
for (i = 0; i < count; i++)
res = slre_match(&slre, data, len, caps);
printf("Result: %d\n", res);
for (i = 0; i < 20; i++)
if (caps[i].len > 0)
printf("Substring %d: [%.*s]\n", i,
caps[i].len, caps[i].ptr);
}
return (0); }
It became this:
public static void main(String[] args) {
if ( args.length < 2 ) {
System.out.println( "Usage: java Slre 'slre' <file> [count]" );
return;
}
try {
FileInputStream fis = new FileInputStream(args[1]);
BufferedInputStream bis = new BufferedInputStream(fis);
DataInputStream dis = new DataInputStream(bis);
Slre slre = Slre.compile( args[0] );
/* Read first 128K of file */
byte[] data = new byte[1 * 1024 * 1024];
int len = dis.read( data );
dis.close();
boolean res = false;
int count = args.length > 3 ? Integer.parseInt(args[2]) : 1;
Cap[] caps = new Cap[20];
for (int i = 0; i < count; i++)
res = slre.match(data, len, caps);
System.out.printf("Result: %b\n", res);
for (int i = 0; i < 20; i++)
if (caps[i] != null && caps[i].length() > 0)
System.out.printf( "Substring %d: [%.*s]\n", i,
caps[i].length(), caps[i].pointer() );
}
catch (FileNotFoundException fnfe) {
System.out.printf( "Error: cannot open %s: %s\n",
args[1], fnfe.getMessage() );
}
catch (IllegalArgumentException iae) {
System.out.printf( "Error compiling slre: %s\n",
iae.getMessage() );
}
catch (IOException ioe) {
System.out.printf( "An I/O error occurred reading %s: %s",
args[1], ioe.getMessage() );
}
}
Observations:
- The difference in length is negligible. C has 38 to Java's 41.
- C handles errors with return codes and
ifstatements. Java usestry, which changes the order around a bit. (And not necessarily for the worse: error handling is now at the bottom of the method.) - Also note how we handle one more error (
IOException) in the Java code. It wasn't my idea, it was Java's! And not all that unreasonable, either, to imagine that something can go wrong duringdis.readordis.close. It's an interesting exercise to think what would happen in the C code in such a case. - In C,
argv[0]contains the program name, with the added benefit that the actual arguments are 1-based and thus easier to think about. In Java, I had to subtract 1 everywhere. - In Java, I declare variables when I first use them. That makes me all warm and tingly inside.
- As a special case of the above, note that the C version uses
ifor two different for loops. In the Java version, there are two variablesiwith the same name but separate scopes. - I realize that I've been using
printftoo little in my ordinary coding. Wonderful function, er, method.
02008-07-24
I have now finished the compiler step:
int
slre_compile(struct slre *r, const char *re)
{
r->err_str = NULL;
r->code_size = r->data_size = r->num_caps = r->anchored = 0;
if (*re == '^')
r->anchored++;
emit(r, OPEN); /* This will capture what matches full RE */
emit(r, 0);
while (*re != '\0')
compile(r, &re);
if (r->code[2] == BRANCH)
fixup_branch(r, 4);
emit(r, CLOSE);
emit(r, 0);
emit(r, END);
return (r->err_str == NULL ? 1 : 0); }
It became this:
private Slre() {
this.code_size = this.data_size = this.num_caps = 0;
this.anchored = false;
}
public static Slre compile( String re ) {
Slre r = new Slre();
r.anchored = re.charAt(0) == '^';
r.emit(Opcode.OPEN); // This will capture what matches full RE
r.emit(0);
int pos = 0;
while (pos < re.length())
pos = r.compile(re, pos);
if (r.code[2] == Opcode.BRANCH.getVal())
r.fixup_branch(4);
r.emit(Opcode.CLOSE);
r.emit(0);
r.emit(Opcode.END);
return r; }
Notes:
-
rhas turned from astructinto an object. -
err_stris gone, replaced by exception handling. -
anchoredbecomes aboolean, and theifstatement evaporates - Since Java
Strings are not zero-terminated like C strings, we have to check against the length instead. - Also, note how
reis a reference to (within) a string. We cannot do that, so I had to introduce theintvariablepos. Arguably, the C pointer style is more powerful than an integer index. - Primitives are sent by value in Java, so any changes inside the callee to a parameter only modify a copy of the original variable. Because of this, I had to turn the return type
voidof all functions modifyingposintointso that they can return the newposvalue. - The opcodes became an
enumin Java as well as in C. The Java way of accessing them is slightly more cumbersome.
I also implemented all methods called (directly or indirectly) by compile. They all exhibit more or less the same properties as the above method. Only one is especially noteworthy:
static void
compile(struct slre *r, const char **re)
{
int op, esc, branch_start, last_op, fixup, cap_no, level;
fixup = 0;
level = r->num_caps;
branch_start = last_op = r->code_size;
for (;;)
switch (*(*re)++) {
case '\0':
(*re)--;
return;
/* NOTREACHED */
break;
case '^':
emit(r, BOL);
break;
case '$':
emit(r, EOL);
break;
case '.':
last_op = r->code_size;
emit(r, ANY);
break;
case '[':
anyof(r, re);
break;
case '\\':
last_op = r->code_size;
esc = get_escape_char(re);
if (esc & 0xff00) {
emit(r, esc >> 8);
} else {
exact_one_char(r, esc);
}
break;
case '(':
last_op = r->code_size;
cap_no = ++r->num_caps;
emit(r, OPEN);
emit(r, cap_no);
compile(r, re);
if (*(*re)++ != ')') {
r->err_str = "No closing bracket";
return;
}
emit(r, CLOSE);
emit(r, cap_no);
break;
case ')':
(*re)--;
fixup_branch(r, fixup);
if (level == 0) {
r->err_str = "Unbalanced brackets";
return;
}
return;
/* NOTREACHED */
break;
case '+':
case '*':
op = (*re)[-1] == '*' ? STAR: PLUS;
if (**re == '?') {
(*re)++;
op = op ? STARQ : PLUSQ;
}
quantifier(r, last_op, op);
break;
case '?':
quantifier(r, last_op, QUEST);
break;
case '|':
fixup_branch(r, fixup);
relocate(r, branch_start, 3);
r->code[branch_start] = BRANCH;
set_jump_offset(r, branch_start + 1, branch_start);
fixup = branch_start + 2;
r->code[fixup] = 0xff;
break;
default:
(*re)--;
last_op = r->code_size;
exact(r, re);
break;
}
}
...into...
int compile(String re, int pos) {
int fixup = 0,
level = this.num_caps,
branch_start = this.code_size,
last_op = this.code_size;
while (pos < re.length())
switch ( re.charAt(pos++) ) {
case '^':
emit(Opcode.BOL);
break;
case '$':
emit(Opcode.EOL);
break;
case '.':
last_op = this.code_size;
emit(Opcode.ANY);
break;
case '[':
pos = anyof(re, pos);
break;
case '\\':
last_op = this.code_size;
int esc = get_escape_char(re, pos);
if ((esc & 0xff00) != 0) {
emit(esc >> 8);
}
else {
exact_one_char(esc);
}
break;
case '(':
last_op = this.code_size;
int cap_no = ++this.num_caps;
emit(Opcode.OPEN);
emit(cap_no);
compile(re, pos);
if (re.charAt(pos++) != ')')
throw new IllegalArgumentException(
"No closing bracket"
);
emit(Opcode.CLOSE);
emit(cap_no);
break;
case ')':
pos--;
fixup_branch(fixup);
if (level == 0)
throw new IllegalArgumentException(
"Unbalanced brackets"
);
return pos;
case '+':
case '*':
Opcode op = re.charAt(pos-1) == '*'
? Opcode.STAR
: Opcode.PLUS;
if (re.charAt(pos) == '?') {
pos++;
op = op == Opcode.STAR ? Opcode.STARQ : Opcode.PLUSQ;
}
quantifier(last_op, op);
break;
case '?':
quantifier(last_op, Opcode.QUEST);
break;
case '|':
fixup_branch(fixup);
relocate(branch_start, 3);
this.code[branch_start] = (char) Opcode.BRANCH.getVal();
set_jump_offset(branch_start + 1, branch_start);
fixup = branch_start + 2;
this.code[fixup] = 0xff;
break;
default:
pos--;
last_op = this.code_size;
pos = exact(re, pos);
break;
}
return re.length(); }
Note:
- Did you notice the bug I fixed on the way? I have highlighted it for you. Actually, the Java compiler gets all the praise for finding it.
$ javac Slre.java
Slre.java:213: incompatible types
found : Opcode
required: boolean
op = op ? Opcode.STARQ : Opcode.PLUSQ;
^
1 error
