retrosegfault - Tumblr blog

retrosegfault · 1 year

Text

ktre

A regular expression engine in pure C (3339 lines, 2755 sloc).

I call this engine ktre, which means “Krok’s Tiny Regex Engine.” It implements a backtracking implementation of PCRE-flavored regular expressions in a single header for easy embedding. A short feature list:

• Submatching • Backreferencing • Named capture groups • Subroutine calls • Recursion • Positive and negative lookahead assertions • Arbitrary positive and negative lookbehind assertions

The basic algorithm for the execution of the generated bytecode is based on “threads” of execution. When a conditional metacharacter (like `?`) is executed, the VM splits execution into two paths, one path of execution attempts to find a match that includes the character being matched conditionally, and the other path attempting to find a match that does not include the character being matched conditionally. In fact, all of the basic quantifiers are implemented by different permutations of branching instructions interwoven with the actual text to match. The instruction that implements this “splitting” behavior is called INSTR_BRANCH, if you’re following along with the code.

Here’s what the pre-runtime debugging info for `foo?` looks like:

In this image, “regexpr:” indicated the expression being compiled. It’s followed by a list of the options passed to the engine for this particular compile. The next item is an abstract syntax tree (AST) representing the expression. The last item is the bytecode that the AST was compiled into. `group 0:` indicates that the location it marks is the beginning of the 0th group (the entire match). When a text editor or general purpose language like Perl supports regular expressions in the syntax the flags are often passed as single-character flags as some part of the string-like syntax required. For example, the expression we’re looking at now would be written in Perl like `/foo?/gi`. The `g` passes the global option to the engine, and `i` passes the insensitive option.

You might notice some odd things about the bytecode. The first three instructions are outside of the 0th group, so it will not be part of the match. Why are we matching stuff that’s not in the expression? These first three instructions implement the unanchored matching passed to the engine as a flag. If you following the execution with your mind you may realize what’s happening. The first instruction has us attempt to match the regex at the first position in the subject string. If that fails, we’ll fall back to the 1st instruction, which consumes a single character and then tries to match the regex again. If that fails, it consumes another character &c. The behavior of these instructions is identical to the behavior of `.*?` - a so-called “non-greedy” quantifier.

This simple example demonstrates how backtracking works. Here’s what the execution looks like:

Because ktre uses backtracking to find matches, it is essentially an implementation of a kind of nondeterministic finite automaton (NFA). This means that in a worst case scenario it may take exponential time to complete its execution.

An interesting ktre feature is arbitrary (variable-width) lookbehind assertions; most engines support only lookbehind assertions that match text of one length.

Unfortunately, ktre does not support Unicode. Basic Unicode strings will work as intended, but applying a quantifier like `?` to a multi-byte Unicode character can cause erroneous matches with seemingly unrelated character because the engine is not aware of code-points - it works only on individual bytes.

0 notes

retrosegfault · 1 year

Text

bfm

bfm (on github) is a macro language that generates brainfuck. You can see what the language looks like here. The 16-bit deadfish and interactive cellular automata examples are particularly interesting.

The first version of bfm was previewed in my post about brainfuck from last year. Since then about a month of work was done. It looks much different now, and has many more features.

In bfm there are only two language constructs: statements and operations. A statement is a keyword followed by an argument (I can’t recall any keyword which read more than one argument). An operation is recognized as some valid identifier (a variable) followed by an operator (like + - / *) and either a constant expression or another identifier. This makes the placement of semicolons rather arbitrary looking, but they are very reasonable.

Here’s the 16-bit deadfish example in action:

Variables are simple aliases for cells, there is no stack. Thus, expressions of arbitrary complexity are not supported, only constant expressions may be used in the following way:

field[WIDTH + WIDTH / 2] = 1;

Variables die when parsing hits the end of the block in which they were declared. This is important for two reasons: namespace pollution, and efficient pointer movements. The first of these problems is obvious, the second is slightly more involved with the way bfm maps variables to cells.

While handwritten brainfuck programs often have clever memory layouts that significantly reduce the size or complexity of the algorithm being implemented, recognizing the potential for such a layout would be no trivial task, and would have very rapidly diminishing returns for more general programs. Thus, bfm simply places the variables at zero, the temporary cells at the end of the variables, and the arrays at the end of the temporary cells.

An excellent example of how memory layout can make brainfuck programs much smaller is dbfi.b, a brainfuck interpreter written in brainfuck by Daniel B Cristofani, who described the program in a short draft on his website.

The next important point is that every time a significant operation is performed on a variable the pointer must be moved to and from the temporary cells (probably many times). This means that using cells conservatively is important. If we only killed the name of the variable and abandoned the cell it had been assigned to, every pointer movement from the beginning of the program to the end would have to traverse that cell. So, instead, we calculate the number of cells we will need ahead of time (the maximum number of cells in use at any one point in the program).

Macros were implemented rather late into the language’s development. They are declared like:

macro macro_name (arg1, arg2, ...) // stuff end

When a macro expansion is found, the arguments that are being passed to the macro are aliased to the corresponding arguments in the macro definition and parsing moves to the macro body. Macros cannot be recursive, of course.

Writing code in this language is strange and interesting.

The actual bfm code suffered along the way from lack of foresight. If I were to do it over again I would write a real parser for the language, split it into a number of files, and do the brainfuck generation in a compiler pass outside of the main language parsing, probably with some kind of intermediate language that maps well to brainfuck.

0 notes

retrosegfault · 7 years

Photo

A few images from a simple raytracer in C++ almost directly copied from a short book about raytracing I read a while ago.

#computers #3d #code #processing #graphics #art #programming

0 notes

retrosegfault · 7 years

Text

on github

If you ever want to check out the code for the projects on this blog, they’re probably going to be available on github! I just got a fresh new development environment, so I thought I’d clean up some old code and collect it all in one place online. The brainfuck interpreter is there, the macro language will probably be on there soon, and my latest WIP is currently being developed there.

#gamedev #development #code #github

0 notes

retrosegfault · 7 years

Text

brainfuck

A brainfuck interpreter in pure C (C89 / ANSI C).

brainfuck is a minimalistic programming language that consists entirely of eight commands: . , < > + - [ and ]. When writing brainfuck you visualize the computational environment as a long grid of 8-bit cells which you must use to perform all of your computations. Cells are accessed and modified with a pointer that you control with the < and > commands. The + and - commands will add or subtract from the value of the current cell. [ will jump to the corresponding ] if the current cell is 0. The , command will set the value of the current cell to whatever key the user presses, and the . command will print the ASCII representation of the cell.

I actually wrote two versions of the interpreter: the first was a very simple thing that loaded the entire program text into memory and ran it directly. It suffered from performance issues because of it’s simplicity.

The second version goes through a “compilation” process. Instead of simply holding the program and looking at each character individually, it converts the program into an internal representation, which moves some of the computational cost from runtime to compilation. This means that when we encounter something like a [ command, we don’t need to find the corresponding ] every time we look at it. This approach drastically reduces the run time of most programs.

The switch also made it much easier to add specialized commands based on common brainfuck idioms, like the ones describe on this page. My interpreter contracts commands, detects multiplication loops, clearloops, dead code, and redundant commands, which makes it fast enough to run hanoi.bf in less than a second.

So after running many of the popular bf programs, I started experimenting with non-standard features and add-ons. A (relatively) famous program written in brainfuck is mandelbrot.bf, which generates an ascii visualization of the Mandelbrot set. I searched a bit and found several other versions of the program (which was generated with C preprocessor macros, not actually written by hand), including a “titannic” version that generated a 512x196 fractal:

It looks pretty high-res as text, so I thought it would be cool to interpret it as an image:

Kinda cool! But it’s very very squashed because it was intended to be printed as text, and text characters aren’t squares (like pixels.) The idea of graphics with brainfuck was appealing, so I threw an “image mode” into the interpreter, turned on with an “-img” commandline argument. Once image mode is on it will expect the program being run to print the rgb value of each pixel row by row, and to print the number -1 at the end of every row. This allows it to deduce the resolution of the image without any extra information. Once the program has finished the output is saved to the output file (specified with the “-o” parameter.) Here’s a program that generates a gradient (with 16 bit cells):

>>++++++++++++++++++++++++++++++++[<++++++++++++++++++++++++++++++++<++++++++++++++++++++++++++++++++>>-]<<[>[>><<..<.>>><<-]>++++++++++++++++++++++++++++++++[<++++++++++++++++++++++++++++++++>-]<<-]

I tried to write some other things with bf, but anything nontrivial is pretty tedious to write. I’m not sure what I was expecting. But from what I did write, I learned that many things are very simple and repetitive to write by hand, and could easily be automated. So I did the natural thing, and created a macro language that generates brainfuck. Here’s the gradient program:

var x x = 128 var y y = 128

var end_row end_row = -1

while y while x printc x printc y printc x x - 1 end printc end_row x = 128 y - 1 end

which generates this (relatively bad) brainfuck (which works with 8-bit cells because of the image resolution):

><[-]++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++>>[-]<[-]<[>+>+<<-]>>[<<+>>-]<>>>>>>>>>><[-]++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++>>[-]<[-]<[>+>+<<-]>>[<<+>>-]<>>>>>>>>>><[-]->>[-]<[-]<[>+>+<<-]>>[<<+>>-]<<<<<<<<<<<[<<<<<<<<<<[.>>>>>>>>>>.<<<<<<<<<<.<[-]+>>[-]<<[>->+<<-]>>[<<+>>-]<]>>>>>>>>>>>>>>>>>>>>.<<<<<<<<<<<<<<<<<<<<<[-]++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++>>[-]<[-]<[>+>+<<-]>>[<<+>>-]<>>>>>>>>>><[-]+>>[-]<<[>->+<<-]>>[<<+>>-]<]

I eventually threw out the entire image mode thing because it was weird, but kept the ability to dump the program’s output to a file. The macro language was cool, so I kept working on it (it’s also in pure ANSI C), and it is now capable of generating relatively complex programs. I’ve now spent faaaar more time on it than on the interpreter. (Most of the brainfuck algorithms were stolen and customized from the EsoLang wiki page)

Here’s a one-dimensional cellular automata program generated with it. To run it, just press the “raw” button on that page, select the entire thing, and paste it into an interpreter like Le Brainfuck.

The macro language is easy enough to write that a graphical variation (that actually creates the image data from scratch) was trivial:

Here’s fizzbuzz, and here’s the macro program that generated it:

var i var max i = 1 max = 100

var i_not_done i_not_done = 1

while i_not_done var should_print_num should_print_num = 1 var is_divizz is_divizz = i is_divizz % 3 is_divizz == 0 if is_divizz should_print_num = 0 print fizz endif is_divizz = i is_divizz % 5 is_divizz == 0 if is_divizz should_print_num = 0 print buzz endif if should_print_num printv i endif i_not_done = i i_not_done == max not i_not_done printc 10 i + 1 endwhile

(naming conventions on point)

#programming #creative coding #code #generative art #gif #computer science #hacking #computers

3 notes · View notes

retrosegfault · 8 years

Photo