Camen Design Forum

Kroc, why did you bestow upon me this nightmare?

append delete JJ

I've been porting Remarkable to JS (markdown was ported and I felt remarkable should be represented in the JS world as well), so I'm actually mad at JS's horrible implementation of regexes. ...but damn you and your conditional subpatterns & lookbehinds! :)

Reply RSS

Replies

append delete #1. Kroc

I can assure you it was all necessary to get it to work. I was once writing an article that explained fully how the whole regex strings worked, but I’m holding off of that whilst ReMarkable is still being developed--plus it takes time and I don’t have that.

You think you have it hard, I hear someone is trying to port it to perl! :)

I should write a conformance suite so that people making ports can ensure that the code is correct but I am so cripplingly busy with work. At the moment, camen design is my conformance suite, if my website doesn’t publish, there’s a bug. :P

If you need any help understanding the internals of ReMarkable and its regex, just ask, I’d be happy to explain any particular part.

append delete #2. JJ

~~~
I should write a conformance suite so that people making ports can ensure that the code is correct
~~~

I'm using your articles as the test suite at the moment, hehe. It's coming along. It's mostly done, I just haven't gotten the indentation or wrapping fully working, also, ordered and unordered lists are buggy. This is all due to the fact that JS regexes suck (no conditional patterns, no lookbehinds, no possessives, etc) and I couldn't use yours, so I had to rewrite those parts completely.

Although, not counting the regex part, it's a lot easier to get it working in JS given the fact that you can actually use real callbacks in JS. Very luxurious when writing something like this.

Anyway, I could definitely use a hand if anyone wants to help with it: epsilon-not.net/…

I included all of Kroc's original comments so it will be easy to orient oneself when juxtaposing the two different versions.

append delete #3. Kroc

This is really nice stuff, can you release the testing rig you are using to run this so I can validate output, or write a testing script to work with?

append delete #4. JJ

Ah, it's just a simple html file with:

<script src="unremarkable.js"></script>
<script src="ajax.googleapis.com/…"></script>
<script>
window.onload = function() {
$.get('article.rem', function(text) {
document.body.textContent = unremarkable(text);
document.body.style.whiteSpace = 'pre-wrap';
});
};
</script>

And then I swap any of your articles in or out to see if they work properly.

append delete #5. Kroc

Something that immediately strikes me is that since JS doesn't support recursive regexes, the PRE block is not going to work. You should be feeding your script documentation.rem as a more robust test. Also, don't change the placeholder style! You appear to have added : and - to it, this won't work; the placeholder is the way it is because it is always the exact size of the content it is replacing, which is necessary for wordwrapping. By adding extra characters to it, you will break the wordwrapping, as well as that ReMarkable will crash with the "placeholders already in the markup" test and the title-casing code.

ReMarkable was written to compliment PHPs functionality and would be an entirely different beast if I were to write it in JS. Much of the complexity in the regexes (especially with look-behinds) is to do with the basic principle of how ReMarkable splits the content into chunks. In essence, what ReMarkable does is ensure that every paragraph has a blank line before and after it and any block-level elements (such as OL / DL) are on a separate line, not directly on the paragraph text. This allows something like this to have paragraphs added quickly and reliably:

~~~
<ol>
<li>

The quick brown fox.

Jumps over the lazy dog.

</li>
</ol>
~~~

The regexes have to do various checks to find the blank lines where each chunk starts and stops and must also check where lists come to an end and some other content follows. The lists are very complicated because ReMarkable has to assess whether the line before and after a list item is the beginning or end of the whole list or just an optional blank line between list items, and whether this blank line belongs to one list item or another one. It was a real headache to write and took something like a week to write that regex alone.

Because these regexes are operating on the whole source text, they have to be careful not to step on other content. This is where the JavaScript regex limitations are going to suffer badly.

If I were doing ReMarkable from scratch in JS (where closures are cheap), I would instead chunk the content in to multi-level object arrays, based on type.

For example, if you use regex to recognise a whole list, you would remove the whole list from the source text and put it into an array of removed chunks with everything else, and then use a sub-array for each of the list items (so that the list-item regex would not have to be aware of what was above / below the list). With all the source text broken into a nested array that represents actual depth, doing the word-wrapping, output and indenting would be very cheap.

This is also how I would probably do it when -- if ever -- my host upgrades to PHP5.3, where I can use cheap closures.

As it stands I cannot see anyway that JS can recursively recognise the PRE fences and this syntax will have to be changed to better support weak regex engines.

append delete #6. JJ

Yeah, I changed the placeholder style so I could readably add an ID number which would have the placeholder text correspond to a value in the array. I did make it a point to amend the length to make sure the proper amount of padding was added though.

~~~
If I were doing ReMarkable from scratch in JS (where closures are cheap), I would instead chunk the content in to multi-level object arrays, based on type.
~~~

This is a good idea. I'll try to get to work on that.

I definitely realize the major flaws that you're pointing out. My main goal was just to get it working at first. I rewrote it line-by-line originally, and then ran it for the first time. After fixing the slew of errors, I went back and started revising it, seeing how I I could do things better for JS. I actually ended up rewriting it again (after writing it the first time, I gathered a better understanding of how it worked). I will go back and work on it some more soon. I need to “recover” after that one night though, hehe. (I sometimes almost think it might be easier or more foolproof to write a real parser for this when it comes to JS.)

append delete #7. JJ

I've been trying to get better at writing parsers lately, so I think I'm gonna try to write a scanner and a recursive descent parser for the complicated stuff and leave whatevers left to plain regexes (for speed). It's the easy way out, recursive descent won't be the absolute fastest, but its just what I know (the easiest to write that I know of). I don't see much of a way this is possible to avoid without the sheer power of a full pcre implementation.

Reply

(Leave this as-is, it’s a trap!)

There is no need to “register”, just enter the same name + password of your choice every time.

Pro tip: Use markup to add links, quotes and more.

Your friendly neighbourhood moderators: Kroc, Impressed, Martijn