Syntax-Highlighting inner vorhandener <pre> Tags

Zap · Mittwoch 14. März 2007, 19:45

Hallo zusammen, ich wollte mir ein kleines skript schreiben
mit dem ich in bereits vorhandenen html-Dateien
die <pre> Blöcke mit Syntax-highlighting versehe und dann wieder abspeicher.

so in etwa sehen die html dateien aus:

Code: Alles auswählen

<html><head><title>Test</title></head>
<body>

<h1>Test</h1>

<p>Hier ist einfach ein wenig Text 
</p>
<pre>#!python
import sys

_names = sys.builtin_module_names

# Note:  more names are added to __all__ later.
__all__ = ["altsep", "curdir", "pardir", "sep", "pathsep", "linesep",
           "defpath", "name", "path", "devnull",
           "SEEK_SET", "SEEK_CUR", "SEEK_END"]

def _get_exports_list(module):
    try:
        return list(module.__all__)
    except AttributeError:
        return [n for n in dir(module) if n[0] != '_']
</pre>

<p>Hier ist auch noch Text 
</p><p>Und noch ein codeblock:
</p>
<pre>int main () {
   printf ("Hello World");
   return 0;
}
</pre>

</body>
</html>

Zum ausprobieren habe ich dieses skript gebastelt:

Code: Alles auswählen

from pygments import highlight 
from pygments.lexers import get_lexer_by_name 
from pygments.formatters import HtmlFormatter 
import re

content = open("test.html","r").read()

# Create a list of code-blocks in file 
pre_blocks = re.findall("<pre>(.*)</pre>",content, re.S)

for code in pre_blocks:
  if "#!python" in code:
     lexer = get_lexer_by_name("python")
  else:
     lexer = get_lexer_by_name("c++")

  # using css later
  formatter = HtmlFormatter(noclasses=True)

  content_out = content.replace("<pre>%s</pre>" % code,
                                highlight(code, lexer, formatter)
                                )

open("out.html","w").write(content_out)

Ich bin leider nicht wirklich fit in RegExp und so funktioniert das Beispiel da oben leider nur mit einem <pre> block innerhalb einer Datei.

So wie ich es zZt mache, werden das <pre> vom oberen PythonCode als Anfang und das </pre> vom C-Code als Ende angesehen.

Die Ausgabedatei:

Code: Alles auswählen

<html><head><title>Test</title></head>
<body>

<h1>Test</h1>

<p>Hier ist einfach ein wenig Text 
</p>
<div class="highlight"><pre><span style="border: 1px solid #FF0000">!</span>python
<span style="color: #AA22FF; font-weight: bold">import</span> <span style="color: #0000FF; font-weight: bold">sys</span>

_names <span style="color: #666666">=</span> sys<span style="color: #666666">.</span>builtin_module_names

<span style="color: #008800; font-style: italic"># Note:  more names are added to __all__ later.</span>
__all__ <span style="color: #666666">=</span> [<span style="color: #BB4444">"altsep"</span>, <span style="color: #BB4444">"curdir"</span>, <span style="color: #BB4444">"pardir"</span>, <span style="color: #BB4444">"sep"</span>, <span style="color: #BB4444">"pathsep"</span>, <span style="color: #BB4444">"linesep"</span>,
           <span style="color: #BB4444">"defpath"</span>, <span style="color: #BB4444">"name"</span>, <span style="color: #BB4444">"path"</span>, <span style="color: #BB4444">"devnull"</span>,
           <span style="color: #BB4444">"SEEK_SET"</span>, <span style="color: #BB4444">"SEEK_CUR"</span>, <span style="color: #BB4444">"SEEK_END"</span>]

<span style="color: #AA22FF; font-weight: bold">def</span> <span style="color: #00A000">_get_exports_list</span>(module):
    <span style="color: #AA22FF; font-weight: bold">try</span>:
        <span style="color: #AA22FF; font-weight: bold">return</span> <span style="color: #AA22FF">list</span>(module<span style="color: #666666">.</span>__all__)
    <span style="color: #AA22FF; font-weight: bold">except</span> <span style="color: #D2413A; font-weight: bold">AttributeError</span>:
        <span style="color: #AA22FF; font-weight: bold">return</span> [n <span style="color: #AA22FF; font-weight: bold">for</span> n <span style="color: #AA22FF; font-weight: bold">in</span> <span style="color: #AA22FF">dir</span>(module) <span style="color: #AA22FF; font-weight: bold">if</span> n[<span style="color: #666666">0</span>] <span style="color: #666666">!=</span> <span style="color: #BB4444">'_'</span>]
<span style="color: #666666"></</span>pre<span style="color: #666666">></span>

<span style="color: #666666"><</span>p<span style="color: #666666">></span>Hier ist auch noch Text 
<span style="color: #666666"></</span>p<span style="color: #666666">><</span>p<span style="color: #666666">></span>Und noch ein codeblock:
<span style="color: #666666"></</span>p<span style="color: #666666">></span>
<span style="color: #666666"><</span>pre<span style="color: #666666">></span><span style="color: #AA22FF">int</span> main () {
   printf (<span style="color: #BB4444">"Hello World"</span>);
   <span style="color: #AA22FF; font-weight: bold">return</span> <span style="color: #666666">0</span>;
}
</pre></div>


</body>
</html>

Ich habe leider keine Idee wie ich das sonst machen könnte.

Vielleicht hat jemand ne Idee wie der re aussehen müsste um richtig zu funktionieren? Habe vor langer Zeit mal dom für xml Dateien genutzt, wäre das sonst ein Ansatz ?

BlackJack · Mittwoch 14. März 2007, 20:10

Reguläre Ausdrücke versuchen normalerweise soviel wie möglich zu erfassen. Deiner würde also alles zwischen dem ersten '<pre>' und dem letzten '</pre>' erkennen. Was natürlich auch öffnende und schliessende ``pre``-Tags dazwischen erkennt.

Um nicht soviel zu erwischen, musst Du ein Fragezeichen hinter das Sternchen setzen: '<pre>(.*?)</pre>'.

Allerdings würde ich ein Modul benutzen, das mit HTML bzw. XML umgehen kann, weil zum Beispiel ein Kleiner-Zeichen im Quelltext im HTML als 'if x < 10:' kodiert wird. Und das dürfte `pygments` so nicht erkennen. Das gleiche gilt mindestens noch für '&' -> '&' was in C-Quelltext öfter mal vorkommt.

Zap · Mittwoch 14. März 2007, 21:48

Erst mal danke für den Tipp

BlackJack hat geschrieben:weil zum Beispiel ein Kleiner-Zeichen im Quelltext im HTML als 'if x < 10:' kodiert wird. Und das dürfte `pygments` so nicht erkennen. Das gleiche gilt mindestens noch für '&' -> '&' was in C-Quelltext öfter mal vorkommt.

Oh das ist natürlich mist. An sowas habe ich ja noch garnicht gedacht....
Was für ein Highlightmodul könntet ihr mir denn da empfehlen oder kann ich
den string irgendwie zuvor vorbereiten, das sich pygments nicht daran aufhängt.

Im moment habe ich das Problem:

Code: Alles auswählen

in highlight_pre
    highlight(code, lexer, formatter)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x91 in position 691: ordinal not in range(128)

Es liegt an dem string...

Code: Alles auswählen

/*******************************
fopen() is used to open a file and associate it with a stream. filename must be a valid pathname.
Pathnames are absolute if they start with ‘/’, otherwise they are relative to the

Das Zeichen ` mag der nicht...
Hat da jemand ne Idee ?
Kann ich da was über Formatierungen erreichen ?

PmanX · Mittwoch 14. März 2007, 22:55

Nur das Matchobjekt aus

Code: Alles auswählen

<pre>(.*?)</pre>

dem Textmarker vorwerfen.

Y0Gi · Donnerstag 15. März 2007, 12:24

Alternativ zu RegEx kannst du auch z.B. BeautifulSoup verwenden, um die Tags zu finden und den Textknoten darin zu ersetzen.