Eine Idee von der Struktur eines XML-Dokuments bekommen

Code-Stücke können hier veröffentlicht werden.
Antworten
Benutzeravatar
__blackjack__
User
Beiträge: 13116
Registriert: Samstag 2. Juni 2018, 10:21
Wohnort: 127.0.0.1
Kontaktdaten:

Ich hatte kürzlich mit einem Webservice zu tun, der ein relativ grosses (143 KiB, 5 K Elemente) undokumentiertes XML als Ergebnis liefert. Dafür habe ich den folgenden kleinen Helfer geschrieben, der prüft ob die Vermutung das es keine Attribute gibt, und nur Elemente die Blätter darstellen Text enthalten, und alle Pfade zu Blättern ausgibt.

Wenn es auf einer Ebene mehrere Kindelemente des gleichen Typs gibt, wird der Name im Pfad mit einem "*" gekennzeichnet. Ist nicht perfekt, war aber hilfreich um Code zu schreiben, der die (für mich relevanten) Daten in eine JSON-Struktur überführt.

Code: Alles auswählen

#!/usr/bin/env python3
"""
Little helper to analyse the result of a webservice.

The service query returns an XML document with wrong XML declaration encoding
and an undocumented, quite large, and nested XML structure.

The elements have no attributes, and only leaf elements contain text other than
the occasional line break (or other whitespace).  The `visit()` function tests
for both constraints and raises a `ValueError` on violation.

The output is a sorted list of unique paths to leaf elements found in the
document.  Where a node has more than one child of a particular type, a "*" is
added to that name.

Here is a small example which also demonstrates two shortcomings:

>>> root = ET.fromstring('''\\
... <root>
...   <item>
...     <a>one a</a>
...     <a>another a</a>
...     <b>a single b</b>
...   </item>
...   <item>
...     <a>in this item there is just one a</a>
...   </item>
... </root>
... ''')
>>> count_elements(root)
7
>>> get_paths(root)
['./item*/a', './item*/a*', './item*/b']

The first two paths show not all ``<item>`` elements do contain multiple
``<a>`` elements.  And the result doesn't show that not *all* ``<item>``
elements contain a ``<b>`` element, i.e. that ``<b>`` is optional.
"""
from collections import Counter
from pathlib import Path
from xml.etree import ElementTree as ET


def count_elements(node):
    return 1 + sum(map(count_elements, node))


def visit(node, path="."):
    if node.attrib:
        raise ValueError(f"element has attribute(s): {node.attrib!r}")

    if node:
        if node.text and node.text.strip() or node.tail and node.tail.strip():
            sub_node_tags = ", ".join(f"<{sub_node.tag}>" for sub_node in node)
            raise ValueError(
                f"text content ({node.text.strip()!r}) in an element"
                f" (<{node.tag}>) with subelement(s) ({sub_node_tags})"
            )

        tag_to_count = Counter(sub_node.tag for sub_node in node)
        for sub_node in node:
            multiple = "*" if tag_to_count[sub_node.tag] > 1 else ""
            yield from visit(sub_node, f"{path}/{sub_node.tag}{multiple}")
    else:
        yield path


def get_paths(root):
    return sorted(set(visit(root)))


def main():
    root = ET.fromstring(Path("group.xml").read_text(encoding="utf-8"))
    print(count_elements(root), "elements.")
    for path in get_paths(root):
        print(path)


if __name__ == "__main__":
    main()
Beispielausgabe:

Code: Alles auswählen

5046 elements.
./Group/BaseCountry
./Group/FoundMonth
./Group/FoundYear
./Group/Founder/Handle/ID
./Group/Grouptypes/Grouptype*
./Group/ID
./Group/Member*/Handle/AKA
./Group/Member*/Handle/CurrentlyUsedHandle
./Group/Member*/Handle/FreelanceFunctions/Function
./Group/Member*/Handle/FreelanceFunctions/Function*
./Group/Member*/Handle/Handle
./Group/Member*/Handle/HandleStory
./Group/Member*/Handle/ID
./Group/Member*/Handle/Scener/Country
./Group/Member*/Handle/Scener/Handles/Handle*/AKA
./Group/Member*/Handle/Scener/Handles/Handle*/CurrentlyUsedHandle
./Group/Member*/Handle/Scener/Handles/Handle*/FreelanceFunctions/Function
./Group/Member*/Handle/Scener/Handles/Handle*/FreelanceFunctions/Function*
./Group/Member*/Handle/Scener/Handles/Handle*/Handle
./Group/Member*/Handle/Scener/Handles/Handle*/HandleStory
./Group/Member*/Handle/Scener/Handles/Handle*/ID
./Group/Member*/Handle/Scener/Handles/Handle*/Scener/ID
./Group/Member*/Handle/Scener/Handles/Handle/ID
./Group/Member*/Handle/Scener/ID
./Group/Member*/Handle/Scener/Trivia
./Group/Member*/JoinDay
./Group/Member*/JoinMonth
./Group/Member*/JoinYear
./Group/Member*/LeaveDay
./Group/Member*/LeaveMonth
./Group/Member*/LeaveYear
./Group/Member*/Profession
./Group/Member*/Profession*
./Group/Member*/Status
./Group/Name
./Group/Rating
./Group/Release*/Release/AKA
./Group/Release*/Release/Achievement/Compo
./Group/Release*/Release/Achievement/Place
./Group/Release*/Release/GfxType
./Group/Release*/Release/ID
./Group/Release*/Release/Name
./Group/Release*/Release/Rating
./Group/Release*/Release/ReleaseDay
./Group/Release*/Release/ReleaseMonth
./Group/Release*/Release/ReleaseYear
./Group/Release*/Release/ReleasedAt/Event/AKA
./Group/Release*/Release/ReleasedAt/Event/Address
./Group/Release*/Release/ReleasedAt/Event/City
./Group/Release*/Release/ReleasedAt/Event/Country
./Group/Release*/Release/ReleasedAt/Event/EndDay
./Group/Release*/Release/ReleasedAt/Event/EndMonth
./Group/Release*/Release/ReleasedAt/Event/EndYear
./Group/Release*/Release/ReleasedAt/Event/EventType
./Group/Release*/Release/ReleasedAt/Event/EventType*
./Group/Release*/Release/ReleasedAt/Event/ID
./Group/Release*/Release/ReleasedAt/Event/Name
./Group/Release*/Release/ReleasedAt/Event/StartDay
./Group/Release*/Release/ReleasedAt/Event/StartMonth
./Group/Release*/Release/ReleasedAt/Event/StartYear
./Group/Release*/Release/ReleasedAt/Event/State
./Group/Release*/Release/ReleasedAt/Event/Tagline
./Group/Release*/Release/ReleasedAt/Event/Website
./Group/Release*/Release/ReleasedAt/Event/Zip
./Group/Release*/Release/ScreenShot
./Group/Release*/Release/Type
./Group/Release*/Release/Website
./Group/Short
./Group/Slogan
./Group/Trivia
./Group/Website
„All religions are the same: religion is basically guilt, with different holidays.” — Cathy Ladman
Benutzeravatar
kbr
User
Beiträge: 1487
Registriert: Mittwoch 15. Oktober 2008, 09:27

Schönes Beispiel für die rekursive Nutzung von Generatoren.
nezzcarth
User
Beiträge: 1635
Registriert: Samstag 16. April 2011, 12:47

Ich verwende für soetwas ein Feature meines XML-Editors oder auf der Kommandozeile 'xmlstarlet el' (https://xmlstar.sourceforge.net/doc/UG/ch03s02.html). Letzteres zeigt jedoch keine Textnodes an - Nett. :)
Benutzeravatar
__blackjack__
User
Beiträge: 13116
Registriert: Samstag 2. Juni 2018, 10:21
Wohnort: 127.0.0.1
Kontaktdaten:

Ah, ich hatte ``xmlstarlet`` hier auch benutzt: zum formatieren (``fo``) der Antwort, bevor ich das in eine Datei weg geschrieben habe. ``el`` war mir nicht bekannt/bewusst.
„All religions are the same: religion is basically guilt, with different holidays.” — Cathy Ladman
Antworten