Regex - word-boundary

cuddlePanda · Donnerstag 26. August 2021, 15:47

Liebes Forum,
ich will von einem String herausfinden, ob er die Zeichenfolge 'id' entweder alleine oder abgetrennt, aber nicht als Bestandteil eines Wortes enthält. Also
'id' soll matchen, desgleichen z.B. 'customer_id', nicht aber 'middle'.
Zunächst habe ich es mit '\bid\b' versucht, das hat nicht funktioniert (siehe Codebeispiel unten und die dazugehörige Ausgabe).
Dann habe ich mir eine word-boundary selbst gebaut: '(.*[^a-zA-Z])?id([^a-zA-Z].*)?'
Das interessante Phänomen dabei: verwende ich re.fullmatch funktioniert es für alle (von mir betrachteten) Fälle, verwende ich nur re.match wird z.B. 'identity' fälschlicherweise (nun ja, also nach meinem Begriff, aber möglicherweise sitze ich da einem Denkfehler auf, vielleicht kann mir da jemand helfen) als match gefunden...

Der Code sieht bei mir folgendermaßen aus:
--------------------------------------------------------------------------
import re
import pytest

test_string_01 = 'id'
test_string_02 = 'mid'
test_string_03 = 'middle'
test_string_04 = 'identity'
test_string_05_b = 'customer id'
test_string_05 = 'customer_id'
test_string_06_b = 'id validation'
test_string_06 = 'id_validation'
test_string_07 = 'some_id_included'
test_string_07_b = 'some id included'
test_string_08 = 'in_the_middle_id_more'
test_string_08_b = 'in the middle id more'

match_pattern = '\bid\b'
print("matching test_string_01 '" + test_string_01 + "' (exp.: Match) " + str(re.match(match_pattern, test_string_01)))
print("matching test_string_02 '" + test_string_02 + "' (exp.: No Match) " + str(re.match(match_pattern, test_string_02)))
print("matching test_string_03 '" + test_string_03 + "' (exp.: No Match) " + str(re.match(match_pattern, test_string_03)))
print("matching test_string_04 '" + test_string_04 + "' (exp.: No Match) " + str(re.match(match_pattern, test_string_04)))
print("matching test_string_05 '" + test_string_05 + "' (exp.: Match) " + str(re.match(match_pattern, test_string_05)))
print("matching test_string_05_b '" + test_string_05_b + "' (exp.: Match) " + str(re.match(match_pattern, test_string_05_b)))
print("matching test_string_06 '" + test_string_06 + "' (exp.: Match) " + str(re.match(match_pattern, test_string_06)))
print("matching test_string_06_b '" + test_string_06_b + "' (exp.: Match) " + str(re.match(match_pattern, test_string_06_b)))
print("matching test_string_07 '" + test_string_07 + "' (exp.: Match) " + str(re.match(match_pattern, test_string_07)))
print("matching test_string_07_b '" + test_string_07_b + "' (exp.: Match) " + str(re.match(match_pattern, test_string_07_b)))
print("matching test_string_08 '" + test_string_08 + "' (exp.: Match) " + str(re.match(match_pattern, test_string_08)))
print("matching test_string_08_b '" + test_string_08_b + "' (exp.: Match) " + str(re.match(match_pattern, test_string_08_b)))

match_pattern = '(.*[^a-zA-Z])?id([^a-zA-Z].*)?'

print("matching test_string_01 '" + test_string_01 + "' (exp.: Match) " + str(re.match(match_pattern, test_string_01)))
print("matching test_string_02 '" + test_string_02 + "' (exp.: No Match) " + str(re.match(match_pattern, test_string_02)))
print("matching test_string_03 '" + test_string_03 + "' (exp.: No Match) " + str(re.match(match_pattern, test_string_03)))
print("matching test_string_04 '" + test_string_04 + "' (exp.: No Match) " + str(re.match(match_pattern, test_string_04)))
print("matching test_string_05 '" + test_string_05 + "' (exp.: Match) " + str(re.match(match_pattern, test_string_05)))
print("matching test_string_05_b '" + test_string_05_b + "' (exp.: Match) " + str(re.match(match_pattern, test_string_05_b)))
print("matching test_string_06 '" + test_string_06 + "' (exp.: Match) " + str(re.match(match_pattern, test_string_06)))
print("matching test_string_06_b '" + test_string_06_b + "' (exp.: Match) " + str(re.match(match_pattern, test_string_06_b)))
print("matching test_string_07 '" + test_string_07 + "' (exp.: Match) " + str(re.match(match_pattern, test_string_07)))
print("matching test_string_07_b '" + test_string_07_b + "' (exp.: Match) " + str(re.match(match_pattern, test_string_07_b)))
print("matching test_string_08 '" + test_string_08 + "' (exp.: Match) " + str(re.match(match_pattern, test_string_08)))
print("matching test_string_08_b '" + test_string_08_b + "' (exp.: Match) " + str(re.match(match_pattern, test_string_08_b)))

print("\n ### +++ Now the same thing with 'fullmatch'... +++ ###\n")
print("matching test_string_01 '" + test_string_01 + "' (exp.: Match) " + str(re.fullmatch(match_pattern, test_string_01)))
print("matching test_string_02 '" + test_string_02 + "' (exp.: No Match) " + str(re.fullmatch(match_pattern, test_string_02)))
print("matching test_string_03 '" + test_string_03 + "' (exp.: No Match) " + str(re.fullmatch(match_pattern, test_string_03)))
print("matching test_string_04 '" + test_string_04 + "' (exp.: No Match) " + str(re.fullmatch(match_pattern, test_string_04)))
print("matching test_string_05 '" + test_string_05 + "' (exp.: Match) " + str(re.fullmatch(match_pattern, test_string_05)))
print("matching test_string_05_b '" + test_string_05_b + "' (exp.: Match) " + str(re.fullmatch(match_pattern, test_string_05_b)))
print("matching test_string_06 '" + test_string_06 + "' (exp.: Match) " + str(re.fullmatch(match_pattern, test_string_06)))
print("matching test_string_06_b '" + test_string_06_b + "' (exp.: Match) " + str(re.fullmatch(match_pattern, test_string_06_b)))
print("matching test_string_07 '" + test_string_07 + "' (exp.: Match) " + str(re.fullmatch(match_pattern, test_string_07)))
print("matching test_string_07_b '" + test_string_07_b + "' (exp.: Match) " + str(re.fullmatch(match_pattern, test_string_07_b)))
print("matching test_string_08 '" + test_string_08 + "' (exp.: Match) " + str(re.fullmatch(match_pattern, test_string_08)))
print("matching test_string_08_b '" + test_string_08_b + "' (exp.: Match) " + str(re.fullmatch(match_pattern, test_string_08_b)))
--------------------------------------------------------------------------
und die Ausgabe folgendermaßen:
======================================
matching test_string_01 'id' (exp.: Match) None
matching test_string_02 'mid' (exp.: No Match) None
matching test_string_03 'middle' (exp.: No Match) None
matching test_string_04 'identity' (exp.: No Match) None
matching test_string_05 'customer_id' (exp.: Match) None
matching test_string_05_b 'customer id' (exp.: Match) None
matching test_string_06 'id_validation' (exp.: Match) None
matching test_string_06_b 'id validation' (exp.: Match) None
matching test_string_07 'some_id_included' (exp.: Match) None
matching test_string_07_b 'some id included' (exp.: Match) None
matching test_string_08 'in_the_middle_id_more' (exp.: Match) None
matching test_string_08_b 'in the middle id more' (exp.: Match) None
matching test_string_01 'id' (exp.: Match) <re.Match object; span=(0, 2), match='id'>
matching test_string_02 'mid' (exp.: No Match) None
matching test_string_03 'middle' (exp.: No Match) None
matching test_string_04 'identity' (exp.: No Match) <re.Match object; span=(0, 2), match='id'>
matching test_string_05 'customer_id' (exp.: Match) <re.Match object; span=(0, 11), match='customer_id'>
matching test_string_05_b 'customer id' (exp.: Match) <re.Match object; span=(0, 11), match='customer id'>
matching test_string_06 'id_validation' (exp.: Match) <re.Match object; span=(0, 13), match='id_validation'>
matching test_string_06_b 'id validation' (exp.: Match) <re.Match object; span=(0, 13), match='id validation'>
matching test_string_07 'some_id_included' (exp.: Match) <re.Match object; span=(0, 16), match='some_id_included'>
matching test_string_07_b 'some id included' (exp.: Match) <re.Match object; span=(0, 16), match='some id included'>
matching test_string_08 'in_the_middle_id_more' (exp.: Match) <re.Match object; span=(0, 21), match='in_the_middle_id_more'>
matching test_string_08_b 'in the middle id more' (exp.: Match) <re.Match object; span=(0, 21), match='in the middle id more'>

### +++ Now the same thing with 'fullmatch'... +++ ###

matching test_string_01 'id' (exp.: Match) <re.Match object; span=(0, 2), match='id'>
matching test_string_02 'mid' (exp.: No Match) None
matching test_string_03 'middle' (exp.: No Match) None
matching test_string_04 'identity' (exp.: No Match) None
matching test_string_05 'customer_id' (exp.: Match) <re.Match object; span=(0, 11), match='customer_id'>
matching test_string_05_b 'customer id' (exp.: Match) <re.Match object; span=(0, 11), match='customer id'>
matching test_string_06 'id_validation' (exp.: Match) <re.Match object; span=(0, 13), match='id_validation'>
matching test_string_06_b 'id validation' (exp.: Match) <re.Match object; span=(0, 13), match='id validation'>
matching test_string_07 'some_id_included' (exp.: Match) <re.Match object; span=(0, 16), match='some_id_included'>
matching test_string_07_b 'some id included' (exp.: Match) <re.Match object; span=(0, 16), match='some id included'>
matching test_string_08 'in_the_middle_id_more' (exp.: Match) <re.Match object; span=(0, 21), match='in_the_middle_id_more'>
matching test_string_08_b 'in the middle id more' (exp.: Match) <re.Match object; span=(0, 21), match='in the middle id more'>
======================================
(btw.: gibt es eine in der community übliche Möglichkeit, Code-snippets bzw. Ausgabe-Snippets deutlich vom umgebenden Text abzugrenzen? ich habe hier halt einmal --- und === verwendet, aber vielleicht gibt es etwas besseres...)

Noch einmal zusammenfassend:
(1) Wieso funktioniert es mit \b nicht und ich erhalte überhaupt keine matches?
(2) Beim angegebenen match_pattern hätte ich erwartet, dass sich re.match und re.fullmatch gleich verhalten (ich biete ja ein Pattern an, das im Prinzip den ganzen String abdecken könnte). Und nicht, dass re.match irgendetwas _nicht_ matchen würde, was re.fullmatch matcht (das könnte ich ja noch eher verstehen), sondern genau umgekehrt. Was habe ich da nicht verstanden?
Bin für jeden Hinweis dankbar.
cuddle Panda

cuddlePanda · Donnerstag 26. August 2021, 15:57

Soeben ist mir aufgefallen, dass ich vergessen habe, bei der word - boundary zu 'double-escapen'...
habe also im Code das Pattern auf
'\\bid\\b'
geändert. Das hat zwar einige Veränderungen gebracht, aber auch noch nicht wirklich zum erwarteten Ergebnis geführt...

---------------------------------------------------------------------
match_pattern = '\\bid\\b'
print("matching test_string_01 '" + test_string_01 + "' (exp.: Match) " + str(re.match(match_pattern, test_string_01)))
print("matching test_string_02 '" + test_string_02 + "' (exp.: No Match) " + str(re.match(match_pattern, test_string_02)))
print("matching test_string_03 '" + test_string_03 + "' (exp.: No Match) " + str(re.match(match_pattern, test_string_03)))
print("matching test_string_04 '" + test_string_04 + "' (exp.: No Match) " + str(re.match(match_pattern, test_string_04)))
print("matching test_string_05 '" + test_string_05 + "' (exp.: Match) " + str(re.match(match_pattern, test_string_05)))
print("matching test_string_05_b '" + test_string_05_b + "' (exp.: Match) " + str(re.match(match_pattern, test_string_05_b)))
print("matching test_string_06 '" + test_string_06 + "' (exp.: Match) " + str(re.match(match_pattern, test_string_06)))
print("matching test_string_06_b '" + test_string_06_b + "' (exp.: Match) " + str(re.match(match_pattern, test_string_06_b)))
print("matching test_string_07 '" + test_string_07 + "' (exp.: Match) " + str(re.match(match_pattern, test_string_07)))
print("matching test_string_07_b '" + test_string_07_b + "' (exp.: Match) " + str(re.match(match_pattern, test_string_07_b)))
print("matching test_string_08 '" + test_string_08 + "' (exp.: Match) " + str(re.match(match_pattern, test_string_08)))
print("matching test_string_08_b '" + test_string_08_b + "' (exp.: Match) " + str(re.match(match_pattern, test_string_08_b)))
---------------------------------------------------------------------
liefert jetzt:
======================================
matching test_string_01 'id' (exp.: Match) <re.Match object; span=(0, 2), match='id'>
matching test_string_02 'mid' (exp.: No Match) None
matching test_string_03 'middle' (exp.: No Match) None
matching test_string_04 'identity' (exp.: No Match) None
matching test_string_05 'customer_id' (exp.: Match) None
matching test_string_05_b 'customer id' (exp.: Match) None
matching test_string_06 'id_validation' (exp.: Match) None
matching test_string_06_b 'id validation' (exp.: Match) <re.Match object; span=(0, 2), match='id'>
matching test_string_07 'some_id_included' (exp.: Match) None
matching test_string_07_b 'some id included' (exp.: Match) None
matching test_string_08 'in_the_middle_id_more' (exp.: Match) None
matching test_string_08_b 'in the middle id more' (exp.: Match) None
======================================
also einen Match, falls id alleine steht oder am Anfang stehend durch ein Leerzeichen vom restlichen Text abgetrennt ist. Keine matches, wenn es nicht am Anfang des Textes steht oder durch einen Unterstrich _ abgetrennt ist.
Vielleicht kann das mehr Licht (oder auch noch mehr Verwirrung) auf die Sache werfen?

Nochmals vielen Dank für eine Antwort.

Cuddle Panda

sparrow · Donnerstag 26. August 2021, 16:09

Also wenn ich dich richtig verstehe:

Code: Alles auswählen

text == "id" or text.startswith("id ") or text.endswith(" id") or " id " in text

Anstatt da sehr viele "Tests" per cops und paste in den Code zu bringen, solltest du Schleifen verwenden. Ein deutliches Zeichen dafür ist das Durchnummerieren von Variablennamen. Wenn man das tut, möchte man eigentlich eine Datenstruktur verwenden.

Stückel Zeichenketten nicht mit + zusammen. Verwende Zeichenkettenformatierung.

narpfel · Donnerstag 26. August 2021, 16:42

@cuddlePanda: Du suchst `re.search` statt `re.match`. `match` sucht nur am Anfang des Strings (hat also ein implizites `^` im RegEx), fullmatch hat zusätzlich noch ein implizites `$` am Ende. Siehe auch `search` vs. `match`.

Für Code gibt es Code-Blöcke (der </>-Button bzw. [ code][ /code] ohne die Leerzeichen).

Wenn das Tests sind, dann würde ich da keine Schleife für schreiben sondern `pytest.mark.parametrize` benutzen.

cuddlePanda · Donnerstag 26. August 2021, 16:54

Lieber Sparrow,
zunächst einmal vielen Dank für Deine rasche Antwort...
Zunächst einmal zu Deinem Vorschlag mit text == etc...
Im Prinzip ja, nur, dass Leerzeichen, Unterstriche, dashes etc auch möglich sein sollten. Nun gut, das könnte man sich auch zusammenbauen, erscheint mir aber im Moment ein wenig kompliziert... Vielleicht ist es weniger schlimm als es aussieht...
Zu den Cops und pastes im Code: Ich habe mir das schnell zusammengebaut, weil ich das einmal kurz austesten wollte. Ja, ist vielleicht nicht ganz professionell, ich werde mich bessern.
Und auch zur Zeichenkettenkonkatenation... Ich bin leider noch etwas neu in Python (habe früher mit anderen Sprachen gearbeitet, in FORTRAN ging's überhaupt nur so, in Java habe ich es auch immer so gemacht)... auch hier hoffe ich, im Laufe der Zeit professioneller und 'pythonischer' zu werden. Ich hoffe, man wird mir diese Anfangskrankheiten ein wenig verzeihen.
(Ich habe schon in einer früheren Anfrage erkannt, dass ich bei den Zeichenketten noch einiges dazulernen muss).

Vielleicht schaff' ich es noch, mein Beispiel ein wenig besser hinzubekommen, muss aber auch an meine 'main tasks' denken
Jedenfalls nochmals vielen Dank.
Cuddle Panda

cuddlePanda · Donnerstag 26. August 2021, 16:57

Danke auch an narpfel, ich werde mir die Vorschläge auch noch genauer durchsehen.
Schönen Abend noch
Cuddle Panda

__blackjack__ · Donnerstag 26. August 2021, 18:42

@cuddlePanda: Das Problem bei \b ist, dass da der Unterstrich nicht als Wortgrenze zählt.

So funktioniert's:

Code: Alles auswählen

import re

import pytest


@pytest.mark.parametrize(
    "text, expected",
    [
        ("id", True),
        ("mid", False),
        ("middle", False),
        ("identity", False),
        ("customer id", True),
        ("customer_id", True),
        ("id validation", True),
        ("id_validation", True),
        ("some_id_included", True),
        ("some id included", True),
        ("in_the_middle_id_more", True),
        ("in the middle id more", True),
    ],
)
def test_id_pattern(text, expected):
    assert bool(re.search(r"(^|[^a-zA-Z])id([^a-zA-Z]|$)", text)) == expected

Sirius3 · Donnerstag 26. August 2021, 19:23

@__blackjack__: das funktioniert jetzt aber auch nur für englische Wörter, mir fällt zwar gerade keins ein, das davor und danach Umlaute hat, aber wer kennt schon alle Sprachen.

narpfel · Donnerstag 26. August 2021, 20:08

`(\b|_)id(\b|_)`?

__blackjack__ · Donnerstag 26. August 2021, 21:41

@Sirius3: Den "[^a-zA-z]" habe ich vom OP so übernommen und ging davon aus der weiss schon was er selber will.

@narpfel: Das ist natürlich auch eine Möglichkeit.

cuddlePanda · Freitag 27. August 2021, 09:13

Vielen Dank an alle.
@Sirius3: meine Datensätze sind derzeit nur auf englisch. Falls ich einmal etwas anderes brauchen sollte, müsste ich mir das ohnehin nochmals ansehen.

cuddlePanda · Freitag 27. August 2021, 10:17

@narpfel: Vielen Dank auch für den Hinweis auf pytest.mark.parametrize. Das macht das Leben (und Testen

) natürlich erheblich einfacher...