https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 Stack Overflow 1. About 2. Products 3. For Teams 1. Stack Overflow Public questions & answers 2. Stack Overflow for Teams Where developers & technologists share private knowledge with coworkers 3. Jobs Programming & related technical career opportunities 4. Talent Recruit tech talent & build your employer brand 5. Advertising Reach developers & technologists worldwide 6. About the company [ ] Loading... 1. 2. Log in Sign up 3. current community + Stack Overflow help chat + Meta Stack Overflow your communities Sign up or log in to customize your list. more stack exchange communities company blog Join Stack Overflow to learn, share knowledge, and build your career. Sign up with email Sign up Sign up with Google Sign up with GitHub Sign up with Facebook 1. Home 2. 1. Public 2. Stack Overflow 3. Tags 4. Users 5. Find a Job 6. Jobs 7. Companies 3. 1. Teams What's this? 2. Create a Team Teams Q&A for work Connect and share knowledge within a single location that is structured and easy to search. Learn more RegEx match open tags except XHTML self-contained tags Ask Question Asked 11 years, 3 months ago Active 3 months ago Viewed 3.1m times 1675 6674 Locked. Comments on this question have been disabled, but it is still accepting new answers and other interactions. Learn more. I need to match all of these opening tags:
But not these:
I came up with this and wanted to make sure I've got it right. I am
only capturing the a-z.
<([a-z]+) *[^/]*?>
I believe it says:
* Find a less-than, then
* Find (and capture) a-z one or more times, then
* Find zero or more spaces, then
* Find any character zero or more times, greedy, except /, then
* Find a greater-than
Do I have that right? And more importantly, what do you think?
html regex xhtml
Share
Improve this question
Follow
edited May 26 '12 at 20:37
community wiki
11 revs, 7 users 58%
Jeff
0
Comments disabled on deleted / locked posts / reviews |
36 Answers 36
Active Oldest Votes
1
2 Next
4413
Locked. There are disputes about this answer's content being resolved
at this time. It is not currently accepting new interactions.
You can't parse [X]HTML with regex. Because HTML can't be parsed by
regex. Regex is not a tool that can be used to correctly parse HTML.
As I have answered in HTML-and-regex questions here so many times
before, the use of regex will not allow you to consume HTML. Regular
expressions are a tool that is insufficiently sophisticated to
understand the constructs employed by HTML. HTML is not a regular
language and hence cannot be parsed by regular expressions. Regex
queries are not equipped to break down HTML into its meaningful
parts. so many times but it is not getting to me. Even enhanced
irregular regular expressions as used by Perl are not up to the task
of parsing HTML. You will never make me crack. HTML is a language of
sufficient complexity that it cannot be parsed by regular
expressions. Even Jon Skeet cannot parse HTML using regular
expressions. Every time you attempt to parse HTML with regular
expressions, the unholy child weeps the blood of virgins, and Russian
hackers pwn your webapp. Parsing HTML with regex summons tainted
souls into the realm of the living. HTML and regex go together like
love, marriage, and ritual infanticide. The
'.match(/<.([^r>][^>]*)?>/g) - imma May 22
'14 at 16:14
| Show 5 more comments
511
There are people that will tell you that the Earth is round (or
perhaps that the Earth is an oblate spheroid if they want to use
strange words). They are lying.
There are people that will tell you that Regular Expressions
shouldn't be recursive. They are limiting you. They need to subjugate
you, and they do it by keeping you in ignorance.
You can live in their reality or take the red pill.
Like Lord Marshal (is he a relative of the Marshal .NET class?), I
have seen the [S:Underverse:S] Stack Based Regex-Verse and returned
with [S:powers:S] knowledge you can't imagine. Yes, I think there
were an Old One or two protecting them, but they were watching
football on the TV, so it wasn't difficult.
I think the XML case is quite simple. The RegEx (in the .NET syntax),
deflated and coded in base64 to make it easier to comprehend by your
feeble mind, should be something like this:
7L0HYBxJliUmL23Ke39K9UrX4HShCIBgEyTYkEAQ7MGIzeaS7B1pRyMpqyqBymVWZV1mFkDM7Z28
995777333nvvvfe6O51OJ/ff/z9cZmQBbPbOStrJniGAqsgfP358Hz8itn6Po9/3eIue3+Px7/3F
86enJ8+/fHn64ujx7/t7vFuUd/Dx65fHJ6dHW9/7fd/t7fy+73Ye0v+f0v+Pv//JnTvureM3b169
OP7i9Ogyr5uiWt746u+BBqc/8dXx86PP7tzU9mfQ9tWrL18d3UGnW/z7nZ9htH/y9NXrsy9fvPjq
i5/46ss3p4z+x3e8b452f9/x93a2HxIkH44PpgeFyPD6lMAEHUdbcn8ffTP9fdTrz/8rBPCe05Iv
p9WsWF788Obl9MXJl0/PXnwONLozY747+t7x9k9l2z/4vv4kqo1//993+/vf2kC5HtwNcxXH4aOf
LRw2z9/v8WEz2LTZcpaV1TL/4c3h66ex2Xv95vjF0+PnX744PbrOm59ZVhso5UHYME/dfj768H7e
Yy5uQUydDAH9+/4eR11wHbqdfPnFF6cv3ogq/V23t++4z4620A13cSzd7O1s/77rpw+ePft916c7
O/jj2bNnT7e/t/397//M9+ibA/7s6ZNnz76PP0/kT2rz/Ts/s/0NArvziYxVEZWxbm93xsrUfnlm
rASN7Hf93u/97vvf+2Lx/e89L7+/FSXiz4Bkd/hF5mVq9Yik7fcncft9350QCu+efkr/P6BfntEv
z+iX9c4eBrFz7wEwpB9P+d9n9MfuM3yzt7Nzss0/nuJfbra3e4BvZFR7z07pj3s7O7uWJM8eCkme
nuCPp88MfW6kDeH7+26PSTX8vu+ePAAiO4LVp4zIPWC1t7O/8/+pMX3rzo2KhL7+8s23T1/RhP0e
vyvm8HbsdmPXYDVhtpdnAzJ1k1jeufOtUAM8ffP06Zcnb36fl6dPXh2f/F6nRvruyHfMd9rgJp0Y
gvsRx/6/ZUzfCtX4e5hTndGzp5jQo9e/z+s3p1/czAUMlts+P3tz+uo4tISd745uJxvb3/v4ZlWs
mrjfd9SG/swGPD/6+nh+9MF4brTBRmh1Tl5+9eT52ckt5oR0xldPzp7GR8pfuXf5PWJv4nJIwvbH
W3c+GY3vPvrs9zj8Xb/147/n7/b7/+52DD2gsSH8zGDvH9+i9/fu/PftTfTXYf5hB+9H7P1BeG52
MTtu4S2cTAjDizevv3ry+vSNb8N+3+/1po2anj4/hZsGt3TY4GmjYbEKDJ62/pHB+3/LmL62wdsU
1J18+eINzTJr3dMvXr75fX7m+MXvY9XxF2e/9+nTgPu2bgwh5U0f7u/74y9Pnh6/OX4PlA2UlwTn
xenJG8L996VhbP3++PCrV68QkrjveITxr2TIt+lL+f3k22fPn/6I6f/fMqZvqXN/K4Xps6sazUGZ
GeQlar49xEvajzI35VRevDl78/sc/b7f6jkG8Va/x52N4L9lBe/kZSh1hr9fPj19+ebbR4AifyuY
12efv5CgGh9TroR6Pj2l748iYxYgN8Z7pr0HzRLg66FnRvcjUft/45i+pRP08vTV6TOe2N/9jv37
R9P0/5YxbXQDeK5E9R12XdDA/4zop+/9Ht/65PtsDVlBBUqko986WsDoWqvbPD2gH/T01DAC1NVn
3/uZ0feZ+T77fd/GVMkA4KjeMcg6RcvQLRl8HyPaWVStdv17PwHV0bOB9xUh7rfMp5Zu3icBJp25
D6f0NhayHyfI3HXHY6YYCw7Pz17fEFhQKzS6ZWChrX+kUf7fMqavHViEPPKjCf1/y5hukcyPTvjP
mHQCppRDN4nbVFPaT8+ekpV5/TP8g/79mVPo77PT1/LL7/MzL7548+XvdfritflFY00fxIsvSQPS
mvctdYZpbt7vxKRfj3018OvC/hEf/79lTBvM3debWj+b8KO0wP+3OeM2aYHumuCAGonmCrxw9cVX
X1C2d4P+uSU7eoBUMzI3/f9udjbYl/el04dI7s8fan8dWRjm6gFx+NrKeFP+WX0CxBdPT58df/X8
DaWLX53+xFdnr06f/szv++NnX7x8fnb6NAhIwsbPkPS7iSUQAFETvP2Tx8+/Og0Xt/yBvDn9vd/c
etno8S+81QKXptq/ffzKZFZ+4e/743e8zxino+8RX37/k595h5/H28+y7fPv490hQdJ349E+txB3
zPZ5J/jsR8bs/y1j2hh/2fkayOqEmYcej0cXUWMN7QrqBwjDrVZRfyQM3xjj/EgYvo4wfLTZrnVS
ebdKq0XSZJvzajKQDUv1/P3NwbEP7cN5+Odivv9/ysPfhHfkOP6b9Fl+91v7LD9aCvp/+Zi+7lLQ
j0zwNzYFP+/Y6r1NcFeDbfBIo8rug3zS3/3WPumPlN3/y8f0I2X3cz4FP+/Y6htSdr2I42fEuSPX
/ewpL4e9/n1evzn94hb+Plpw2+dnbyh79zx0CsPvbq0lb+UQ/h7xvqPq/Gc24PnR18fzVrp8I57d
mehj7ebk5VdPnp+d3GJOSP189eTsaXyk/JV7l98j4SAZgRxtf7x155PR+O6jz36Pw9/1Wz/+e/5u
v//vbsfQAxobws8M9v7xLXp/785/395ED4nO1wx5fsTeH4LnRva+eYY8rpZUBFb/j/jfm8XAvfEj
4/b/ljF1F9B/jx5PhAkp1nu/+y3n+kdZp/93jWmjJ/M11TG++VEG6puZn593PPejoOyHMQU/79jq
GwrKfpSB+tmcwZ93XPkjZffDmIKfd2z1DSm7bmCoPPmjBNT74XkrVf71I/Sf6wTU7XJA4RB+lIC6
mW1+xN5GWw1/683C5rnj/m364cmr45Pf6/SN9H4Us4LISn355vjN2ZcvtDGT6fHvapJcMISmxc0K
MAD4IyP6/5Yx/SwkP360FvD1VTH191mURr/HUY+2P3I9boPnz7Ju/pHrcWPnP3I9/r/L3sN0v52z
0fEgNrgbL8/Evfh9fw/q5Xf93u/97vvf+2Lx/e89L7+/Fe3iZ37f34P5h178kTfx/5YxfUs8vY26
7/d4/OWbb5++ogn7PX5XzOHtOP3GrsHmqobOVO/8Hh1Gk/TPl198QS6w+rLb23fcZ0fMaTfjsv29
7Zul7me2v0FgRoYVURnf9nZEkDD+H2VDf8hjeq8xff1s6GbButNLacEtefHm9VdPXp++CRTw7/v9
r6vW8b9eJ0+/PIHzs1HHdyKE/x9L4Y+s2f+PJPX/1dbsJn3wrY6wiqv85vjVm9Pnp+DgN8efM5va
j794+eb36Xz3mAf5+58+f3r68s230dRvJcxKn/l//oh3f+7H9K2O0r05PXf85s2rH83f/1vGdAvd
w+qBFqsoWvzspozD77EpXYeZ7yzdfxy0ec+l+8e/8FbR84+Wd78xbvn/qQQMz/J7L++GPB7N0MQa
2vTMBwjDrVI0PxKGb4xxfiQMX0cYPuq/Fbx2C1sU8yEF+F34iNsx1xOGa9t6l/yX70uqmxu+qBGm
AxlxWwVS11O97ULqlsFIUvUnT4/fHIuL//3f9/t9J39Y9m8W/Tuc296yUeX/b0PiHwUeP1801Y8C
j/9vz9+PAo8f+Vq35Jb/n0rAz7Kv9aPA40fC8P+RMf3sC8PP08DjR1L3DXHoj6SuIz/CCghZNZb8
fb/Hf/2+37tjvuBY9vu3jmRvxNeGgQAuaAF6Pwj8/+e66M8/7rwpRNj6uVwXZRl52k0n3FVl95Q+
+fz0KSu73/dtkGDYdvZgSP5uskadrtViRKyal2IKAiQfiW+FI+tET/9/Txj9SFf8SFf8rOuKzagx
+r/vD34mUADO1P4/AQAA//8=
The options to set is RegexOptions.ExplicitCapture. The capture group
you are looking for is ELEMENTNAME. If the capture group ERROR is not
empty then there was a parsing error and the Regex stopped.
If you have problems reconverting it to a human-readable regex, this
should help:
static string FromBase64(string str)
{
byte[] byteArray = Convert.FromBase64String(str);
using (var msIn = new MemoryStream(byteArray))
using (var msOut = new MemoryStream()) {
using (var ds = new DeflateStream(msIn, CompressionMode.Decompress)) {
ds.CopyTo(msOut);
}
return Encoding.UTF8.GetString(msOut.ToArray());
}
}
If you are unsure, no, I'm NOT kidding (but perhaps I'm lying). It
WILL work. I've built tons of unit tests to test it, and I have even
used (part of) the conformance tests. It's a tokenizer, not a
full-blown parser, so it will only split the XML into its component
tokens. It won't parse/integrate DTDs.
Oh... if you want the source code of the regex, with some auxiliary
methods:
regex to tokenize an xml or the full plain regex
Share
Improve this answer
Follow
edited Aug 14 '20 at 6:32
community wiki
12 revs, 10 users 70%
xanatos
30
* 70
Good Lord, it's massive. My biggest question is why? You realize
that all modern languages have XML parsers, right? You can do all
that in like 3 lines and be sure it'll work. Furthermore, do you
also realize that pure regex is provably unable to do certain
things? Unless you've created a hybrid regex/imperative code
parser, but it doesn't look like you have. Can you compress
random data as well? - Justin Morgan Mar 8 '11 at 15:23
* 117
@Justin I don't need a reason. It could be done (and it wasn't
illegal/immoral), so I have done it. There are no limitations to
the mind except those we acknowledge (Napoleon Hill)... Modern
languages can parse XML? Really? And I thought that THAT was
illegal! :-) - xanatos Mar 8 '11 at 15:31
* 83
Sir, I'm convinced. I'm going to use this code as part of the
kernel for my perpetual-motion machine--can you believe those
fools at the patent office keep rejecting my application? Well,
I'll show them. I'll show them all! - Justin Morgan Mar 8 '11 at
17:55
* 31
@Justin So an Xml Parser is by definition bug free, while a Regex
isn't? Because if an Xml Parser isn't bug free by definition
there could be an xml that make it crash and we are back to step
0. Let say this: both the Xml Parser and this Regex try to be
able to parse all the "legal" XML. They CAN parse some "illegal"
XML. Bugs could crash both of them. C# XmlReader is surely more
tested than this Regex. - xanatos Mar 9 '11 at 15:08
* 32
No, nothing is bug free: 1) All programs contain at least one
bug. 2) All programs contain at least one line of unnecessary
source code. 3) By #1 and #2 and using logical induction, it's a
simple matter to prove that any program can be reduced to a
single line of code with a bug. (from Learning Perl) - Scott
Weaver Feb 16 '12 at 0:53
| Show 25 more comments
304
In shell, you can parse HTML using sed:
1. Turing.sed
2. Write HTML parser (homework)
3. ???
4. Profit!
---------------------------------------------------------------------
Related (why you shouldn't use regex match):
* If You Like Regular Expressions So Much, Why Don't You Marry
Them?
* Regular Expressions: Now You Have Two Problems
* Hacking stackoverflow.com's HTML sanitizer
Share
Improve this answer
Follow
edited Apr 23 '19 at 16:44
community wiki
10 revs, 7 users 43%
kenorb
7
* 3
I'm afraid you did not get the joke, @kenorb. Please, read the
question and the accepted answer once more. This is not about
HTML parsing tools in general, nor about HTML parsing shell
tools, it's about parsing HTML via regexes. - Palec Oct 13 '15 at
8:12
* 1
No, @Abdul. It is completely, provably (in the mathematical
sense) impossible. - Palec Mar 24 '17 at 13:24
* 4
Yes, that answer summarizes it well, @Abdul. Note that, however,
regex implementations are not really regular expressions in the
mathematical sense -- they have constructs that make them
stronger, often Turing-complete (equivalent to Type 0 grammars).
The argument breaks with this fact, but is still somewhat valid
in the sense that regexes were never meant to be capable of doing
such a job, though. - Palec Mar 24 '17 at 14:24
* 2
And by the way, the joke I referred to was the content of this
answer before kenorb's (radical) edits, specifically revision 4,
@Abdul. - Palec Mar 24 '17 at 14:26
* 5
The funny thing is that OP never asked to parse html using regex.
He asked to match text (which happens to be HTML) using regex.
Which is perfectly reasonable. - Paralife Mar 29 '18 at 15:29
| Show 2 more comments
279
I agree that the right tool to parse XML and especially HTML is a
parser and not a regular expression engine. However, like others have
pointed out, sometimes using a regex is quicker, easier, and gets the
job done if you know the data format.
Microsoft actually has a section of Best Practices for Regular
Expressions in the .NET Framework and specifically talks about
Consider[ing] the Input Source.
Regular Expressions do have limitations, but have you considered the
following?
The .NET framework is unique when it comes to regular expressions in
that it supports Balancing Group Definitions.
* See Matching Balanced Constructs with .NET Regular Expressions
* See .NET Regular Expressions: Regex and Balanced Matching
* See Microsoft's docs on Balancing Group Definitions
For this reason, I believe you CAN parse XML using regular
expressions. Note however, that it must be valid XML (browsers are
very forgiving of HTML and allow bad XML syntax inside HTML). This is
possible since the "Balancing Group Definition" will allow the
regular expression engine to act as a PDA.
Quote from article 1 cited above:
.NET Regular Expression Engine
As described above properly balanced constructs cannot be
described by a regular expression. However, the .NET regular
expression engine provides a few constructs that allow balanced
constructs to be recognized.
+ (?)
(?>
|
<[^>]*/> |
(?
) # match start with
# atomic group / don't backtrack (faster)
| # match xml / html comment
<[^>]*/> | # self closing tag
(?
although it actually came out like this:
Lastly, I really enjoyed Jeff Atwood's article: Parsing Html The
Cthulhu Way. Funny enough, it cites the answer to this question that
currently has over 4k votes.
Share
Improve this answer
Follow
edited Feb 23 '20 at 4:43
[x6Q]
Callum Watkins
2,41522 gold badges2626 silver badges4242 bronze badges
answered Sep 27 '11 at 4:01
[efa]
SamSam
25.6k1212 gold badges6767 silver badges9797 bronze badges
5
* 18
System.Text is not part of C#. It's part of .NET. - John Saunders
Feb 2 '12 at 19:07
* 8
In the first line of your regex ((?=
) # match start with
- Scheintod Sep 27 '13 at 17:05
* 3
@Scheintod Thank you for the comment. I updated the code. The
previous expression failed for self closing tags that had a /
somewhere inside which failed for your
html. - Sam Sep 27 '13 at 19:00
Add a comment |
261
I suggest using QueryPath for parsing XML and HTML in PHP. It's
basically much the same syntax as jQuery, only it's on the server
side.
Share
Improve this answer
Follow
edited May 12 '15 at 18:54
community wiki
4 revs, 4 users 57%
John Fiala
4
* 8
@Kyle--jQuery does not parse XML, it uses the client's built-in
parser (if there is one). Therefore you do not need jQuery to do
it, but as little as two lines of plain old JavaScript. If there
is no built-in parser, jQuery will not help. - RobG Oct 31 '13 at
6:25
* 1
@RobG Actually jQuery uses the DOM, not the built-in parser. -
Qix - MONICA WAS MISTREATED Sep 22 '14 at 3:49
* 11
@Qix--you'd better tell the authors of the documentation then: "
jQuery.parseXML uses the native parsing function of the browser...
". Source: jQuery.parseXML() - RobG Sep 22 '14 at 5:01
* 6
Having come here from the meme question (meta.stackexchange.com/
questions/19478/the-many-memes-of-meta/...), I love that one of the
answers is 'Use jQuery' - Jorn Apr 1 '16 at 21:09
Add a comment |
224
While the answers that you can't parse HTML with regexes are correct,
they don't apply here. The OP just wants to parse one HTML tag with
regexes, and that is something that can be done with a regular
expression.
The suggested regex is wrong, though:
<([a-z]+) *[^/]*?>
If you add something to the regex, by backtracking it can be forced
to match silly things like >, [^/] is too permissive. Also note
that
and check out.
title
Fine, thanks.
';
// let's get the occurrences:
preg_match_all($pattern, $string, $matches, PREG_PATTERN_ORDER);
// print the result:
print_r($matches[0]);
?>
To test it deeply, I entered in the string auto-closing tags like:
1.
2.
3.
I also entered tags with:
1. one attribute
2. more than one attribute
3. attributes which value is bound either into single quotes or into
double quotes
4. attributes containing single quotes when the delimiter is a
double quote and vice versa
5. "unpretty" attributes with a space before the "=" symbol, after
it and both before and after it.
Should you find something which does not work in the proof of concept
above, I am available in analyzing the code to improve my skills.