https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 Stack Overflow 1. About 2. Products 3. For Teams 1. Stack Overflow Public questions & answers 2. Stack Overflow for Teams Where developers & technologists share private knowledge with coworkers 3. Jobs Programming & related technical career opportunities 4. Talent Recruit tech talent & build your employer brand 5. Advertising Reach developers & technologists worldwide 6. About the company [ ] Loading... 1. 2. Log in Sign up 3. current community + Stack Overflow help chat + Meta Stack Overflow your communities Sign up or log in to customize your list. more stack exchange communities company blog Join Stack Overflow to learn, share knowledge, and build your career. Sign up with email Sign up Sign up with Google Sign up with GitHub Sign up with Facebook 1. Home 2. 1. Public 2. Stack Overflow 3. Tags 4. Users 5. Find a Job 6. Jobs 7. Companies 3. 1. Teams What's this? 2. Create a Team Teams Q&A for work Connect and share knowledge within a single location that is structured and easy to search. Learn more RegEx match open tags except XHTML self-contained tags Ask Question Asked 11 years, 3 months ago Active 3 months ago Viewed 3.1m times 1675 6674 Locked. Comments on this question have been disabled, but it is still accepting new answers and other interactions. Learn more. I need to match all of these opening tags:

But not these:


I came up with this and wanted to make sure I've got it right. I am only capturing the a-z. <([a-z]+) *[^/]*?> I believe it says: * Find a less-than, then * Find (and capture) a-z one or more times, then * Find zero or more spaces, then * Find any character zero or more times, greedy, except /, then * Find a greater-than Do I have that right? And more importantly, what do you think? html regex xhtml Share Improve this question Follow edited May 26 '12 at 20:37 community wiki 11 revs, 7 users 58% Jeff 0 Comments disabled on deleted / locked posts / reviews | 36 Answers 36 Active Oldest Votes 1 2 Next 4413 Locked. There are disputes about this answer's content being resolved at this time. It is not currently accepting new interactions. You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions. Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The
cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes. HTML-plus-regexp will liquify the n erves of the sentient whilst you observe, your psyche withering in the onslaught of horror. Regex-based HTML parsers are the cancer that is killing StackOverflow it is too late it is too late we cannot be saved the transgression of a child ensures regex will consume all living tissue (except for HTML which it cannot, as previously prophesied) dear lord help us how can anyone survive this scourge using regex to parse HTML has doomed humanity to an eternity of dread torture and security holes using regex as a tool to process HTML establishes a breach between this world and the dread realm of c[?]ho[?]mrrupt entities (like SGML entities, but more corrupt) a mere glimpse of the world of reg ex parsers for HTML will ins tantly transport a programmer's consciousness into a world of ceaseless screaming, he comes[S:, the pestilent sl:S]ithy regex-infection wil l devour your HT ML parser, application and existence for all time like Visual Basic only worse he comes he comes do not fi ght he comes, hi s unho[?]ly radiance de stro*1.000.000*ying all enlightenment, HTML tags leakin[?]g from yo [?]ur eyes lik[?]e liq uid pain, the song of regular exp re[S:ssion parsing :S]will exti nguish the voices of mor tal man from the sp here I can see it can you see [?][?]it[?] it is beautiful t he final snuffing of the lie s of Man ALL IS LOS[?]T ALL I S LOST the pony he comes he com[S:es he co:S][S:me :S]s the ich or permeates all MY FACE MY FACE h god no NO NOOO O NTh stop the an *[?] m[?]g[?]mlmic[?]es a[?]r[?][?]e n ot reuacle ZA[?]LGO ISv*1.000.000*[?] TONy THE[?] PO NY HcE[?]h[?] urCtiv[?][?]Ov[?]Mhdr[?]Ed[?][?]Sx --------------------------------------------------------------------- Have you tried using an XML parser instead? --------------------------------------------------------------------- Moderator's Note This post is locked to prevent inappropriate edits to its content. The post looks exactly as it is supposed to look - there are no problems with its content. Please do not flag it for our attention. Share Follow edited Nov 12 '20 at 14:21 community wiki 10 revs, 6 users 24% bobince 3 * 179 Kobi: I think it's time for me to quit the post of Assistant Don't Parse HTML With Regex Officer. No matter how many times we say it, they won't stop coming every day... every hour even. It is a lost cause, which someone else can fight for a bit. So go on, parse HTML with regex, if you must. It's only broken code, not life and death. - bobince Nov 13 '09 at 23:18 * 27 Is it possible to use RegEx to parse this answer? - Chris Porter Nov 17 '09 at 18:26 * 2 If you can't see this post, here's a screencapture of it in all its glory: imgur.com/gOPS2.png - Andrew Keeton Nov 19 '09 at 14:37 Comments disabled on deleted / locked posts / reviews | 3341 +50 While arbitrary HTML with only a regex is impossible, it's sometimes appropriate to use them for parsing a limited, known set of HTML. If you have a small set of HTML pages that you want to scrape data from and then stuff into a database, regexes might work fine. For example, I recently wanted to get the names, parties, and districts of Australian federal Representatives, which I got off of the Parliament's web site. This was a limited, one-time job. Regexes worked just fine for me, and were very fast to set up. Share Improve this answer Follow edited Sep 19 '19 at 15:30 community wiki 10 revs, 10 users 36% Kaitlin Duck Sherwood 31 * 136 Also, scraping fairly regularly formatted data from large documents is going to be WAY faster with judicious use of scan & regex than any generic parser. And if you are comfortable with coding regexes, way faster to code than coding xpaths. And almost certainly less fragile to changes in what you are scraping. So bleh. - Michael Johnston Apr 17 '12 at 20:47 * 277 @MichaelJohnston "Less fragile"? Almost certainly not. Regexes care about text-formatting details than an XML parser can silently ignore. Switching between &foo; encodings and CDATA sections? Using an HTML minifier to remove all whitespace in your document that the browser doesn't render? An XML parser won't care, and neither will a well-written XPath statement. A regex-based "parser", on the other hand... - Charles Duffy Jul 11 '12 at 16:03 * 41 @CharlesDuffy for an one time job it's ok, and for spaces we use \s+ - quantum Jul 12 '12 at 13:50 * 72 @xiaomao indeed, if having to know all the gotchas and workarounds to get an 80% solution that fails the rest of the time "works for you", I can't stop you. Meanwhile, I'm over on my side of the fence using parsers that work on 100% of syntactically valid XML. - Charles Duffy Jul 12 '12 at 16:07 * 394 I once had to pull some data off ~10k pages, all with the same HTML template. They were littered with HTML errors that caused parsers to choke, and all their styling was inline or with etc.: no classes or IDs to help navigate the DOM. After fighting all day with the "right" approach, I finally switched to a regex solution and had it working in an hour. - Paul A Jungwirth Sep 7 '12 at 7:14 | Show 26 more comments 2120 I think the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and a regular expression is a Chomsky Type 3 grammar (regular grammar). Since a Type 2 grammar is fundamentally more complex than a Type 3 grammar (see the Chomsky hierarchy), it is mathematically impossible to parse XML with a regular expression. But many will try, and some will even claim success - but until others find the fault and totally mess you up. Share Improve this answer Follow edited Aug 14 '20 at 15:50 community wiki 9 revs, 8 users 20% Vlad Gudim 12 * 238 The OP is asking to parse a very limited subset of XHTML: start tags. What makes (X)HTML a CFG is its potential to have elements between the start and end tags of other elements (as in a grammar rule A -> s A e). (X)HTML does not have this property within a start tag: a start tag cannot contain other start tags. The subset that the OP is trying to parse is not a CFG. - LarsH Mar 2 '12 at 8:43 * 106 In CS theory, regular languages are a strict subset of context-free languages, but regular expression implementations in mainstream programming languages are more powerful. As noulakaz.net/weblog/2007/03/18/... describes, so-called "regular expressions" can check for prime numbers in unary, which is certainly something that a regular expression from CS theory can't accomplish. - Adam Mihalcin Mar 19 '12 at 23:50 * 13 @eyelidlessness: the same "only if" applies to all CFGs, does it not? I.e. if the (X)HTML input is not well-formed, not even a full-blown XML parser will work reliably. Maybe if you give examples of the "(X)HTML syntax errors implemented in real world user agents" you're referring to, I'll understand what you're getting at better. - LarsH May 22 '12 at 5:09 * 86 @AdamMihalcin is exactly right. Most extant regex engines are more powerful than Chomsky Type 3 grammars (eg non-greedy matching, backrefs). Some regex engines (such as Perl's) are Turing complete. It's true that even those are poor tools for parsing HTML, but this oft-cited argument is not the reason why. - dubiousjim May 31 '12 at 13:44 * 28 This is the most "full and short" answer here. It leads people to learn basics of formal grammars and languages and hopefully some maths so they will not wast time on hopeless things like solving NP-tasks in polynomial time - mishmashru Apr 19 '13 at 12:15 | Show 7 more comments 1383 Don't listen to these guys. You totally can parse context-free grammars with regex if you break the task into smaller pieces. You can generate the correct pattern with a script that does each of these in order: 1. Solve the Halting Problem. 2. Square a circle. 3. Work out the Traveling Salesman Problem in O(log n) or less. If it's any more than that, you'll run out of RAM and the engine will hang. 4. The pattern will be pretty big, so make sure you have an algorithm that losslessly compresses random data. 5. Almost there - just divide the whole thing by zero. Easy-peasy. I haven't quite finished the last part myself, but I know I'm getting close. It keeps throwing CthulhuRlyehWgahnaglFhtagnExceptions for some reason, so I'm going to port it to VB 6 and use On Error Resume Next. I'll update with the code once I investigate this strange door that just opened in the wall. Hmm. P.S. Pierre de Fermat also figured out how to do it, but the margin he was writing in wasn't big enough for the code. Share Improve this answer Follow edited Aug 30 '18 at 16:52 community wiki 10 revs Justin Morgan 19 * 85 Divison by zero is a much easier problem than the others you mention. If you use intervals, rather than plain floating point arithmetic (which everyone should be but nobody is), you can happily divide something by [an interval containing] zero. The result is simply an interval containing plus and minus infinity. - rjmunro Jun 14 '12 at 10:53 * 158 Fermat's small margin problem has been solved by soft margins in modern text-editing software. - kd4ttc Mar 1 '13 at 20:24 * 55 Fermat's small margin problem has been solved by Randall Munroe by setting the fontsize to zero: xkcd.com/1381 - heltonbiker Oct 16 '14 at 19:55 * 34 FYI: Fermat's problem has actually been solved in 1995, and it only took mathematicians 358 years to do so. - jmiserez Jan 22 '15 at 18:40 * 11 I was able to bypass that sticky divide-by-zero step by instead using Brownian ratchets yielded from cold fusion...though it only works when I remove the cosmological constant. - Tim Lehner Mar 9 '16 at 18:52 | Show 14 more comments 1100 Disclaimer: use a parser if you have the option. That said... This is the regex I use (!) to match HTML tags: <(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+> It may not be perfect, but I ran this code through a lot of HTML. Note that it even catches strange things like , which show up on the web. I guess to make it not match self contained tags, you'd either want to use Kobi's negative look-behind: <(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+(? or just combine if and if not. To downvoters: This is working code from an actual product. I doubt anyone reading this page will get the impression that it is socially acceptable to use regexes on HTML. Caveat: I should note that this regex still breaks down in the presence of CDATA blocks, comments, and script and style elements. Good news is, you can get rid of those using a regex... Share Improve this answer Follow edited May 23 '17 at 12:34 community wiki 5 revs, 2 users 92% itsadok 10 * 106 I would go with something that works on sane things than weep about not being universally perfect :-) - prajeesh kumar May 10 '12 at 3:44 * 57 Is someone using CDATA inside HTML? - Danubian Sailor Mar 2 '13 at 7:51 * 20 so you do not actually solve the parsing problem with regexp only but as a part of the parser this may work. PS: working product doesn't mean good code. No offence, but this is how industrial programming works and gets their money - mishmashru Apr 19 '13 at 12:18 * 35 Your regex starts fail on the very shortest possible, valid HTML: <. Simple ' <'.match(/<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>/g) returns ["", "", "<"] while should ["", ""]. - user1180790 May 1 '14 at 16:48 * 2 if we're just trying to match & not match the examples given, /<. ([^r>][^>]*)?>/g works :-) // javascript: '



'.match(/<.([^r>][^>]*)?>/g) - imma May 22 '14 at 16:14 | Show 5 more comments 511 There are people that will tell you that the Earth is round (or perhaps that the Earth is an oblate spheroid if they want to use strange words). They are lying. There are people that will tell you that Regular Expressions shouldn't be recursive. They are limiting you. They need to subjugate you, and they do it by keeping you in ignorance. You can live in their reality or take the red pill. Like Lord Marshal (is he a relative of the Marshal .NET class?), I have seen the [S:Underverse:S] Stack Based Regex-Verse and returned with [S:powers:S] knowledge you can't imagine. Yes, I think there were an Old One or two protecting them, but they were watching football on the TV, so it wasn't difficult. I think the XML case is quite simple. The RegEx (in the .NET syntax), deflated and coded in base64 to make it easier to comprehend by your feeble mind, should be something like this: 7L0HYBxJliUmL23Ke39K9UrX4HShCIBgEyTYkEAQ7MGIzeaS7B1pRyMpqyqBymVWZV1mFkDM7Z28 995777333nvvvfe6O51OJ/ff/z9cZmQBbPbOStrJniGAqsgfP358Hz8itn6Po9/3eIue3+Px7/3F 86enJ8+/fHn64ujx7/t7vFuUd/Dx65fHJ6dHW9/7fd/t7fy+73Ye0v+f0v+Pv//JnTvureM3b169 OP7i9Ogyr5uiWt746u+BBqc/8dXx86PP7tzU9mfQ9tWrL18d3UGnW/z7nZ9htH/y9NXrsy9fvPjq i5/46ss3p4z+x3e8b452f9/x93a2HxIkH44PpgeFyPD6lMAEHUdbcn8ffTP9fdTrz/8rBPCe05Iv p9WsWF788Obl9MXJl0/PXnwONLozY747+t7x9k9l2z/4vv4kqo1//993+/vf2kC5HtwNcxXH4aOf LRw2z9/v8WEz2LTZcpaV1TL/4c3h66ex2Xv95vjF0+PnX744PbrOm59ZVhso5UHYME/dfj768H7e Yy5uQUydDAH9+/4eR11wHbqdfPnFF6cv3ogq/V23t++4z4620A13cSzd7O1s/77rpw+ePft916c7 O/jj2bNnT7e/t/397//M9+ibA/7s6ZNnz76PP0/kT2rz/Ts/s/0NArvziYxVEZWxbm93xsrUfnlm rASN7Hf93u/97vvf+2Lx/e89L7+/FSXiz4Bkd/hF5mVq9Yik7fcncft9350QCu+efkr/P6BfntEv z+iX9c4eBrFz7wEwpB9P+d9n9MfuM3yzt7Nzss0/nuJfbra3e4BvZFR7z07pj3s7O7uWJM8eCkme nuCPp88MfW6kDeH7+26PSTX8vu+ePAAiO4LVp4zIPWC1t7O/8/+pMX3rzo2KhL7+8s23T1/RhP0e vyvm8HbsdmPXYDVhtpdnAzJ1k1jeufOtUAM8ffP06Zcnb36fl6dPXh2f/F6nRvruyHfMd9rgJp0Y gvsRx/6/ZUzfCtX4e5hTndGzp5jQo9e/z+s3p1/czAUMlts+P3tz+uo4tISd745uJxvb3/v4ZlWs mrjfd9SG/swGPD/6+nh+9MF4brTBRmh1Tl5+9eT52ckt5oR0xldPzp7GR8pfuXf5PWJv4nJIwvbH W3c+GY3vPvrs9zj8Xb/147/n7/b7/+52DD2gsSH8zGDvH9+i9/fu/PftTfTXYf5hB+9H7P1BeG52 MTtu4S2cTAjDizevv3ry+vSNb8N+3+/1po2anj4/hZsGt3TY4GmjYbEKDJ62/pHB+3/LmL62wdsU 1J18+eINzTJr3dMvXr75fX7m+MXvY9XxF2e/9+nTgPu2bgwh5U0f7u/74y9Pnh6/OX4PlA2UlwTn xenJG8L996VhbP3++PCrV68QkrjveITxr2TIt+lL+f3k22fPn/6I6f/fMqZvqXN/K4Xps6sazUGZ GeQlar49xEvajzI35VRevDl78/sc/b7f6jkG8Va/x52N4L9lBe/kZSh1hr9fPj19+ebbR4AifyuY 12efv5CgGh9TroR6Pj2l748iYxYgN8Z7pr0HzRLg66FnRvcjUft/45i+pRP08vTV6TOe2N/9jv37 R9P0/5YxbXQDeK5E9R12XdDA/4zop+/9Ht/65PtsDVlBBUqko986WsDoWqvbPD2gH/T01DAC1NVn 3/uZ0feZ+T77fd/GVMkA4KjeMcg6RcvQLRl8HyPaWVStdv17PwHV0bOB9xUh7rfMp5Zu3icBJp25 D6f0NhayHyfI3HXHY6YYCw7Pz17fEFhQKzS6ZWChrX+kUf7fMqavHViEPPKjCf1/y5hukcyPTvjP mHQCppRDN4nbVFPaT8+ekpV5/TP8g/79mVPo77PT1/LL7/MzL7548+XvdfritflFY00fxIsvSQPS mvctdYZpbt7vxKRfj3018OvC/hEf/79lTBvM3debWj+b8KO0wP+3OeM2aYHumuCAGonmCrxw9cVX X1C2d4P+uSU7eoBUMzI3/f9udjbYl/el04dI7s8fan8dWRjm6gFx+NrKeFP+WX0CxBdPT58df/X8 DaWLX53+xFdnr06f/szv++NnX7x8fnb6NAhIwsbPkPS7iSUQAFETvP2Tx8+/Og0Xt/yBvDn9vd/c etno8S+81QKXptq/ffzKZFZ+4e/743e8zxino+8RX37/k595h5/H28+y7fPv490hQdJ349E+txB3 zPZ5J/jsR8bs/y1j2hh/2fkayOqEmYcej0cXUWMN7QrqBwjDrVZRfyQM3xjj/EgYvo4wfLTZrnVS ebdKq0XSZJvzajKQDUv1/P3NwbEP7cN5+Odivv9/ysPfhHfkOP6b9Fl+91v7LD9aCvp/+Zi+7lLQ j0zwNzYFP+/Y6r1NcFeDbfBIo8rug3zS3/3WPumPlN3/y8f0I2X3cz4FP+/Y6htSdr2I42fEuSPX /ewpL4e9/n1evzn94hb+Plpw2+dnbyh79zx0CsPvbq0lb+UQ/h7xvqPq/Gc24PnR18fzVrp8I57d mehj7ebk5VdPnp+d3GJOSP189eTsaXyk/JV7l98j4SAZgRxtf7x155PR+O6jz36Pw9/1Wz/+e/5u v//vbsfQAxobws8M9v7xLXp/785/395ED4nO1wx5fsTeH4LnRva+eYY8rpZUBFb/j/jfm8XAvfEj 4/b/ljF1F9B/jx5PhAkp1nu/+y3n+kdZp/93jWmjJ/M11TG++VEG6puZn593PPejoOyHMQU/79jq GwrKfpSB+tmcwZ93XPkjZffDmIKfd2z1DSm7bmCoPPmjBNT74XkrVf71I/Sf6wTU7XJA4RB+lIC6 mW1+xN5GWw1/683C5rnj/m364cmr45Pf6/SN9H4Us4LISn355vjN2ZcvtDGT6fHvapJcMISmxc0K MAD4IyP6/5Yx/SwkP360FvD1VTH191mURr/HUY+2P3I9boPnz7Ju/pHrcWPnP3I9/r/L3sN0v52z 0fEgNrgbL8/Evfh9fw/q5Xf93u/97vvf+2Lx/e89L7+/Fe3iZ37f34P5h178kTfx/5YxfUs8vY26 7/d4/OWbb5++ogn7PX5XzOHtOP3GrsHmqobOVO/8Hh1Gk/TPl198QS6w+rLb23fcZ0fMaTfjsv29 7Zul7me2v0FgRoYVURnf9nZEkDD+H2VDf8hjeq8xff1s6GbButNLacEtefHm9VdPXp++CRTw7/v9 r6vW8b9eJ0+/PIHzs1HHdyKE/x9L4Y+s2f+PJPX/1dbsJn3wrY6wiqv85vjVm9Pnp+DgN8efM5va j794+eb36Xz3mAf5+58+f3r68s230dRvJcxKn/l//oh3f+7H9K2O0r05PXf85s2rH83f/1vGdAvd w+qBFqsoWvzspozD77EpXYeZ7yzdfxy0ec+l+8e/8FbR84+Wd78xbvn/qQQMz/J7L++GPB7N0MQa 2vTMBwjDrVI0PxKGb4xxfiQMX0cYPuq/Fbx2C1sU8yEF+F34iNsx1xOGa9t6l/yX70uqmxu+qBGm AxlxWwVS11O97ULqlsFIUvUnT4/fHIuL//3f9/t9J39Y9m8W/Tuc296yUeX/b0PiHwUeP1801Y8C j/9vz9+PAo8f+Vq35Jb/n0rAz7Kv9aPA40fC8P+RMf3sC8PP08DjR1L3DXHoj6SuIz/CCghZNZb8 fb/Hf/2+37tjvuBY9vu3jmRvxNeGgQAuaAF6Pwj8/+e66M8/7rwpRNj6uVwXZRl52k0n3FVl95Q+ +fz0KSu73/dtkGDYdvZgSP5uskadrtViRKyal2IKAiQfiW+FI+tET/9/Txj9SFf8SFf8rOuKzagx +r/vD34mUADO1P4/AQAA//8= The options to set is RegexOptions.ExplicitCapture. The capture group you are looking for is ELEMENTNAME. If the capture group ERROR is not empty then there was a parsing error and the Regex stopped. If you have problems reconverting it to a human-readable regex, this should help: static string FromBase64(string str) { byte[] byteArray = Convert.FromBase64String(str); using (var msIn = new MemoryStream(byteArray)) using (var msOut = new MemoryStream()) { using (var ds = new DeflateStream(msIn, CompressionMode.Decompress)) { ds.CopyTo(msOut); } return Encoding.UTF8.GetString(msOut.ToArray()); } } If you are unsure, no, I'm NOT kidding (but perhaps I'm lying). It WILL work. I've built tons of unit tests to test it, and I have even used (part of) the conformance tests. It's a tokenizer, not a full-blown parser, so it will only split the XML into its component tokens. It won't parse/integrate DTDs. Oh... if you want the source code of the regex, with some auxiliary methods: regex to tokenize an xml or the full plain regex Share Improve this answer Follow edited Aug 14 '20 at 6:32 community wiki 12 revs, 10 users 70% xanatos 30 * 70 Good Lord, it's massive. My biggest question is why? You realize that all modern languages have XML parsers, right? You can do all that in like 3 lines and be sure it'll work. Furthermore, do you also realize that pure regex is provably unable to do certain things? Unless you've created a hybrid regex/imperative code parser, but it doesn't look like you have. Can you compress random data as well? - Justin Morgan Mar 8 '11 at 15:23 * 117 @Justin I don't need a reason. It could be done (and it wasn't illegal/immoral), so I have done it. There are no limitations to the mind except those we acknowledge (Napoleon Hill)... Modern languages can parse XML? Really? And I thought that THAT was illegal! :-) - xanatos Mar 8 '11 at 15:31 * 83 Sir, I'm convinced. I'm going to use this code as part of the kernel for my perpetual-motion machine--can you believe those fools at the patent office keep rejecting my application? Well, I'll show them. I'll show them all! - Justin Morgan Mar 8 '11 at 17:55 * 31 @Justin So an Xml Parser is by definition bug free, while a Regex isn't? Because if an Xml Parser isn't bug free by definition there could be an xml that make it crash and we are back to step 0. Let say this: both the Xml Parser and this Regex try to be able to parse all the "legal" XML. They CAN parse some "illegal" XML. Bugs could crash both of them. C# XmlReader is surely more tested than this Regex. - xanatos Mar 9 '11 at 15:08 * 32 No, nothing is bug free: 1) All programs contain at least one bug. 2) All programs contain at least one line of unnecessary source code. 3) By #1 and #2 and using logical induction, it's a simple matter to prove that any program can be reduced to a single line of code with a bug. (from Learning Perl) - Scott Weaver Feb 16 '12 at 0:53 | Show 25 more comments 304 In shell, you can parse HTML using sed: 1. Turing.sed 2. Write HTML parser (homework) 3. ??? 4. Profit! --------------------------------------------------------------------- Related (why you shouldn't use regex match): * If You Like Regular Expressions So Much, Why Don't You Marry Them? * Regular Expressions: Now You Have Two Problems * Hacking stackoverflow.com's HTML sanitizer Share Improve this answer Follow edited Apr 23 '19 at 16:44 community wiki 10 revs, 7 users 43% kenorb 7 * 3 I'm afraid you did not get the joke, @kenorb. Please, read the question and the accepted answer once more. This is not about HTML parsing tools in general, nor about HTML parsing shell tools, it's about parsing HTML via regexes. - Palec Oct 13 '15 at 8:12 * 1 No, @Abdul. It is completely, provably (in the mathematical sense) impossible. - Palec Mar 24 '17 at 13:24 * 4 Yes, that answer summarizes it well, @Abdul. Note that, however, regex implementations are not really regular expressions in the mathematical sense -- they have constructs that make them stronger, often Turing-complete (equivalent to Type 0 grammars). The argument breaks with this fact, but is still somewhat valid in the sense that regexes were never meant to be capable of doing such a job, though. - Palec Mar 24 '17 at 14:24 * 2 And by the way, the joke I referred to was the content of this answer before kenorb's (radical) edits, specifically revision 4, @Abdul. - Palec Mar 24 '17 at 14:26 * 5 The funny thing is that OP never asked to parse html using regex. He asked to match text (which happens to be HTML) using regex. Which is perfectly reasonable. - Paralife Mar 29 '18 at 15:29 | Show 2 more comments 279 I agree that the right tool to parse XML and especially HTML is a parser and not a regular expression engine. However, like others have pointed out, sometimes using a regex is quicker, easier, and gets the job done if you know the data format. Microsoft actually has a section of Best Practices for Regular Expressions in the .NET Framework and specifically talks about Consider[ing] the Input Source. Regular Expressions do have limitations, but have you considered the following? The .NET framework is unique when it comes to regular expressions in that it supports Balancing Group Definitions. * See Matching Balanced Constructs with .NET Regular Expressions * See .NET Regular Expressions: Regex and Balanced Matching * See Microsoft's docs on Balancing Group Definitions For this reason, I believe you CAN parse XML using regular expressions. Note however, that it must be valid XML (browsers are very forgiving of HTML and allow bad XML syntax inside HTML). This is possible since the "Balancing Group Definition" will allow the regular expression engine to act as a PDA. Quote from article 1 cited above: .NET Regular Expression Engine As described above properly balanced constructs cannot be described by a regular expression. However, the .NET regular expression engine provides a few constructs that allow balanced constructs to be recognized. + (?) - pushes the captured result on the capture stack with the name group. + (?<-group>) - pops the top most capture with the name group off the capture stack. + (?(group)yes|no) - matches the yes part if there exists a group with the name group otherwise matches no part. These constructs allow for a .NET regular expression to emulate a restricted PDA by essentially allowing simple versions of the stack operations: push, pop and empty. The simple operations are pretty much equivalent to increment, decrement and compare to zero respectively. This allows for the .NET regular expression engine to recognize a subset of the context-free languages, in particular the ones that only require a simple counter. This in turn allows for the non-traditional .NET regular expressions to recognize individual properly balanced constructs. Consider the following regular expression: (?=) (?> | <[^>]*/> | (?<(?!/)[^>]*[^/]>) | (?<-opentag>]*[^/]>) | [^<>]* )* (?(opentag)(?!)) Use the flags: * Singleline * IgnorePatternWhitespace (not necessary if you collapse regex and remove all whitespace) * IgnoreCase (not necessary) Regular Expression Explained (inline) (?=) # match start with