html - Haskell:为什么我的解析器不能正确回溯?

标签 html parsing haskell backtracking parsec

我决定自学如何使用 Parsec ,我在分配给自己的玩具项目中遇到了一些障碍。

我正在尝试解析 HTML,特别是:

<html>
  <head>
    <title>Insert Clever Title</title>
  </head>
  <body>
    What don't you like?
    <select id="some stuff">
      <option name="first" font="green">boilerplate</option>
      <option selected name="second" font="blue">parsing HTML with regexes</option>
      <option name="third" font="red">closing tags for option elements
    </select>
    That was short.
  </body>
</html>

我的代码是:

{-# LANGUAGE FlexibleContexts, RankNTypes #-}
module Main where

import System.Environment (getArgs)
import Data.Map hiding (null)
import Text.Parsec hiding ((<|>), label, many, optional)
import Text.Parsec.Token
import Control.Applicative

data HTML = Element { tag :: String, attributes :: Map String (Maybe String), children :: [HTML] }
          | Text { contents :: String }
  deriving (Show, Eq)

type HTMLParser a = forall s u m. Stream s m Char => ParsecT s u m a

htmlDoc :: HTMLParser HTML
htmlDoc = do
  spaces
  doc <- html
  spaces >> eof
  return doc

html :: HTMLParser HTML
html = text <|> element

text  :: HTMLParser HTML
text = Text <$> (many1 $ noneOf "<")

label :: HTMLParser String
label = many1 . oneOf $ ['a' .. 'z']  ++ ['A' .. 'Z']

value :: HTMLParser String
value = between (char '"') (char '"') (many anyChar) <|> label

attribute :: HTMLParser (String, Maybe String)
attribute = (,) <$> label <*> (optionMaybe $ spaces >> char '=' >> spaces >> value)

element :: HTMLParser HTML
element = do
  char '<' >> spaces
  tag <- label
  -- at least one space between each attribute and what was before
  attributes <- fromList <$> many (space >> spaces >> attribute)
  spaces >> char '>' 
  -- nested html
  children <- many html
  optional $ string "</" >> spaces >> string tag >> spaces >> char '>'
  return $ Element tag attributes children

main = do
  source : _ <- getArgs
  result <- parse htmlDoc source <$> readFile source
  print result

问题似乎是我的解析器不喜欢关闭标签 - 它似乎贪婪地假设 <始终表示开始标记(据我所知):

% HTMLParser temp.html
Left "temp.html" (line 3, column 32):
unexpected "/"
expecting white space

我已经尝试了一段时间,我不确定为什么它没有回溯到 char '<' 之后匹配。

最佳答案

正如 ehird 所说,我需要使用 try:

attribute = (,) <$> label <*> (optionMaybe . try $ spaces >> char '=' >> spaces >> value) 
--...
attributes <- fromList <$> many (try $ space >> spaces >> attribute)
--...
children <- many $ try html
optional . try $ string "</" >> spaces >> string tag >> spaces >> char '>'

关于html - Haskell:为什么我的解析器不能正确回溯?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/9947837/

相关文章:

html - Bootstrap 切换开关超链接布局问题

java - 拆分一个大负数的字符串并将其放入 LinkedList

haskell - Cabal 安装提示 "<built-in>:0:4: lexical error (UTF-8 decoding error)"

debugging - Haskell:任何 debugShow 函数?

php - setTimeout 无法与 jQuery 和 load() 一起正常工作

java - 如何根据元数据中的横向纵向信息禁用 html 中 img 的自动旋转?

c++ - 将多种类型分配给 Bison 中的非终端

C++ 从下往上读取文件

scala - 学习 Haskell 是为了学习 Scala

javascript - 如何根据Jquery表单更改文本框值?