regex - 使用 Enlive 和 Clojure 提取 MLA 引用的 HTML

标签 regex macros clojure enlive

我的目标是从网页中提取并解析一系列引用文献,以便稍后输入数据库。引用文献均采用 MLA 格式。这应该是一个通用解决方案,适用于 MLA 格式引用书目的所有实例,并且应该不仅仅适用于下面所示的网页。

这是我的尝试代码,但不起作用:

(use '[net.cgrand.enlive-html])

(def ^:dynamic *base-url* "https://www.impacttest.com/research/?Clinical-Research-Database-4")
(def ^:dynamic *ref-selector*     [:div#content_1 :ul :li])


(defn fetch-url [url]
  (html-resource (java.net.URL. url)))

(defn references []
  (select (fetch-url *base-url*) *ref-selector*))

(def ^:dynamic *ref-regex*    #"\s([A-Z]{1}[\w|\s]+)[,|\.]")
(def ^:dynamic *ref-modifier* `(remove :content))

(defmacro extract-re [node re modifier]
  `(doseq [seqs (map :content (node))]
    (re-find re (apply str (modifier seqs)))))

(extract-re references *ref-regex* *ref-modifier*)

(macroexpand-1 '(extract-re references *ref-regex* *ref-modifier*))

我希望宏 extract-re 创建一个 doseq,在所有的激活节点。有两个变量需要更改:一个是正则表达式本身,另一个是修饰符,它在处理之前修改 enlive 节点。如果没有修饰符,正则表达式将匹配作者和一些标题。我尝试编写一个函数,但无法让它在一般情况下工作,所以我认为宏是可行的方法。

在 MLA 引用中,我认为在 enlive 节点上使用修饰符比使用正则表达式进行所有提取更容易,尽管我可能是错的。我想不出如何做一个仅匹配标题或仅匹配作者的正则表达式。

那么,如何将修饰符传递给宏并使其正确执行?我不完全理解宏的引用细节,所以我可能不知道如何开始编写宏,或者即使宏是必要的。

最佳答案

此代码存在许多问题。

'(use [net.cgrand.enlive-html])

这不会引入库,它会创建一个文字列表,但不会对其执行任何操作:

user> (class '(use [net.cgrand.enlive-html]))
clojure.lang.PersistentList

它实际上是一个空操作。

(def ^:dynamic *ref-modifier* `(remove :content))

这将创建一个两个元素的列表,而不是任何类型的“修饰符”。

(defmacro extract-re [node re modifier]
  `(doseq [seqs (map :content (node))]
    (re-find re (apply str (modifier seqs)))))

这里您使用语法引用,但您永远不会取消引用其中的任何内容。该宏不以任何方式使用其任何参数。

您似乎想要应用modifier,就好像它是一个函数一样(这甚至没有开始发生,请参阅上面的引用问题),但正如我们在实际调用中看到的那样,modifier 是一个两元素列表,如果调用会导致错误。

最后,doseq 仅适用于副作用,并且始终返回 nil。 oseq block 不使用重新查找生成的值,因此,doseq 主体实际上是一个无操作。

此外,我发现对将作为显式函数参数提供的变量使用动态变量声明的实用性可疑。

解决所有这些问题后,我认为我们更接近于可行的方法:

(use 'net.cgrand.enlive-html)

(def ^:dynamic *base-url*
  "https://www.impacttest.com/research/?Clinical-Research-Database-4")

(def ^:dynamic *ref-selector* [:div#content_1 :ul :li])


(defn fetch-url [url]
  (html-resource (java.net.URL. url)))

(defn references []
  (select (fetch-url *base-url*) *ref-selector*))

(def ^:dynamic *ref-regex* #"\s([A-Z]{1}[\w|\s]+)[,|\.]")

(def ^:dynamic *ref-modifier* (partial remove :content))

(defn extract-re [node re modifier]
  (doall
    (for [sq (map :content (node))]
      (re-find re (apply str (modifier sq))))))

实际操作:

user> (extract-re references *ref-regex* *ref-modifier*)

([" Dambinova SA," "Dambinova SA"] [" Zuckerman SL," "Zuckerman SL"] [" Conklin HM," "Conklin HM"] [" Covassin T," "Covassin T"] [" Maerlender A," "Maerlender A"] [" Fedor A," "Fedor A"] [" Resch J," "Resch J"] [" Elbin RJ," "Elbin RJ"] [" Rabinowitz AR," "Rabinowitz AR"] [" Kinnaman KA," "Kinnaman KA"] [" Tsushima WT," "Tsushima WT"] [" Amonette WE," "Amonette WE"] [" Lovell MR," "Lovell MR"] [" Schatz P," "Schatz P"] [" McGrath N," "McGrath N"] [" Kontos AP," "Kontos AP"] [" AB," "AB"] [" Meehan WP," "Meehan WP"] [" Rieger BP," "Rieger BP"] [" Solomon GS," "Solomon GS"] [" Sandel NK," "Sandel NK"] [" Schatz P," "Schatz P"] [" Schatz P," "Schatz P"] [" Lebrun CM," "Lebrun CM"] [" Brooks B," "Brooks B"] [" Meehan WP," "Meehan WP"] [" Fakhran S," "Fakhran S"] [" Cole WR," "Cole WR"] [" Tsushima M," "Tsushima M"] [" Zuckerman SL," "Zuckerman SL"] [" JK," "JK"] [" Covassin T," "Covassin T"] [" Moser RS," "Moser RS"] [" Mayers LB," "Mayers LB"] [" McAllister TW," "McAllister TW"] [" Meehan WP 3rd," "Meehan WP 3rd"] [" Neal MT," "Neal MT"] [" Lau BC," "Lau BC"] [" Kontos AP," "Kontos AP"] [" Gardner A," "Gardner A"] [" Elbin RJ," "Elbin RJ"] [" Wolf EG," "Wolf EG"] [" Reddy CC," "Reddy CC"] [" Moser RS," "Moser RS"] [" Guerriero RM," "Guerriero RM"] [" Deibert E," "Deibert E"] [" Wiebe DJ," "Wiebe DJ"] [" Baillargeon A," "Baillargeon A"] [" Erdal K." "Erdal K"] [" Maugans TA," "Maugans TA"] [" Iverson GL," "Iverson GL"] [" Ponsford J," "Ponsford J"] [" Schatz P," "Schatz P"] [" Mulligan I," "Mulligan I"] [" Echlin PS," "Echlin PS"] [" McLeod TC," "McLeod TC"] [" Zuckerman SL," "Zuckerman SL"] [" Kontos AP," "Kontos AP"] [" Zuckerman SL," "Zuckerman SL"] [" Schatz P," "Schatz P"] [" Kontos AP," "Kontos AP"] [" Covassin T," "Covassin T"] [" Covassin T," "Covassin T"] [" Duhaime AC," "Duhaime AC"] [" Echemendia RJ," "Echemendia RJ"] [" Ramanathan DM," "Ramanathan DM"] [" Meehan WP 3rd," "Meehan WP 3rd"] [" Krol AL," "Krol AL"] [" Turgeon C," "Turgeon C"] [" Randolph C." "Randolph C"] [" Barlow M," "Barlow M"] [" Schatz P," "Schatz P"] [" Moser RS," "Moser RS"] [" Broglio SP," "Broglio SP"] [" Thomas DG," "Thomas DG"] [" Allen BJ," "Allen BJ"] [" Solomon GS," "Solomon GS"] [" Ponsford J," "Ponsford J"] [" Johnson EW," "Johnson EW"] [" Randolph C," "Randolph C"] [" Elbin RJ," "Elbin RJ"] [" Broglio SP," "Broglio SP"] [" Kontos AP," "Kontos AP"] [" Lau BC," "Lau BC"] [" Lau BC," "Lau BC"] [" Hettich T," "Hettich T"] [" Elbin T," "Elbin T"] [" Maerlender A," "Maerlender A"] [" Kontos AP," "Kontos AP"] [" Talavage TM," "Talavage TM"] [" Meehan WP 3rd," "Meehan WP 3rd"] [" Lange RT," "Lange RT"] [" Covassin T," "Covassin T"] [" Schatz P." "Schatz P"] [" Lange RT," "Lange RT"] [" Pardini JE," "Pardini JE"] [" Echlin PS," "Echlin PS"] [" Schatz P," "Schatz P"] [" Echlin PS," "Echlin PS"] [" Keightley ML," "Keightley ML"] [" McGrath N." "McGrath N"] [" Covassin T," "Covassin T"] [" Pontifex MB," "Pontifex MB"] [" AB," "AB"] [" Casson IR," "Casson IR"] [" McCrory P," "McCrory P"] [" Covassin T," "Covassin T"] [" Bruce JM," "Bruce JM"] [" Covassin T," "Covassin T"] [" Lovell M." "Lovell M"] [" Lau B," "Lau B"] [" Nance ML," "Nance ML"] [" Peterson SE," "Peterson SE"] [" Lovell M." "Lovell M"] [" Broglio SP," "Broglio SP"] [" Broglio SP," "Broglio SP"] [" Colvin AC," "Colvin AC"] [" Reddy CC," "Reddy CC"] [" Solomon GS," "Solomon GS"] [" Covassin T," "Covassin T"] [" Majerske CW," "Majerske CW"] [" Lovell MR," "Lovell MR"] [" AB," "AB"] [" Tsushima WT," "Tsushima WT"] [" Miller JR," "Miller JR"] [" Slobounov S," "Slobounov S"] [" Mihalik JP," "Mihalik JP"] [" Covassin T," "Covassin T"] [" Lovell MR," "Lovell MR"] [" Stoller KP." "Stoller KP"] [" Broglio SP," "Broglio SP"] [" Moser RS," "Moser RS"] [" Iverson G." "Iverson G"] [" Fazio VC," "Fazio VC"] [" Swanik CB," "Swanik CB"] [" Broglio SP," "Broglio SP"] [" Covassin T," "Covassin T"] [" Broglio SP," "Broglio SP"] [" Chen JK," "Chen JK"] [" Van Kampen DA," "Van Kampen DA"] [" Broglio SP," "Broglio SP"] [" Pellman EJ," "Pellman EJ"] [" Pellman EJ," "Pellman EJ"] [" Schatz P," "Schatz P"] [" Biasca N," "Biasca N"] [" Collins M," "Collins M"] [" Lovell MR," "Lovell MR"] [" Lovell MR," "Lovell MR"] [" Iverson GL," "Iverson GL"] [" Cantu RC," "Cantu RC"] [" McClincy MP," "McClincy MP"] [" Schatz P," "Schatz P"] [" Iverson GL," "Iverson GL"] [" Van Kampen DA," "Van Kampen DA"] [" Lovell M," "Lovell M"] [" Mihalik JP," "Mihalik JP"] [" Moser RS," "Moser RS"] [" Broshek DK," "Broshek DK"] [" Grove R," "Grove R"] [" McCrea M," "McCrea M"] [" McCrory P," "McCrory P"] [" Iverson GL," "Iverson GL"] [" Lovell MR," "Lovell MR"] [" Bruce JM," "Bruce JM"] [" Pellman EJ," "Pellman EJ"] [" Iverson GL," "Iverson GL"] [" Lovell MR," "Lovell MR"] [" Kontos A," "Kontos A"] [" Collins MW," "Collins MW"] [" Iverson GL," "Iverson GL"] [" Lovell M," "Lovell M"] [" Field M," "Field M"] [" Covassin T," "Covassin T"] [" Iverson GL," "Iverson GL"] [" Lovell MR," "Lovell MR"] [" Collins MW," "Collins MW"] [" Lovell MR," "Lovell MR"] [" Collins MW," "Collins MW"] [" Collins MW," "Collins MW"] [" Collins MW," "Collins MW"] [" Maroon JC," "Maroon JC"] [" Lovell MR," "Lovell MR"] [" Lovell MR." "Lovell MR"] [" Aubry M," "Aubry M"] [" Grindel SH," "Grindel SH"] [" Collins MW," "Collins MW"] [" Lovell MR," "Lovell MR"] [" Collins MW," "Collins MW"] [" Lovell MR," "Lovell MR"])

关于regex - 使用 Enlive 和 Clojure 提取 MLA 引用的 HTML,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/24679176/

相关文章:

c - 为什么我的简单 C 宏不起作用?

vba - 优化 VBA/Excel 宏代码(在大工作表中查找重复项)

clojure - 在 Clojure/Compojure 中转义/清理用户输入

scala - Scala 是否有相当于 Haskell 的 CHP?

javascript - [Alphanumeric][alphanumeric.-_@] 31 个字符的正则表达式建议

regex - 正则表达式负前瞻

c - 预处理的 printf 函数

clojure - 有资质的生产者消费者

javascript - 删除单个新行

javascript - 正则表达式复杂模式