add Utility.HtmlDetect

This will be used in youtube-dl integration, to tell when a html page has been downloaded by addurl, in which case it is worth running youtube-dl to see if it can extract media from it. tagsoup is an almost free dependency, because yesod depends on it. So, this only really adds a dep when git-annex is built without the webapp. I'd like this to as closely as possible match how browsers decide if a page is html or not. Unfortunately, that is fairly heuristic, in order to support malformed html. And, we don't want to falsely detect something as html just because it has something that looks like a html tag embedded somewhere in it. Probably any major video hosting site is going to be serving html documents that at least start with a <html> tag, so requiring that or a DOCTYPE should be good enough. This commit was sponsored by Jeff Goeke-Smith on Patreon.
author: Joey Hess <joeyh@joeyh.name> 2017-11-28 12:50:30 -0400
committer: Joey Hess <joeyh@joeyh.name> 2017-11-28 13:03:11 -0400
commit: ede2198520dded21d580a9c199a0909c2b04923a (patch)
tree: dcc6f2fb2bfc40278c53ffac93bf61773a7553a4 /Utility
parent: 938c89ec2a3b28c577dffd18a301915a609d6cae (diff)
1 files changed, 35 insertions, 0 deletions
diff --git a/Utility/HtmlDetect.hs b/Utility/HtmlDetect.hs
new file mode 100644
index 000000000..ca516e960
--- /dev/null
+++ b/Utility/HtmlDetect.hs
@@ -0,0 +1,35 @@
+{- html detection
+ -
+ - Copyright 2017 Joey Hess <id@joeyh.name>
+ -
+ - License: BSD-2-clause
+ -}
+
+module Utility.HtmlDetect where
+
+import Text.HTML.TagSoup
+import Data.Char
+
+-- | Detect if a string is a html document.
+--
+-- The document many not be valid, and will still be detected as html,
+-- as long as it starts with a "<html>" or "<!DOCTYPE html>" tag.
+--
+-- Html fragments like "<p>this</p>" are not detected as being html,
+-- although some browsers may chose to render them as html.
+isHtml :: String -> Bool
+isHtml = evaluate . canonicalizeTags . parseTags . truncate
+  where
+	-- We only care about the beginning of the file,
+	-- so although tagsoup parses lazily anyway, truncate it.
+	truncate = take 16384
+	evaluate (TagOpen "!DOCTYPE" ((t, _):_):_) = map toLower t == "html"
+	evaluate (TagOpen "html" _:_) = True
+	-- Allow some leading whitespace before the tag.
+	evaluate (TagText t:rest)
+		| all isSpace t = evaluate rest
+		| otherwise = False
+	-- It would be pretty weird to have a html comment before the html
+	-- tag, but easy to allow for.
+	evaluate (TagComment _:rest) = evaluate rest
+	evaluate _ = False
author	Joey Hess <joeyh@joeyh.name>	2017-11-28 12:50:30 -0400
committer	Joey Hess <joeyh@joeyh.name>	2017-11-28 13:03:11 -0400
commit	ede2198520dded21d580a9c199a0909c2b04923a (patch)
tree	dcc6f2fb2bfc40278c53ffac93bf61773a7553a4 /Utility
parent	938c89ec2a3b28c577dffd18a301915a609d6cae (diff)