You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

369 lines
11 KiB

# Copyright 2010-2019 Kurt McKee <contactme@kurtmckee.org>
# Copyright 2002-2008 Mark Pilgrim
# All rights reserved.
#
# This file is a part of feedparser.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# * Redistributions of source code must retain the above copyright notice,
# this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 'AS IS'
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
# ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
# LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
# INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
# POSSIBILITY OF SUCH DAMAGE.
from __future__ import absolute_import
from __future__ import unicode_literals
import re
Change core system to improve performance and facilitate multi TV info sources. Change migrate core objects TVShow and TVEpisode and everywhere that these objects affect. Add message to logs and disable ui backlog buttons when no media provider has active and/or scheduled searching enabled. Change views for py3 compat. Change set default runtime of 5 mins if none is given for layout Day by Day. Add OpenSubtitles authentication support to config/Subtitles/Subtitles Plugin. Add &#34;Enforce media hash match&#34; to config/Subtitles Plugin/Opensubtitles for accurate subs if enabled, but if disabled, search failures will fallback to use less reliable subtitle results. Add Apprise 0.8.0 (6aa52c3). Add hachoir_py3 3.0a6 (5b9e05a). Add sgmllib3k 1.0.0 Update soupsieve 1.9.1 (24859cc) to soupsieve_py2 1.9.5 (6a38398) Add soupsieve_py3 2.0.0.dev (69194a2). Add Tornado_py3 Web Server 6.0.3 (ff985fe). Add xmlrpclib_to 0.1.1 (c37db9e). Remove ancient Growl lib 0.1 Remove xmltodict library. Change requirements.txt for Cheetah3 to minimum 3.2.4 Change update sabToSickBeard. Change update autoProcessTV. Change remove Twitter notifier. Update NZBGet Process Media extension, SickGear-NG 1.7 → 2.4 Update Kodi addon 1.0.3 → 1.0.4 Update ADBA for py3. Update Beautiful Soup 4.8.0 (r526) to 4.8.1 (r531). Update Send2Trash 1.3.0 (a568370) to 1.5.0 (66afce7). Update soupsieve 1.9.1 (24859cc) to 1.9.5 (6a38398). Change use GNTP (Growl Notification Transport Protocol) from Apprise. Change add multi host support to Growl notifier. Fix Growl notifier when using empty password. Change update links for Growl notifications. Change deprecate confg/Notifications/Growl password field as these are now stored with host setting. Fix prevent infinite memoryError from a particular jpg data structure. Change subliminal for py3. Change enzyme for py3. Change browser_ua for py3. Change feedparser for py3 (sgmlib is no longer available on py3 as standardlib so added ext lib) Fix Guessit. Fix parse_xml for py3. Fix name parser with multi eps for py3. Fix tvdb_api fixes for py3 (search show). Fix config/media process to only display &#34;pattern is invalid&#34; qtip on &#34;Episode naming&#34; tab if the associated field is actually visible. Also, if the field becomes hidden due to a setting change, hide any previously displayed qtip. Note for Javascript::getelementbyid (or $(&#39;tag[id=&#34;&lt;name&gt;&#34;&#39;)) is required when an id is being searched in the dom due to &#34;:&#34; used in a shows id name. Change download anidb xml files to main cache folder and use adba lib folder as a last resort. Change create get anidb show groups as centralised helper func and consolidate dupe code. Change move anidb related functions to newly renamed anime.py (from blacklistandwhitelist.py). Change str encode hex no longer exits in py3, use codecs.encode(...) instead. Change fix b64decode on py3 returns bytestrings. Change use binary read when downloading log file via browser to prevent any encoding issues. Change add case insensitive ordering to anime black/whitelist. Fix anime groups list not excluding whitelisted stuff. Change add Windows utf8 fix ... see: ytdl-org/youtube-dl#820 Change if no qualities are wanted, exit manual search thread. Fix keepalive for py3 process media. Change add a once a month update of tvinfo show mappings to the daily updater. Change autocorrect ids of new shows by updating from -8 to 31 days of the airdate of episode one. Add next run time to Manage/Show Tasks/Daily show update. Change when fetching imdb data, if imdb id is an episode id then try to find and use real show id. Change delete diskcache db in imdbpie when value error (due to change in Python version). Change during startup, cleanup any _cleaner.pyc/o to prevent issues when switching python versions. Add .pyc cleaner if python version is switched. Change replace deprecated gettz_db_metadata() and gettz. Change rebrand &#34;SickGear PostProcessing script&#34; to &#34;SickGear Process Media extension&#34;. Change improve setup guide to use the NZBGet version to minimise displayed text based on version. Change NZBGet versions prior to v17 now told to upgrade as those version are no longer supported - code has actually exit on start up for some time but docs were outdated. Change comment out code and unused option sg_base_path. Change supported Python version 2.7.9-2.7.18 inclusive expanded to 3.7.1-3.8.1 inclusive. Change pidfile creation under Linux 0o644. Make logger accept lists to output continuously using the log_lock instead of split up by other processes. Fix long path issues with Windows process media.
6 years ago
from six import PY2
try:
from html.entities import name2codepoint
except ImportError:
# Python 2
# noinspection PyUnresolvedReferences
from htmlentitydefs import name2codepoint
Change core system to improve performance and facilitate multi TV info sources. Change migrate core objects TVShow and TVEpisode and everywhere that these objects affect. Add message to logs and disable ui backlog buttons when no media provider has active and/or scheduled searching enabled. Change views for py3 compat. Change set default runtime of 5 mins if none is given for layout Day by Day. Add OpenSubtitles authentication support to config/Subtitles/Subtitles Plugin. Add &#34;Enforce media hash match&#34; to config/Subtitles Plugin/Opensubtitles for accurate subs if enabled, but if disabled, search failures will fallback to use less reliable subtitle results. Add Apprise 0.8.0 (6aa52c3). Add hachoir_py3 3.0a6 (5b9e05a). Add sgmllib3k 1.0.0 Update soupsieve 1.9.1 (24859cc) to soupsieve_py2 1.9.5 (6a38398) Add soupsieve_py3 2.0.0.dev (69194a2). Add Tornado_py3 Web Server 6.0.3 (ff985fe). Add xmlrpclib_to 0.1.1 (c37db9e). Remove ancient Growl lib 0.1 Remove xmltodict library. Change requirements.txt for Cheetah3 to minimum 3.2.4 Change update sabToSickBeard. Change update autoProcessTV. Change remove Twitter notifier. Update NZBGet Process Media extension, SickGear-NG 1.7 → 2.4 Update Kodi addon 1.0.3 → 1.0.4 Update ADBA for py3. Update Beautiful Soup 4.8.0 (r526) to 4.8.1 (r531). Update Send2Trash 1.3.0 (a568370) to 1.5.0 (66afce7). Update soupsieve 1.9.1 (24859cc) to 1.9.5 (6a38398). Change use GNTP (Growl Notification Transport Protocol) from Apprise. Change add multi host support to Growl notifier. Fix Growl notifier when using empty password. Change update links for Growl notifications. Change deprecate confg/Notifications/Growl password field as these are now stored with host setting. Fix prevent infinite memoryError from a particular jpg data structure. Change subliminal for py3. Change enzyme for py3. Change browser_ua for py3. Change feedparser for py3 (sgmlib is no longer available on py3 as standardlib so added ext lib) Fix Guessit. Fix parse_xml for py3. Fix name parser with multi eps for py3. Fix tvdb_api fixes for py3 (search show). Fix config/media process to only display &#34;pattern is invalid&#34; qtip on &#34;Episode naming&#34; tab if the associated field is actually visible. Also, if the field becomes hidden due to a setting change, hide any previously displayed qtip. Note for Javascript::getelementbyid (or $(&#39;tag[id=&#34;&lt;name&gt;&#34;&#39;)) is required when an id is being searched in the dom due to &#34;:&#34; used in a shows id name. Change download anidb xml files to main cache folder and use adba lib folder as a last resort. Change create get anidb show groups as centralised helper func and consolidate dupe code. Change move anidb related functions to newly renamed anime.py (from blacklistandwhitelist.py). Change str encode hex no longer exits in py3, use codecs.encode(...) instead. Change fix b64decode on py3 returns bytestrings. Change use binary read when downloading log file via browser to prevent any encoding issues. Change add case insensitive ordering to anime black/whitelist. Fix anime groups list not excluding whitelisted stuff. Change add Windows utf8 fix ... see: ytdl-org/youtube-dl#820 Change if no qualities are wanted, exit manual search thread. Fix keepalive for py3 process media. Change add a once a month update of tvinfo show mappings to the daily updater. Change autocorrect ids of new shows by updating from -8 to 31 days of the airdate of episode one. Add next run time to Manage/Show Tasks/Daily show update. Change when fetching imdb data, if imdb id is an episode id then try to find and use real show id. Change delete diskcache db in imdbpie when value error (due to change in Python version). Change during startup, cleanup any _cleaner.pyc/o to prevent issues when switching python versions. Add .pyc cleaner if python version is switched. Change replace deprecated gettz_db_metadata() and gettz. Change rebrand &#34;SickGear PostProcessing script&#34; to &#34;SickGear Process Media extension&#34;. Change improve setup guide to use the NZBGet version to minimise displayed text based on version. Change NZBGet versions prior to v17 now told to upgrade as those version are no longer supported - code has actually exit on start up for some time but docs were outdated. Change comment out code and unused option sg_base_path. Change supported Python version 2.7.9-2.7.18 inclusive expanded to 3.7.1-3.8.1 inclusive. Change pidfile creation under Linux 0o644. Make logger accept lists to output continuously using the log_lock instead of split up by other processes. Fix long path issues with Windows process media.
6 years ago
if PY2:
from .sgml import *
else:
import sgmllib3k as sgmllib
_cp1252 = {
128: '\u20ac', # euro sign
130: '\u201a', # single low-9 quotation mark
131: '\u0192', # latin small letter f with hook
132: '\u201e', # double low-9 quotation mark
133: '\u2026', # horizontal ellipsis
134: '\u2020', # dagger
135: '\u2021', # double dagger
136: '\u02c6', # modifier letter circumflex accent
137: '\u2030', # per mille sign
138: '\u0160', # latin capital letter s with caron
139: '\u2039', # single left-pointing angle quotation mark
140: '\u0152', # latin capital ligature oe
142: '\u017d', # latin capital letter z with caron
145: '\u2018', # left single quotation mark
146: '\u2019', # right single quotation mark
147: '\u201c', # left double quotation mark
148: '\u201d', # right double quotation mark
149: '\u2022', # bullet
150: '\u2013', # en dash
151: '\u2014', # em dash
152: '\u02dc', # small tilde
153: '\u2122', # trade mark sign
154: '\u0161', # latin small letter s with caron
155: '\u203a', # single right-pointing angle quotation mark
156: '\u0153', # latin small ligature oe
158: '\u017e', # latin small letter z with caron
159: '\u0178', # latin capital letter y with diaeresis
}
class _BaseHTMLProcessor(sgmllib.SGMLParser, object):
special = re.compile("""[<>'"]""")
bare_ampersand = re.compile(r"&(?!#\d+;|#x[0-9a-fA-F]+;|\w+;)")
elements_no_end_tag = {
'area',
'base',
'basefont',
'br',
'col',
'command',
'embed',
'frame',
'hr',
'img',
'input',
'isindex',
'keygen',
'link',
'meta',
'param',
'source',
'track',
'wbr',
}
def __init__(self, encoding=None, _type='application/xhtml+xml'):
if encoding:
self.encoding = encoding
self._type = _type
self.pieces = []
super(_BaseHTMLProcessor, self).__init__()
def reset(self):
self.pieces = []
super(_BaseHTMLProcessor, self).reset()
def _shorttag_replace(self, match):
"""
:type match: Match[str]
:rtype: str
"""
tag = match.group(1)
if tag in self.elements_no_end_tag:
return '<' + tag + ' />'
else:
return '<' + tag + '></' + tag + '>'
# By declaring these methods and overriding their compiled code
# with the code from sgmllib, the original code will execute in
# feedparser's scope instead of sgmllib's. This means that the
# `tagfind` and `charref` regular expressions will be found as
# they're declared above, not as they're declared in sgmllib.
def goahead(self, i):
raise NotImplementedError
# Replace goahead with SGMLParser's goahead() code object.
try:
goahead.__code__ = sgmllib.SGMLParser.goahead.__code__
except AttributeError:
# Python 2
# noinspection PyUnresolvedReferences
goahead.func_code = sgmllib.SGMLParser.goahead.func_code
def __parse_starttag(self, i):
raise NotImplementedError
# Replace __parse_starttag with SGMLParser's parse_starttag() code object.
try:
__parse_starttag.__code__ = sgmllib.SGMLParser.parse_starttag.__code__
except AttributeError:
# Python 2
# noinspection PyUnresolvedReferences
__parse_starttag.func_code = sgmllib.SGMLParser.parse_starttag.func_code
def parse_starttag(self, i):
j = self.__parse_starttag(i)
if self._type == 'application/xhtml+xml':
if j > 2 and self.rawdata[j-2:j] == '/>':
self.unknown_endtag(self.lasttag)
return j
def feed(self, data):
"""
:type data: str
:rtype: None
"""
data = re.sub(r'<!((?!DOCTYPE|--|\[))', r'&lt;!\1', data, re.IGNORECASE)
data = re.sub(r'<([^<>\s]+?)\s*/>', self._shorttag_replace, data)
data = data.replace('&#39;', "'")
data = data.replace('&#34;', '"')
super(_BaseHTMLProcessor, self).feed(data)
super(_BaseHTMLProcessor, self).close()
@staticmethod
def normalize_attrs(attrs):
"""
:type attrs: List[Tuple[str, str]]
:rtype: List[Tuple[str, str]]
"""
if not attrs:
return attrs
# utility method to be called by descendants
# Collapse any duplicate attribute names and values by converting
# *attrs* into a dictionary, then convert it back to a list.
attrs_d = {k.lower(): v for k, v in attrs}
attrs = [
(k, k in ('rel', 'type') and v.lower() or v)
for k, v in attrs_d.items()
]
attrs.sort()
return attrs
def unknown_starttag(self, tag, attrs):
"""
:type tag: str
:type attrs: List[Tuple[str, str]]
:rtype: None
"""
# Called for each start tag
# attrs is a list of (attr, value) tuples
# e.g. for <pre class='screen'>, tag='pre', attrs=[('class', 'screen')]
uattrs = []
strattrs = ''
if attrs:
for key, value in attrs:
value = value.replace('>', '&gt;')
value = value.replace('<', '&lt;')
value = value.replace('"', '&quot;')
value = self.bare_ampersand.sub("&amp;", value)
uattrs.append((key, value))
strattrs = ''.join(
' %s="%s"' % (key, value)
for key, value in uattrs
)
if tag in self.elements_no_end_tag:
self.pieces.append('<%s%s />' % (tag, strattrs))
else:
self.pieces.append('<%s%s>' % (tag, strattrs))
def unknown_endtag(self, tag):
"""
:type tag: str
:rtype: None
"""
# Called for each end tag, e.g. for </pre>, tag will be 'pre'
# Reconstruct the original end tag.
if tag not in self.elements_no_end_tag:
self.pieces.append("</%s>" % tag)
def handle_charref(self, ref):
"""
:type ref: str
:rtype: None
"""
# Called for each character reference, e.g. '&#160;' will extract '160'
# Reconstruct the original character reference.
ref = ref.lower()
if ref.startswith('x'):
value = int(ref[1:], 16)
else:
value = int(ref)
if value in _cp1252:
self.pieces.append('&#%s;' % hex(ord(_cp1252[value]))[1:])
else:
self.pieces.append('&#%s;' % ref)
def handle_entityref(self, ref):
"""
:type ref: str
:rtype: None
"""
# Called for each entity reference, e.g. '&copy;' will extract 'copy'
# Reconstruct the original entity reference.
if ref in name2codepoint or ref == 'apos':
self.pieces.append('&%s;' % ref)
else:
self.pieces.append('&amp;%s' % ref)
def handle_data(self, text):
"""
:type text: str
:rtype: None
"""
# called for each block of plain text, i.e. outside of any tag and
# not containing any character or entity references
# Store the original text verbatim.
self.pieces.append(text)
def handle_comment(self, text):
"""
:type text: str
:rtype: None
"""
# Called for HTML comments, e.g. <!-- insert Javascript code here -->
# Reconstruct the original comment.
self.pieces.append('<!--%s-->' % text)
def handle_pi(self, text):
"""
:type text: str
:rtype: None
"""
# Called for each processing instruction, e.g. <?instruction>
# Reconstruct original processing instruction.
self.pieces.append('<?%s>' % text)
def handle_decl(self, text):
"""
:type text: str
:rtype: None
"""
# called for the DOCTYPE, if present, e.g.
# <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
# "http://www.w3.org/TR/html4/loose.dtd">
# Reconstruct original DOCTYPE
self.pieces.append('<!%s>' % text)
_new_declname_match = re.compile(r'[a-zA-Z][-_.a-zA-Z0-9:]*\s*').match
def _scan_name(self, i, declstartpos):
"""
:type i: int
:type declstartpos: int
:rtype: Tuple[Optional[str], int]
"""
rawdata = self.rawdata
n = len(rawdata)
if i == n:
return None, -1
m = self._new_declname_match(rawdata, i)
if m:
s = m.group()
name = s.strip()
if (i + len(s)) == n:
return None, -1 # end of buffer
return name.lower(), m.end()
else:
self.handle_data(rawdata)
# self.updatepos(declstartpos, i)
return None, -1
@staticmethod
def convert_charref(name):
"""
:type name: str
:rtype: str
"""
return '&#%s;' % name
@staticmethod
def convert_entityref(name):
"""
:type name: str
:rtype: str
"""
return '&%s;' % name
def output(self):
"""Return processed HTML as a single string.
:rtype: str
"""
return ''.join(self.pieces)
def parse_declaration(self, i):
"""
:type i: int
:rtype: int
"""
try:
return sgmllib.SGMLParser.parse_declaration(self, i)
except sgmllib.SGMLParseError:
# Escape the doctype declaration and continue parsing.
self.handle_data('&lt;')
return i+1