git @ Cat's Eye Technologies Feedmark / 0.2
Merge pull request #2 from catseye/develop-0.2 Develop 0.2 Chris Pressey authored 4 years ago GitHub committed 4 years ago
8 changed file(s) with 489 addition(s) and 41 deletion(s). Raw diff Collapse all Expand all
00 Feedmark
11 ========
22
3 *Version 0.1. Subject to change in backwards-incompatible ways without notice.*
3 *Version 0.2. Subject to change in backwards-incompatible ways without notice.*
44
55 Feedmark is a format for embedding entities in Markdown files with
66 accompanying metadata in a way which is both human-readable and
88
99 To this end, it is not dissimilar to [Falderal][], however it has
1010 different goals. It is more oriented for "curational" tasks.
11 [The Dossier][] is (nominally) written in Feedmark format.
11 [The Dossier][] is written in Feedmark format.
1212
1313 Informally, the format says that every `h3`-level heading in the
1414 Markdown file gives the title of an entity, and may be followed
1515 immediately by the entity's "plaque", which is a bullet list
1616 where every item is prefixed by an identifier and a colon.
1717
18 This repository contains a Python program, `feedmark`, which is an
19 implementation of an extractor for the Feedmark format. It is
20 currently able to:
18 Example Feedmark documents can be found in the `eg/` directory.
19 Further examples can be found in [The Dossier][].
2120
22 * parse a set of Feedmark documents and extract entries from them
23 * dump a summary of the parsed entries and their properties
24 * dump an inverted index of each property found, and its entries
25 * write out an Atom (née RSS) feed containing the parsed entries
26 * parse all of the "Items of Note" lists in The Dossier
21 Implementation
22 --------------
2723
28 Example Feedmark documents can be found in the `eg/` directory.
24 This repository contains a Python program, `feedmark`, which is a
25 reference implementation of a processor for the Feedmark format.
26 It is currently able to do the following things:
2927
30 [Falderal]: http://catseye.tc/node/Falderal
31 [The Dossier]: https://github.com/catseye/The-Dossier/
28 ### Parse Feedmark documents
3229
33 Example Usage
34 -------------
30 which will check that they are minimally well-formed.
31
32 bin/feedmark eg/*.md
33
34 ### Archive all web objects linked to from the documents
35
36 bin/feedmark --archive-links-to=downloads eg/Recent\ Llama\ Sightings.md
37
38 If it is only desired that the links be checked, `--check-links` will
39 make `HEAD` requests and will not save any of the responses.
40
41 ### Convert Feedmark documents to an Atom (née RSS) feed
3542
3643 bin/feedmark "eg/Recent Llama Sightings.md" --output-atom=feed.xml
3744 python -m SimpleHTTPServer 7000 &
3845 python -m webbrowser http://localhost:7000/feed.xml
3946
47 ### Check entries against a schema
48
49 A Feedmark schema is simply another Feedmark document, one in which
50 each entry describes a property that entries should have.
51
52 bin/feedmark eg/Video\ games.md --check-against=eg/Video\ games\ schema.md
53
54 Note that this facility is still under development.
55
56 ### Rewrite documents in-place
57
58 They will be parsed as Feedmark, and then output as Markdown, to the
59 same files that were read in as input. (This is destructive, but it
60 is recommended that the original files be under version control such
61 as `git`, which will easily allow the changes to be reverted.)
62
63 bin/feedmark --rewrite-markdown eg/*.md
64
65 Note that this facility is still under development.
66
67 ### Interlink documents
68
69 Markdown supports "reference-style" links, which are not inline
70 with the text.
71
72 `feedmark` can rewrite reference-style links that match the name of
73 an entry in a previously-created "refdex", so that they
74 can be kept current and point to the canonical document in which the
75 entry exists, since it may exist in multiple, or be moved over time.
76
77 bin/feedmark eg/*.md --output-refdex >refdex.json
78 bin/feedmark --input-refdex=refdex.json --rewrite-markdown eg/*.md
79
80 Note that this facility is still under development.
81
82 ### Write out to miscellaneous formats
83
84 Output entries as JSON, indexed by entry, or by property.
85
86 bin/feedmark --dump-entries eg/*.md
87 bin/feedmark --by-property eg/*.md
88
89 Output entries as Markdown, or HTML, or a snippet of HTML
90
91 bin/feedmark --output-markdown eg/*.md
92 bin/feedmark --output-html eg/*.md
93 bin/feedmark --output-html-snippet eg/*.md
94
4095 Motivation
4196 ----------
4297
43 Why is this desirable? Because if your structured data format is
98 Why is Feedmark desirable? Because if your structured data format is
4499 a subset of Markdown, the effort to format it into something
45100 nicely human-readable is very small. YAML and Markdown are both
46101 fairly easy to read as raw text, but Github, for example,
47102 automatically formats Markdown as HTML, making it that much nicer.
103
104 Or, if you like the transitivity: in the same way that a Markdown
105 file is still a readable text file, which is nice, a Feedmark file
106 is still a readable Markdown file, which is still a readable text
107 file, which is nice.
108
109 TODO
110 ----
111
112 "common" properties on document which all entries within inherit.
113
114 Sub-entries. Somehow. For individual games in a series, implementations
115 or variations on a programming language, etc.
116
117 Allow trailing `###` on h3-level headings.
118
119 Index creation from refdex, for permalinks.
120
121 [Falderal]: http://catseye.tc/node/Falderal
122 [The Dossier]: https://github.com/catseye/The-Dossier/
0 Video Games Schema
1 ==================
2
3 This is an example schema which defines the properties an entry should have,
4 if it is an entry for a video game. This is not official or anything.
5
6 ### available for
7
8 The platform that the video game was available for.
9
10 ### published by
11
12 The entity which published the video game.
13
14 ### genre
15
16 The genre (nominal) of the video game.
17
18 ### wikipedia
19
20 Optional. If it has an entry on Wikipedia, a link to that.
21
22 ### controls
23
24 What controls are used by the player when playing the video game.
25
26 ### written by
27
28 The author(s) of the video game.
29
30 ### entry
31
32 Multiple may occur. Gives the link for an entry in a video games database.
33
34 ### play online
35
36 Multiple may occur.
37
38 ### walkthrough
39
40 Multiple may occur.
00 atomize==0.2.0
11 Markdown==2.6.8
2 beautifulsoup4==4.6.0
3 requests==2.17.3
0 import os
1 from time import sleep
2 import urllib
3
4 from bs4 import BeautifulSoup
5 import markdown
6 import requests
7
8 try:
9 from tqdm import tqdm
10 except ImportError:
11 def tqdm(x, **kwargs): return x
12
13
14 class Schema(object):
15 def __init__(self, document):
16 self.document = document
17 self.property_rules = {}
18 self.property_priority_order = []
19 for section in self.document.sections:
20 self.property_rules[section.title] = section
21 self.property_priority_order.append(section.title)
22
23 def check(self, section):
24 results = []
25 for key, value in section.properties.iteritems():
26 if key not in self.property_rules:
27 results.append(['extra', key])
28 for key, value in self.property_rules.iteritems():
29 optional = value.properties.get('optional', 'false') == 'true'
30 if optional:
31 continue
32 if key not in section.properties:
33 results.append(['missing', key])
34 return results
35
36 def get_property_priority_order(self):
37 return self.property_priority_order
38
39
40 def extract_links(html_text):
41
42 links = []
43 soup = BeautifulSoup(html_text, 'html.parser')
44 for link in soup.find_all('a'):
45 url = link.get('href')
46 links.append(url)
47
48 return links
49
50
51 def extract_links_from_documents(documents):
52 links = []
53 for document in documents:
54 for section in document.sections:
55 for (name, url) in section.images:
56 links.append((url, section))
57 for key, value in section.properties.iteritems():
58 if isinstance(value, list):
59 for subitem in value:
60 links.extend([(url, section) for url in extract_links(markdown.markdown(subitem))])
61 else:
62 links.extend([(url, section) for url in extract_links(markdown.markdown(value))])
63 links.extend([(url, section) for url in extract_links(markdown.markdown(section.body))])
64 return links
65
66
67 def url_to_dirname_and_filename(url):
68 parts = url.split('/')
69 parts = parts[2:]
70 domain_name = parts[0]
71 domain_name = urllib.quote_plus(domain_name)
72 parts = parts[1:]
73 filename = '/'.join(parts)
74 filename = urllib.quote_plus(filename)
75 if not filename:
76 filename = 'index.html'
77 return (domain_name, filename)
78
79
80 def download(url, filename):
81 response = requests.get(url, stream=True)
82 part_filename = filename + '_part'
83 with open(part_filename, "wb") as f:
84 for data in response.iter_content():
85 f.write(data)
86 os.rename(part_filename, filename)
87 return response
88
89
90 delay_between_fetches = 0
91
92
93 def archive_links(documents, dest_dir):
94 """If dest_dir is None, links will only be checked for existence, not downloaded."""
95 links = extract_links_from_documents(documents)
96
97 failures = []
98 for url, section in tqdm(links, total=len(links)):
99 try:
100 if not url.startswith(('http://', 'https://')):
101 raise ValueError('Not http: {}'.format(url))
102 if dest_dir is not None:
103 dirname, filename = url_to_dirname_and_filename(url)
104 dirname = os.path.join(dest_dir, dirname)
105 if not os.path.exists(dirname):
106 os.makedirs(dirname)
107 filename = os.path.join(dirname, filename)
108 response = download(url, filename)
109 else:
110 response = requests.head(url)
111 status = response.status_code
112 except Exception as e:
113 status = str(e)
114 if status not in (200, 301, 302, 303):
115 failures.append({
116 'status': status,
117 'url': url,
118 'section': str(section)
119 })
120 if delay_between_fetches > 0:
121 sleep(delay_between_fetches)
122 return failures
99 if 'link-to-anchors-on' not in section.document.properties:
1010 return None
1111
12 title = re.sub(r"[':,.!]", '', section.title)
13 anchor = (title.replace(u' ', u'-').lower()).encode('utf-8')
14 return '{}#{}'.format(section.document.properties['link-to-anchors-on'], quote_plus(anchor))
12 return '{}#{}'.format(section.document.properties['link-to-anchors-on'], quote_plus(section.anchor))
1513
1614
1715 def extract_feed_properties(document):
2725 sections = []
2826 for document in documents:
2927 for section in document.sections:
30 section.document = document # TODO: maybe the parser should do this for us
3128 sections.append(section)
3229 sections.sort(key=lambda section: section.publication_date, reverse=True)
3330 return sections
1212 return text
1313
1414
15 def render_section(section):
15 def render_section_snippet(section):
1616 date = section.publication_date.strftime('%b %-d, %Y')
1717 if 'summary' in section.properties:
1818 summary = strip_outer_p(markdown.markdown(section.properties['summary']))
2525 return '{}: {} {}'.format(date, summary, read_more)
2626
2727
28 def feedmark_htmlize(documents, limit=None):
28 def feedmark_htmlize_snippet(documents, limit=None):
2929 properties = {}
3030
3131 sections = extract_sections(documents)
3434 for (n, section) in enumerate(sections):
3535 if limit is not None and n >= limit:
3636 break
37 s += u'<li>{}</li>\n'.format(render_section(section))
37 s += u'<li>{}</li>\n'.format(render_section_snippet(section))
3838 s += u'</ul>'
3939
4040 return s
41
42
43 def items_in_priority_order(di, priority):
44 for key in priority:
45 if key in di:
46 yield key, di[key]
47 for key, item in sorted(di.iteritems()):
48 if key not in priority:
49 yield key, item
50
51
52 def markdownize_properties(properties, property_priority_order):
53 if not properties:
54 return ''
55 md = ''
56 for key, value in items_in_priority_order(properties, property_priority_order):
57 if isinstance(value, list):
58 for subitem in value:
59 md += u'* {} @ {}\n'.format(key, subitem)
60 else:
61 md += u'* {}: {}\n'.format(key, value)
62 md += '\n'
63 return md
64
65
66 def markdownize_reference_links(reference_links):
67 if not reference_links:
68 return ''
69 md = ''
70 md += '\n'
71 for name, url in reference_links:
72 md += '[{}]: {}\n'.format(name, url)
73 return md
74
75
76 def feedmark_markdownize(document, schema=None):
77 property_priority_order = []
78 if schema is not None:
79 property_priority_order = schema.get_property_priority_order()
80
81 md = u'{}\n{}\n\n'.format(document.title, '=' * len(document.title))
82 md += markdownize_properties(document.properties, property_priority_order)
83 md += u'\n'.join(document.preamble)
84 md += markdownize_reference_links(document.reference_links)
85 for section in document.sections:
86 md += u'\n'
87 md += u'### {}\n\n'.format(section.title)
88 if section.images:
89 for name, url in section.images:
90 md += u'![{}]({})\n'.format(name, url)
91 md += u'\n'
92 md += markdownize_properties(section.properties, property_priority_order)
93 md += section.body
94 md += markdownize_reference_links(section.reference_links)
95 md += '\n'
96 return md
97
98
99 def feedmark_htmlize(document, *args, **kwargs):
100 return markdown.markdown(feedmark_markdownize(document, *args, **kwargs))
00 from argparse import ArgumentParser
11 import codecs
2 import json
23 import sys
34
45 from feedmark.atomizer import feedmark_atomize
5 from feedmark.htmlizer import feedmark_htmlize
66 from feedmark.feeds import extract_sections
77 from feedmark.parser import Parser
88
1313 argparser.add_argument('input_files', nargs='+', metavar='FILENAME', type=str,
1414 help='Markdown files containing the embedded entries'
1515 )
16
1617 argparser.add_argument('--by-property', action='store_true',
17 help='Display a list of all properties found and list the entries they were found on'
18 help='Output JSON containing a list of all properties found and the entries they were found on'
1819 )
1920 argparser.add_argument('--dump-entries', action='store_true',
2021 help='Display a summary of the entries on standard output'
2122 )
23
24 argparser.add_argument('--archive-links-to', metavar='DIRNAME', type=str, default=None,
25 help='Download a copy of all web objects linked to from the entries'
26 )
27 argparser.add_argument('--check-links', action='store_true',
28 help='Check if web objects linked to from the entries exist'
29 )
30 argparser.add_argument('--check-against-schema', metavar='FILENAME', type=str, default=None,
31 help='Check if entries have the properties specified by this schema. This schema will '
32 'also provide hints (such as ordering of properties) when outputting Markdown or HTML.'
33 )
34
2235 argparser.add_argument('--output-atom', metavar='FILENAME', type=str,
2336 help='Construct an Atom XML feed from the entries and write it out to this file'
2437 )
38 argparser.add_argument('--output-markdown', action='store_true',
39 help='Reconstruct a Markdown document from the entries and write it to stdout'
40 )
41 argparser.add_argument('--output-html', action='store_true',
42 help='Construct an HTML5 article element from the entries and write it to stdout'
43 )
2544 argparser.add_argument('--output-html-snippet', action='store_true',
2645 help='Construct a snippet of HTML from the entries and write it to stdout'
2746 )
47
48 argparser.add_argument('--rewrite-markdown', action='store_true',
49 help='Rewrite all input Markdown documents in-place. Note!! Destructive!!'
50 )
51
52 argparser.add_argument('--input-refdex', metavar='FILENAME', type=str,
53 help='Load this JSON file as the reference-style links index before processing'
54 )
55 argparser.add_argument('--output-refdex', action='store_true',
56 help='Construct reference-style links index from the entries and write it to stdout as JSON'
57 )
58
2859 argparser.add_argument('--limit', metavar='COUNT', type=int, default=None,
2960 help='Process no more than this many entries when making an Atom or HTML feed'
3061 )
3364
3465 documents = []
3566
36 for filename in options.input_files:
67 ### helpers
68
69 def read_document_from(filename):
3770 with codecs.open(filename, 'r', encoding='utf-8') as f:
3871 markdown_text = f.read()
3972 parser = Parser(markdown_text)
4073 document = parser.parse_document()
41 documents.append(document)
74 document.filename = filename
75 return document
4276
4377 def write(s):
4478 print(s.encode('utf-8'))
79
80 ### input
81
82 for filename in options.input_files:
83 document = read_document_from(filename)
84 documents.append(document)
85
86 refdex = {}
87 if options.input_refdex:
88 with codecs.open(options.input_refdex, 'r', encoding='utf-8') as f:
89 refdex = json.loads(f.read())
90
91 ### processing
92
93 if options.check_links or options.archive_links_to is not None:
94 from feedmark.checkers import archive_links
95 result = archive_links(documents, options.archive_links_to)
96 write(json.dumps(result, indent=4, sort_keys=True))
97
98 schema = None
99 if options.check_against_schema is not None:
100 from feedmark.checkers import Schema
101 schema_document = read_document_from(options.check_against_schema)
102 schema = Schema(schema_document)
103 results = []
104 for document in documents:
105 for section in document.sections:
106 result = schema.check(section)
107 if result:
108 results.append({
109 'section': section.title,
110 'document': document.title,
111 'result': result
112 })
113 if results:
114 write(json.dumps(results, indent=4, sort_keys=True))
115 sys.exit(1)
116
117 ### processing: collect refdex phase
118 # NOTE: we only run this if we were asked to output a refdex -
119 # this is to prevent scurrilous insertion of refdex entries when rewriting.
120
121 if options.output_refdex:
122 for document in documents:
123 for section in document.sections:
124 refdex[section.title] = {
125 'filename': document.filename,
126 'anchor': section.anchor
127 }
128
129 ### processing: rewrite references phase
130
131 def rewrite_reference_links(refdex, reference_links):
132 from urllib import quote
133
134 new_reference_links = []
135 for (name, url) in reference_links:
136 if name in refdex:
137 url = '{}#{}'.format(quote(refdex[name]['filename']), quote(refdex[name]['anchor']))
138 new_reference_links.append((name, url))
139 return new_reference_links
140
141 if refdex:
142 for document in documents:
143 document.reference_links = rewrite_reference_links(refdex, document.reference_links)
144 for section in document.sections:
145 section.reference_links = rewrite_reference_links(refdex, section.reference_links)
146
147 ### output
148
149 if options.output_refdex:
150 write(json.dumps(refdex, indent=4, sort_keys=True))
45151
46152 if options.dump_entries:
47153 for document in documents:
64170 for key, value in section.properties.iteritems():
65171 if isinstance(value, list):
66172 key = u'{}@'.format(key)
67 by_property.setdefault(key, set()).add(section.title)
68 for property_name, entry_set in sorted(by_property.iteritems()):
69 write(property_name)
70 for entry_name in sorted(entry_set):
71 write(u' {}'.format(entry_name))
173 by_property.setdefault(key, {}).setdefault(section.title, value)
174 write(json.dumps(by_property, indent=4))
175
176 if options.output_markdown:
177 from feedmark.htmlizer import feedmark_markdownize
178 for document in documents:
179 s = feedmark_markdownize(document, schema=schema)
180 write(s)
181
182 if options.rewrite_markdown:
183 from feedmark.htmlizer import feedmark_markdownize
184 for document in documents:
185 s = feedmark_markdownize(document, schema=schema)
186 with open(document.filename, 'w') as f:
187 f.write(s.encode('UTF-8'))
188
189 if options.output_html:
190 from feedmark.htmlizer import feedmark_htmlize
191 for document in documents:
192 s = feedmark_htmlize(document, schema=schema)
193 write(s)
72194
73195 if options.output_html_snippet:
74 s = feedmark_htmlize(documents, limit=options.limit)
196 from feedmark.htmlizer import feedmark_htmlize_snippet
197 s = feedmark_htmlize_snippet(documents, limit=options.limit)
75198 write(s)
76199
77200 if options.output_atom:
1010
1111 self.sections = []
1212
13 def __str__(self):
14 return "document '{}'".format(self.title.encode('utf-8'))
15
1316
1417 class Section(object):
1518 def __init__(self, title):
19 self.document = None
1620 self.title = title
1721 self.properties = {}
1822
1923 self.lines = []
2024
2125 def __str__(self):
22 return "section '{}'".format(self.title.encode('utf-8'))
26 s = "section '{}'".format(self.title.encode('utf-8'))
27 if self.document:
28 s += " of " + str(self.document)
29 return s
2330
2431 def add_line(self, line):
2532 self.lines.append(line.rstrip())
4552 except ValueError:
4653 pass
4754 raise NotImplementedError
55
56 @property
57 def anchor(self):
58 title = re.sub(r"[':,.!]", '', self.title)
59 return (title.replace(u' ', u'-').lower()).encode('utf-8')
4860
4961
5062 class Parser(object):
7284 return re.match(r'^\*\s+(.*?)\s*(\:|\@)\s*(.*?)\s*$', self.line)
7385
7486 def is_heading_line(self):
75 return re.match(r'^\#.*?$', self.line)
87 return re.match(r'^\#\#\#\s+(.*?)\s*$', self.line)
88
89 def is_reference_link_line(self):
90 return re.match(r'^\[(.*?)\]\:\s*(.*?)\s*$', self.line)
7691
7792 def parse_document(self):
7893 # Feed ::= :Title Properties Body {Section}.
8398 title = self.parse_title()
8499 document = Document(title)
85100 document.properties = self.parse_properties()
86 document.preamble = self.parse_body()
101 preamble, reference_links = self.parse_body()
102 document.preamble = preamble
103 document.reference_links = reference_links
87104 while not self.eof():
88105 section = self.parse_section()
106 section.document = document
89107 document.sections.append(section)
90108 return document
91109
92110 def parse_title(self):
93 match = re.match(r'^\#\s*([^#].*?)\s*$', self.line)
111 match = re.match(r'^\#\s+(.*?)\s*$', self.line)
94112 if match:
95113 title = match.group(1)
96114 self.scan()
139157 while self.is_blank_line():
140158 self.scan()
141159
142 match = re.match(r'^\#\#\#\s*([^#].*?)\s*$', self.line)
160 match = re.match(r'^\#\#\#\s+(.*?)\s*$', self.line)
143161 if not match:
144162 raise ValueError('Expected section, found "{}"'.format(self.line))
145163
147165 self.scan()
148166 section.images = self.parse_images()
149167 section.properties = self.parse_properties()
150 section.lines = self.parse_body()
168 lines, reference_links = self.parse_body()
169 section.lines = lines
170 section.reference_links = reference_links
151171 return section
152172
153173 def parse_images(self):
161181
162182 def parse_body(self):
163183 lines = []
164 while not self.eof() and not self.is_heading_line():
184 reference_links = []
185 while not self.eof() and not self.is_heading_line() and not self.is_reference_link_line():
165186 lines.append(self.line)
166187 self.scan()
167 return lines
188 while not self.eof() and (self.is_reference_link_line() or self.is_blank_line()):
189 if self.is_reference_link_line():
190 match = re.match(r'^\[(.*?)\]\:\s*(.*?)\s*$', self.line)
191 if match:
192 reference_links.append((match.group(1), match.group(2)))
193 self.scan()
194 return (lines, reference_links)