Skip to content

Commit ca47f9c

Browse files
authored
release: 2.1.0 (#373)
1 parent 3614e31 commit ca47f9c

File tree

4 files changed

+303
-12
lines changed

4 files changed

+303
-12
lines changed

CHANGELOG.md

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,52 @@
11
# Mercury Parser Changelog
22

3+
### 2.1.0 (Apr 10, 2019)
4+
5+
##### Commits
6+
7+
- [[`3614e31abc`](https://github.com/postlight/mercury-parser/commit/3614e31abc)] - **fix**: skip absolutizing empty hrefs (#372) (Toufic Mouallem)
8+
- [[`73be0c5a10`](https://github.com/postlight/mercury-parser/commit/73be0c5a10)] - **feat**: add www.jnsa.org custom parser (#346) (kik0220)
9+
- [[`eacd1ee97f`](https://github.com/postlight/mercury-parser/commit/eacd1ee97f)] - **feat**: custom genius parser. (#284) (Adam Pash)
10+
- [[`c389c966d7`](https://github.com/postlight/mercury-parser/commit/c389c966d7)] - **feat**: add jvndb.jvn.jp custom parser (#345) (kik0220)
11+
- [[`8493d05cb5`](https://github.com/postlight/mercury-parser/commit/8493d05cb5)] - **feat**: add scan.netsecurity.ne.jp custom parser (#347) (kik0220)
12+
- [[`2a76c6c212`](https://github.com/postlight/mercury-parser/commit/2a76c6c212)] - **feat**: add www.elecom.co.jp custom parser (#348) (kik0220)
13+
- [[`a9e010b718`](https://github.com/postlight/mercury-parser/commit/a9e010b718)] - **feat**: add www.sanwa.co.jp custom parser (#349) (kik0220)
14+
- [[`1639eae324`](https://github.com/postlight/mercury-parser/commit/1639eae324)] - **feat**: add www.asahi.com custom parser (#350) (kik0220)
15+
- [[`21f7de70c1`](https://github.com/postlight/mercury-parser/commit/21f7de70c1)] - **feat**: add buzzap.jp custom parser (#351) (kik0220)
16+
- [[`f3a7e393a3`](https://github.com/postlight/mercury-parser/commit/f3a7e393a3)] - **feat**: add www.ossnews.jp custom parser (#352) (kik0220)
17+
- [[`c309bdb373`](https://github.com/postlight/mercury-parser/commit/c309bdb373)] - **feat**: add otrs.com custom parser (#353) (kik0220)
18+
- [[`71c4d05037`](https://github.com/postlight/mercury-parser/commit/71c4d05037)] - **chore**: Include "src/shims" for webpack builds for web (#302) (Alexsander Akers)
19+
- [[`a3fe02678c`](https://github.com/postlight/mercury-parser/commit/a3fe02678c)] - **chore**: small CoC typofix (#358) (Frankie Simms)
20+
- [[`437f50a5c8`](https://github.com/postlight/mercury-parser/commit/437f50a5c8)] - **fix**: Initialize Content-Type as empty string if not present (#359) (John Holdun)
21+
- [[`da9a836eab`](https://github.com/postlight/mercury-parser/commit/da9a836eab)] - **chore**: remove unneeded import (#357) (Frankie Simms)
22+
- [[`bafa764000`](https://github.com/postlight/mercury-parser/commit/bafa764000)] - **chore**: set up ciftr for failed test reports (#343) (Frankie Simms)
23+
- [[`262dda94b3`](https://github.com/postlight/mercury-parser/commit/262dda94b3)] - **fix**: explicity reject non-200 status codes (#342) (Toufic Mouallem)
24+
- [[`b6c82f2b16`](https://github.com/postlight/mercury-parser/commit/b6c82f2b16)] - **docs**: fix extend typo in README (#340) (Drew Bell)
25+
- [[`144a797564`](https://github.com/postlight/mercury-parser/commit/144a797564)] - **feat**: Support passing custom headers in requests (#337) (Toufic Mouallem)
26+
- [[`3ed778b53e`](https://github.com/postlight/mercury-parser/commit/3ed778b53e)] - **fix**: Adapt CNBC extractor to article redesign (#336) (Toufic Mouallem)
27+
- [[`da9606a4cb`](https://github.com/postlight/mercury-parser/commit/da9606a4cb)] - **docs**: Add parsing custom HTML to README.md (#326) (Toufic Mouallem)
28+
- [[`b3e2a0ffd1`](https://github.com/postlight/mercury-parser/commit/b3e2a0ffd1)] - **feat**: extract custom types with extend option (#313) (Drew Bell)
29+
- [[`136d6df798`](https://github.com/postlight/mercury-parser/commit/136d6df798)] - **feat**: Return specific errors on failed parse attempts (Toufic Mouallem)
30+
- [[`a250f403f5`](https://github.com/postlight/mercury-parser/commit/a250f403f5)] - **fix**: Preserve whitespace in certain HTML elements (#333) (Toufic Mouallem)
31+
- [[`2a3ade706d`](https://github.com/postlight/mercury-parser/commit/2a3ade706d)] - **fix**: run parser preview (Adam Pash)
32+
- [[`a7e4c67d1d`](https://github.com/postlight/mercury-parser/commit/a7e4c67d1d)] - **feat**: Extract content from GitHub repos. (#306) (Ben Ubois)
33+
- [[`6e66887048`](https://github.com/postlight/mercury-parser/commit/6e66887048)] - **docs**: add content formats to README.md (#318) (Matthew Watkins)
34+
- [[`0940971069`](https://github.com/postlight/mercury-parser/commit/0940971069)] - **fix**: better handling for responsive images (#312) (Toufic Mouallem)
35+
- [[`785a22245f`](https://github.com/postlight/mercury-parser/commit/785a22245f)] - **feat**: switch from forked request to postman-request (#319) (Drew Bell)
36+
- [[`7844129fda`](https://github.com/postlight/mercury-parser/commit/7844129fda)] - **feat**: Add custom parser for Reddit (#307) (Toufic Mouallem)
37+
- [[`13581cd899`](https://github.com/postlight/mercury-parser/commit/13581cd899)] - **feat**: upgrade watchify to remove vulnerable hoek dep (#320) (Drew Bell)
38+
- [[`91fb0dfb46`](https://github.com/postlight/mercury-parser/commit/91fb0dfb46)] - **fix**: update parse signature in tests (#315) (Drew Bell)
39+
- [[`ffb25f34d7`](https://github.com/postlight/mercury-parser/commit/ffb25f34d7)] - **docs**: add usage gif (#308) (Adam Pash)
40+
- [[`9714cb70c5`](https://github.com/postlight/mercury-parser/commit/9714cb70c5)] - **feat**: Use Deadspin parser for all Kinja websites (#304) (Toufic Mouallem)
41+
- [[`83d1c2401b`](https://github.com/postlight/mercury-parser/commit/83d1c2401b)] - **feat**: add custom extractor for blisterreview.com (#299) (Jordan Hotmann)
42+
- [[`d9a1e7b22b`](https://github.com/postlight/mercury-parser/commit/d9a1e7b22b)] - **feat**: add news.mynavi.jp custom parser (#287) (kik0220)
43+
- [[`44a7ec791d`](https://github.com/postlight/mercury-parser/commit/44a7ec791d)] - **docs**: typofix (#300) (Olli Sulopuisto)
44+
- [[`0a15a37f04`](https://github.com/postlight/mercury-parser/commit/0a15a37f04)] - **fix**: ci artifact paths (#301) (Adam Pash)
45+
- [[`9698d9a0c4`](https://github.com/postlight/mercury-parser/commit/9698d9a0c4)] - **dx**: comment on custom parser pr fix (#278) (Adam Pash)
46+
- [[`ed14203e97`](https://github.com/postlight/mercury-parser/commit/ed14203e97)] - **fix**: return early if creating the resource failed. (#285) (Ben Ubois)
47+
- [[`52dfdda553`](https://github.com/postlight/mercury-parser/commit/52dfdda553)] - **deps**: Update mocha to the latest version 🚀 (#282) (greenkeeper[bot])
48+
- [[`b044cfa958`](https://github.com/postlight/mercury-parser/commit/b044cfa958)] - **release**: 2.0.0 (#275) (Adam Pash)
49+
350
### 2.0.0 (Feb 13, 2019)
451

552
##### Commits

dist/mercury.js

Lines changed: 254 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -220,13 +220,13 @@ function get(options) {
220220
});
221221
});
222222
} // Evaluate a response to ensure it's something we should be keeping.
223-
// This does not validate in the sense of a response being 200 level or
224-
// not. Validation here means that we haven't found reason to bail from
223+
// This does not validate in the sense of a response being 200 or not.
224+
// Validation here means that we haven't found reason to bail from
225225
// further processing of this url.
226226

227227

228228
function validateResponse(response) {
229-
var parseNon2xx = arguments.length > 1 && arguments[1] !== undefined ? arguments[1] : false;
229+
var parseNon200 = arguments.length > 1 && arguments[1] !== undefined ? arguments[1] : false;
230230

231231
// Check if we got a valid status code
232232
// This isn't great, but I'm requiring a statusMessage to be set
@@ -237,8 +237,8 @@ function validateResponse(response) {
237237
if (response.statusMessage && response.statusMessage !== 'OK' || response.statusCode !== 200) {
238238
if (!response.statusCode) {
239239
throw new Error("Unable to fetch content. Original exception was ".concat(response.error));
240-
} else if (!parseNon2xx) {
241-
throw new Error("Resource returned a response status code of ".concat(response.statusCode, " and resource was instructed to reject non-2xx level status codes."));
240+
} else if (!parseNon200) {
241+
throw new Error("Resource returned a response status code of ".concat(response.statusCode, " and resource was instructed to reject non-200 status codes."));
242242
}
243243
}
244244

@@ -1248,6 +1248,7 @@ function absolutize($, rootUrl, attr) {
12481248
$("[".concat(attr, "]")).each(function (_, node) {
12491249
var attrs = getAttrs(node);
12501250
var url = attrs[attr];
1251+
if (!url) return;
12511252
var absoluteUrl = URL.resolve(baseUrl || rootUrl, url);
12521253
setAttr(node, attr, absoluteUrl);
12531254
});
@@ -1646,7 +1647,8 @@ var Resource = {
16461647
generateDoc: function generateDoc(_ref) {
16471648
var content = _ref.body,
16481649
response = _ref.response;
1649-
var contentType = response.headers['content-type']; // TODO: Implement is_text function from
1650+
var _response$headers$con = response.headers['content-type'],
1651+
contentType = _response$headers$con === void 0 ? '' : _response$headers$con; // TODO: Implement is_text function from
16501652
// https://github.com/ReadabilityHoldings/readability/blob/8dc89613241d04741ebd42fa9fa7df1b1d746303/readability/utils/text.py#L57
16511653

16521654
if (!contentType.includes('html') && !contentType.includes('text')) {
@@ -4832,6 +4834,236 @@ var WwwRedditComExtractor = {
48324834
}
48334835
};
48344836

4837+
var OtrsComExtractor = {
4838+
domain: 'otrs.com',
4839+
title: {
4840+
selectors: ['#main article h1']
4841+
},
4842+
author: {
4843+
selectors: ['div.dateplusauthor a']
4844+
},
4845+
date_published: {
4846+
selectors: [['meta[name="article:published_time"]', 'value']]
4847+
},
4848+
dek: {
4849+
selectors: [['meta[name="og:description"]', 'value']]
4850+
},
4851+
lead_image_url: {
4852+
selectors: [['meta[name="og:image"]', 'value']]
4853+
},
4854+
content: {
4855+
selectors: ['#main article'],
4856+
defaultCleaner: false,
4857+
transforms: {},
4858+
clean: ['div.dateplusauthor', 'div.gr-12.push-6.footershare', '#atftbx', 'div.category-modul']
4859+
}
4860+
};
4861+
4862+
var WwwOssnewsJpExtractor = {
4863+
domain: 'www.ossnews.jp',
4864+
title: {
4865+
selectors: ['#alpha-block h1.hxnewstitle']
4866+
},
4867+
author: null,
4868+
date_published: null,
4869+
dek: null,
4870+
lead_image_url: {
4871+
selectors: [['meta[name="og:image"]', 'value']]
4872+
},
4873+
content: {
4874+
selectors: ['#alpha-block .section:has(h1.hxnewstitle)'],
4875+
defaultCleaner: false,
4876+
transforms: {},
4877+
clean: []
4878+
}
4879+
};
4880+
4881+
var BuzzapJpExtractor = {
4882+
domain: 'buzzap.jp',
4883+
title: {
4884+
selectors: ['h1.entry-title']
4885+
},
4886+
author: null,
4887+
date_published: {
4888+
selectors: [['time.entry-date', 'datetime']]
4889+
},
4890+
dek: null,
4891+
lead_image_url: {
4892+
selectors: [['meta[name="og:image"]', 'value']]
4893+
},
4894+
content: {
4895+
selectors: ['div.ctiframe'],
4896+
defaultCleaner: false,
4897+
transforms: {},
4898+
clean: []
4899+
}
4900+
};
4901+
4902+
var WwwAsahiComExtractor = {
4903+
domain: 'www.asahi.com',
4904+
title: {
4905+
selectors: ['.ArticleTitle h1']
4906+
},
4907+
author: {
4908+
selectors: [['meta[name="article:author"]', 'value']]
4909+
},
4910+
date_published: {
4911+
selectors: [['meta[name="pubdate"]', 'value']]
4912+
},
4913+
dek: null,
4914+
excerpt: {
4915+
selectors: [['meta[name="og:description"]', 'value']]
4916+
},
4917+
lead_image_url: {
4918+
selectors: [['meta[name="og:image"]', 'value']]
4919+
},
4920+
content: {
4921+
selectors: ['#MainInner div.ArticleBody'],
4922+
defaultCleaner: false,
4923+
transforms: {},
4924+
clean: ['div.AdMod', 'div.LoginSelectArea']
4925+
}
4926+
};
4927+
4928+
var WwwSanwaCoJpExtractor = {
4929+
domain: 'www.sanwa.co.jp',
4930+
title: {
4931+
selectors: ['#newsContent h1']
4932+
},
4933+
author: null,
4934+
date_published: null,
4935+
dek: {
4936+
selectors: [['meta[name="og:description"]', 'value']]
4937+
},
4938+
lead_image_url: {
4939+
selectors: [['meta[name="og:image"]', 'value']]
4940+
},
4941+
content: {
4942+
selectors: ['#newsContent'],
4943+
defaultCleaner: false,
4944+
transforms: {},
4945+
clean: ['#smartphone', 'div.sns_box', 'div.contentFoot']
4946+
}
4947+
};
4948+
4949+
var WwwElecomCoJpExtractor = {
4950+
domain: 'www.elecom.co.jp',
4951+
title: {
4952+
selectors: ['title']
4953+
},
4954+
author: null,
4955+
date_published: null,
4956+
dek: null,
4957+
lead_image_url: null,
4958+
content: {
4959+
selectors: ['td.TableMain2'],
4960+
defaultCleaner: false,
4961+
transforms: {
4962+
table: function table($node) {
4963+
$node.attr('width', 'auto');
4964+
}
4965+
},
4966+
clean: []
4967+
}
4968+
};
4969+
4970+
var ScanNetsecurityNeJpExtractor = {
4971+
domain: 'scan.netsecurity.ne.jp',
4972+
title: {
4973+
selectors: ['header.arti-header h1.head']
4974+
},
4975+
author: null,
4976+
date_published: {
4977+
selectors: [['meta[name="article:modified_time"]', 'value']]
4978+
},
4979+
dek: {
4980+
selectors: ['header.arti-header p.arti-summary']
4981+
},
4982+
lead_image_url: {
4983+
selectors: [['meta[name="og:image"]', 'value']]
4984+
},
4985+
content: {
4986+
selectors: ['div.arti-content.arti-content--thumbnail'],
4987+
defaultCleaner: false,
4988+
transforms: {},
4989+
clean: ['aside.arti-giga']
4990+
}
4991+
};
4992+
4993+
var JvndbJvnJpExtractor = {
4994+
domain: 'jvndb.jvn.jp',
4995+
title: {
4996+
selectors: ['title']
4997+
},
4998+
author: null,
4999+
date_published: null,
5000+
dek: null,
5001+
lead_image_url: null,
5002+
content: {
5003+
selectors: ['#news-list'],
5004+
defaultCleaner: false,
5005+
transforms: {},
5006+
clean: []
5007+
}
5008+
};
5009+
5010+
var GeniusComExtractor = {
5011+
domain: 'genius.com',
5012+
title: {
5013+
selectors: ['h1']
5014+
},
5015+
author: {
5016+
selectors: ['h2 a']
5017+
},
5018+
date_published: {
5019+
selectors: [['meta[itemprop=page_data]', 'value', function (res) {
5020+
var json = JSON.parse(res);
5021+
return json.song.release_date;
5022+
}]]
5023+
},
5024+
dek: {
5025+
selectors: [// enter selectors
5026+
]
5027+
},
5028+
lead_image_url: {
5029+
selectors: [['meta[itemprop=page_data]', 'value', function (res) {
5030+
var json = JSON.parse(res);
5031+
return json.song.album.cover_art_url;
5032+
}]]
5033+
},
5034+
content: {
5035+
selectors: ['.lyrics'],
5036+
// Is there anything in the content you selected that needs transformed
5037+
// before it's consumable content? E.g., unusual lazy loaded images
5038+
transforms: {},
5039+
// Is there anything that is in the result that shouldn't be?
5040+
// The clean selectors will remove anything that matches from
5041+
// the result
5042+
clean: []
5043+
}
5044+
};
5045+
5046+
var WwwJnsaOrgExtractor = {
5047+
domain: 'www.jnsa.org',
5048+
title: {
5049+
selectors: ['#wgtitle h2']
5050+
},
5051+
author: null,
5052+
date_published: null,
5053+
dek: null,
5054+
excerpt: {
5055+
selectors: [['meta[name="og:description"]', 'value']]
5056+
},
5057+
lead_image_url: {
5058+
selectors: [['meta[name="og:image"]', 'value']]
5059+
},
5060+
content: {
5061+
selectors: ['#main_area'],
5062+
transforms: {},
5063+
clean: ['#pankuzu', '#side']
5064+
}
5065+
};
5066+
48355067

48365068

48375069
var CustomExtractors = /*#__PURE__*/Object.freeze({
@@ -4931,7 +5163,17 @@ var CustomExtractors = /*#__PURE__*/Object.freeze({
49315163
BlisterreviewComExtractor: BlisterreviewComExtractor,
49325164
NewsMynaviJpExtractor: NewsMynaviJpExtractor,
49335165
GithubComExtractor: GithubComExtractor,
4934-
WwwRedditComExtractor: WwwRedditComExtractor
5166+
WwwRedditComExtractor: WwwRedditComExtractor,
5167+
OtrsComExtractor: OtrsComExtractor,
5168+
WwwOssnewsJpExtractor: WwwOssnewsJpExtractor,
5169+
BuzzapJpExtractor: BuzzapJpExtractor,
5170+
WwwAsahiComExtractor: WwwAsahiComExtractor,
5171+
WwwSanwaCoJpExtractor: WwwSanwaCoJpExtractor,
5172+
WwwElecomCoJpExtractor: WwwElecomCoJpExtractor,
5173+
ScanNetsecurityNeJpExtractor: ScanNetsecurityNeJpExtractor,
5174+
JvndbJvnJpExtractor: JvndbJvnJpExtractor,
5175+
GeniusComExtractor: GeniusComExtractor,
5176+
WwwJnsaOrgExtractor: WwwJnsaOrgExtractor
49355177
});
49365178

49375179
var Extractors = _Object$keys(CustomExtractors).reduce(function (acc, key) {
@@ -6389,14 +6631,16 @@ function select(opts) {
63896631
// extract the attr
63906632

63916633
if (_Array$isArray(matchingSelector)) {
6392-
var _matchingSelector = _slicedToArray(matchingSelector, 2),
6634+
var _matchingSelector = _slicedToArray(matchingSelector, 3),
63936635
selector = _matchingSelector[0],
6394-
attr = _matchingSelector[1];
6636+
attr = _matchingSelector[1],
6637+
transform = _matchingSelector[2];
63956638

63966639
$match = $(selector);
63976640
$match = transformAndClean($match);
63986641
result = $match.map(function (_, el) {
6399-
return $(el).attr(attr).trim();
6642+
var item = $(el).attr(attr).trim();
6643+
return transform ? transform(item) : item;
64006644
});
64016645
} else {
64026646
$match = $(matchingSelector);

dist/mercury.web.js

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "@postlight/mercury-parser",
3-
"version": "2.0.0",
3+
"version": "2.1.0",
44
"description": "Mercury transforms web pages into clean text. Publishers and programmers use it to make the web make sense, and readers use it to read any web article comfortably.",
55
"author": "Postlight <[email protected]>",
66
"homepage": "https://mercury.postlight.com",

0 commit comments

Comments
 (0)