feat(): Introduce domain specific parsers#605
Conversation
|
Thanks so much for fixing the types!! I haven't thoroughly reviewed yet, but my first thought (and what I was getting at in the issue discussion) is that H&M and Ikea are not "custom". I mean, there might be other sites that benefit from those same rules considering you're just extracting from the Json-LD, so I think those should just be part of the default rule set. bol.com seems to be mostly custom, at least by using the variant id, so that one makes sense to me to be a truly custom rule set. Other than that, again at first glance, I like the domain framework that you've set up. I'll do a thorough review when I have some more time |
cmintey
left a comment
There was a problem hiding this comment.
Again, thank you for your work on this and cleaning things up! I have a few comments I'd like to see addressed.
| const name = jsonld ? getProperty(jsonld, "name") : undefined; | ||
| return typeof name === "string" ? name : undefined; |
There was a problem hiding this comment.
You use this pattern a lot and you've actually already created a helper in domain-helpers, so you should use that instead of repeating this code a bunch
| const formatted = toPriceFormat($('[property="og:price:amount"]').attr("content")); | ||
| return formatted !== undefined ? String(formatted) : undefined; |
There was a problem hiding this comment.
It would probably be better to move this logic into the toPriceFormat method to cast to a string
| * Extract price from JSON-LD offers with proper formatting. | ||
| */ | ||
| export const extractPrice = (data: unknown): string | undefined => { | ||
| return extractOffersField(data, "price"); |
There was a problem hiding this comment.
This should make use of toPriceFormat
| */ | ||
| const domains = ["ikea.com"]; | ||
|
|
||
| export const ikeaComRules: ShoppingDomainRules = rulesForDomain(domains, { |
There was a problem hiding this comment.
Why does Ikea need a custom parser? This site works for me with the base parser. Also, the rules here are not unique to Ikea and are basically the same as the base parser
There was a problem hiding this comment.
I've had pages that didn't load for me - where the current parser did not pick up meta data.
| */ | ||
| const domains = ["hm.com"]; | ||
|
|
||
| export const hmComRules: ShoppingDomainRules = rulesForDomain(domains, { |
There was a problem hiding this comment.
I don't think this needs to be a custom parser. There are likely several sites that also use this ProductGroup schema with a variant. I'd prefer these rules just be added to the base parser
There was a problem hiding this comment.
One of the reasons why I considered it as a separate app is due to performance. For every rule you add to the "base parser", the more it needs to process, even if the page doesn't contain such an element. It still needs to be parsed (albeit memoized, but still) - where it could've been skipped alltogether..
Also, I used hm, ikea & bol as examples. Happy to omit one or more, if that has a preference
There was a problem hiding this comment.
I don't think the performance hit is very large. I'd rather the parsing take just a few ms longer to have a robust set of standard rules and then only have a few custom rules for outliers. Like I think it makes sense for Amazon to have it's own parser, since as you pointed out, there are several Amazon-only rules in the base parser
As discussed here: #594
I suggested a domain specific parser. Yes, hm.com gave access denied, due to akamai protection, but it did load a few times during testing.
As you can see it can easily be extended by the community with domain specific parsing. I've tried to keep is as simple and reproducable as possible, whilst keeping the original parser intact.
Also introduced typings that are in line with the metascraper package, since Check wasn't exposed anymore in the installed version