Quickstart
This library supports .NET Standard 2.0. The core algorithm is a port of the Mozilla Readability library. The original library is stable and used in production inside Firefox. By relying on a library maintained by a competent organization like Mozilla, we leverage their robust and well-tested work.
SmartReader also adds improvements to the original library, extracting more and better metadata, including:
- Site name
- Author and publication date
- Language
- Article excerpt
- Featured image
- List of images found (can optionally be downloaded and stored as data URIs)
- Estimated time needed to read the article
Feel free to suggest new features.
Installation
Installation is straightforward using the NuGet package.
PM> Install-Package SmartReader
Usage
There are two main ways to use the library. The first involves creating a new Reader object, using the URI as the argument, and then calling the GetArticle method to obtain the extracted Article. The second uses the static method ParseArticle of the Reader class directly to return an Article. Both approaches are also available via async methods, named GetArticleAsync and ParseArticleAsync respectively.
The advantage of using an object, rather than the static method, is that it allows you to configure specific options.
You also have the option to directly parse a String or Stream obtained via other means. This is available either through the ParseArticle methods or by using the appropriate Reader constructor. In either case, you must provide the original URI. The library will not re-download the text, but it requires the URI to perform checks and fix relative links present on the page. If you cannot provide the original URI, you can use a placeholder, such as https://localhost.
If the extraction fails, the returned Article object will have the IsReadable field set to false.
The content of the article is unstyled, but it is wrapped in a div with the id readability-content which you can style yourself.
The library attempts to detect the correct encoding of the text, provided the correct tags are present.
Getting Images
You can call GetImagesAsync on the Article object to obtain a Task that returns a list of Image objects, representing the images found in the extracted article. This method is async because it makes HEAD requests to determine the size of the images; it only returns those larger than the specified size. The default size is 75KB. This filtering is done to exclude elements such as UI icons.
You can also call ConvertImagesToDataUriAsync on the Article object to inline the images found in the article using the data URI scheme. The method is async. This inserts the images into the Content property of the Article, which may significantly increase its size.
The data URI scheme is not efficient because it uses Base64 to encode the image bytes. Base64 encoded data is approximately 33% larger than the original data. The purpose of this method is to provide an offline article suitable for long-term storage. This is useful if the original article becomes inaccessible. The method only converts images larger than the specified size (default 75KB) to exclude UI elements.
Note that this method will not store external elements that are not images, such as embedded videos.
Examples
Using the GetArticle method:
SmartReader.Reader sr = new SmartReader.Reader("[https://arstechnica.com/information-technology/2017/02/humans-must-become-cyborgs-to-survive-says-elon-musk/](https://arstechnica.com/information-technology/2017/02/humans-must-become-cyborgs-to-survive-says-elon-musk/)");
sr.Debug = true;
sr.LoggerDelegate = Console.WriteLine;
SmartReader.Article article = sr.GetArticle();
var images = article.GetImagesAsync();
if(article.IsReadable)
{
// do something with it
}
Using the ParseArticle static method:
SmartReader.Article article = SmartReader.Reader.ParseArticle("[https://arstechnica.com/information-technology/2017/02/humans-must-become-cyborgs-to-survive-says-elon-musk/](https://arstechnica.com/information-technology/2017/02/humans-must-become-cyborgs-to-survive-says-elon-musk/)");
if(article.IsReadable)
{
Console.WriteLine($"Article title {article.Title}");
}
Settings
The following settings on the Reader class can be modified:
intMaxElemsToParse
Max number of nodes supported by this parser.
Default: 0 (no limit)intNTopCandidates
The number of top candidates to consider when analyzing how tight the competition is among candidates.
Default: 5boolDebug
Set the Debug option. If set to true the library writes data on Logger.
Default: falseAction<string>LoggerDelegate
A delegate function that accepts a string as an argument; it will receive log messages.
Default: does not do anythingReportLevelLogging
Level of information written with theLoggerDelegate. Valid values are from theReportLevelenum:IssueorInfo. The first level logs only errors or issues that could prevent correctly obtaining an article. The second level logs all information needed to debug a problematic article.
Default: ReportLevel.IssueboolContinueIfNotReadable
T The library attempts to determine if it will find an article before actually trying to do so. This option decides whether to continue if the library heuristics fail. This value is ignored if Debug is set to true.
Default: trueintCharThreshold
The minimum number of characters an article must have to return a result.
Default: 500boolKeepClasses
Whether to preserve or clean CSS classes.
Default: falseString[]ClassesToPreserve
The CSS classes that must be preserved in the article, if we opt to not keep all of them.
Default: ["page"]boolDisableJSONLD
The library looks first at JSON-LD to determine metadata. This setting gives you the option to disable it.
Default: falseDictionary<string, int>MinContentLengthReadearable
The minimum node content length used to decide if the document is readerable (i.e., the library will find something useful)
You can provide a dictionary with values based on language.
Default: 140intMinScoreReaderable
The minumum cumulated 'score' used to determine if the document is readerable
Default: 20Func<IElement, bool>IsNodeVisible
The function used to determine if a node is visible. Used in the process of determining if the document is readerable.
Default: NodeUtility.IsProbablyVisibleboolForceHeaderEncoding
Whether to force the encoding provided in the response header. This will convert the stream to the encoding set in the header before passing it to the HTML parser
Default: falseintAncestorsDepth
The default level of depth a node must have to be used for scoring. Nodes without as many ancestors as this level are not counted
Default: 5intParagraphThreshold
The default number of characters a node must have to be used for scoring.
Default: 25doubleLinkDensityModifier
A number that is added to the base link density threshold during the shadiness checks. This can be used to penalize nodes with a high link density or vice versa.
Default: 0.0boolPreCleanPage
Some pages have structural issues that harms performance, such as hundred of thousands of empty paragraph nodes. This flag activates heuristics to pre-clean the page before it is analyzed by the library. In practice, the current implementation just eliminates empty paragraph nodes.
Default: false
Settings Notes
The settings MinScoreReaderable, CharThreshold, and MinContentLengthReadearable are used in the process of determining if an article is readerable or if the result found is valid.
The scoring algorithm assigns a score to each valid node, then determines the best node based on its relationships (i.e., the score of the node's ancestors and descendants). The settings NTopCandidates, AncestorsDepth, and ParagraphThreshold allow you to customize this process. It is useful to change them if you are targeting sites that use a specific coding
boolCompleted
Indicate whether we completed the process without getting an Exception (for instance, the HTTP request returned 403 Forbidden)List<Exception>Errors
The list of errors generated during the process style or design.
The settings ParagraphThreshold, MinContentLengthReadearable, and CharThreshold should be customized for content written in non-alphabetical languages.
Article Model
A brief overview of the Article model returned by the library:
UriUri
Original URIStringTitle
TitleStringByline
Byline of the article, usually containing author and publication dateStringDir
Direction of the textStringFeaturedImage
The main image of the articleStringContent
HTML content of the articleStringTextContent
The plain text of the article with basic formattingStringExcerpt
A summary of the article, based on metadata or first paragraphStringLanguage
Language string (es. 'en-US')Dictionary<string, Uri>AlternativeLanguageUris
Contains URIs for pages in alternative languages, where the key is the language code (es. 'en-US': 'https://www.example.com/en')StringAuthor
Author of the articleStringSiteName
Name of the site that hosts the articleintLength
Length of the text of the articleTimeSpanTimeToRead
Average time needed to read the articleDateTime?PublicationDate
Date of publication of the articleboolIsReadable
Indicate whether an article was successfully foundboolCompleted
Indicate whether we completed the process without getting an Exception (for instance, the HTTP request returned 403 Forbidden)List<Exception>Errors
The list of errors generated during the process
It's important to be aware that the fields Byline, Author, and PublicationDate are found independently of each other. Consequently, there might be inconsistencies or unexpected data. For instance, Byline may be a string in the form "@Date by @Author", "@Author, @Date", or any other combination used by the publication.
The TimeToRead calculation is based on research found in Standardized Assessment of Reading Performance: The New International Reading Speed Texts IReST. It should be accurate if the article is written in one of the languages covered by the research, but it is an educated guess for other languages. If you can point to any scientific research for missing languages, please open an issue.
The FeaturedImage property holds the image indicated by the Open Graph or Twitter meta tags. If neither of these is present, and you called the GetImagesAsync method, it will be set to the first image found.
The TextContent property is based on the pure text content of the HTML (i.e., the concatenation of text nodes). We then apply basic formatting, such as removing double spaces or newlines left by HTML formatting. We also add meaningful newlines for P and BR nodes.
The IsReadable property will be false if no article was extracted, whatever the reason (i.e., the algorithm did not found anything valuable or the request failed). The property Completed just indicated whether the process completed correctly or not. Previously we left to the user of the library to manage exceptions, but now we try to handle them ourselves.