Advanced Usage

This page contains information to use optional features of the library.

Customizing Regular Expressions

You can customize the regular expressions that are used to determine whether a part of the document will be inside the article. There are two methods to do this:

void AddOptionToRegularExpression(RegularExpressions expression, string option)
Add an option (i.e., usually a CSS class name) to the regular expression.
void ReplaceRegularExpression(RegularExpressions expression, string newExpression)
Replace the selected regular expression.

The type RegularExpression is an enum that can have one of the following values, corresponding to a regular expression:

UnlikelyCandidates
PossibleCandidates
Positive (increases chances to keep the element)
Negative (decreases chances to keep the element)
Byline
Videos
ShareElements
Extraneous (note: this regular expression is not used anywhere at the moment. We keep it for alignment with the original library)

Except for the Videos regular expression, they all represent values of attributes, classes, etc. of tags. You should look at the code to understand how each of the regular expression is used. For instance, if you add the string "niceContent" to RegularExpression.Positive, a div containing niceContent in its class attribute, or id, will be more likely to be included in the final output.

The Videos regular expression represents a domain of origin of an embedded video. Since this is a string representing a regular expression, you have to remember to escape any dot present. This option is used to determine if an embed should be maintained in the article, because people generally want to see videos. If an embed matches one of the domains of this option is maintained, otherwise it is not.

// how to add the domain example.com
AddOptionToRegularExpression(RegularExpressions.Videos, "example\.com");

Adding Custom Operations

The library allows the user to add custom operations. I.e., to perform arbitrary modifications to the article before is processed or it is returned to the user. A custom operation receives as argument the article (an IElement). For custom operations at the beginning, the element is the entire document; for custom operations executed after the processing is complete, the element is the article extracted.

// example of custom operation
void AddInfo(AngleSharp.Dom.IElement element)
{       
    // we add a paragraph to the first div we find
	element.QuerySelector("div").LastElementChild.InnerHtml += "<p>Article parsed by SmartReader</p>";
}

static void RemoveElement(AngleSharp.Dom.IElement element)
{
    // we remove the first element with class removeable
    element.QuerySelector(".removeable")?.Remove();
}

[..]
Reader reader = // ..

// add a custom operation at the start
reader.AddCustomOperationStart(RemoveElement);

// add a custom operation at the end
reader.AddCustomOperationEnd(AddInfo);

As you can see, the custom operation works on an IElement and it would normally rely on the AngleSharp API. AngleSharp is the library that SmartReader uses to parse and manipulate HTML. The API of the library follows the standard structure that you can also see in JavaScript, so it feels intuitive and natural. If you need any help to work with it, consult their documentation.

Preserving CSS Classes

Normally the library strips all classes of the elements except for page. This is done because classes are used to govern the display of the article, but they are irrelevant to the content itself. However, there is an option to preserve other classes. This is mostly useful if you want to perform custom operations on certain elements and you need CSS classes to identify them.

You can preserve some classes using the property ClassesToPreserve which is an array of class names that will be preserved. Note that this has no effect if an element that contains the class is eliminated from the extracted article. This means that the option does not maintain the element itself, it just maintains the class, if the element is kept in the extracted article.

Reader reader = // ..

// preserve the class info
reader.ClassesToPreserve = new string[] { "info" };

The class page is always kept, no matter the value you assign to this option.

Setting Custom User Agent and Custom HttpMessageHandler

By default all web requests made by the library use the User Agent SmartReader Library. This can be changed by using the function SetCustomUserAgent(string).

Reader.SetCustomUserAgent("SuperAwesome App - for any issue contact admin@example.com");

This function will change the user agent for all subsequent web requests with any object of the class.

If you need to use a custom HttpMessageHandler, you can replace the default one, with the function SetBaseHttpClientHandler(HttpMessageHandler).

HttpMessageHandler majesticHandler = GetMyTailorMadeHttpMessageHandler();
// ..
Reader.`SetBaseHttpClientHandler(majesticHandler);

Manipulating Content Returned by Article

By default the HTML content returned in the Article object it is just the property InnerHTML of the extracted HTML. You can change that by changing the static property Serializer of the Article class. This property is a Func<IElement, string> method, that is is a method that accepts as an argument an IElement and returns a string.

Since you have access to the AngleSharp library, this can be quite useful. For instance, in case you want to ensure that the HTML is well-formed or to transform the content in a way that is useful to your application. This is also the ideal place where to make changes to the content, such as formatting text or removing content.

What follows is a terrible example.

// example of an alternative serializer that remove spaces outside and inside (unless they are pre) tags
string RemoveSpace(AngleSharp.Dom.IElement element)
{
	return Regex.Replace(
		Regex.Replace(element.InnerHtml, @"(?<endBefore></.*?>)\s+(?<startAfter><[^/]>)", "${endBefore}${startAfter}"),
		@"(?<endBefore><((?!pre).)*?>)\s+",
		"${endBefore}").Trim();
}

[..]

Article.Serializer = RemoveSpace;

You can also replace the standard function used to convert HTML content in plain text, that is the function called to calculate the TextContent property of an Article object. You can do that by changing the property Converter which is also of type Func<IElement, string>, that is a method that accepts as an argument an IElement and returns a string.

Since you have access to the AngleSharp library, this can be quite useful. For example, the standard converter function convert <p> and <br> tags in corresponding line breaks in text.

// example of an alternative converter
string MagicConverter(AngleSharp.Dom.IElement element)
{
	[..]
}

[..]

Article.Converter = MagicConverter;

Natural Language Processing

Often the webpage containing the article will contain metadata about the language used in the article. Sometimes this is not true, so the library comes with a delegate LanguageIdentification in Article, that can be used to identify the language based on the text itself. This delegate will be called on the TextContent of the Article. The default method just returns the language in the metadata. This is the old, standard behavior.

You can use the delegate to implement your own method to identify the language. We also provide a decent implementation, that actually does something, using FastText. This implementation is distributed in a separate nuget package, SmartReader.NaturalLanguageProcessing. To use the language identification feature you will need to call the Enable method.

NLP.Enable();

This will change the delegate LanguageIdentification, so it will automatically identify the language and set the property Language in every Article object created by the library.

To restore the default behavior, instead use RestoreDefaults.

NLP.RestoreDefaults();

The delegate accepts two arguments:

the first one will receive the text of the article
the second one will receive the language indicates in the metadata, if any. You can use this argument as hint or as a default value in case your method cannot identify the language

You can use the delegate to implement your own method to identify the language. We also provide a decent implementation, that actually does something, using FastText.

There is also a delegate to create a summary of the article : CreateSummary. Also in this case the default implementation returns the summary provided by the metadata. In practice this is usually a short summary meant for social media sharing. At this moment we do not provide any better implementation.