Module org.elasticsearch.xcore
Class FeatureUtils
java.lang.Object
org.elasticsearch.xpack.core.ml.inference.preprocessing.customwordembedding.FeatureUtils
A collection of messy feature extractors
-
Method Summary
Modifier and TypeMethodDescriptionstatic StringcleanAndLowerText(String text) Cleanup text and lower-case it NOTE: This does not do any string compression by removing duplicate tokensstatic StringtruncateToNumValidBytes(String text, int maxLength) Truncates a string to the number of characters that fit in X bytes avoiding multi byte characters being cut in half at the cut off point.
-
Method Details
-
truncateToNumValidBytes
Truncates a string to the number of characters that fit in X bytes avoiding multi byte characters being cut in half at the cut off point. Also handles surrogate pairs where 2 characters in the string is actually one literal character. Based on: https://stackoverflow.com/a/35148974/1818849 -
cleanAndLowerText
Cleanup text and lower-case it NOTE: This does not do any string compression by removing duplicate tokens
-