java.lang.Object
org.elasticsearch.xpack.core.ml.inference.preprocessing.customwordembedding.FeatureUtils

public final class FeatureUtils extends Object
A collection of messy feature extractors
  • Method Details

    • truncateToNumValidBytes

      public static String truncateToNumValidBytes(String text, int maxLength)
      Truncates a string to the number of characters that fit in X bytes avoiding multi byte characters being cut in half at the cut off point. Also handles surrogate pairs where 2 characters in the string is actually one literal character. Based on: https://stackoverflow.com/a/35148974/1818849
    • cleanAndLowerText

      public static String cleanAndLowerText(String text)
      Cleanup text and lower-case it NOTE: This does not do any string compression by removing duplicate tokens