Interface Segments

All Known Implementing Classes:
SegmentsImpl

public interface Segments
An interface that represents the segmentation results, including the APIs for iteration therein, that are yielded from passing an input CharSequence to a Segmenter.

The segmentation results can be provided either as the segmentation boundary indices ({code int}s) or as segments, which are represented by the Segment class. In turn, the Segment object can also provide the subsequence of the original input that it represents.

Example:

Segmenter wordSeg =
    LocalizedSegmenter.builder()
        .setLocale(ULocale.forLanguageTag("de"))
        .setSegmentationType(SegmentationType.WORD)
        .build();

Segments segments = wordSeg.segment("Das 21ste Jahrh. ist das beste.");

List<CharSequence> words = segments.subSequences().collect(Collectors.toList());
See Also:
  • Nested Class Summary

    Nested Classes
    Modifier and Type
    Interface
    Description
    static enum 
     
  • Method Summary

    Modifier and Type
    Method
    Description
    default IntStream
    Returns all segmentation boundaries, starting from the beginning and moving forwards.
    Returns all segmentation boundaries after the provided index.
    Returns all segmentation boundaries on or before the provided index.
    boolean
    isBoundary(int i)
    Returns whether offset i is a segmentation boundary.
    segmentAt(int i)
    Returns the segment that contains index i.
    default Stream<Segment>
    Returns a Stream of all Segments in the source sequence.
    Returns a Stream of all Segments in the source sequence where all segment limits l satisfy l ≤ i.
    segmentsFrom(int i)
    Returns a Stream of all Segments in the source sequence where all segment limits l satisfy i < l.
    Returns a Stream of the CharSequences for all of the segments in the source sequence.
  • Method Details

    • subSequences

      default Stream<CharSequence> subSequences()
      Returns a Stream of the CharSequences for all of the segments in the source sequence. Start from the beginning of the sequence and iterate forwards until the end.
      Returns:
      a Stream of all Segments in the source sequence.
    • segmentAt

      Segment segmentAt(int i)
      Returns the segment that contains index i. Containment is inclusive of the start index and exclusive of the limit index.

      Specifically, the containing segment is defined as the segment with start s and limit l such that s ≤ i < l.

      Parameters:
      i - index in the input CharSequence to the Segmenter
      Returns:
      A segment that either starts at or contains index i
      Throws:
      IndexOutOfBoundsException - if i is less than 0 or greater than or equal to the length of the input CharSequence to the Segmenter
    • segments

      default Stream<Segment> segments()
      Returns a Stream of all Segments in the source sequence. Start with the first and iterate forwards until the end of the sequence.

      This is equivalent to segmentsFrom(0).

      Returns:
      a Stream of all Segments in the source sequence.
    • segmentsFrom

      Stream<Segment> segmentsFrom(int i)
      Returns a Stream of all Segments in the source sequence where all segment limits l satisfy i < l. Iteration moves forwards.

      This means that the first segment in the stream is the same as what is returned by segmentAt(i).

      The word "from" is used here to mean "at or after", with the semantics of "at" for a Segment defined by segmentAt(int)}. We cannot describe the segments all as being "after" since the first segment might contain i in the middle, meaning that in the forward direction, its start position precedes i.

      segmentsFrom and segmentsBefore(int) create a partitioning of the space of all Segments.

      Parameters:
      i - index in the input CharSequence to the Segmenter
      Returns:
      a Stream of all Segments at or after i
    • segmentsBefore

      Stream<Segment> segmentsBefore(int i)
      Returns a Stream of all Segments in the source sequence where all segment limits l satisfy l ≤ i. Iteration moves backwards.

      This means that the all segments in the stream come before the one that is returned by segmentAt(i). A segment is not considered to contain index i if {code i} is equal to limit l. Thus, "before" encapsulates the invariant l ≤ i.

      Parameters:
      i - index in the input CharSequence to the Segmenter
      Returns:
      a Stream of all Segments before i
    • isBoundary

      boolean isBoundary(int i)
      Returns whether offset i is a segmentation boundary. Throws an exception when i is not a valid index position for the source sequence.
      Parameters:
      i - index in the input CharSequence to the Segmenter
      Returns:
      Returns whether offset i is a segmentation boundary.
      Throws:
      IllegalArgumentException - if i is less than 0 or greater than the length of the input CharSequence to the Segmenter
    • boundaries

      default IntStream boundaries()
      Returns all segmentation boundaries, starting from the beginning and moving forwards.

      Note: boundaries() != boundariesAfter(0). This difference naturally results from the strict inequality condition in boundariesAfter, and the fact that 0 is the first boundary returned from the start of an input sequence.

      Returns:
      An IntStream of all segmentation boundaries, starting at the first boundary with index 0, and moving forwards in the input sequence.
    • boundariesAfter

      IntStream boundariesAfter(int i)
      Returns all segmentation boundaries after the provided index. Iteration moves forwards.
      Parameters:
      i - index in the input CharSequence to the Segmenter
      Returns:
      An IntStream of all boundaries b such that b > i
    • boundariesBackFrom

      IntStream boundariesBackFrom(int i)
      Returns all segmentation boundaries on or before the provided index. Iteration moves backwards.

      The phrase "back from" is used to indicate both that: 1) boundaries are "on or before" the input index; 2) the direction of iteration is backwards (towards the beginning). "on or before" indicates that the result set is b where b ≤ i, which is a weak inequality, while "before" might suggest the strict inequality b < i.

      boundariesBackFrom and boundariesAfter(int) create a partitioning of the space of all boundaries.

      Parameters:
      i - index in the input CharSequence to the Segmenter
      Returns:
      An IntStream of all boundaries b such that b ≤ i