How to split by character

This is the simplest method. This splits based on a given character sequence, which defaults to "\n\n". Chunk length is measured by number of characters.

How the text is split: by single character separator.
How the chunk size is measured: by number of characters.

To obtain the string content directly, use .split_text.

To create LangChain Document objects (e.g., for use in downstream tasks), use .createDocuments.

import { CharacterTextSplitter } from "@langchain/textsplitters";

// Load an example document
const stateOfTheUnion = await Deno.readTextFile(
  "../../../../examples/state_of_the_union.txt"
);

const textSplitter = new CharacterTextSplitter({
  separator: "\n\n",
  chunkSize: 1000,
  chunkOverlap: 200,
});
const texts = await textSplitter.createDocuments([stateOfTheUnion]);
console.log(texts[0]);

Document {
  pageContent: "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th"... 839 more characters,
  metadata: { loc: { lines: { from: 1, to: 17 } } }
}

Use .createDocuments to propagate metadata associated with each document to the output chunks:

const metadatas = [{ document: 1 }, { document: 2 }];
const documents = await textSplitter.createDocuments(
  [stateOfTheUnion, stateOfTheUnion],
  metadatas
);
console.log(documents[0]);

Document {
  pageContent: "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th"... 839 more characters,
  metadata: { document: 1, loc: { lines: { from: 1, to: 17 } } }
}

Use .splitText to obtain the string content directly:

(await textSplitter.splitText(stateOfTheUnion))[0];

"Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th"... 839 more characters

How to split by character

Was this page helpful?

You can leave detailed feedback on GitHub.