How to split by character
This is the simplest method. This splits based on a given character
sequence, which defaults to "\n\n"
. Chunk length is measured by number
of characters.
- How the text is split: by single character separator.
- How the chunk size is measured: by number of characters.
To obtain the string content directly, use .split_text
.
To create LangChain
Document
objects (e.g., for use in downstream tasks), use .createDocuments
.
import { CharacterTextSplitter } from "@langchain/textsplitters";
// Load an example document
const stateOfTheUnion = await Deno.readTextFile(
"../../../../examples/state_of_the_union.txt"
);
const textSplitter = new CharacterTextSplitter({
separator: "\n\n",
chunkSize: 1000,
chunkOverlap: 200,
});
const texts = await textSplitter.createDocuments([stateOfTheUnion]);
console.log(texts[0]);
Document {
pageContent: "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th"... 839 more characters,
metadata: { loc: { lines: { from: 1, to: 17 } } }
}
Use .createDocuments
to propagate metadata associated with each
document to the output chunks:
const metadatas = [{ document: 1 }, { document: 2 }];
const documents = await textSplitter.createDocuments(
[stateOfTheUnion, stateOfTheUnion],
metadatas
);
console.log(documents[0]);
Document {
pageContent: "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th"... 839 more characters,
metadata: { document: 1, loc: { lines: { from: 1, to: 17 } } }
}
Use .splitText
to obtain the string content directly:
(await textSplitter.splitText(stateOfTheUnion))[0];
"Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th"... 839 more characters