Anomaly detection with embeddings

Overview

This tutorial demonstrates how to use the embeddings from the Gemini API to detect potential outliers in your dataset. You will visualize a subset of the 20 Newsgroup dataset using t-SNE and detect outliers outside a particular radius of the central point of each categorical cluster.

Setup

Install the Google GenAI SDK

Install the Google GenAI SDK from npm.

$ npm install @google/genai

Setup your API key

You can create your API key using Google AI Studio with a single click.

Remember to treat your API key like a password. Don’t accidentally save it in a notebook or source file you later commit to GitHub. In this notebook we will be storing the API key in a .env file. You can also set it as an environment variable or use a secret manager.

Here’s how to set it up in a .env file:

$ touch .env
$ echo "GEMINI_API_KEY=<YOUR_API_KEY>" >> .env
Tip

Another option is to set the API key as an environment variable. You can do this in your terminal with the following command:

$ export GEMINI_API_KEY="<YOUR_API_KEY>"

Load the API key

To load the API key from the .env file, we will use the dotenv package. This package loads environment variables from a .env file into process.env.

$ npm install dotenv

Then, we can load the API key in our code:

const dotenv = require("dotenv") as typeof import("dotenv");

dotenv.config({
  path: "../.env",
});

const GEMINI_API_KEY = process.env.GEMINI_API_KEY ?? "";
if (!GEMINI_API_KEY) {
  throw new Error("GEMINI_API_KEY is not set in the environment variables");
}
console.log("GEMINI_API_KEY is set in the environment variables");
GEMINI_API_KEY is set in the environment variables
Note

In our particular case the .env is is one directory up from the notebook, hence we need to use ../ to go up one directory. If the .env file is in the same directory as the notebook, you can omit it altogether.

│
├── .env
└── examples
    └── Anomaly_detection_with_embeddings.ipynb

Initialize SDK Client

With the new SDK, now you only need to initialize a client with you API key (or OAuth if using Vertex AI). The model is now set in each call.

const tslab = require("tslab") as typeof import("tslab");
const google = require("@google/genai") as typeof import("@google/genai");

const ai = new google.GoogleGenAI({ apiKey: GEMINI_API_KEY });

Prepare dataset

The 20 Newsgroups Text Dataset from the open-source SciKit project contains 18,000 newsgroups posts on 20 topics divided into training and test sets. The split between the training and test datasets are based on messages posted before and after a specific date. This tutorial uses the training subset.

const fs = require("fs") as typeof import("fs");
const path = require("path") as typeof import("path");
const danfo = require("danfojs-node") as typeof import("danfojs-node");

// URL of the scikit-learn 20 Newsgroups dataset
const DATA_URL = "https://ndownloader.figshare.com/files/5975967";
const EXTRACT_PATH = "../assets/anomaly_detection";

async function downloadAndExtractDataset(): Promise<void> {
  if (fs.existsSync(EXTRACT_PATH)) {
    console.log("Dataset already exists. Skipping download.");
    return;
  }

  console.log("Downloading 20 Newsgroups dataset...");
  const response = await fetch(DATA_URL);
  const buffer = await response.arrayBuffer();

  console.log("Extracting dataset...");
  await fs.promises.mkdir(EXTRACT_PATH, { recursive: true });

  const zipPath = path.join(EXTRACT_PATH, "20news-bydate.tar.gz");
  fs.writeFileSync(zipPath, Buffer.from(buffer));

  const tar = require("tar") as typeof import("tar");
  await tar.x({
    file: zipPath,
    cwd: EXTRACT_PATH,
  });

  console.log("Dataset extracted.");
}

function loadTextFilesFromDir(dirPath: string): {
  data: string[];
  target: string[];
} {
  const categories = fs.readdirSync(dirPath);
  const data: string[] = [];
  const target: string[] = [];

  for (const category of categories) {
    const categoryPath = path.join(dirPath, category);
    if (fs.lstatSync(categoryPath).isDirectory()) {
      const files = fs.readdirSync(categoryPath);
      for (const file of files) {
        const filePath = path.join(categoryPath, file);
        const content = fs.readFileSync(filePath, "utf-8");
        data.push(content);
        target.push(category);
      }
    }
  }

  return { data, target };
}

await downloadAndExtractDataset();

const trainDir = path.join(EXTRACT_PATH, "20news-bydate-train");
const { data, target } = loadTextFilesFromDir(trainDir);

const df = new danfo.DataFrame({ data, target });

console.log("Sample Data:");
df.head().print();
Dataset already exists. Skipping download.
Sample Data:
╔════════════╤═══════════════════╤═══════════════════╗
║            │ data              │ target            ║
╟────────────┼───────────────────┼───────────────────╢
║ 0          │ From: mathew <m…  │ alt.atheism       ║
╟────────────┼───────────────────┼───────────────────╢
║ 1          │ From: mathew <m…  │ alt.atheism       ║
╟────────────┼───────────────────┼───────────────────╢
║ 2          │ From: I3150101@…  │ alt.atheism       ║
╟────────────┼───────────────────┼───────────────────╢
║ 3          │ From: mathew <m…  │ alt.atheism       ║
╟────────────┼───────────────────┼───────────────────╢
║ 4          │ From: strom@Wat…  │ alt.atheism       ║
╚════════════╧═══════════════════╧═══════════════════╝
/* eslint-disable @typescript-eslint/no-unsafe-member-access, @typescript-eslint/no-unsafe-call */
const classNames = df.target.unique().values as string[];
console.log("Class names:", classNames);
Class names: [
  'alt.atheism',
  'comp.graphics',
  'comp.os.ms-windows.misc',
  'comp.sys.ibm.pc.hardware',
  'comp.sys.mac.hardware',
  'comp.windows.x',
  'misc.forsale',
  'rec.autos',
  'rec.motorcycles',
  'rec.sport.baseball',
  'rec.sport.hockey',
  'sci.crypt',
  'sci.electronics',
  'sci.med',
  'sci.space',
  'soc.religion.christian',
  'talk.politics.guns',
  'talk.politics.mideast',
  'talk.politics.misc',
  'talk.religion.misc'
]

Here is the first example in the training set.

/* eslint-disable @typescript-eslint/no-unsafe-member-access */
const firstDoc = df.loc({ rows: [0], columns: ["data"] });
const firstText = firstDoc.data.values[0] as string;

const idx = firstText.indexOf("Lines");

if (idx !== -1) {
  console.log(firstText.slice(idx));
} else {
  console.log('"Lines" not found in the first document.');
}
Lines: 290

Archive-name: atheism/resources
Alt-atheism-archive-name: resources
Last-modified: 11 December 1992
Version: 1.0

                              Atheist Resources

                      Addresses of Atheist Organizations

                                     USA

FREEDOM FROM RELIGION FOUNDATION

Darwin fish bumper stickers and assorted other atheist paraphernalia are
available from the Freedom From Religion Foundation in the US.

Write to:  FFRF, P.O. Box 750, Madison, WI 53701.
Telephone: (608) 256-8900

EVOLUTION DESIGNS

Evolution Designs sell the "Darwin fish".  It's a fish symbol, like the ones
Christians stick on their cars, but with feet and the word "Darwin" written
inside.  The deluxe moulded 3D plastic fish is $4.95 postpaid in the US.

Write to:  Evolution Designs, 7119 Laurel Canyon #4, North Hollywood,
           CA 91605.

People in the San Francisco Bay area can get Darwin Fish from Lynn Gold --
try mailing <figmo@netcom.com>.  For net people who go to Lynn directly, the
price is $4.95 per fish.

AMERICAN ATHEIST PRESS

AAP publish various atheist books -- critiques of the Bible, lists of
Biblical contradictions, and so on.  One such book is:

"The Bible Handbook" by W.P. Ball and G.W. Foote.  American Atheist Press.
372 pp.  ISBN 0-910309-26-4, 2nd edition, 1986.  Bible contradictions,
absurdities, atrocities, immoralities... contains Ball, Foote: "The Bible
Contradicts Itself", AAP.  Based on the King James version of the Bible.

Write to:  American Atheist Press, P.O. Box 140195, Austin, TX 78714-0195.
      or:  7215 Cameron Road, Austin, TX 78752-2973.
Telephone: (512) 458-1244
Fax:       (512) 467-9525

PROMETHEUS BOOKS

Sell books including Haught's "Holy Horrors" (see below).

Write to:  700 East Amherst Street, Buffalo, New York 14215.
Telephone: (716) 837-2475.

An alternate address (which may be newer or older) is:
Prometheus Books, 59 Glenn Drive, Buffalo, NY 14228-2197.

AFRICAN-AMERICANS FOR HUMANISM

An organization promoting black secular humanism and uncovering the history of
black freethought.  They publish a quarterly newsletter, AAH EXAMINER.

Write to:  Norm R. Allen, Jr., African Americans for Humanism, P.O. Box 664,
           Buffalo, NY 14226.

                                United Kingdom

Rationalist Press Association          National Secular Society
88 Islington High Street               702 Holloway Road
London N1 8EW                          London N19 3NL
071 226 7251                           071 272 1266

British Humanist Association           South Place Ethical Society
14 Lamb's Conduit Passage              Conway Hall
London WC1R 4RH                        Red Lion Square
071 430 0908                           London WC1R 4RL
fax 071 430 1271                       071 831 7723

The National Secular Society publish "The Freethinker", a monthly magazine
founded in 1881.

                                   Germany

IBKA e.V.
Internationaler Bund der Konfessionslosen und Atheisten
Postfach 880, D-1000 Berlin 41. Germany.

IBKA publish a journal:
MIZ. (Materialien und Informationen zur Zeit. Politisches
Journal der Konfessionslosesn und Atheisten. Hrsg. IBKA e.V.)
MIZ-Vertrieb, Postfach 880, D-1000 Berlin 41. Germany.

For atheist books, write to:

IBDK, Internationaler B"ucherdienst der Konfessionslosen
Postfach 3005, D-3000 Hannover 1. Germany.
Telephone: 0511/211216


                               Books -- Fiction

THOMAS M. DISCH

"The Santa Claus Compromise"
Short story.  The ultimate proof that Santa exists.  All characters and 
events are fictitious.  Any similarity to living or dead gods -- uh, well...

WALTER M. MILLER, JR

"A Canticle for Leibowitz"
One gem in this post atomic doomsday novel is the monks who spent their lives
copying blueprints from "Saint Leibowitz", filling the sheets of paper with
ink and leaving white lines and letters.

EDGAR PANGBORN

"Davy"
Post atomic doomsday novel set in clerical states.  The church, for example,
forbids that anyone "produce, describe or use any substance containing...
atoms". 

PHILIP K. DICK

Philip K. Dick Dick wrote many philosophical and thought-provoking short 
stories and novels.  His stories are bizarre at times, but very approachable.
He wrote mainly SF, but he wrote about people, truth and religion rather than
technology.  Although he often believed that he had met some sort of God, he
remained sceptical.  Amongst his novels, the following are of some relevance:

"Galactic Pot-Healer"
A fallible alien deity summons a group of Earth craftsmen and women to a
remote planet to raise a giant cathedral from beneath the oceans.  When the
deity begins to demand faith from the earthers, pot-healer Joe Fernwright is
unable to comply.  A polished, ironic and amusing novel.

"A Maze of Death"
Noteworthy for its description of a technology-based religion.

"VALIS"
The schizophrenic hero searches for the hidden mysteries of Gnostic
Christianity after reality is fired into his brain by a pink laser beam of
unknown but possibly divine origin.  He is accompanied by his dogmatic and
dismissively atheist friend and assorted other odd characters.

"The Divine Invasion"
God invades Earth by making a young woman pregnant as she returns from
another star system.  Unfortunately she is terminally ill, and must be
assisted by a dead man whose brain is wired to 24-hour easy listening music.

MARGARET ATWOOD

"The Handmaid's Tale"
A story based on the premise that the US Congress is mysteriously
assassinated, and fundamentalists quickly take charge of the nation to set it
"right" again.  The book is the diary of a woman's life as she tries to live
under the new Christian theocracy.  Women's right to own property is revoked,
and their bank accounts are closed; sinful luxuries are outlawed, and the
radio is only used for readings from the Bible.  Crimes are punished
retroactively: doctors who performed legal abortions in the "old world" are
hunted down and hanged.  Atwood's writing style is difficult to get used to
at first, but the tale grows more and more chilling as it goes on.

VARIOUS AUTHORS

"The Bible"
This somewhat dull and rambling work has often been criticized.  However, it
is probably worth reading, if only so that you'll know what all the fuss is
about.  It exists in many different versions, so make sure you get the one
true version.

                             Books -- Non-fiction

PETER DE ROSA

"Vicars of Christ", Bantam Press, 1988
Although de Rosa seems to be Christian or even Catholic this is a very
enlighting history of papal immoralities, adulteries, fallacies etc.
(German translation: "Gottes erste Diener. Die dunkle Seite des Papsttums",
Droemer-Knaur, 1989)

MICHAEL MARTIN

"Atheism: A Philosophical Justification", Temple University Press,
 Philadelphia, USA.
A detailed and scholarly justification of atheism.  Contains an outstanding
appendix defining terminology and usage in this (necessarily) tendentious
area.  Argues both for "negative atheism" (i.e. the "non-belief in the
existence of god(s)") and also for "positive atheism" ("the belief in the
non-existence of god(s)").  Includes great refutations of the most
challenging arguments for god; particular attention is paid to refuting
contempory theists such as Platinga and Swinburne.
541 pages. ISBN 0-87722-642-3 (hardcover; paperback also available)

"The Case Against Christianity", Temple University Press
A comprehensive critique of Christianity, in which he considers
the best contemporary defences of Christianity and (ultimately)
demonstrates that they are unsupportable and/or incoherent.
273 pages. ISBN 0-87722-767-5

JAMES TURNER

"Without God, Without Creed", The Johns Hopkins University Press, Baltimore,
 MD, USA
Subtitled "The Origins of Unbelief in America".  Examines the way in which
unbelief (whether agnostic or atheistic)  became a mainstream alternative
world-view.  Focusses on the period 1770-1900, and while considering France
and Britain the emphasis is on American, and particularly New England
developments.  "Neither a religious history of secularization or atheism,
Without God, Without Creed is, rather, the intellectual history of the fate
of a single idea, the belief that God exists." 
316 pages. ISBN (hardcover) 0-8018-2494-X (paper) 0-8018-3407-4

GEORGE SELDES (Editor)

"The great thoughts", Ballantine Books, New York, USA
A "dictionary of quotations" of a different kind, concentrating on statements
and writings which, explicitly or implicitly, present the person's philosophy
and world-view.  Includes obscure (and often suppressed) opinions from many
people.  For some popular observations, traces the way in which various
people expressed and twisted the idea over the centuries.  Quite a number of
the quotations are derived from Cardiff's "What Great Men Think of Religion"
and Noyes' "Views of Religion".
490 pages. ISBN (paper) 0-345-29887-X.

RICHARD SWINBURNE

"The Existence of God (Revised Edition)", Clarendon Paperbacks, Oxford
This book is the second volume in a trilogy that began with "The Coherence of
Theism" (1977) and was concluded with "Faith and Reason" (1981).  In this
work, Swinburne attempts to construct a series of inductive arguments for the
existence of God.  His arguments, which are somewhat tendentious and rely
upon the imputation of late 20th century western Christian values and
aesthetics to a God which is supposedly as simple as can be conceived, were
decisively rejected in Mackie's "The Miracle of Theism".  In the revised
edition of "The Existence of God", Swinburne includes an Appendix in which he
makes a somewhat incoherent attempt to rebut Mackie.

J. L. MACKIE

"The Miracle of Theism", Oxford
This (posthumous) volume contains a comprehensive review of the principal
arguments for and against the existence of God.  It ranges from the classical
philosophical positions of Descartes, Anselm, Berkeley, Hume et al, through
the moral arguments of Newman, Kant and Sidgwick, to the recent restatements
of the classical theses by Plantinga and Swinburne.  It also addresses those
positions which push the concept of God beyond the realm of the rational,
such as those of Kierkegaard, Kung and Philips, as well as "replacements for
God" such as Lelie's axiarchism.  The book is a delight to read - less
formalistic and better written than Martin's works, and refreshingly direct
when compared with the hand-waving of Swinburne.

JAMES A. HAUGHT

"Holy Horrors: An Illustrated History of Religious Murder and Madness",
 Prometheus Books
Looks at religious persecution from ancient times to the present day -- and
not only by Christians.
Library of Congress Catalog Card Number 89-64079. 1990.

NORM R. ALLEN, JR.

"African American Humanism: an Anthology"
See the listing for African Americans for Humanism above.

GORDON STEIN

"An Anthology of Atheism and Rationalism", Prometheus Books
An anthology covering a wide range of subjects, including 'The Devil, Evil
and Morality' and 'The History of Freethought'.  Comprehensive bibliography.

EDMUND D. COHEN

"The Mind of The Bible-Believer", Prometheus Books
A study of why people become Christian fundamentalists, and what effect it
has on them.

                                Net Resources

There's a small mail-based archive server at mantis.co.uk which carries
archives of old alt.atheism.moderated articles and assorted other files.  For
more information, send mail to archive-server@mantis.co.uk saying

   help
   send atheism/index

and it will mail back a reply.


mathew
�
/* eslint-disable @typescript-eslint/no-unsafe-member-access, @typescript-eslint/no-unsafe-call, @typescript-eslint/no-unsafe-assignment */

df.data = df.data.values.map((d: string) => {
  let cleaned = d;

  // Remove emails
  cleaned = cleaned.replace(/[\w.-]+@[\w.-]+/g, "");

  // Remove names (assuming your original regex was incomplete due to formatting)
  // You can customize this pattern based on what "names" means in your context
  cleaned = cleaned.replace(/^(.*?)(?=\n)/g, ""); // naive: remove first line, often name

  // Remove "From: "
  cleaned = cleaned.replace(/From: /g, "");

  // Remove "\nSubject: "
  cleaned = cleaned.replace(/\nSubject: /g, "");

  // Truncate to 5000 characters
  if (cleaned.length > 5000) {
    cleaned = cleaned.slice(0, 5000);
  }

  return cleaned;
});

df.head().print();
╔════════════╤═══════════════════╤═══════════════════╗
║            │ data              │ target            ║
╟────────────┼───────────────────┼───────────────────╢
║ 0          │ Alt.Atheism FAQ…  │ alt.atheism       ║
╟────────────┼───────────────────┼───────────────────╢
║ 1          │ Alt.Atheism FAQ…  │ alt.atheism       ║
╟────────────┼───────────────────┼───────────────────╢
║ 2          │ Re: Gospel Dati…  │ alt.atheism       ║
╟────────────┼───────────────────┼───────────────────╢
║ 3          │ Re: university …  │ alt.atheism       ║
╟────────────┼───────────────────┼───────────────────╢
║ 4          │ Re: [soc.motss,…  │ alt.atheism       ║
╚════════════╧═══════════════════╧═══════════════════╝
/* eslint-disable @typescript-eslint/no-unsafe-member-access, @typescript-eslint/no-unsafe-call, @typescript-eslint/no-unsafe-assignment */

const texts = df.data.values as string[];
const classNameToLabelMap: Record<string, number> = df.target
  .unique()
  .values.reduce((acc: Record<string, number>, className: string, index: number) => {
    acc[className] = index + 1; // Start labels from 1
    return acc;
  }, {});

const classNames = df.target.values as string[];

const labels = classNames.map((name) => classNameToLabelMap[name]);

const df_train = new danfo.DataFrame({
  Text: texts,
  Label: labels,
  "Class Name": classNames,
});
df_train.head().print();
console.log("Training DataFrame created with", df_train.shape[0], "rows.");
╔════════════╤═══════════════════╤═══════════════════╤═══════════════════╗
║            │ Text              │ Label             │ Class Name        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 0          │ Alt.Atheism FAQ…  │ 1                 │ alt.atheism       ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 1          │ Alt.Atheism FAQ…  │ 1                 │ alt.atheism       ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 2          │ Re: Gospel Dati…  │ 1                 │ alt.atheism       ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 3          │ Re: university …  │ 1                 │ alt.atheism       ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 4          │ Re: [soc.motss,…  │ 1                 │ alt.atheism       ║
╚════════════╧═══════════════════╧═══════════════════╧═══════════════════╝

Training DataFrame created with 11314 rows.

Next, sample some of the data by taking 150 data points in the training dataset and choosing a few categories. This tutorial uses the science categories.

/* eslint-disable @typescript-eslint/no-unsafe-member-access, @typescript-eslint/no-unsafe-call, @typescript-eslint/no-unsafe-argument, @typescript-eslint/no-unsafe-assignment */

import { DataFrame } from "danfojs-node";

const SAMPLE_SIZE = 150;

const uniqueLabels = df_train.Label.unique().values;
const sampledGroups = [];
for (const label of uniqueLabels) {
  const labelGroup = df_train.query(df_train.Label.eq(label)).resetIndex();
  const groupSize = labelGroup.shape[0];

  if (groupSize > 0) {
    const sampledGroup = await labelGroup.sample(SAMPLE_SIZE, { seed: 42 });
    sampledGroups.push(sampledGroup);
  }
}
const df_train_sampled = danfo.concat({ dfList: sampledGroups, axis: 0 }) as DataFrame;
const df_train_final = df_train_sampled.query(df_train_sampled["Class Name"].str.includes("sci")).resetIndex();
console.log(`Sampled ${sampledGroups.length} groups from the training DataFrame.`);
console.log("Sampled DataFrame shape:", df_train_final.shape);
Sampled 20 groups from the training DataFrame.
Sampled DataFrame shape: [ 600, 3 ]
/* eslint-disable no-control-regex, @typescript-eslint/no-unsafe-member-access, @typescript-eslint/no-unsafe-call, @typescript-eslint/no-unsafe-argument, @typescript-eslint/no-unsafe-assignment */
const cleanedText = df_train_final.Text.values.map((text: string) => text.replace(/[\x00-\x1F\x7F]/g, " "));
df_train_final.addColumn("Text", cleanedText, { inplace: true });
df_train_final.head().print();
╔════════════╤═══════════════════╤═══════════════════╤═══════════════════╗
║            │ Text              │ Label             │ Class Name        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 0          │ Cryptography FA…  │ 12                │ sci.crypt         ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 1          │ text of White H…  │ 12                │ sci.crypt         ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 2          │ Re: White House…  │ 12                │ sci.crypt         ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 3          │ Cryptography FA…  │ 12                │ sci.crypt         ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 4          │ Re: How do they…  │ 12                │ sci.crypt         ║
╚════════════╧═══════════════════╧═══════════════════╧═══════════════════╝
/* eslint-disable @typescript-eslint/no-unsafe-member-access, @typescript-eslint/no-unsafe-call */

import { DataFrame } from "danfojs-node";

const valueCounts = df_train_final["Class Name"].valueCounts() as DataFrame;
valueCounts.print();
╔═════════════════╤═════╗
║ sci.crypt       │ 150 ║
╟─────────────────┼─────╢
║ sci.electronics │ 150 ║
╟─────────────────┼─────╢
║ sci.med         │ 150 ║
╟─────────────────┼─────╢
║ sci.space       │ 150 ║
╚═════════════════╧═════╝

Create the embeddings

In this section, you will see how to generate embeddings for the different texts in the dataframe using the embeddings from the Gemini API.

API changes to Embeddings with model embedding-001

For the embeddings model, text-embedding-004, there is a task type parameter and the optional title (only valid with task_type=RETRIEVAL_DOCUMENT).

These parameters apply only to the embeddings models. The task types are:

Task Type Description
RETRIEVAL_QUERY Specifies the given text is a query in a search/retrieval setting.
RETRIEVAL_DOCUMENT Specifies the given text is a document in a search/retrieval setting.
SEMANTIC_SIMILARITY Specifies the given text will be used for Semantic Textual Similarity (STS).
CLASSIFICATION Specifies that the embeddings will be used for classification.
CLUSTERING Specifies that the embeddings will be used for clustering.
/* eslint-disable @typescript-eslint/no-unsafe-member-access, @typescript-eslint/no-unsafe-call, @typescript-eslint/no-unsafe-assignment */

const MODEL_ID = "models/text-embedding-004";
const BATCH_SIZE = 100;
const embeddings: number[][] = [];
const display = tslab.newDisplay();
display.text("Progress: 0%");

for (let i = 0; i < df_train_final.shape[0]; i += BATCH_SIZE) {
  const batch = df_train_final.Text.values.slice(i, i + BATCH_SIZE);
  const embeddingResponse = await ai.models.embedContent({
    model: MODEL_ID,
    contents: batch,
    config: {
      taskType: "CLUSTERING",
    },
  });
  const batchEmbeddings = embeddingResponse.embeddings?.map((e) => e.values ?? []) ?? [];
  embeddings.push(...batchEmbeddings);
  display.text(`Progress: ${(((i + BATCH_SIZE) / df_train_final.shape[0]) * 100).toFixed(2)}%`);
}
df_train_final.addColumn("Embedding", new danfo.Series(embeddings), { inplace: true });
Progress: 100.00%
df_train_final.head().print();
╔════════════╤═══════════════════╤═══════════════════╤═══════════════════╤═══════════════════╗
║            │ Text              │ Label             │ Class Name        │ Embedding         ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 0          │ Cryptography FA…  │ 12                │ sci.crypt         │ 0.024656007,0.0…  ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 1          │ text of White H…  │ 12                │ sci.crypt         │ 0.029945772,0.0…  ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 2          │ Re: White House…  │ 12                │ sci.crypt         │ 0.00047821522,0…  ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 3          │ Cryptography FA…  │ 12                │ sci.crypt         │ 0.015821712,0.0…  ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 4          │ Re: How do they…  │ 12                │ sci.crypt         │ 0.025664028,0.0…  ║
╚════════════╧═══════════════════╧═══════════════════╧═══════════════════╧═══════════════════╝

Dimensionality reduction

The dimension of the document embedding vector is 768. In order to visualize how the embedded documents are grouped together, you will need to apply dimensionality reduction as you can only visualize the embeddings in 2D or 3D space. Contextually similar documents should be closer together in space as opposed to documents that are not as similar.

console.log((df_train_final.at(0, "Embedding") as string).split(",").length, "dimensions in the embedding vector.");
768 dimensions in the embedding vector.
/* eslint-disable @typescript-eslint/no-unsafe-member-access, @typescript-eslint/no-unsafe-call */
const X = df_train_final.Embedding.values.map((e: string) => e.split(",").map(Number)) as number[][];
console.log("Shape of the embedding matrix:", X.length, "x", X[0].length);
Shape of the embedding matrix: 600 x 768

You will apply the t-Distributed Stochastic Neighbor Embedding (t-SNE) approach to perform dimensionality reduction. This technique reduces the number of dimensions, while preserving clusters (points that are close together stay close together). For the original data, the model tries to construct a distribution over which other data points are “neighbors” (e.g., they share a similar meaning). It then optimizes an objective function to keep a similar distribution in the visualization.

/* eslint-disable @typescript-eslint/no-unsafe-member-access, @typescript-eslint/no-unsafe-call, @typescript-eslint/no-unsafe-assignment */
const seedrandom = require("seedrandom") as typeof import("seedrandom");

Math.random = seedrandom("0") as () => number; // Set a fixed seed for reproducibility

const TSNE = require("tsne-js") as typeof import("tsne-js");

const tsne = new TSNE({
  dim: 2, // output dimension (2D)
  perplexity: 50, // typical value, you can tune it
  earlyExaggeration: 12.0,
  learningRate: 50.0,
  nIter: 1000, // max iterations like max_iter=1000 in sklearn
  metric: "euclidean",
});

// Initialize and run
tsne.init({
  data: X,
  type: "dense",
});
tsne.run();

// Get the 2D embeddings
const tsneResults = tsne.getOutput(); // number[][] with shape (n_samples, 2)

console.log(tsneResults);
[
  [ -2.75719412213166, -2.4167579925325673 ],
  [ -3.4650509728517473, 0.19940904059927403 ],
  [ -3.1753115790018804, -0.1284964597750167 ],
  [ -2.732233765019833, -2.2738909389327677 ],
  [ -3.640960248614705, 1.592031448202725 ],
  [ -2.913274910232553, -2.0117842994764934 ],
  [ -3.732565872724042, 0.8929089650821408 ],
  [ -2.6001813552425275, -0.4345021259512707 ],
  [ -3.4235476754065806, 0.26750584730012095 ],
  [ -2.9094321056136345, -0.20967248547410808 ],
  [ -2.670751980293048, -0.259668319473064 ],
  [ -3.0984366169794093, 0.4478399191644772 ],
  [ -3.730983535910628, 1.663066747665998 ],
  [ -2.813190116009361, -2.2528771229509585 ],
  [ 0.14338377138146707, -4.050573473374774 ],
  [ -2.7730442343509845, -1.2572076195272872 ],
  [ -2.812856057009155, -0.6776512329441536 ],
  [ -3.2991438028844975, 0.4214064649960356 ],
  [ -2.847791802459114, 0.17132417448921827 ],
  [ -3.5114410469589976, 0.09457726442168582 ],
  [ -2.481084649210149, -2.2765942963368118 ],
  [ -4.119281664184095, 0.003326814908289964 ],
  [ 1.343627471949905, 0.13043929240223734 ],
  [ -3.4626831652854917, -1.5854509352096195 ],
  [ -4.022288633352748, 0.05933391126826325 ],
  [ -3.6075551233322045, -1.5551137270731281 ],
  [ -3.3990454528907543, -0.38329348308808925 ],
  [ -3.152273955390189, 0.05964823178148446 ],
  [ -4.554734811959803, 0.3621894284256046 ],
  [ -3.0322653921150478, 1.3267324132866969 ],
  [ -3.205014886307019, -1.685280712335292 ],
  [ -3.487612826803247, -0.13739846073977646 ],
  [ -3.621165903878491, 1.6764088523801437 ],
  [ -3.781258093597947, 0.9654437402332678 ],
  [ -3.263296624713932, -1.3043939423227202 ],
  [ -3.643671735136128, 0.7982166673727821 ],
  [ -1.9825516177133382, -2.9852993115048188 ],
  [ -2.796726194971425, -2.342798334799461 ],
  [ -3.418304950905879, 0.2520139109718462 ],
  [ -2.272932997009172, 0.665897544406455 ],
  [ -2.959277201168116, -0.4808951298144854 ],
  [ -3.5191170044388858, -1.5822375384169851 ],
  [ -3.6718846973933026, 1.6863084763169696 ],
  [ -0.10717139922240557, -0.7924637079721915 ],
  [ -4.416387283334934, 0.3036821901396999 ],
  [ -2.8376079576156084, -2.6851472681872948 ],
  [ -2.660493358067436, 0.3700642061629209 ],
  [ 0.1501392101717233, -4.044177967414137 ],
  [ -2.73412399679903, 2.132187977748074 ],
  [ -3.0751542691818248, 0.19801504783368584 ],
  [ -2.798942324727005, -2.2859973785641095 ],
  [ -3.092624085378025, 1.645195741005908 ],
  [ -1.959198190367589, -3.0415080170307887 ],
  [ -3.279382709490505, -1.7013211135796653 ],
  [ -2.7963320260563385, -0.07420727048992252 ],
  [ -3.265040108356337, 0.3171497174201787 ],
  [ -2.8973113567790842, -2.625503845312865 ],
  [ -3.4538424335641627, 0.26775004726638857 ],
  [ -3.6700320446069106, -1.4383323128505636 ],
  [ -3.1424779746761344, -2.336497124051821 ],
  [ -2.877066266085532, -0.18796038228802414 ],
  [ -2.954830448950433, -0.14175695859932466 ],
  [ -3.0050465370063004, 0.7051067387086039 ],
  [ 0.1588881630864866, -4.074795399394153 ],
  [ -2.341704983858917, -1.5671266092446112 ],
  [ -1.616773198414981, -2.7917638574934625 ],
  [ -3.1447294707922113, -2.2654588936050417 ],
  [ -3.390896928228023, -0.0879412177360309 ],
  [ -3.8654507203512596, 1.251887995243783 ],
  [ -2.870202263845239, -2.4721063596229347 ],
  [ -3.071020333068617, 0.030423787825393008 ],
  [ -2.7666024657126247, -2.2475269075162094 ],
  [ -3.090652547446768, -0.03515393665120045 ],
  [ -2.4584769656123537, 3.6194066338497706 ],
  [ -2.734340337387575, -0.66535319727941 ],
  [ 0.14569857896490673, -4.055534036802551 ],
  [ -2.997590681428589, -1.7723923439324571 ],
  [ -2.7308760527460136, 2.1038448229239215 ],
  [ -3.664635622273039, 0.8521478576385247 ],
  [ -3.609407854798383, 0.7662698287432311 ],
  [ -3.3258877246982386, 0.4375709037472294 ],
  [ -3.193734103004041, -1.3182277404257015 ],
  [ -2.490288865402754, 0.6036089243887675 ],
  [ -2.345400420805133, 0.6653508522849211 ],
  [ -3.449591142002654, -0.2281816441482124 ],
  [ -4.629871864325621, 0.37992808921999927 ],
  [ -3.1252282755023177, 1.654393676174113 ],
  [ -3.5841664560558173, 0.7256331199477916 ],
  [ -3.618310046119403, 0.3931151799798502 ],
  [ -2.346985602095025, -0.22166610030234346 ],
  [ -3.3001389381673696, 0.46921670905010815 ],
  [ -2.9162838182777415, -0.12780013707644988 ],
  [ -2.9978278464029025, -0.02477620462808609 ],
  [ -2.2815192119337753, -0.1786052332304423 ],
  [ -3.110707401695051, 1.5275108022386112 ],
  [ -3.010200680865807, 1.3559566761181823 ],
  [ -2.969744852231584, 0.6886056283458636 ],
  [ -2.7206445190844826, 0.15633192640302912 ],
  [ -3.165763610103645, 0.9415403988846237 ],
  [ -2.003595489918228, -2.95971959822669 ],
  ... 500 more items
]
/* eslint-disable @typescript-eslint/no-unsafe-argument */
const df_tsne = new danfo.DataFrame(tsneResults, { columns: ["TSNE1", "TSNE2"] });
df_tsne.addColumn("Class Name", df_train_final["Class Name"], { inplace: true });
df_tsne.head().print();
╔════════════╤═══════════════════╤═══════════════════╤═══════════════════╗
║            │ TSNE1             │ TSNE2             │ Class Name        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 0          │ -2.757194122131…  │ -2.416757992532…  │ sci.crypt         ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 1          │ -3.465050972851…  │ 0.1994090405992…  │ sci.crypt         ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 2          │ -3.175311579001…  │ -0.128496459775…  │ sci.crypt         ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 3          │ -2.732233765019…  │ -2.273890938932…  │ sci.crypt         ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 4          │ -3.640960248614…  │ 1.5920314482027…  │ sci.crypt         ║
╚════════════╧═══════════════════╧═══════════════════╧═══════════════════╝
const rawData = df_tsne.toJSON() as { TSNE1: number[]; TSNE2: number[]; "Class Name": string[] }[];

// Step 1: Get unique class names
const classNames = [...new Set(rawData.map((row) => row["Class Name"]))];

// Step 2: Generate one trace per class
const traces = classNames.map((className) => {
  const classData = rawData.filter((row) => row["Class Name"] === className);
  return {
    x: classData.map((row) => row.TSNE1),
    y: classData.map((row) => row.TSNE2),
    mode: "markers",
    type: "scatter",
    name: className,
    marker: {
      size: 6,
    },
    text: classData.map((row) => row["Class Name"]),
    hoverinfo: "text",
  };
});

const html = `
<div style="width: 100%; height: 600px;">
  <div id="scatter-plot" style="width: 100%; height: 100%;"></div>
  <script src="https://cdn.jsdelivr.net/npm/plotly.js-dist@latest/plotly.min.js"></script>
  <script>
    const traces = ${JSON.stringify(traces)};
    
    const layout = {
      title: { text: 'Scatter plot of news using t-SNE', font: { size: 20 } },
      xaxis: { title: 'TSNE1' },
      yaxis: { title: 'TSNE2' },
      showlegend: true,
      height: 600,
      width: 800
    };

    Plotly.newPlot('scatter-plot', traces, layout);
  </script>
</div>
`;

tslab.display.html(html);

Outlier detection

To determine which points are anomalous, you will determine which points are inliers and outliers. Start by finding the centroid, or location that represents the center of the cluster, and use the distance to determine the points that are outliers.

Start by getting the centroid of each category.

const centroids = df_tsne.groupby(["Class Name"]).mean();
console.log("Centroids of each class:");
centroids.print();
Centroids of each class:
╔════════════╤═══════════════════╤═══════════════════╤═══════════════════╗
║            │ Class Name        │ TSNE1_mean        │ TSNE2_mean        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 0          │ sci.crypt         │ -2.957136843703…  │ -0.219018097539…  ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 1          │ sci.electronics   │ 0.1989684993404…  │ -1.092874851044…  ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 2          │ sci.med           │ 0.4772668588016…  │ 1.9774906322719…  ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 3          │ sci.space         │ 2.3621191261015…  │ -0.698502340616…  ║
╚════════════╧═══════════════════╧═══════════════════╧═══════════════════╝
/* eslint-disable @typescript-eslint/no-unsafe-member-access */

import { DataFrame } from "danfojs-node";

function getEmbeddingCentroids(df: DataFrame): Record<string, number[]> {
  const embCentroids: Record<string, number[]> = {};
  const grouped = df.groupby(["Class Name"]);
  const uniqueClasses = Object.keys(grouped.groups);
  for (const c of uniqueClasses) {
    const subDf = grouped.getGroup([c]);
    const embeddings = (subDf.Embedding.values as string[]).map((emb) => emb.split(",").map(Number));
    const centroid = embeddings[0].map(
      (_, dim) => embeddings.reduce((sum, emb) => sum + emb[dim], 0) / embeddings.length
    );
    embCentroids[c] = centroid;
  }
  return embCentroids;
}
const embeddingCentroids = getEmbeddingCentroids(df_train_final);
console.log("Embedding centroids for each class:");
for (const [className, centroid] of Object.entries(embeddingCentroids)) {
  console.log(`${className}: [${centroid.slice(0, 5).join(", ")}, ...]`); // Display first 5 dimensions
}
Embedding centroids for each class:
sci.crypt: [0.022761533203466654, 0.03083421185486666, -0.04401745468546671, 0.028704286586933327, 0.022358004574199997, ...]
sci.electronics: [0.014366566139066662, 0.008163063420626665, -0.04703755094199998, 0.03413747522600001, 0.0053475741199999986, ...]
sci.med: [0.027153817067799988, 0.02117220626106667, -0.04637992124000001, 0.028582068277333336, 0.009880412332133338, ...]
sci.space: [0.05165711565620001, 0.019747807417133327, -0.03648269307839999, 0.0462917145776, 0.019814320328666667, ...]

Plot each centroid you have found against the rest of the points.

const centroidData = (
  danfo.toJSON(centroids) as { "Class Name": string; TSNE1_mean: number; TSNE2_mean: number }[]
).map((row) => ({
  className: row["Class Name"],
  x: row.TSNE1_mean,
  y: row.TSNE2_mean,
}));

const centroidTrace = {
  x: centroidData.map((d) => d.x),
  y: centroidData.map((d) => d.y),
  mode: "markers+text",
  type: "scatter",
  name: "Centroids",
  marker: {
    size: 14,
    color: "black",
    symbol: "x",
  },
  text: centroidData.map((d) => d.className),
  textposition: "top center",
  hoverinfo: "text",
};

const html = `
<div style="width: 100%; height: 600px;">
  <div id="scatter-plot-centroids" style="width: 100%; height: 100%;"></div>
  <script src="https://cdn.jsdelivr.net/npm/plotly.js-dist@latest/plotly.min.js"></script>
  <script>
    const allTraces = ${JSON.stringify([...traces, centroidTrace])};

    const centroidLayout = {
      title: { text: 'Scatter plot of news using t-SNE with centroids', font: { size: 20 } },
      xaxis: { title: 'TSNE1' },
      yaxis: { title: 'TSNE2' },
      showlegend: true,
      legend: { x: 1.05, y: 1 },
      height: 600,
      width: 800
    };

    Plotly.newPlot('scatter-plot-centroids', allTraces, centroidLayout);
  </script>
</div>
`;

tslab.display.html(html);

Choose a radius. Anything beyond this bound from the centroid of that category is considered an outlier.

/* eslint-disable @typescript-eslint/no-unsafe-member-access */

import { DataFrame } from "danfojs-node";

function calculateEuclideanDistance(p1: number[], p2: number[]): number {
  return Math.sqrt(p1.reduce((sum, val, idx) => sum + Math.pow(val - p2[idx], 2), 0));
}

function detectOutliers(df: DataFrame, embCentroids: Record<string, number[]>, radius: number): number[] {
  const outlierFlags: boolean[] = [];
  for (let i = 0; i < df.shape[0]; i++) {
    const row = df.iloc({ rows: [i] });
    const className = row["Class Name"].values[0] as string;
    const embedding = (row.Embedding.values[0] as string).split(",").map(Number);
    const centroid = embCentroids[className];
    const distance = calculateEuclideanDistance(embedding, centroid);
    outlierFlags.push(distance > radius);
  }
  df.addColumn("Outlier", new danfo.Series(outlierFlags), { inplace: true });
  return outlierFlags.map((flag, index) => (flag ? index : -1)).filter((index) => index !== -1);
}
const range_ = Array.from({ length: 23 }, (_, i) => (0.3 + i * 0.02).toFixed(2));
const numOutliers: number[] = [];
for (const radius of range_) {
  const outlierCount = detectOutliers(df_train_final, embeddingCentroids, parseFloat(radius));
  numOutliers.push(outlierCount.length);
}
const barTrace = {
  x: range_,
  y: numOutliers,
  type: "bar",
  text: numOutliers.map(String), // bar labels
  textposition: "outside", // display on top of bars
  marker: { color: "#1f77b4" },
};

const html = `
<div style="width: 100%; height: 600px;">
  <div id="bar-plot" style="width: 100%; height: 100%;"></div>
  <script src="https://cdn.jsdelivr.net/npm/plotly.js-dist@latest/plotly.min.js"></script>
  <script>
    const barTrace = ${JSON.stringify(barTrace)};

    const barLayout = {
      title: {
        text: "Number of outliers vs. distance of points from centroid",
        font: { size: 20 }
      },
      xaxis: {
        title: { text: "Distance" },
        tickangle: -45
      },
      yaxis: {
        title: { text: "Number of outliers" }
      },
      height: 600,
      width: 1000,
      margin: { t: 80, l: 80, r: 40, b: 100 }
    };

    Plotly.newPlot("bar-plot", [barTrace], barLayout);
  </script>
</div>
`;

tslab.display.html(html);

Depending on how sensitive you want your anomaly detector to be, you can choose which radius you would like to use. For now, 0.62 is used, but you can change this value.

const RADIUS = 0.62;
const outlierIndices = detectOutliers(df_train_final, embeddingCentroids, RADIUS);
const df_outliers = new danfo.DataFrame(
  // @ts-expect-error false positive, expression is not callable
  df_train_final.values.filter((row, index) => outlierIndices.includes(index)),
  { columns: df_train_final.columns }
);
df_outliers.head().print();
╔════════════╤═══════════════════╤═══════════════════╤═══════════════════╤═══════════════════╤═══════════════════╗
║            │ Text              │ Label             │ Class Name        │ Embedding         │ Outlier           ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 0          │ Cryptography FA…  │ 12                │ sci.crypt         │ 0.024656007,0.0…  │ true              ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 1          │ Re: Once tapped…  │ 12                │ sci.crypt         │ 0.015565537,-0.…  │ true              ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 2          │ Re: How do they…  │ 12                │ sci.crypt         │ 0.040472295,0.0…  │ true              ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 3          │ Cryptography FA…  │ 12                │ sci.crypt         │ 0.009976089,0.0…  │ true              ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 4          │ Re: Source of r…  │ 12                │ sci.crypt         │ -0.024744896,0.…  │ true              ║
╚════════════╧═══════════════════╧═══════════════════╧═══════════════════╧═══════════════════╧═══════════════════╝
const outliers_projected = df_tsne.loc({
  rows: outlierIndices,
  columns: df_tsne.columns,
});
outliers_projected.head().print();
╔════════════╤═══════════════════╤═══════════════════╤═══════════════════╗
║            │ TSNE1             │ TSNE2             │ Class Name        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 0          │ -2.757194122131…  │ -2.416757992532…  │ sci.crypt         ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 7          │ -2.600181355242…  │ -0.434502125951…  │ sci.crypt         ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 12         │ -3.730983535910…  │ 1.6630667476659…  │ sci.crypt         ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 13         │ -2.813190116009…  │ -2.252877122950…  │ sci.crypt         ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 14         │ 0.1433837713814…  │ -4.050573473374…  │ sci.crypt         ║
╚════════════╧═══════════════════╧═══════════════════╧═══════════════════╝

Plot the outliers and denote them using a transparent red color.

/* eslint-disable @typescript-eslint/no-unsafe-member-access, @typescript-eslint/no-unsafe-assignment */
const outlierTrace = {
  x: outliers_projected.TSNE1.values,
  y: outliers_projected.TSNE2.values,
  mode: "markers",
  type: "scatter",
  name: "Outliers",
  marker: {
    size: 10,
    color: "red",
    opacity: 0.5,
  },
  hoverinfo: "text",
};

const html = `
<div style="width: 100%; height: 600px;">
  <div id="scatter-plot-outliers" style="width: 100%; height: 100%;"></div>
  <script src="https://cdn.jsdelivr.net/npm/plotly.js-dist@latest/plotly.min.js"></script>
  <script>
    const outlierTrace = ${JSON.stringify(outlierTrace)};
    const outlierLayout = {
      title: { text: 'Scatter plot of news with outliers projected with t-SNE', font: { size: 20 } },
      xaxis: { title: 'TSNE1' },
      yaxis: { title: 'TSNE2' },
      showlegend: true,
      height: 600,
      width: 800
    };
    Plotly.newPlot('scatter-plot-outliers', [...allTraces, outlierTrace], outlierLayout);
    </script>
</div>
`;
tslab.display.html(html);

Use the index values of the datafames to print a few examples of what outliers can look like in each category. Here, the first data point from each category is printed out. Explore other points in each category to see data that are deemed as outliers, or anomalies.

/* eslint-disable @typescript-eslint/no-unsafe-member-access, @typescript-eslint/no-unsafe-call, @typescript-eslint/no-unsafe-argument */
const sciCryptOutliers = df_outliers.query(df_outliers["Class Name"].eq("sci.crypt"));
console.log(sciCryptOutliers.Text.values[0]);
Cryptography FAQ 07/10 - Digital Signatures Organization: The Crypt Cabal Lines: 85 Expires: 22 May 1993 04:00:07 GMT Reply-To:  NNTP-Posting-Host: pad-thai.aktis.com Summary: Part 7 of 10 of the sci.crypt FAQ, Digital Signatures and  Hash Functions.  Theory of one-way hash functions, distinctions of  terms. MD4 and MD5. Snefru. X-Last-Updated: 1993/04/16  Archive-name: cryptography-faq/part07 Last-modified: 1993/4/15   FAQ for sci.crypt, part 7: Digital Signatures and Hash Functions  This is the seventh of ten parts of the sci.crypt FAQ. The parts are mostly independent, but you should read the first part before the rest. We don't have the time to send out missing parts by mail, so don't ask. Notes such as ``[KAH67]'' refer to the reference list in the last part.  The sections of this FAQ are available via anonymous FTP to rtfm.mit.edu  as /pub/usenet/news.answers/cryptography-faq/part[xx].  The Cryptography  FAQ is posted to the newsgroups sci.crypt, sci.answers, and news.answers every 21 days.   Contents:  * What is a one-way hash function? * What is the difference between public, private, secret, shared, etc.? * What are MD4 and MD5? * What is Snefru?   * What is a one-way hash function?    A typical one-way hash function takes a variable-length message and   produces a fixed-length hash. Given the hash it is computationally   impossible to find a message with that hash; in fact one can't   determine any usable information about a message with that hash, not   even a single bit. For some one-way hash functions it's also   computationally impossible to determine two messages which produce the   same hash.    A one-way hash function can be private or public, just like an   encryption function. Here's one application of a public one-way hash   function, like MD5 or Snefru. Most public-key signature systems are   relatively slow. To sign a long message may take longer than the user   is willing to wait. Solution: Compute the one-way hash of the message,   and sign the hash, which is short. Now anyone who wants to verify the   signature can do the same thing.    Another name for one-way hash function is message digest function.  * What is the difference between public, private, secret, shared, etc.?    There is a horrendous mishmash of terminology in the literature for a   very small set of concepts. When an algorithm depends on a key which   isn't published, we call it a private algorithm; otherwise we call it   a public algorithm. We have encryption functions E and decryption   functions D, so that D(E(M)) = M for any message M. We also have   hashing functions H and verification functions V, such that V(M,X) = 1   if and only if X = H(M).    A public-key cryptosystem has public encryption and private   decryption. Checksums, such as the application mentioned in the   previous question, have public hashing and public verification.   Digital signature functions have private hashing and public   verification: only one person can produce the hash for a message,   but everyone can verify that the hash is correct.    Obviously, when an algorithm depends on a private key, it's meant to   be unusable by anyone who doesn't have the key. There's no real   difference between a ``shared'' key and a private key: a shared key   isn't published, so it's private. If you encrypt data for a friend   rather than ``for your eyes only'', are you suddenly doing   ``shared-key encryption'' rather than private-key encryption? No.  * What are MD4 and MD5?    MD4 and MD5 are message digest functions developed by Ron Rivest.   Definitions appear in RFC 1320 and RFC 1321 (see part 10). Code is   available from [FTPMD].    Note that a transcription error was found in the original MD5 draft   RFC. The corrected algorithm should be called MD5a, though some   people refer to it as MD5.  * What is Snefru?    Snefru is a family of message digest functions developed by Ralph   Merkle. Snefru-8 is an 8-round function, the newest in the family.   Definitions appear in Merkle's paper [ME91a]. Code is available from   [FTPSF]. 
/* eslint-disable @typescript-eslint/no-unsafe-member-access, @typescript-eslint/no-unsafe-call, @typescript-eslint/no-unsafe-argument */
const sciElectronicsOutliers = df_outliers.query(df_outliers["Class Name"].eq("sci.electronics"));
console.log(sciElectronicsOutliers.Text.values[0]);
Re: Conductive Plastic, what happened? Organization: Litton Systems, Toronto ONT Lines: 7  If you're thinking of reactive polymers they're making ESD safe contau iners out of it. As far as being conductive goes anything with a resistance less than 10 to the fouth  rth power ohms per cubic measure is classed as conductive per MIL-STD-1686 for ESD protection. My $0.02 ($0.016 US).  Bob. 
/* eslint-disable @typescript-eslint/no-unsafe-member-access, @typescript-eslint/no-unsafe-call, @typescript-eslint/no-unsafe-argument */
const sciMedOutliers = df_outliers.query(df_outliers["Class Name"].eq("sci.med"));
console.log(sciMedOutliers.Text.values[0]);
Is an oral form of Imitrex(sumatriptan) available in CA Article-I.D.: vela.1psee5$c3t Distribution: na Organization: Oakland University, Rochester MI. Lines: 9 NNTP-Posting-Host: ouchem.chem.oakland.edu  Sumatriptan(Imitrex) just became available in the US in a subcutaneous injectable form.  Is there an oral form available in CA?  A friend(yes really not me!)  has severe migranes about 2-3 times per week.  We live right by the CA border and he has gotten drugs for GERD prescribed by a US physician and filled in a CA pharmacy, but not yet FDA approved in the US.  What would be the cost of the oral form in CA$ also if anyone would have that info?      Thanks 
/* eslint-disable @typescript-eslint/no-unsafe-member-access, @typescript-eslint/no-unsafe-call, @typescript-eslint/no-unsafe-argument */
const sciSpaceOutliers = df_outliers.query(df_outliers["Class Name"].eq("sci.space"));
console.log(sciSpaceOutliers.Text.values[0]);
Re: pushing the envelope Article-I.D.: rave.1psogpINNksq Reply-To:  (CLAUDIO OLIVEIRA EGALON) Distribution: world Organization: NASA Langley Research Center, Hampton, VA  USA Lines: 11 NNTP-Posting-Host: tahiti.larc.nasa.gov   > flight tests are generally carefully coreographed and just what  > is going to be  'pushed' and how > far is precisely planned (despite occasional deviations from plans, > such as the 'early' first flight of the F-16 during its high-speed > taxi tests).  .. and Chuck Yeager earlier flights with the X-1...     

Next steps

You’ve now created an anomaly detector using embeddings! Try using your own textual data to visualize them as embeddings, and choose some bound such that you can detect outliers. You can perform dimensionality reduction in order to complete the visualization step. Note that t-SNE is good at clustering inputs, but can take a longer time to converge or might get stuck at local minima. If you run into this issue, another technique you could consider are principal components analysis (PCA).