Instruction Embedding

Concept

Instruction embedding is the representation of individual assembly or machine instructions (or sequences of them) as continuous, low-dimensional vectors in a learned vector space. This representation enables neural network models — particularly transformer language models — to consume disassembly for downstream binary analysis, similarity detection, and coverage-guided hardware verification tasks.

First seen 6/12/2026

Last seen 6/12/2026

Evidence 3 chunks

Wiki v1

WIKI

Instruction Embedding

Definition

An instruction embedding is a fixed-length, continuous, low-dimensional vector representation of an assembly or machine instruction (or of a sequence of such instructions) that is suitable for input to a neural network model. The concept is borrowed from natural language processing (NLP), where words are commonly mapped into high-dimensional vector spaces, and it is applied to binary code analysis by treating instructions analogously to words.

READ FULL ARTICLE →

NEIGHBORHOOD

No graph connections found for this entity yet. It may appear in future ingestion runs.

explore full graph →

RELATIONSHIPS

2 connections

DeepVerifier ← uses 100% 2e

DeepVerifier creates instruction embeddings as continuous vector representations for downstream coverage prediction.

transformer language model for instruction sequences ← uses 100% 2e

The transformer language model produces instruction embeddings as continuous vector representations.

LINKED ENTITIES

2 links

DeepVerifier USES Extracted graph relationship

transformer language model for instruction sequences USES Extracted graph relationship

CITATIONS

9 sources

9 citations — click to expand

[1] Instruction embeddings represent instructions as continuous, low-dimensional vectors for neural network models in binary analysis tasks such as function boundary detection, binary code search, function prototype inference, and value set analysis. PalmTree: Learning an Assembly Language Model for Instruction Embedding

[2] Earlier instruction-embedding schemes that rely on control-flow context are noisy and sensitive to compiler optimizations, and they tend to ignore complex intra-instruction structures, motivating learned approaches such as PalmTree's self-supervised pre-training on large-scale unlabeled binary corpora. PalmTree: Learning an Assembly Language Model for Instruction Embedding

[3] Treating instructions as 'words' in an NLP-inspired pipeline, a joint learning approach produces instruction embeddings that place similar instructions from different architectures near each other, enabling cross-architecture binary code analysis and semantics-based basic-block comparison. A Cross-Architecture Instruction Embedding Model for Natural Language Processing-Inspired Binary Code Analysis

[4] DeepVerifier customizes a language model to tokenize and transform RISC-V assembly instructions into continuous, low-dimensional embedded vectors, which are then consumed by a coverage score predictor and used (via gradients) to update test sequences for higher coverage. DeepVerifier: Learning to Update Test Sequences for Coverage-Guided Verification

[5] RISC-V instructions are tokenized into opcodes, register operands, and special role tokens (<srcs>, <dsts>, <const>, <mem>…</mem>, <addr>…</addr>); sequence boundaries are marked with <s> and </s>, and intra-instruction commas are replaced with semicolons to facilitate sequence processing. DeepVerifier: Learning to Update Test Sequences for Coverage-Guided Verification

[6] The token vocabulary integrates opcode categories (Integer Register-Register, Integer Immediate, Floating-Point, Multiplication/Division, Compressed), register symbols (x0…x31, f0…f31), and special role tokens to form a RISC-V-specific tokenizer. DeepVerifier: Learning to Update Test Sequences for Coverage-Guided Verification

[7] Input tokens are mapped through an embedding matrix TE ∈ ℝ^(vocab_size × d_model) with d_model = 768 and combined with sinusoidal positional encodings via e_i = te_i + pe_pos to form position-aware token representations consumed by transformer blocks. DeepVerifier: Learning to Update Test Sequences for Coverage-Guided Verification

[8] Transformer blocks composed of multi-head self-attention, feed-forward networks, and add-and-normalize layers contextualize the position-aware token embeddings; the final block's output is the representation of the input instruction (or instruction sequence). DeepVerifier: Learning to Update Test Sequences for Coverage-Guided Verification

[9] The transformer-based framework is extensible to other ISA families by adapting the tokenization layer and ISA-specific architectural constraints; the self-attention mechanism supports sequences well beyond the 128–256-token range used in the current RISC-V implementation. DeepVerifier: Learning to Update Test Sequences for Coverage-Guided Verification