[HN Gopher] New and Improved Embedding Model for OpenAI
       ___________________________________________________________________
        
       New and Improved Embedding Model for OpenAI
        
       Author : craigkerstiens
       Score  : 62 points
       Date   : 2022-12-15 18:13 UTC (4 hours ago)
        
 (HTM) web link (openai.com)
 (TXT) w3m dump (openai.com)
        
       | tibbar wrote:
       | Note that this model is not able to write code.
       | 
       | Prompt: "Write a C program to RSA encrypt a string."
       | 
       | text-embedding-ada-002:
       | 
       | %-C C C C C C C C
       | 
       | C
       | 
       | C
       | 
       | C
       | 
       | C
       | 
       | C C
       | 
       | C
       | 
       | C
       | 
       | C
       | 
       | C
       | 
       | C
       | 
       | C
       | 
       | C
       | 
       | C
       | 
       | C
       | 
       | C
       | 
       | C
       | 
       | C C C
       | 
       | C
       | 
       | C
       | 
       | C
       | 
       | C
       | 
       | C
       | 
       | C
       | 
       | C
       | 
       | C
       | 
       | C
       | 
       | C
       | 
       | C C
       | 
       | C
       | 
       | C
       | 
       | C
       | 
       | C C
       | 
       | C
       | 
       | C
       | 
       | C
       | 
       | C
       | 
       | C C C
       | 
       | C
       | 
       | C
       | 
       | C
       | 
       | C
       | 
       | C
       | 
       | C
       | 
       | C
       | 
       | C
       | 
       | C C C C C C C
       | 
       | C C
       | 
       | C
       | 
       | C
       | 
       | C C C C C C C
       | 
       | C
       | 
       | C C C C C C C
       | 
       | C
       | 
       | C C C C C C
       | 
       | C C C aC C C
       | 
       | C
       | 
       | m m m m m m m m m m m m m m m m m m mm m C mC
       | 
       | (Note: a relatively short max length)
       | 
       | Compare text-davinci-003:
       | 
       | #include <stdio.h> #include<math.h>
       | 
       | // Utility function to find GCD of 'a' // and 'b' int gcd(int a,
       | int b) { if (a == 0) return b; return gcd(b % a, a); }
       | 
       | // Function to find Modular Multiplicative // Inverse of 'a'
       | under modulo 'm' int modInverse(int a, int m) { a = a % m; for
       | (int x = 1; x < m; x++) if ((a * x) % m == 1) return x; }
       | 
       | // Function to compute a/b under modulo m int modDivide(int a,
       | int b, int m) { a = a % m; int inv = modInverse(b, m); return
       | (inv * a) % m; }
       | 
       | // Function to calculate (a^b)%c long long int power(int a, int
       | b, int c) { int res = 1; a = a % c; while (b > 0) { if (b & 1)
       | res = modDivide(res * a, 1, c); b = b >> 1; a = modDivide(a * a,
       | 1, c); } return res; }
       | 
       | // C function for extended Euclidean Algorithm int
       | gcdExtended(int a, int b, int _x, int_ y) { if (a == 0) { _x =
       | 0;_ y = 1; return b; } int x1, y1; int gcd = gcdExtended(b%a, a,
       | &x1, &y1); _x = y1 - (b /a) _ x1; _y = x1;
       | return gcd;
       | 
       | }
       | 
       | // Function to generate public and private // keys void
       | generateKeys(int p, int q, int _e, int _d, int_ n) { // n = p _q_
       | n = p * q;                   // Computing phi(n) = (p-1)*(q-1)
       | int phi = (p - 1) * (q - 1);                 // Choose an integer
       | e such that          // e and phi(n) are co-prime          *e =
       | 2;          while (*e < phi)          {              if (gcd(*e,
       | phi)==1)                  break;              else
       | ++*e;          }                 // Compute d such that it
       | satisfies          // d*e = 1 + k * phi(n)          int k = 2;
       | // Fixed value          *d = (1 + (k * phi))/ *e;
       | 
       | }
       | 
       | // Encryption Function long int RSA_Encrypt(int msg, int e, int
       | n) { // Cipher Text c = (msg ^ e) % n long int c = power(msg, e,
       | n);                   return c;
       | 
       | }
       | 
       | // Decryption Function long int RSA_Decrypt(int c, int d, int n)
       | { // Message msg = (c ^ d) % n long int msg = power(c, d, n);
       | return msg;
       | 
       | }
       | 
       | // Driver program int main() { // Two random prime numbers int p
       | = 3; int q = 7;                   // Message to be encrypted
       | int msg = 15;                 // Encryption key (e, n)
       | int e, d, n;                 // Calculate n and phi
       | generateKeys(p, q, &e, &d, &n);                 // Encryption
       | long int c = RSA_Encrypt(msg, e, n);          printf("Encrypted
       | message is: %d\n", c);                 // Decryption
       | long int m = RSA_Decrypt(c, d, n);          printf("Original
       | Message is: %d\n", m);                 return 0;      }
        
         | varunkmohan wrote:
         | It's an embedding model so it generates vector embeddings not
         | text generations. That's to be expected.
        
       | lee101 wrote:
       | Also check out the embedding model from https://text-generator.io
       | 
       | It supports some things that openAI can't do, it retrieves any
       | linked images of web pages, analyses the images or images with
       | text inside to help the embedding model
        
       | IanCal wrote:
       | Once I've got embeddings my naive next step would be to do cosine
       | similarity for comparisons/search/anything that requires a
       | distance. I see they do that in some examples.
       | 
       | Is that the standard approach these days? Are there newer default
       | approaches that tend to work better?
        
         | visarga wrote:
         | You can also apply clustering or train a classification model
         | based on embeddings.
        
       | evergreener wrote:
       | Is it known to anyone how OpenAI (and others) are extending the
       | context windows of things like ChatGPT so far? E.g. if you exceed
       | 2048/8192 (subword) tokens, does the model just chunk the inputs
       | and evaluate separately on the chunks? Is context/state
       | maintained across chunks? I've never seen anyone actually explain
       | this.
        
       | bcjordan wrote:
       | The embeddings/search API seem super powerful, been meaning to
       | play around with it more. I wonder how its performance compares
       | to ElasticSearch / other text search/classification offerings out
       | there
        
         | gk1 wrote:
         | The Search and Classification (and Answers) APIs were
         | deprecated last week.[1]
         | 
         | They were never in serious competition with Elastic, as far as
         | search goes. If you wanted to build a semantic search
         | application using OpenAI embeddings, the more common (and
         | scalable) method is to index those embeddings in a vector
         | database like Pinecone.[2] In fact that's what OpenAI
         | recommends to anyone who needs to transition off their Search
         | API.
         | 
         | [1] https://help.openai.com/en/articles/6272952-search-
         | transitio...
         | 
         | [2] https://docs.pinecone.io/docs/openai
        
           | thirdtrigger wrote:
           | Agreed - one can also use Weaviate which comes with an OOTB
           | OpenAI module leveraging the embeddings end-point
           | https://weaviate.io/developers/weaviate/current/retriever-
           | ve...
        
       | dr_dshiv wrote:
       | What effect will this have for connecting concepts between books?
       | Either through summarization or topic mapping?
        
       | gok wrote:
       | > Longer context. The context length of the new model is
       | increased by a factor of four, from 2048 to 8192, making it more
       | convenient to work with long documents.
       | 
       | 8192 words is getting into the range of short stories or a
       | masters thesis, which opens the door to some interesting
       | applications.
        
         | drusepth wrote:
         | Important to note that these tokens _can_ be words, but
         | oftentimes a word will be comprised of multiple tokens, so 8192
         | tokens = 8192 words isn't strictly correct.
         | 
         | That said, your point stands. Most short stories are low-to-mid
         | four-digit words, and a jump from 2048 tokens to 8192 squarely
         | fits in that window.
         | 
         | As someone who's been working on multi-layered approaches to
         | using GPT-like models for long text generation (e.g. synopsis
         | -> outline -> paragraph expansions) to get around the limited
         | context window, it'll be interesting to see if people will keep
         | working towards that end or if it'll all become a moot point as
         | the effective context window continues to scale up.
        
       ___________________________________________________________________
       (page generated 2022-12-15 23:01 UTC)