Text Pre-Processing to Latent Semantic Indexing

Slide 2:

  • text preprocessing
    • Stopwords removal
    • Stemming
      • Basic stemming methods
      • remove ending
      • transform words
    • Digits
    • Hyphens
    • Punctuation Marks
    • Case of Letters
    • Identifying different text fields
      • Identifying anchor text:
      • Removing HTML tags
      • Identifying main content blocks
        • Partitioning based on visual cues
        • Tree matching

slide 9:Duplicate Detection : ngrams

slide 11:Inverted index

slide 14:Search using inverted index

slide 16:Index construction

slide 21:

  • Inverted Index Compression
    • variable-bit scheme
      • Unary coding
      • Elias gamma coding
      • delta coding
      • Golomb coding
    • the variable-byte scheme

slide 24:how the id stored by using gap

slide 25:Unary Coding

slide 27:Elias Gamma Coding

slide 28:Elias Gamma Decoding

slide 29:Elias Delta Coding

slide 30:Elias Delta Coding

slide 31:Golomb Coding

slide 33:Golomb Decoding

slide 34:Golomb Decoding Example

slide 35:Variable-Byte Coding

slide 36:Variable Byte Decoding

slide 38:Space Vs Time trades off

slide 39:Latent Semantic Indexing

slide 42:Singular Value Decomposition

slide 44:LSI organization

slide 45:Query and Retrieval

slide 52:LSI-Disadvantages

Leave a reply