Skip to content
  • Home
  • Blog

vector question

vector question

In our data set of line items we intend to categorize rows based on the
similarity of their textual description. One mechanism to do this would
be based on a bag of words (bow) approach that creates a vector space stored
in a term document matrix (TDM) or document term matrix (DTM). Assuming lines
that include a National Stock Number (NSN) are correctly identified as such, we
could create labelled data by gathering all lines with the same NSN and build a
comprehensive TDM/DTM depicting textual description to the best of our current capability.

Given the size of our data set (24 million rows) and the number of NSNs in our library (7 million)
it would be useful to store these vector space representations (TDM/DTM) for later
use instead of recalculating them. One manner to do this would be to create a directory structure
Codes (the first four digits of an NSN)
http://www.acquisition.gov/psc-manual
http://support.outreachsystems.com/resources/tables/pscs/
http://dodprocurementtoolbox.com/site-pages/psc-tool
good explain – > http://www.gsa.gov/cdnstatic/4%20-%20PPDS%20FSC%20Best%20Practice%20508.docx

and write the TDM/DTM there. R allows the creation
of a binary file of the data structure (typically called a RDS/RDA).

The only way to meaningfully perform matrix calculations later on these TDM/DTM is if
the matrices have the same dimensions. In order to achieve this we would need to isolate
the terms we were using in advance and then consistently only use those in our TDM/DTM.

My question is the following, is it correct to assume if i create one word list and consistently
use that in my TDM/DTM can i merely add the matrices derived from it together?

http://www.wingovernmentcontracts.com/national-stock-number-nsn.htm

Take for example:
The National Stock Number is a 13-digit number.

Example: 6240-00-027-2059

Federal Supply Group (FSG): Positions 1-2 (62)
Federal Supply Class (FSC): Positions 1-4 (6240)
Nato Country Code : Positions 5-6 (00)
National Item Identification Number (NIIN): Positions 5-13 (000272059)
Serial Number: Positions 7-13 (0272059)

Now go find all rows with “6240-00-027-2059”
Grab all their textual descriptions and create a TDM/DTM.
Now put it in a /6240/00/027/2059 directory.
If i gather (and sum) all the matrices in the subdirectory of
/6240/ will I have the correct TDM/DTM for 6240 ?

Theme by Colorlib Powered by WordPress