Generating Descriptors Workflow: Difference between revisions

From Rasulev Lab Wiki
Jump to navigation Jump to search
m (Revised formatting)
 
Line 3: Line 3:
Procedure Document for using Chem Sketch, Avogadro, Dragon, Excel, Hyperchem, & Build QSAR to obtain Descriptors.
Procedure Document for using Chem Sketch, Avogadro, Dragon, Excel, Hyperchem, & Build QSAR to obtain Descriptors.


Purpose: Using ChemSketch, ChemDraw, HyperChem, Avogadro, Dragon5, Build QSAR, and QSARINS programs to generate structures, descriptors, and models. Also, OpenBabel is used for converting files from one type to another.
Purpose: Using ChemSketch, ChemDraw, HyperChem, Avogadro, Dragon5, Build QSAR, and QSARINS programs to generate structures, descriptors, and models. OpenBabel is used for converting files from one type to another.


[[File:GeneratingDescriptors_Model_ChemSketch_HyperChem_OpenBabel_Avogadro_Dragon_BuildQSAR_HTML_e43805e4b85a54a7.png|624x291px]]
[[File:GeneratingDescriptors_Model_ChemSketch_HyperChem_OpenBabel_Avogadro_Dragon_BuildQSAR_HTML_e43805e4b85a54a7.png|624x291px]]


Chem Sketch: Special can convert SMILES to structure and structure to SMILES.
====== ChemSketch ======
Can convert SMILES to structure and structure to SMILES.


# Open Chem Sketch and click out of the little popup windows by clicking the X on the top right of the small window.
# Open Chem Sketch and click out of the little popup windows by clicking the X on the top right of the small window.
Line 17: Line 18:
# Exit Chem Sketch.
# Exit Chem Sketch.


<br />
====== ChemDraw ======
<br />
SPECIAL can convert name of structure to a structure.
 
 
Chem Draw: SPECIAL can convert name of structure to a structure.
 
# ChemDraw has a special feature that if given a name of a polymer, it will generate the monomer unit for you.
# ChemDraw has a special feature that if given a name of a polymer, it will generate the monomer unit for you.
# Open ChemDraw Professional
# Open ChemDraw Professional
Line 40: Line 37:
# Exit Chem Draw.
# Exit Chem Draw.


Avogadro:
====== Avogadro ======
 
<ol start="17">
<ol start="17">
<li><p>After doing these steps for all the molecules you wanted to create then open the program Avogadro.</p></li>
<li><p>After doing these steps for all the molecules you wanted to create then open the program Avogadro.</p></li>
Line 50: Line 46:
<li><p>Exit Avogadro.</p></li></ol>
<li><p>Exit Avogadro.</p></li></ol>


Dragon5: (IT IS OKAY TO USE *.mol and *.hin or any other format usable on Dragon, no issues what so ever)
====== Dragon5 ======
It is okay to use *.mol and *.hin or any other format usable on Dragon, no issues what so ever)


<ol start="23">
<ol start="23">
Line 67: Line 64:
<li><p>Once all the information is saved onto Excel then exit out of the Notepad++ file.</p></li></ol>
<li><p>Once all the information is saved onto Excel then exit out of the Notepad++ file.</p></li></ol>


<br />
====== Splitting to Training and Test sets ======
<br />
 
 
Splitting to Training and Test sets
 
In Excel worksheet, split the whole set to training and test sets by sorting your molecules according to “Experimental values” or “Y” column, in descending or ascending order. After sorting, select every 5<sup>th</sup> (20% of the set) or every 4<sup>th</sup> (25% of the set) compound (row) and then copy it to another excel sheet, naming it as a Test set. Remove the test compounds from the initial set to make a training set.
In Excel worksheet, split the whole set to training and test sets by sorting your molecules according to “Experimental values” or “Y” column, in descending or ascending order. After sorting, select every 5<sup>th</sup> (20% of the set) or every 4<sup>th</sup> (25% of the set) compound (row) and then copy it to another excel sheet, naming it as a Test set. Remove the test compounds from the initial set to make a training set.


Use the obtained training set to build a dataset in BuildQSAR.
Use the obtained training set to build a dataset in BuildQSAR.


<br />
====== BuildQSAR ======
<br />
 
 
BuildQSAR:
 
<ol start="36">
<ol start="36">
<li><p>Open “BuildQSAR.”</p></li>
<li><p>Open “BuildQSAR.”</p></li>
Line 105: Line 92:
<li><p>These rows are the different models the BuildQSAR generated for you.</p></li></ol>
<li><p>These rows are the different models the BuildQSAR generated for you.</p></li></ol>


Descriptor Information:
====== Descriptor Information ======
 
<ol start="56">
<ol start="56">
<li><p>So each model (AKA equation) has a certain number of descriptors in it. In this example there are 3.</p>
<li><p>So each model (AKA equation) has a certain number of descriptors in it. In this example there are 3.<math display="block">Y_1 = -1.6198 (\pm 1.1586) X_{269} + 0.0110 (\pm 0.0016) X_{631} - 0.0336 (\pm 0.0094) X_{634} + 0.8483 (\pm 0.3386)</math></p>
<ol style="list-style-type: lower-alpha;">
<li><p>Y1 = - 1.6198 (± 1.1586) X269 + 0.0110 (± 0.0016) X631 - 0.0336 (± 0.0094) X634 + 0.8483 (± 0.3386)</p></li></ol>
</li>
</li>
<li><p>To find the meaning of these descriptors go to your Excel file and above your descriptor data create a row that is listed from X1 to X####. X1 is above the first descriptor, usually MW. The #’s are the number of descriptors in one row.</p></li>
<li><p>To find the meaning of these descriptors go to your Excel file and above your descriptor data create a row that is listed from X1 to X#. X1 is above the first descriptor, usually MW. The #’s are the number of descriptors in one row.</p></li>
<li><p>On Excel find the descriptors that are listed in the model. Find the abbreviation under the X value. Example X269 corresponded to the abbreviation “MATS3e”</p></li>
<li><p>On Excel find the descriptors that are listed in the model. Find the abbreviation under the X value. Example X269 corresponded to the abbreviation “MATS3e”</p></li>
<li><p>Once you have all the descriptor abbreviations then go back to Dragon.</p></li>
<li><p>Once you have all the descriptor abbreviations then go back to Dragon.</p></li>
Line 120: Line 104:
<li><p>This will give the descriptor information.</p></li></ol>
<li><p>This will give the descriptor information.</p></li></ol>


<br />
====== Test Set predictions ======
<br />
 
 
Test Set predictions
 
Use obtained model to calculate “Predicted Experimental values” for the Test set.
Use obtained model to calculate “Predicted Experimental values” for the Test set.


Line 136: Line 115:
Select the best model by finding the maximum value of R<sup>2</sup> for the test set models.
Select the best model by finding the maximum value of R<sup>2</sup> for the test set models.


Open Babel GUI:
====== Open Babel GUI ======
 
This section depends on conversion of one file type to another if needed. Usually this is done to get HOMO LUMO (Extra descriptors) from Hyperchem (another 3D software) which has file types of “*.hin” Also, this is used to convert file type to smiles notation if needed. In this example we are converting “*.mol” files to Smiles format notation and “*.hin” format. “*.hin” is for HyperChem<br />
This section depends on conversion of one file type to another if needed. Usually this is done to get HOMO LUMO (Extra descriptors) from Hyperchem (another 3D software) which has file types of “*.hin” Also, this is used to convert file type to smiles notation if needed. In this example we are converting “*.mol” files to Smiles format notation and “*.hin” format. “*.hin” is for HyperChem<br />
<br />
<br />
<ol start="64">
<ol start="64">
<li><p>Open “Open Babel GUI”</p></li>
<li><p>Open “Open Babel GUI”</p></li>
Line 160: Line 134:
<li><p>The output file should be in the location with the name specified.</p></li></ol>
<li><p>The output file should be in the location with the name specified.</p></li></ol>


HyperChem:
====== HyperChem ======
 
This section depends on the need for quantum descriptors such as HOMO, LUMO, and Total energy.
This section depends on the need for quantum descriptors such as HOMO, LUMO, and Total energy.


Line 197: Line 170:
<li><p>Select “ QSAR Properties” at the bottom</p></li>
<li><p>Select “ QSAR Properties” at the bottom</p></li>
<li><p>In this window you can select any of the properties that you desired to compute by simply selecting the property and press compute. The value will be shown at the bottom of the “QSAR Properties” window.</p></li>
<li><p>In this window you can select any of the properties that you desired to compute by simply selecting the property and press compute. The value will be shown at the bottom of the “QSAR Properties” window.</p></li>
<li><p>Scripts are also available to be written for Hyperchem if desired, you can do this via following Lesson 17 or 18 in this manual. <u>http://www.chemistry-software.com/pdf/Hyperchem_full_manual.pdf</u><br />
<li><p>Scripts are also available to be written for Hyperchem if desired, you can do this via following Lesson 17 or 18 in this manual. <u>http://www.chemistry-software.com/pdf/Hyperchem_full_manual.pdf</u><br /></p>
<br />
<li><p>If manual doesn’t show up just google “Hyperchem Manual”</p></li></li>
If manual doesn’t show up just google “Hyperchem Manual”</p></li>
<li><p>If It says your Molecule has issues before Geometry optimization such as Valence error go to Select</p>
<li><p>If It says your Molecule has issues before Geometry optimization such as Valence error go to Select</p>
<ol style="list-style-type: lower-alpha;">
<ol style="list-style-type: lower-alpha;">
Line 232: Line 204:
<li><p>If that doesn’t work then select the semi-empirical. Once selected redo steps 114a-d</p></li>
<li><p>If that doesn’t work then select the semi-empirical. Once selected redo steps 114a-d</p></li>
<li><p>Use RHF not UHF</p></li></ol>
<li><p>Use RHF not UHF</p></li></ol>
</li></ol>
</li></ol><span id="_GoBack"></span>
 
<span id="_GoBack"></span><br />
<br />

Latest revision as of 21:14, 26 September 2022

Procedure Document for using Chem Sketch, Avogadro, Dragon, Excel, Hyperchem, & Build QSAR to obtain Descriptors.

Purpose: Using ChemSketch, ChemDraw, HyperChem, Avogadro, Dragon5, Build QSAR, and QSARINS programs to generate structures, descriptors, and models. OpenBabel is used for converting files from one type to another.

GeneratingDescriptors Model ChemSketch HyperChem OpenBabel Avogadro Dragon BuildQSAR HTML e43805e4b85a54a7.png

ChemSketch

Can convert SMILES to structure and structure to SMILES.

  1. Open Chem Sketch and click out of the little popup windows by clicking the X on the top right of the small window.
  2. Draw the desired molecule in skeletal format.
  3. Once the desired molecule is drawn then go to “Tools” and click “Clean Structure.” This will make it easier for the other programs to read what you created.
  4. Next, go to “Tools” again and then “Generate.” You will have to generate the “Name of Structure,” “Smiles Notation,” and “InChi for Structure.”
  5. Open a blank document on Excel. For the molecule created copy and paste the three things you generated. These three things will help keep your data organized. You may not need to refer to all three of these again but it is best to have them ready on hand as a precaution. You will use the “Name of Structure” for naming the files. ** Make sure you assign a number to each molecule. If you have 10 molecules just assign them numbers 1-10. Once again this will help with organizing your data. Organization is a big thing in this work!
  6. After the three things you generated are on Excel and you assigned the molecule a number. Then delete the names off of chem sketch. Keep the structure and save it as an “MDL Molfiles (*.mol).” The file name can be whatever but it is preferred to follow this format. “Assigned # Name of Structure” Example… “33 2-ethenyl-5-nitrofuran.mol” The “.mol” should be automatically added once you select “MDL molfiles (*.mol).
  7. Exit Chem Sketch.
ChemDraw

SPECIAL can convert name of structure to a structure.

  1. ChemDraw has a special feature that if given a name of a polymer, it will generate the monomer unit for you.
  2. Open ChemDraw Professional
  3. Go to “Structure”
  4. Select “Convert Name to Structure”
  5. Type in the name of the polymer without “poly” for example. “Polybutane” only type “butane.”
  6. It will give you a generalized version. This will assist in giving you the correct monomer unit or just generalized version.
  7. *** ONLY HAVE STURCTURE IN “*.MOL” FILE*** Delete the text after the model is generated because the text will cause issues later
  8. NEXT STEPS ARE FOR CREATION OF STRUCTURE AND NAME REGULAR WAY LIKE DRAWING
  9. Draw the desired molecule
  10. Select the model by pressing the logo near the top left of the screen GeneratingDescriptors Model ChemSketch HyperChem OpenBabel Avogadro Dragon BuildQSAR HTML c603a81bb8df177f.png
  11. Once selected go to “Structure” and click “Clean up Structure.” This will make it easier for the other programs to read what you created.
  12. To generate the name of this structure go to “Structure” and press “Convert structure to name.”
  13. Open a blank document on Excel. For the molecule created copy and paste the name you generated. This will help keep your data organized.
  14. After the model is generated and the name is on Excel, assign a number to the molecule on excel. Then delete the names off of ChemDraw. Keep the structure and save it as an “MDL Molfiles (*.mol).” The file name can be whatever but it is preferred to follow this format. “Assigned # Name of Structure” Example… “33 2-ethenyl-5-nitrofuran.mol” The “.mol” should be automatically added once you select “MDL molfiles (*.mol).
  15. *** ONLY HAVE STURCTURE IN “*.MOL” FILE*** Delete the text after the model is generated because the text will cause issues later
  16. Exit Chem Draw.
Avogadro
  1. After doing these steps for all the molecules you wanted to create then open the program Avogadro.

  2. Go to “File” and “Open” to bring a molecule you created in ChemSketch to Avogadro. Find your molecule that you want to bring to Avogadro and select it to open it. Click “Yes” when a small window pops up about 3D coordinates and a rough sketch.

  3. Once the molecule is on Avogadro then go to “Extensions” and select “Optimize geometry.”

  4. * There is another way to Optimize Geometry How do we do it again???

  5. Save the file as “Sybyl Mol2 (*.mol2)”

  6. Exit Avogadro.

Dragon5

It is okay to use *.mol and *.hin or any other format usable on Dragon, no issues what so ever)

  1. Open “Dragon5.exe,” exit out of the small windows that pup up.

  2. Select “Calculate Descriptors” and then select all the “.mol2” files that you want to calculate. Use the ‘Ctrl’ button when selecting the files so all desired files can be selected.

  3. Press the green check mark “OK” when all desired files are selected.

  4. Choose the desired descriptors you want calculated for these molecule files. “X” means checked. Then press RUN.

  5. Press Continue when the small window pops up.

  6. A yellow window pops up and gives information about the calculations. If no errors are listed on the yellow window then exit out of it.

  7. Select “Save Descriptors.”

  8. Make sure “Constant Variables” & “Near-Constant Variables” are selected “x.” Select “Pair Correlation” and pick .95 or something around that number.

  9. Press “Save” and save as “.txt” file.

  10. Leave Dragon open for future use.

  11. Find the file you saved and open with Notepad++. You can do this by finding the file and right clicking on the file. Select “Edit with Notepad++” to open it in Notepad++.

  12. In Notepad++, select all the information and copy onto an EXCEL worksheet. If desired you can delete the first 2 rows of the information on Notepad++. It is recommended to write the number of descriptors. This will help you later. It is listed in the 2nd row, 3rd column.

  13. Once all the information is saved onto Excel then exit out of the Notepad++ file.

Splitting to Training and Test sets

In Excel worksheet, split the whole set to training and test sets by sorting your molecules according to “Experimental values” or “Y” column, in descending or ascending order. After sorting, select every 5th (20% of the set) or every 4th (25% of the set) compound (row) and then copy it to another excel sheet, naming it as a Test set. Remove the test compounds from the initial set to make a training set.

Use the obtained training set to build a dataset in BuildQSAR.

BuildQSAR
  1. Open “BuildQSAR.”

  2. Go to “File” then “New.”

  3. Add a Title to “Dataset Title.”

  4. Change the number of “Compounds” to the amount in your data. Change the number of “Descriptors” you have. The number of descriptors was obtained from the Notepad++ information.

  5. Click “Ok.”

  6. Input observed data into the yellow column. The yellow column is for the observed information from other sources such as research papers.

  7. ** The observed data should be put into Excel at this point.

  8. Copy and paste the descriptor information from Excel into the blue cells of BuildQSAR.

  9. Go to “QSAR” then “Variable Selection” then “Systematic Search” or “Genetic Algorithm.” (note: Choose Genetic Algorithm only when you need 4, 5 or higher number of variables in the model).

  10. A small popup window will pop up. Make sure the 2 boxes under “Cross Validation” are checked. The correlation criteria can change but if uncertain on a number then put 0.6 as default.

  11. “Descriptors per Model,” this is usually calculated using the 5-1 rule. The 5-1 rule relates the number of molecules you have to the number “Variables AKA Descriptors” in your “Model oKA Equation.” Example: 5-1 rule is used on 24 molecules you should have 4 in the “Descriptors per model” section. ** DON’T ROUND UP **

  12. “No. of generations” can vary 200-500), but 200 is an okay default number to have.

  13. “Models per Generation” should be at least 3 (better to have between 5-10).

  14. Press “Run.”

  15. Double Click on any of the cells in the first row.

  16. A pop up window with a “Model aka equation” will pop up.

  17. Copy and paste the model and information in the “()” onto Excel.

  18. Close out of the window with the model information.

  19. Copy the model and () information from all three rows.

  20. These rows are the different models the BuildQSAR generated for you.

Descriptor Information
  1. So each model (AKA equation) has a certain number of descriptors in it. In this example there are 3.

  2. To find the meaning of these descriptors go to your Excel file and above your descriptor data create a row that is listed from X1 to X#. X1 is above the first descriptor, usually MW. The #’s are the number of descriptors in one row.

  3. On Excel find the descriptors that are listed in the model. Find the abbreviation under the X value. Example X269 corresponded to the abbreviation “MATS3e”

  4. Once you have all the descriptor abbreviations then go back to Dragon.

  5. Select “ Descriptor Search”

  6. Type an abbreviation you found from the Excel file.

  7. Press “Search”

  8. This will give the descriptor information.

Test Set predictions

Use obtained model to calculate “Predicted Experimental values” for the Test set.

For that, find the descriptors selected by the model, find them among test set descriptors, and then calculate “Predicted Experimental values” by applying mathematical model from training set and values of selected descriptor for each compound in the test set.

Do the same for all models, from 1 to 5 variables. Calculate predictive R2 values by correlation of “Experimental values” and Predicted Experimental values” in Excel (function CORR( )).

Build a graph for all models from 1 to 5 variables, by drawing on axes X No of model and on axes Y the values R2 for each model (separate lines for training and for the tests sets).

Select the best model by finding the maximum value of R2 for the test set models.

Open Babel GUI

This section depends on conversion of one file type to another if needed. Usually this is done to get HOMO LUMO (Extra descriptors) from Hyperchem (another 3D software) which has file types of “*.hin” Also, this is used to convert file type to smiles notation if needed. In this example we are converting “*.mol” files to Smiles format notation and “*.hin” format. “*.hin” is for HyperChem

  1. Open “Open Babel GUI”

  2. On the top left of the window that opens chose the file type that you have. For example “mol—MDL MOL format is a common file extension.”

  3. Once chosen you can select the way you want the data read in OpenBabel. In this example the “Use this format for all input files (ignore file extensions)” will be checked.

  4. Clicking on the left side “…” box you can choose the file you would like converted.

  5. Once the file was chosen it should be in the bar to the left of the “…” box

  6. Under the very top right comment of “OUTPUT FORMAT” chose what you would like converted to. In this example chose “smiles - - SMILES format”

  7. Check “Add Hydrogens (make explicit) in the middle column

  8. Check “Generate 3D coordinates” in the middle column

  9. Simply press convert and the data you want should be listed in the right column.

  10. Copy the Smiles notation and save it onto another file such as excel or notepad. This is one example of what a smiles notation looks like “OC(F)(F)[C@H](F)C(F)(F)F”

  11. The same step as 62 can be used to create the “*.hin” format notation. This is done by choosing “hin - - HyperChem HIN format” under OUTPUT FORMAT.

  12. Make sure “Output below only (no output file)” and “Display in firefox” are NOT checked. These are on the right side column or AKA output column.

  13. Before pressing Convert you need to specify an output file and location. A good idea is to name it exactly the same as the input fil except change the “.mol” ending file to “.hin” (THE “.hin” NEEDS TO BE INCLUDED IN THE OUTPUT FILE)

  14. Press convert

  15. The output file should be in the location with the name specified.

HyperChem

This section depends on the need for quantum descriptors such as HOMO, LUMO, and Total energy.

  1. Open “HyperChem”

  2. Open a file that has “.hin” extension

  3. Press “Setup” on the tool bar

  4. Press “Semi-empirical”

  5. Press “RM1”

  6. Press “OK

  7. Press “Compute” on the tool bar

  8. Select “Geometry Optimization”

  9. For regular work make sure “Polak –Ribiere (Conjugate gradiant)” is selected.

  10. Have 0.1 in the top bar of RMS gradiant and in the bottom bar 600.

  11. Have “In vacuo” selected

  12. For the “Screen refresh period” use 1 for cycles.

  13. Steps 78-81 should be default settings

  14. Press “OK”

  15. Once the optimization has stopped (Numbers on the bottom right or left have stopped moving) then select “Compute”

  16. Select “Properties”

  17. All the information in “Properties” can be used as quantum descriptors if needed

  18. By Pressing “Details” you will get more information and these are also used as descriptors if needed.

  19. Exit out of “Properties”

  20. Press “Compute”

  21. Select “Orbitals”

  22. the screen that pops up shows 2-4 different columns, the left being Alpha orbitals and the right being Beta orbitals. The pink/purple column is simply the orbitals above the HOMO-LUMO Gap, and the green is simply below the HOMO LUMO Gap

  23. Each line in these columns gives you the energy in “eV” for that orbital.

  24. If you zoom into the bottom most line of the pink/purple column that is LUMO- Lowest unoccupied molecular orbital. If you select the bottom most line of the pink/purple column, it will give you a value in the “Energy” bar as “eV”. That is LUMO.

  25. If you zoom into the top most line of the green column that is HOMO- Highest Occupied Molecular Orbital. If you select the top most like of the green column it will give you a value in the “Energy” bar as “eV”. That is HOMO

  26. The difference of these energies is the HOMO-LUMO Gap.

  27. If desired, you can simply “Plot” isosurfaces of the orbital that is selected by pressing the “Plot” button on the bottom left of the Orbitals panal.

  28. Press “OK” when done with this screen.

  29. To get further Descriptors

  30. Select “Compute”

  31. Select “ QSAR Properties” at the bottom

  32. In this window you can select any of the properties that you desired to compute by simply selecting the property and press compute. The value will be shown at the bottom of the “QSAR Properties” window.

  33. Scripts are also available to be written for Hyperchem if desired, you can do this via following Lesson 17 or 18 in this manual. http://www.chemistry-software.com/pdf/Hyperchem_full_manual.pdf

  34. If manual doesn’t show up just google “Hyperchem Manual”

  35. If It says your Molecule has issues before Geometry optimization such as Valence error go to Select

    1. Select All

    2. Select again

    3. “Add H & Model Build”

  36. This corrects the error

  37. Additionally, If molecules are large and “Semi-Empirical” geometry optimization takes too much time then molecular mechanics is needed

    1. When doing the molecular mechanics it is necessary to make sure the atom “type” is corrected.

    2. To check if atom type is correct go to “Display”

      1. “Labels”

      2. “type”

    3. If you see “**” on the atoms then hyperchem needs to recalculate the atoms to get C, H, O, and other atoms.

      1. Go to “setup”

      2. “Molecular Mechanics”

      3. Select “MM+”

      4. Click on “Options”

      5. Press OK

      6. Click on “Components”

      7. Press OK

      8. Then press “OK” on the same window you selected “MM+”

      9. You should get a popup window saying the atom types are going to be recalculated

      10. Press “OK”

    4. This should change the “**” to letters such as C, H, O

    5. If it doesn’t initially then try going to labels and pressing “none.” Then repeat steps 114 a-b

    6. If that doesn’t work then select the semi-empirical. Once selected redo steps 114a-d

    7. Use RHF not UHF