Excel The file includes two tags, one for a monohull and one for a catamaran. Within each tab, the columns are labeled Brand, Breed, Length (feet), Geographical Region, Country/area/State, Listing Price (USD) and Year (Manufacture). For a given make, variant and year, in addition to the provided Excel In addition to documents, there are many other sources that can provide detailed descriptions of the characteristics of a particular sailing ship. You may supplement the provided dataset with any other data of your choice; however, you must include in your modeling "2023_MCM_Problem_Y_Boats.xlsx "The data. Please ensure that the source of any supplementary data used is fully identified and documented.
Build a mathematical model to explain the listed prices for each sailboat in the spreadsheet provided.**Include any predictors you find useful.**You can use other sources to find out other characteristics of a particular sailing boat (such as beam, draft, displacement, rigging, sail area, hull material, engine hours, sleeping capacity, headroom, electronics, etc.), and by year and by region economic data.**Identify and describe all data sources used.**include**Discuss the accuracy of your price estimates for each type of sailboat.** The first question needs to be divided into three sub-questions, which are to collect the data that is considered useful, indicate and describe the source of the data, model the price as the dependent variable, and discuss the accuracy of the forecast.
1.1 Collect useful data
The Variant variable given in the title has described the brand of each sailboat in detail. For these brands and manufacturers, other information about sailboats can be found on the Internet. The picture above shows the information collected through the sailing brand on the Internet (opening the ladder), and the title suggests "beam, draft, displacement, rigging, sail area, hull materials, engine hours, sleeping capacity, headroom, electronics , etc” all involve some or all of them. According to excel, there are 398 different brands of Monohulled Sailboats, and 111 different brands of Catamarans. In the absence of existing data sets, it is recommended that participating teams arrange for a team member to spend half a day searching for more than 500 brands one by one.
(Afterwards, our studio will provide the collected data, and now we will use the existing data to demonstrate the algorithm)
1.2 Indicate and describe the data sources used
Using the found data, it is necessary to describe the source of the data and prove that the data is accurate and reliable. (The source of the data will be marked when our studio provides the data)
1.3 Data preprocessing
Before modeling, the raw data should be preprocessed first. As shown in the figure above, the original data contains a large number of strings, only length and year are numeric, and the rest of the variables cannot be directly brought into the mathematical model for modeling. Therefore, we need to convert the raw data into the format needed in the machine learning algorithm.
**Dealing with missing values:**According to preliminary statistical analysis, taking the Monohulled Sailboats dataset as an example, there are 3 missing values in Country/Region/Stata.
**Dealing with outliers:** There are two cases of outliers, one is the abnormality of artificially labeled data, and the other is the data point that is significantly different from other data points. Outliers may cause certain problems in machine learning and statistical models, which need to be screened and processed before modeling. The place where outliers may appear in the title is Listing Price, and you need to pay attention to the predicted value.
**Standardization:**Due to the large fluctuations in the value of the price, using standardization is a common processing technique. By scaling the average value of the data to 0 and the standard deviation to 1, not only can the data be freed from the influence of dimensions, and comparability can be obtained, but also the performance of the algorithm can be improved, the convergence speed can be accelerated, and the influence of the independent variable on the dependent variable can be improved. interpretability.
The following is the complete idea of some operations
1.4 Modeling predicts prices and explains accuracy
Considering that the second question needs to "explain the impact of the region on the listing price", the first question is particularly important for the price prediction model. From the perspective of statistics and strong interpretability, you can choose to use multiple linear regression analysis; from the perspective of machine learning and high performance, you can choose XGBoost,AdaBoost,LightGBM ensemble model or BP neural network to train. In the process of modeling, it is necessary to pay attention to the integrity of the process of model building, model training, model testing, and performance measurement. In the use of statistical models, a clear and detailed statistical test process is required, and the estimated values and confidence intervals of the parameters are given. When using machine learning models, you need to pay attention to performance measurement standards, data set division, and hyperparameter adjustment. Only complete modeling can be expected to get high scores, and it is easier to win awards.
2. Question 2
Use your model to explain**Regional Pair Listing Price**impact (if any).**Discuss whether any regional influences are consistent across all sailing variants.**discuss**The actual and statistical significance of any area effects.**
2.1 The impact of regions on listing prices
Monohulled Sailboats There are 73 different values in the data set area, Catamarans There are 52 of them. There are generally two ways to explain the effect of regions on prices. The first is to use the regression coefficient of regression analysis. After being standardized, it can directly reflect the degree of influence of the independent variable on the dependent variable. The second is to use the characteristics of the tree model to be important, for example XGBoost The gain is used to indicate the degree of influence of the feature on the dependent variable. **What follows is a partial presentation of the complete idea** ![insert image description here](https://img-blog.csdnimg.cn/27c753f80dc3437c8ce84bdb17c9fddf.png#pic_center)
Discuss how your modeling of a given geographic area is useful for the Hong Kong (SAR) market. Select an informative subset of sailboats, divided into monohulls and catamarans, from the spreadsheet provided.**Comparable listing price data for this subset were found from the Hong Kong (SAR) market.**Simulate Hong Kong (SAR) price for each sailboat in your subset, if any**What is the regional impact. Is the effect the same for catamarans and monohulls?**
3.1 Collect listing price data from the Hong Kong SAR market
As shown in the figure above, you can search for sailing prices in the Hong Kong Special Administrative Region on the Internet. After corresponding to the make and Variant in the data set, you can expand the data set to get the data for question 3.
Identify and discuss any other interesting and informative inferences or conclusions your team draws from the data. This topic is relatively open. Generally speaking, we need to review our research questions and carefully consider whether there is any information worth mining for each variable. In the first three small questions, we build a regression model, analyze the influence of independent variables on dependent variables, analyze interactions, collect new data for analysis......In addition, according to the information of the topic itself and the data we have collected, we can also start from the following angles: **What follows is a partial presentation of the complete idea**
Prepare a one to two page report for sailing brokers in Hong Kong (SAR). Include some well-chosen graphics to help brokers understand your conclusions There are not many points to pay attention to in this question, generally pay attention to the following points: The topic requires report，At the same time, the title clearly states that diagrams are needed, so please put the more important conclusive diagrams in the modeling process in the report middle It is best to fill 25 pages, leaving only one page for references after the report, and it is better to fill two pages.
Follow-up detailed ideas can be obtained in the QQ group: 784396893
or direct purchase link: